Lecture 4: Assembly Language & Computer Architecture

Flash and JavaScript are required for this feature.

Download the video from Internet Archive.

Description: Prof. Leiserson walks through the stages of code from source code to compilation to machine code to hardware interpretation and, finally, to execution.

Instructor: Charles Leiserson

The following content is provided under a Creative Commons license. Your support will help MIT OpenCourseWare continue to offer high quality educational resources for free. To make a donation or to view additional materials from hundreds of MIT courses, visit MIT OpenCourseWare at ocw.mit.edu.

CHARLES LEISERSON: So today, we're going to talk about assembly language and computer architecture. It's interesting these days, most software courses don't bother to talk about these things. And the reason is because as much as possible people have been insulated in writing their software from performance considerations. But if you want to write fast code, you have to know what is going on underneath so you can exploit the strengths of the architecture. And the interface, the best interface, that we have to that is the assembly language. So that's what we're going to talk about today.

So when you take a particular piece of code like fib here, to compile it you run it through Clang, as I'm sure you're familiar at this point. And what it produces is a binary machine language that the computer is hardware programmed to interpret and execute. It looks at the bits as instructions as opposed to as data. And it executes them. And that's what we see when we execute.

This process is not one step. It's actually there are four stages to compilation; preprocessing, compiling-- sorry, for the redundancy, that's sort of a bad name conflict, but that's what they call it-- assembling and linking. So I want to take us through those stages.

So the first thing that goes through is you go through a preprocess stage. And you can invoke that with Clang manually. So you can say, for example, if you do clang minus e, that will run the preprocessor and nothing else. And you can take a look at the output there and look to see how all your macros got expanded and such before the compilation actually goes through.

Then you compile it. And that produces assembly code. So assembly is a mnemonic structure of the machine code that makes it more human readable than the machine code itself would be. And once again, you can produce the assembly yourself with clang minus s.

And then finally, penultimately maybe, you can assemble that assembly language code to produce an object file. And since we like to have separate compilations, you don't have to compile everything as one big monolithic hunk. Then there's typically a linking stage to produce the final executable. And for that we are using ld for the most part. We're actually using the gold linker, but ld is the command that calls it.

So let's go through each of those steps and see what's going on. So first, the preprocessing is really straightforward. So I'm not going to do that. That's just a textual substitution.

The next stage is the source code to assembly code. So when we do clang minus s, we get this symbolic representation. And it looks something like this, where we have some labels on the side. And we have some operations when they have some directives. And then we have a lot of gibberish, which won't seem like so much gibberish after you've played with it a little bit. But to begin with looks kind of like gibberish.

From there, we assemble that assembly code and that produces the binary. And once again, you can invoke it just by running Clang. Clang will recognize that it doesn't have a C file or a C++ file. It says, oh, goodness, I've got an assembly language file. And it will produce the binary.

Now, the other thing that turns out to be the case is because assembly in machine code, they're really very similar in structure. Just things like the op codes, which are the things that are here in blue or purple, whatever that color is, like these guys, those correspond to specific bit patterns over here in the machine code. And these are the addresses and the registers that we're operating on, the operands. Those correspond to other to other bit codes over there. And there's very much a-- it's not exactly one to one, but it's pretty close one to one compared to if you had C and you look at the binary, it's like way, way different.

So one of the things that turns out you can do is if you have the machine code, and especially if the machine code that was produced with so-called debug symbols-- that is it was compiled with dash g-- you can use this program called objdump, which will produce a disassembly of the machine code. So it will tell you, OK, here's what the mnemonic, more human readable code is, the assembly code, from the binary.

And that's really useful, especially if you're trying to do things-- well, let's see why do we bother looking at the assembly? So why would you want to look at the assembly of your program? Does anybody have some ideas? Yeah.

AUDIENCE: [INAUDIBLE] made or not.

CHARLES LEISERSON: Yeah, you can see whether certain optimizations are made or not. Other reasons? Everybody is going to say that one. OK.

Another one is-- well, let's see, so here's some reasons. The assembly reveals what the compiler did and did not do, because you can see exactly what the assembly is that is going to be executed as machine code.

The second reason, which turns out to happen more often you would think, is that, hey, guess what, compiler is a piece of software. It has bugs. So your code isn't operating correctly. Oh, goodness, what's going on? Maybe the compiler made an error. And we have certainly found that, especially when you start using some of the less frequently used features of a compiler. You may discover, oh, it's actually not that well broken in.

And it mentions here you may only have an effect when compiling at -03, but if you compile at -00, -01, everything works out just fine. So then it says, gee, somewhere in the optimizations, they did an optimization wrong. So one of the first principles of optimization is do it right. And then the second is make it fast. And so sometimes the compiler doesn't that.

It's also the case that sometimes you cannot write code that produces the assembly that you want. And in that case, you can actually write the assembly by hand. Now, it used to be many years ago-- many, many years ago-- that a lot of software was written in assembly. In fact, my first job out of college, I spent about half the time programming in assembly language.

And it's not as bad as you would think. But it certainly is easier to have high-level languages that's for sure. You get lot more done a lot quicker.

And the last reason is reverse engineer. You can figure out what a program does when you only have access to its source, so, for example, the matrix multiplication example that I gave on day 1. You know, we had the overall outer structure, but the inner loop, we could not match the Intel math kernel library code. So what do we do?

We didn't have the source for it. We looked to see what it was doing. We said, oh, is that what they're doing? And then we're able to do it ourselves without having to get the sauce from them. So we reverse engineered what they did? So all those are good reasons.

Now, in this class, we have some expectations. So one thing is, you know, assembly is complicated and you needn't memorize the manual. In fact, the manual has over 1,000 pages. It's like-- but here's what we do expect of you.

You should understand how a compiler implements various C linguistic constructs with x86 instructions. And that's what we'll see in the next lecture. And you should be able to read x86 assembly language with the aid of an architecture manual. And on a quiz, for example, we would give you snippets or explain what the op codes that are being used in case it's not there. But you should have some understanding of that, so you can see what's actually happening.

You should understand the high-level performance implications of common assembly patterns. OK, so what does it mean to do things in a particular way in terms of performance? So some of them are quite obvious. Vector operations tend to be faster than doing the same thing with a bunch of scalar operations.

If you do write an assembly, typically what we use is there are a bunch of compiler intrinsic functions, built-ins, so-called, that allow you to use the assembly language instructions. And you should be after we've done this able to write code from scratch if the situation demands it sometime in the future. We won't do that in this class, but we expect that you will be in a position to do that after-- you should get a mastery to the level where that would not be impossible for you to do. You'd be able to do that with a reasonable amount of effort.

So the rest of the lecture here is I'm going to first start by talking about the instruction set architecture of the x86-64, which is the one that we are using for the cloud machines that we're using. And then I'm going to talk about floating point in vector hardware and then I'm going to do an overview of computer architecture.

Now, all of this I'm doing-- this is software class, right? Software performance engineering we're doing. So the reason we're doing this is so you can write code that better matches the hardware, therefore to better get it.

In order to do that, I could give things at a high-level. My experience is that if you really want to understand something, you want to understand it to level that's necessary and then one level below that. It's not that you'll necessarily use that one level below it, but that gives you insight as to why that layer is what it is and what's really going on. And so that's kind of what we're going to do. We're going to do a dive that takes us one level beyond what you probably will need to know in the class, so that you have a robust foundation for understanding.

Does that makes sense? That's my part of my learning philosophy is you know go one step beyond. And then you can come back.

The ISA primer, so the ISA talks about the syntax and semantics of assembly. There are four important concepts in the instruction set architecture-- the notion of registers, the notion of instructions, the data types, and the memory addressing modes. And those are sort of indicated. For example, here, we're going to go through those one by one.

So let's start with the registers. So the registers is where the processor stores things. And there are a bunch of x86 registers, so many that you don't need to know most of them. The ones that are important are these.

So first of all, there a general purpose registers. And those typically have width 64. And there are many of those.

There is a so-called flags register, called RFLAGS, which keeps track of things like whether there was an overflow, whether the last arithmetic operation resulted in a zero, whether a kid there was a carryout of a word or what have you.

The next one is the instruction pointer. So the assembly language is organized as a sequence of instructions. And the hardware marches linearly through that sequence, one after the other, unless it encounters a conditional jump or an unconditional jump, in which case it'll branch to whatever the location is. But for the most part, it's just running straight through memory.

Then there are some registers that were added quite late in the game, namely the SSE registers and the AVX registers. And these are vector registers. So the XMM registers were, when they first did vectorization, they used 128 bits. There's also for AVX, there are the YMM registers. And in the most recent processors, which were not using this term, there's another level of AVX that gives you 512-bit registers.

Maybe we'll use that for the final project, because it's just like a little more power for the game playing project. But for most of what you'll be doing, we'll just be keeping to the C4 instances in AWS that you guys have been using.

Now, the x86-64 didn't start out as x86-64. It started out as x86. And it was used for machines, in particular the 80-86, which had a 16-bit word. So really short.

How many things can you index with a 16-bit word? About how many?

AUDIENCE: 65,000.

CHARLES LEISERSON: Yeah, about 65,000. 65,536 words you can address, or bytes. This is byte addressing. So that's 65k bytes that you can address.

How could they possibly use that for machines? Well, the answer is that's how much memory was on the machine. You didn't have gigabytes. So as the machines-- as Moore's law marched along and we got more and more memory, then the words had to become wider to be able to index them. Yeah?

AUDIENCE: [INAUDIBLE]

CHARLES LEISERSON: Yeah, but here's the thing is if you're building stuff that's too expensive and you can't get memory that's big enough, then if you build a wider word, like if you build a word of 32 bits, then your processor just cost twice as much as the next guy's processor. So instead, what they did is they went along as long as that was the common size, and then had some growth pains and went to 32. And from there, they had some more growth pains and went to 64. OK, those are two separate things.

And, in fact, they did they did some really weird stuff. So what they did in fact is when they made these longer registers, they have registers that are aliased to exactly the same thing for the lower bits. So they can address them either by a byte-- so these registers all have the same-- you can do the lower and upper half of the short word, or you can do the 32-bit word or you can do the 64-bit word.

And that's just like if you're doing this today, you wouldn't do that. You wouldn't have all these registers that alias and such. But that's what they did because this is history, not design. And the reason was because when they're doing that they were not designing for long term.

Now, are we going to go to 128-bit addressing? Probably not. 64 bits address is a spectacular amount of stuff. You know, not quite as many-- 2 to the 64th is what? Is like how many gazillions? It's a lot of gazillions. So, yeah, we're not going to have to go beyond 64 probably.

So here are the general purpose registers. And as I mentioned, they have different names, but they cover the same thing. So if you change eax, for example, that also changes rax. And so you see they originally all had functional purposes. Now, they're all pretty much the same thing, but the names have stuck because of history. Instead of calling them registers 0, register 1, or whatever, they all have these funny names.

Some of them still are used for a particular purpose, like rsp is used as the stack pointer. And rbp is used to point to the base of the frame, for those who remember their 6004 stuff. So anyway, there are a whole bunch of them. And they're different names depending upon which part of the register you're accessing.

Now, the format of an x86-64 instruction code is to have an opcode and then an operand list. And the operand list is typically 0, 1, 2, or rarely 3 operands separated by commas. Typically, all operands are sources and one operand might also be the destination.

So, for example, if you take a look at this add instruction, the operation is an add. And the operand list is these two registers. One is edi and the other is ecx. And the destination is the second one.

When you add-- in this case, what's going on is it's taking the value in ecx, adding the value in edi into it. And the result is in ecx. Yes?

AUDIENCE: Is there a convention for where the destination [INAUDIBLE]

CHARLES LEISERSON: Funny you should ask. Yes. So what does op A, B mean? It turns out naturally that the literature is inconsistent about how it refers to operations. And there's two major ways that are used. One is the AT&T syntax, and the other is the Intel syntax.

So the AT&T syntax, the second operand is the destination. The last operand is the destination. In the Intel syntax, the first operand is the destination. OK, is that confusing?

So almost all the tools that we're going to use are going to use the AT&T syntax. But you will read documentation, which is Intel documentation. It will use the other syntax. Don't get confused. OK? I can't help-- it's like I can't help that this is the way the state of the world is. OK? Yeah?

AUDIENCE: Are there tools that help [INAUDIBLE]

CHARLES LEISERSON: Oh, yeah. In particular, if you could compile it and undo, but I'm sure there's-- I mean, this is not a hard translation thing. I'll bet if you just Google, you can in two minutes, in two seconds, find somebody who will translate from one to the other. This is not a complicated translation process.

Now, here are some very common x86 opcodes. And so let me just mention a few of these, because these are ones that you'll often see in the code.

So move, what do you think move does?

AUDIENCE: Moves something.

CHARLES LEISERSON: Yeah, it puts something in one register into another register. Of course, when it moves it, this is computer science move, not real move. When I move my belongings in my house to my new house, they're no longer in the old place, right? But in computer science, for some reason, when we move things we leave a copy behind. So they may call it move, but--

AUDIENCE: Why don't they call it copy?

CHARLES LEISERSON: Yeah, why don't they call it copy? You got me. OK, then there's conditional move. So this is move based on a condition-- and we'll see some of the ways that this is-- like move if flag is equal to 0 and so forth, so basically conditional move. It doesn't always do the move.

Then you can extend the sign. So, for example, suppose you're moving from a 32-bit value register into a 64-bit register. Then the question is, what happens to high order bits? So there's two basic mechanisms that can be used. Either it can be filled with zeros, or remember that the first bit, or the leftmost bit as we think of it, is the sign bit from our electron binary. That bit will be extended through the high order part of the word, so that the whole number if it's negative will be negative and if it's positive, it'll be zeros and so forth. Does that makes sense?

Then there are things like push and pop to do stacks. There's a lot of integer arithmetic. There's addition, subtraction, multiplication, division, various shifts, address calculation shifts, rotations, incrementing, decrementing, negating, etc. There's also a lot of binary logic, AND, OR, XOR, NOT. Those are all doing bitwise operations.

And then there is Boolean logic, like testing to see whether some value has a given value or comparing. There's unconditional jump, which is jump. And there's conditional jumps, which is jump with a condition. And then things like subroutines. And there are a bunch more, which the manual will have and which will undoubtedly show up. Like, for example, there's the whole set of vector operations we'll talk about a little bit later.

Now, the opcodes may be augmented with a suffix that describes the data type of the operation or a condition code. OK, so an opcode for data movement, arithmetic, or logic use a single character suffix to indicate the data type. And if the suffix is missing, it can usually be inferred.

So take a look at this example. So this is a move with a q at the end. What do you think q stands for?

AUDIENCE: Quad words?

CHARLES LEISERSON: Quad word. OK, how many bytes in a quad word?

AUDIENCE: Eight.

CHARLES LEISERSON: Eight. That's because originally it started out with a 16-bit word. So they said a quad word was four of those 16-bit words. So that's 8 bytes. You get the idea, right?

But let me tell you this is all over the x86 instruction set. All these historical things and all these mnemonics that if you don't understand what they really mean, you can get very confused. So in this case, we're moving a 64-bit integer, because a quad word has 8 bytes or 64 bits.

This is one of my-- it's like whenever I prepare this lecture, I just go into spasms of laughter, as I look and I say, oh, my god, they really did that like. For example, on the last page, when I did subtract. So the sub-operator, if it's a two argument operator, it subtracts the-- I think it's the first and the second. But there is no way of subtracting the other way around. It puts the destination in the second one. It basically takes the second one minus the first one and puts that in the second one.

But if you wanted to have it the other way around, to save yourself a cycle-- anyway, it doesn't matter. You can't do it that way. And all this stuff the compiler has to understand.

So here are the x86-64 data types. The way I've done it is to show you the difference between C and x86-64, so for example, here are the declarations in C. So there's a char, a short, int, unsigned int, long, etc. Here's an example of a C constant that does those things. And here's the size in bytes that you get when you declare that.

And then the assembly suffix is one of these things. So in the assembly, it says b or w for a word, an l or d for a double word, a q for a quad word, i.e. 8 bytes, single precision, double precision, extended precision.

So sign extension use two date type suffixes. So here's an example. So the first one says we're going to move. And now you see I can't read this without my cheat sheet. So what is this saying?

This is saying, we're going to move with a zero-extend. And it's going to be the first operand is a byte, and the second operation is a long. Is that right? If I'm wrong, it's like I got to look at the chart too. And, of course, we don't hold you to that.

But the z there says extends with zeros. And the S says preserve the sign. So that's the things.

Now, that would all be all well and good, except that then what they did is if you do 32-bit operations, where you're moving it to a 64-bit value, it implicitly zero-extends the sign. If you do it for smaller values and you store it in, it simply overwrites the values in those registers. It doesn't touch the high order bits. But when they did the 32 to 64-bit extension of the instruction set, they decided that they wouldn't do what had been done in the past. And they decided that they would zero-extend things, unless there was something explicit to the contrary. You got me, OK.

Yeah, I have a friend who worked at Intel. And he had a joke about the Intel instructions set. You'll discover the Intel instruction set is really complicated. He says, here's the idea of the Intel instruction set. He said, to become an Intel fellow, you need to have an instruction in the Intel instruction set. You have an instruction that you invented and that that's now used in Intel. He says nobody becomes an Intel fellow for removing instructions. So it just sort of grows and grows and grows and gets more and more complicated for each thing.

Now, once again, for extension, you can sign-extend. And here's two examples. In one case, moving an 8-bit integer to a 32-bit integer and zero-extended it versus preserving the sign.

Conditional jumps and conditional moves also use suffixes to indicate the condition code. So here, for example, the ne indicates the jump should only be taken if the argument of the previous comparison are not equal. So ne is not equal. So you do a comparison, and that's going to set a flag in the RFLAGS register. Then the jump will look at that flag and decide whether it's going to jump or not or just continue the sequential execution of the code.

And there are a bunch of things that you can jump on which are status flags. And you can see the names here. There's Carry. There's Parity. Parity is the XOR of all the bits in the word. Adjust, I don't even know what that's for. There's the Zero flag. It tells whether it's a zero. There's a Sign flag, whether it's positive or negative. There's a Trap flag and Interrupt enable and Direction, Overflow. So anyway, you can see there are a whole bunch of these.

So, for example here, this is going to decrement rbx. And then it sets the Zero flag if the results are equal. And then the jump, the conditional jump, jumps to the label if the ZF flag is not set, in this case. OK, it make sense? After a fashion. Doesn't make rational sense, but it does make sense.

Here are the main ones that you're going to need. The Carry flag is whether you got a carry or a borrow out of the most significant bit. The Zero flag is if the ALU operation was 0, whether the last ALU operation had the sign bit set. And the overflow says it resulted in arithmetic overflow.

The condition codes are-- if you put one of these condition codes on your conditional jump or whatever, this tells you exactly what the flag is that is being set. So, for example, the easy ones are if it's equal. But there are some other ones there.

So, for example, if you say why, for example, do the condition codes e and ne, check the Zero flag? And the answer is typically, rather than having a separate comparison, what they've done is separate the branch from the comparison itself. But it also needn't be a compare instruction. It could be the result of the last arithmetic operation was a zero, and therefore it can branch without having to do a comparison with zero.

So, for example, if you have a loop. where you're decrementing a counter till it gets to 0, that's actually faster by one instruction to compare whether the loop index hits 0 than it is if you have the loop going up to n, and then every time through the loop having to compare with n in order before you can branch.

So these days that optimization doesn't mean anything, because, as we'll talk about in a little bit, these machines are so powerful, that doing an extra integer arithmetic like that probably has no bearing on the overall cost. Yeah?

AUDIENCE: So this instruction doesn't take arguments? It just looks at the flags?

CHARLES LEISERSON: Just looks at the flags, yep. Just looks at the flags. It doesn't take any arguments.

Now, the next aspect of this is you can give registers, but you also can address memory. And there are three direct addressing modes and three indirect addressing modes. At most, one operand may specify a memory address.

So here are the direct addressing modes. So for immediate what you do is you give it a constant, like 172, random constant, to store into the register, in this case. That's called an immediate. What happens if you look at the instruction, if you look at the machine language, 172 is right in the instruction. It's right in the instruction, that number 172.

Register says we'll move the value from the register, in this case, %cx. And then the index of the register is put in that part.

And direct memory says use a particular memory location. And you can give a hex value. When you do direct memory, it's going to use the value at that place in memory. And to indicate that memory is going to take you, on a 64-bit machine, 64 8-bytes to specify that memory. Whereas, for example, the move q, 172 will fit in 1 byte. And so I'll have spent a lot less storage in order to do it. Plus, I can do it directly from the instruction stream. And I avoid having an access to memory, which is very expensive.

So how many cycles does it take if the value that you're fetching from memory is not in cache or whatever or a register? If I'm fetching something from memory, how many cycles of the machine does it typically take these days. Yeah.

AUDIENCE: A few hundred?

CHARLES LEISERSON: Yeah, a couple of hundred or more, yeah, a couple hundred cycles. To fetch something from memory. It's so slow. No, it's the processors are so fast.

And so clearly, if you can get things into registers, most registers you can access in a single cycle. So we want to move things close to the processor, operate on them, shove them back. And while we pull things from memory, we want other things to be to be working on. And so the hardware is all organized to do that.

Now, of course, we spend a lot of time fetching stuff from memory. And that's one reason we use caching. And we'll have a big thing-- caching is really important. We're going spend a bunch of time on how to get the best out of your cache.

There's also indirect addressing. So instead of just giving a location, you say, oh, let's go to some other place, for example, a register, and get the value and the address is going to be stored in that location. So, for example here, register indirect says, in this case, move the contents of rax into-- sorry, the contents is the address of the thing that you're going to move into rdi. So if rax was location 172, then it would take whatever is in location 172 and put it in rdi.

Registered index says, well, do the same thing, but while you're at it, add an offset. So once again, if rax had 172, in this case it would go to 344 to fetch the value out of that location 344 for this particular instruction.

And then instruction-pointer relative, instead of indexing off of a general purpose register, you index off the instruction pointer. That usually happens in the code where the code is-- for example, you can jump to where you are in the code plus four instructions. So you can jump down some number of instructions in the code.

Usually, you'll see that only with use with control, because you're talking about things. But sometimes they'll put some data in the instruction stream. And then it can index off the instruction pointer to get those values without having to soil another register.

Now, the most general form is base indexed scale displacement addressing. Wow. This is a move that has a constant plus three terms. And this is the most complicated instruction that is supported.

The mode refers to the address whatever the base is. So the base is a general purpose register, in this case, rdi. And then it adds the index times the scale. So the scale is 1, 2, 4, or 8. And then a displacement, which is that number on the front. And this gives you very general indexing of things off of a base point.

You'll often see this kind of accessing when you're accessing stack memory, because everything you can say, here is the base of my frame on the stack, and now for anything that I want to add, I'm going to be going up a certain amount. I may scaling by a certain amount to get the value that I want. So once again, you will become familiar with a manual. You don't have to memorize all these, but you do have to understand that there are a lot of these complex addressing modes.

The jump instruction take a label as their operand, which identifies a location in the code. For this, the labels can be symbols. In other words, you can say here's a symbol that I want to jump to. It might be the beginning of a function, or it might be a label that's generated to be at the beginning of a loop or whatever. They can be exact addresses-- go to this place in the code. Or they can be relative address-- jump to some place as I mentioned that's indexed off the instruction pointer.

And then an indirect jump takes as its operand an indirect address-- oop, I got-- as its operand as its operand. OK, so that's a typo. It just takes an operand as an indirect address. So basically, you can say, jump to whatever is pointed to by that register using whatever indexing method that you want.

So that's kind of the overview of the assembly language. Now, let's take a look at some idioms. So the XOR opcode computes the bitwise XOR of A and B. We saw XOR was a great trick for swapping numbers, for example, the other day.

So often in the code, you will see something like this, xor rax rax. What does that do? Yeah.

AUDIENCE: Zeros the register.

CHARLES LEISERSON: It zeros the register. Why does that zero the register?

AUDIENCE: Is the XOR just the same?

CHARLES LEISERSON: Yeah, it's basically taking the results of rax, the results rax, xor-ing them. And when you XOR something with itself, you get zero, storing that back into it. So that's actually how you zero things. So you'll see that.

Whenever you see that, hey, what are they doing? They're zeroing the register. And that's actually quicker and easier than having a zero constant that they put into the instruction. It saves a byte, because this ends up being a very short instruction. I don't remember how many bytes that instruction is.

Here's another one, the test opcode, test A, B, computes the bitwise AND of A and B and discards the result, preserving the RFLAGS register. So basically, it says, what does the test instruction for these things do? So what is the first one doing? So it takes rcx-- yeah.

AUDIENCE: Does it jump? It jumps to [INAUDIBLE] rcx [INAUDIBLE]

So it takes the bitwise AND of A and B. And so then it's saying jump if equal. So--

AUDIENCE: An AND would be non-zero in any of the bits set.

CHARLES LEISERSON: Right. AND is non-zero if any of the bits are set.

AUDIENCE: Right. So if the zero flag were set, that means that rcx was zero.

CHARLES LEISERSON: That's right. So if the Zero flag is set, then rcx is set. So this is going to jump to that location if rcx holds the value 0. In all the other cases, it won't set the Zero flag because the result of the AND will be 0. So once again, that's kind of an idiom that they use.

What about the second one? So this is a conditional move. So both of them are basically checking to see if the register is 0. And then doing something if it is or isn't. But those are just idioms that you sort of have to look at to see how it is that they accomplish their particular thing.

Here's another one. So the ISA can include several no-op, no operation instructions, including nop, nop A-- that's an operation with an argument-- and data16, which sets aside 2 bytes of a nop. So here's a line of assembly that we found in some of our code-- data16 days16 data16 nopw and then %csx. So nopw is going to take this argument, which has got all this address calculation in it.

So what do you think this is doing? What's the effect of this, by the way? They're all no-ops. So the effect is? Nothing. The effect is nothing. OK, now it does set the RFLAGS. But basically, mostly, it does nothing.

Why would a compiler generate assembly with these idioms? Why would you get that kind of-- that's crazy, right? Yeah.

AUDIENCE: Could it be doing some cache optimization?

CHARLES LEISERSON: Yeah, it's actually doing alignment optimization typically or code size. So it may want to start the next instruction on the beginning of a cache line. And, in fact, there's a directive to do that. If you want all your functions to start at the beginning of cache line, then it wants to make sure that if code gets to that point, you'll just proceed to jump through memory, continue through memory. So mainly is to optimize memory. So you'll see those things. I mean, you just have to realize, oh, that's the compiler generating some sum no-ops.

So that's sort of our brief excursion over assembly language, x86 assembly language. Now, I want to dive into floating-point and vector hardware, which is going to be the main part. And then if there's any time at the end, I'll show the slides-- I have a bunch of other slides on how branch prediction works and a variety of other machines sorts of things, that if we don't get to, it's no problem. You can take a look at the slides, and there's also the architecture manual.

So floating-point instruction sets, so mostly the scalar floating-point operations are access via couple of different instruction sets. So the history of floating point is interesting, because originally the 80-86 did not have a floating-point unit. Floating-point was done in software. And then they made a companion chip that would do floating-point. And then they started integrating and so forth as miniaturization took hold.

So the SSE and AVX instructions do both single and double precision scalar floating-point, i.e. floats or doubles. And then the x86 instructions, the x87 instructions-- that's the 80-87 that was attached to the 80-86 and that's where they get them-- support single, double, and extended precision scalar floating-point arithmetic, including float double and long double. So you can actually get a great big result of a multiply if you use the x87 instruction sets. And they also include vector instructions, so you can multiply or add there as well-- so all these places on the chip where you can decide to do one thing or another.

Compilers generally like the SSE instructions over the x87 instructions because they're simpler to compile for and to optimize. And the SSE opcodes are similar to the normal x86 opcodes. And they use the XMM registers and floating-point types. And so you'll see stuff like this, where you've got a movesd and so forth. The suffix there is saying what the data type. In this case, it's saying it's a double precision floating-point value, i.e. a double.

Once again, they're using suffix. The sd in this case is a double precision floating-point. The other option is the first letter says whether it's single, i.e. a scalar operation, or packed, i.e. a vector operation. And the second letter says whether it's single or double precision. And so when you see one of these operations, you can decode, oh, this is operating on a 64-bit value or a 32-bit value, floating-point value, or on a vector of those values.

Now, what about these vectors? So when you start using the packed representation and you start using vectors, you have to understand a little bit about the vector units that are on these machines. So the way a vector unit works is that there is the processor issuing instructions. And it issues the instructions to all of the vector units.

So for example, if you take a look at a typical thing, you may have a vector width of four vector units. Each of them is often called a lane-- l-a-n-e. And the x is the vector width. And so when the instruction is given, it's given to all of the vector units. And they all do it on their own local copy of the register.

So the register you can think of as a very wide thing broken into several words. And when I say add two vectors together, it'll add four words together and store it back into another vector register. And so whatever k is-- in the example I just said, k was 4. And the lanes are the thing that each of which contains the integer floating-point arithmetic.

But the important thing is that they all operate in lock step. It's not like one is going to do one thing and another is going to do another thing. They all have to do exactly the same thing. And the basic idea here is for the price of one instruction, I can command a bunch of operations to be done.

Now, generally, vector instructions operate in an element-wise fashion, where you take the i-th element of one vector and operate on it with the i-th element of another vector. And all the lanes perform exactly the same operation. Depending upon the architecture, some architectures, the operands need to be aligned. That is you've got to have the beginnings at the exactly same place in memory, a multiple of the vector length. There are others where the vectors can be shifted in memory.

Usually, there's a performance difference between the two. If it does support-- some of them will not support unaligned vector operations. So if it can't figure out that they're aligned, I'm sorry, your code will end up being executed scalar, in a scalar fashion. If they are aligned, it's got to be able to figure that out. And in that case-- sorry, if it's not aligned, but you do support vector operizations unaligned, it's usually slower than if they are aligned. And for some machines now, they actually have good performance on both. So it really depends upon the machine.

And then also there are some architectures will support cross-lane operation, such as inserting or extracting subsets of vector elements, permuting, shuffling, scatter, gather types of operations.

So x86 supports several instruction sets, as I mentioned. There's SSE. There's AVX. There's AVX2. And then there's now the AVX-512, or sometimes called AVX3, which is not available on the machines that we'll be using, the Haswell machines that we'll be doing.

Generally, the AVX and AVX2 extend the SSE instruction set by using the wider registers and operate on a 2. The SSE use wider registers and operate on at most two operands. The AVX ones can use the 256 and also have three operands, not just two operations. So say you can say add A to B and store it in C, as opposed to saying add A to B and store it in B. So it can also support three.

Yeah, most of them are similar to traditional opcodes with minor differences. So if you look at them, if you have an SSE, it basically looks just like the traditional name, like add in this case, but you can then say, do a packed add or a vector with packed data. So the v prefix it's AVX. So if you see it's v, you go to the part in the manual that says AVX. If you see the p's, that say it's packed data. Then you go to SSE if it doesn't have the v.

And the p prefix distinguishing integer vector instruction, you got me. I tried to think why is p distinguishing an integer? It's like p, good mnemonic for integer, right?

Then in addition, they do this aliasing trick again, where the YMM registers actually alias the XMM registers. So you can use both operations, but you've got to be careful what's going on, because they just extended them. And now, of course, with AVX-512, they did another extension to 512 bits.

That's vectors stuff. So you can use those explicitly. The compiler will vectorize for you. And the homework this week takes you through some vectorization exercises. It's actually a lot of fun. We were just going over it in a staff meeting. And it's really fun. I think it's a really fun exercise. We introduced that last year, by the way, or maybe two years ago. But, in any case, it's a fun one-- for my definition of fun, which I hope is your definition of fun.

Now, I want to talk generally about computer architecture. And I'm not going to get through all of these slides, as I say. But I want to get started on the and give you a sense of other things going on in the processor that you should be aware of.

So in 6.004, you probably talked about a 5-stage processor. Anybody remember that? OK, 5-stage processor. There's an Instruction Fetch. There's an Instruction Decode. There's an Execute. Then there's a Memory Addressing. And then you Write back the values.

And this is done as a pipeline, so as to make-- you could do all of this in one thing, but then you have a long clock cycle. And you'll only be able to do one thing at a time. Instead, they stack them together.

So here's a block diagram of the 5-stage processor. We read the instruction from memory in the instruction fetch cycle. Then we decode it. Basically, it takes a look at, what is the opcode, what are the addressing modes, et cetera, and figures out what it actually has to do and actually performs the ALU operations. And then it reads and writes the data memory. And then it writes back the results into registers. That's typically a common way that these things go for a 5-stage processor.

By the way, this is vastly oversimplified. You can take 6823 if you want to learn truth. I'm going to tell you nothing but white lies for this lecture.

Now, if you look at the Intel Haswell, the machine that we're using, it actually has between 14 and 19 pipeline stages. The 14 to 19 reflects the fact that there are different paths through it that take different amounts of time. It also I think reflects a little bit that nobody has published the Intel internal stuff. So maybe we're not sure if it's 14 to 19, but somewhere in that range. But I think it's actually because the different lengths of time as I was explaining.

So what I want to do is-- you've seen the 5-stage price line. I want to talk about the difference between that and a modern processor by looking at several design features. We already talked about vector hardware. I then want to talk about super scalar processing, out of order execution, and branch prediction a little bit. And the out of order, I'm going to skip a bunch of that because it has to do with score boarding, which is really interesting and fun, but it's also time consuming. But it's really interesting and fun. That's what you learn in 6823.

So historically, there's two ways that people make processors go faster-- by exploiting parallelism and by exploiting locality. And parallelism, there's instruction-- well, we already did word-level parallelism in the bit tricks thing. But there's also instruction-level parallelism, so-called ILB, vectorization and multicore. And for locality, the main thing that's used there is caching.

I would say also the fact that you have a design with registers that also reflects locality, because the way that the processor wants to do things is fetch stuff from memory. It doesn't want to operate on it in memory. That's very expensive. It wants to fetch things into memory, get enough of them there that you can do some calculations, do a whole bunch of calculations, and then put them back out there.

So this lecture we're talking about ILP and vectorization. So let me talk about instruction-level parallelism.

So when you have, let's say, a 5-stage pipeline, you're interested in finding opportunities to execute multiple instruction simultaneously. So in instruction 1, it's going to do an instruction fetch. Then it does its decode. And so it takes five cycles for this instruction to complete.

So ideally what you'd like is that you can start instruction 2 on cycle 2, instruction 3 on cycle 3, and so forth, and have 5 instructions-- once you get into the steady state, have 5 instructions executing all the time. That would be ideal, where each one takes just one thing. So that's really pretty good.

And that would improve the throughput. Even though it might take a long time to get one instruction done, I can have many instructions in the pipeline at some time. So each pipeline is executing a different instruction.

However, in practice this isn't what happens. In practice, you discover that there are what's called pipeline stalls. When it comes time to execute an instruction, for some correctness reason, it cannot execute the instruction. It has to wait. And that's a pipeline stall. That's what you want to try to avoid and the compiler tries to Bruce code that will avoid stalls.

So why do stalls happen? They happen because of what are called hazards. There's actually two notions of hazard. And this is one of them. The other is a race condition hazard. This is dependency hazard. But people call them both hazards, just like they call the second stage of compilation compiling. It's like they make up these words.

So here's three types of hazards that can prevent an instruction from executing. First of all, there's what's called a structural hazard. Two instructions attempt to use the same functional unit, the same time. If there's, for example, only one floating-point multiplier and two of them try to use it at the same time, one has to wait. In modern processors, there's a bunch of each of those. But if you have k functional units and k plus 1 instructions want to access it, you're out of luck. One of them is going to have to wait.

The second is a data hazard. This is when an instruction depends on the result of a prior instruction in the pipeline. So one instruction is computing a value that is going to stick in rcx, say. So they stick it into rcx. The other one has to read the value from rcx and it comes later. That other instruction has to wait until that value is written there before it can read it. That's a data hazard.

And a control hazard is where you decide that you need to make a jump and you can't execute the next instruction, because you don't know which way the jump is going to go. So if you have a conditional jump, it's like, well, what's the next instruction after that jump? I don't know. So I have to wait to execute that. I can't go ahead and do the jump and then do the next instruction after it, because I don't know what happened to the previous one.

Now of these, we're going to mostly talk about data hazards. So an instruction can create a data hazard-- I can create a data hazard due to a dependence between i and j. So the first type is called a true dependence, or I read after write dependence.

And this is where, as in this example, I'm adding something and storing into rax and the next instruction wants to read from rax. So the second instruction can't get going until the previous one or it may stall until the result of the previous one is known.

There's another one called an anti-dependence. This is where I want to write into a location, but I have to wait until the previous instruction has read the value, because otherwise I'm going to clobber that instruction and clobber the value before it gets read. so that's an anti-dependence.

And then the final one is an output dependence, where they're both trying to move something to are rax. So why would two things want to move things to the same location? After all, one of them is going to be lost and just not do that instruction. Why wouldn't--

AUDIENCE: Set some flags.

CHARLES LEISERSON: Yeah, maybe because it wants to set some flags. So that's one reason that it might do this, because you know the first instruction set some flags in addition to moving the output to that location. And there's one other reason. What's the other reason? I'm blanking. There's two reasons. And I didn't put them in my notes. I don't remember. OK, but anyway, that's a good question for quiz then. OK, give me two reasons-- yeah.

AUDIENCE: Can there be intermediate instructions like between those [INAUDIBLE]

CHARLES LEISERSON: There could, but of course then if it's going to use that register, then-- oh, I know the other reason. So this is still good for a quiz.

The other reason is there may be aliasing going on. Maybe an intervening instruction uses one of the values in its aliasist. So uses part of the result or whatever, there still could be a dependency.

Anyway, some arithmetic operations are complex to implement in hardware and have long latencies. So here's some sample opcodes and how many latency they take. They take a different number. So, for example, integer division actually is variable, but a multiply takes about three times what most of the integer operations are. And floating-point multiply is like 5. And then fma, what's fma? Fused multiply add. This is where you're doing both a multiply and an add. And why do we care about fuse multiply adds?

AUDIENCE: For memory accessing and [INAUDIBLE]

CHARLES LEISERSON: Not for memory accessing. This is actually floating-point multiply and add. It's called linear algebra. So when you do major multiplication, you're doing dot product. You're doing multiplies and adds. So that kind of thing, that's where you do a lot of those.

So how does the hardware accommodate these complex operations? So the strategy that much hardware tends to use is to have separate functional units for complex operations, such as floating-point arithmetic. So there may be in fact separate registers, for example, the XMM registers, that only work with the floating point.

So you have your basic 5-stage pipeline. You have another pipeline that's off on the side. And it's going to take multiple cycles sometimes for things and maybe pipeline to a different depth. And so you basically separate these operations. The functional units may be pipelined, fully, partially, or not at all. And so I now have a whole bunch of different functional units, and there's different paths that I'm going to be able to take through the data path of the processor.

So in Haswell, they have integer vector floating-point distributed among eight different ports, which is sort from the entry. So given that, things get really complicated. If we go back to our simple diagram, suppose we have all these additional functional units, how can I now exploit more instruction-level parallelism? So right now, we can start up one operation at a time. What might I do to get more parallelism out of the hardware that I've got? What do you think computer architects did? OK.

AUDIENCE: It's a guess but, you could glue together [INAUDIBLE]

CHARLES LEISERSON: Yeah, so even simpler than that, but which is implied in what you're saying, is you can just fetch and issue multiple instructions per cycle. So rather than just doing one per cycle as we showed with a typical pipeline processor, let me fetch several that use different parts of the processor pipeline, because they're not going to interfere, to keep everything busy. And so that's basically what's called a super scalar processor, where it's not executing one thing at a time. It's executing multiple things at a time.

So Haswell, in fact, breaks up the instructions into simpler operations, called micro-ops. And they can emit for micro-ops per cycle to the rest of the pipeline. And the fetch and decode stages implement optimizations on micro-op processing, including special cases for common patents.

For example, if it sees the XOR of rax and rax, it knows that rax is being set to 0. It doesn't even use a functional unit for that. It just does it and it's done. It has just a special logic that observes that because it's such a common thing to set things out. And so that means that now your processor can execute a lot of things at one time. And that's the machines that you're doing.

That's why when I said if you save one add instruction, it probably doesn't make any difference in today's processor, because there's probably an idle adder lying around. There's probably a-- did I read caught how many-- where do we go here? Yeah, so if you look here, you can even discover that there are actually a bunch of ALUs that are capable of doing an add. So they're all over the map in Haswell.

Now, still, we are insisting that the processors execute in things in order. And that's kind of the next stage is, how do you end up making things run-- that is, how do you make it so that you can free yourself from the tyranny of one instruction after the other? And so the first thing is there's a strategy called bypassing.

So suppose that you have instructions running into rax. And then you're going to use that to read. Well, why bother waiting for it to be stored into the register file and then pulled back out for the second instruction? Instead, let's have a bypass, a special circuit that identifies that kind of situation and feeds it directly to the next instruction without requiring that it go into the register file and back out. So that's called bypassing. There are lots of places where things are bypassed. And we'll talk about it more.

So normally, you would stall waiting for it to be written back. And now, when you eliminate it, now I can move it way forward, because I just use the bypass path to execute. And it allows the second instruction to get going earlier.

What else can we do? Well, let's take a large code example. Given the amount of time, what I'm going to do is basically say, you can go through and figure out what are the read after write dependencies and the write after read dependencies. They're all over the place. And what you can do is if you look at what the dependencies are that I just flashed through, you can discover, oh, there's all these things. Each one right now has to wait for the previous one before it can get started.

But there are some-- for example, the first one is just issue order. You can't start the second-- if it's in order, you can't start the second till you've started the first, that it's finished the first stage. But the other thing here is there's a data dependence between the second and third instructions. So if you look at the second and third instructions, they're both using XMM2. And so we're prevented.

So one of the questions there is, well, why not do a little bit better by taking a look at this as a graph and figuring out what's the best way through the graph? And there are a bunch of tricks you can do there, which I'll run through here very quickly. And you can take a look at these.

You can discover that some of these dependencies are not real dependence. And as long as you're willing to execute things out of order and keep track of that, it's perfectly fine. If you're not actually dependent on it, then just go ahead and execute it. And then you can advance things.

And then the other trick you can use is what's called register renaming. If you have a destination that's going to be read from-- sorry, if I want to write to something, but I have to wait for something else to read from it, the write after read dependence, then what I can do is just rename the register, so that I have something to write to that is the same thing. And there's a very complex mechanism called score boarding that does that.

So anyway, you can take a look at all of these tricks. And then the last thing that I want to-- so this is this part I was going to skip over. And indeed, I don't have time to do it.

I just want to mention the last thing, which is worthwhile. So this-- you don't have to know any of the details of that part. But it's in there if you're interested. So it does renaming and reordering.

And then the last thing I do want to mention is branch prediction. So when you come to branch prediction, the outcome, you can have a hazard because the outcome is known too late. And so in that case, what they do is what's called speculative execution, which you've probably heard of.

So basically that says I'm going to guess the outcome of the branch and execute. If it's encountered, you assume it's taken and you execute normally. And if you're right, everything is hunky dory. If you're wrong, it cost you something like a-- you have to undo that speculative computation and the effect is sort of like stalling. So you don't want that to happen.

And so a mispredicted branch on Haswell costs about 15 to 20 cycles. Most of the machines use a branch predictor to tell whether or not it's going to do. There's a little bit of stuff here about how you tell about whether a branch is going to be predicted or not. And you can take a look at that on your own.

So sorry to rush a little bit the end, but I knew I wasn't going to get through all of this. But it's in the notes, in the slides when we put it up. And this is really kind of interesting stuff.

Once again, remember that I'm dealing with this at one level below what you really need to do. But it is really helpful to understand that layer so you have a deep understanding of why certain software optimizations work and don't work. Sound good? OK, good luck on finishing your project 1's.