Subtitles section Play video
The following content is provided under a Creative
Commons license.
Your support will help MIT OpenCourseWare
continue to offer high quality educational resources for free.
To make a donation or to view additional materials
from hundreds of MIT courses, visit MIT OpenCourseWare
at ocw.mit.edu.
CHARLES LEISERSON: So today, we're
going to talk about assembly language and computer
architecture.
It's interesting these days, most software courses
don't bother to talk about these things.
And the reason is because as much as possible people
have been insulated in writing their software from performance
considerations.
But if you want to write fast code,
you have to know what is going on underneath so you
can exploit the strengths of the architecture.
And the interface, the best interface, that we have to that
is the assembly language.
So that's what we're going to talk about today.
So when you take a particular piece of code
like fib here, to compile it you run it through Clang,
as I'm sure you're familiar at this point.
And what it produces is a binary machine language
that the computer is hardware programmed
to interpret and execute.
It looks at the bits as instructions as opposed to as
data.
And it executes them.
And that's what we see when we execute.
This process is not one step.
It's actually there are four stages to compilation;
preprocessing, compiling-- sorry, for the redundancy,
that's sort of a bad name conflict,
but that's what they call it--
assembling and linking.
So I want to take us through those stages.
So the first thing that goes through
is you go through a preprocess stage.
And you can invoke that with Clang manually.
So you can say, for example, if you
do clang minus e, that will run the preprocessor
and nothing else.
And you can take a look at the output
there and look to see how all your macros got expanded
and such before the compilation actually goes through.
Then you compile it.
And that produces assembly code.
So assembly is a mnemonic structure of the machine code
that makes it more human readable than the machine
code itself would be.
And once again, you can produce the assembly yourself
with clang minus s.
And then finally, penultimately maybe,
you can assemble that assembly language code
to produce an object file.
And since we like to have separate compilations,
you don't have to compile everything
as one big monolithic hunk.
Then there's typically a linking stage
to produce the final executable.
And for that we are using ld for the most part.
We're actually using the gold linker,
but ld is the command that calls it.
So let's go through each of those steps
and see what's going on.
So first, the preprocessing is really straightforward.
So I'm not going to do that.
That's just a textual substitution.
The next stage is the source code to assembly code.
So when we do clang minus s, we get
this symbolic representation.
And it looks something like this, where we
have some labels on the side.
And we have some operations when they have some directives.
And then we have a lot of gibberish,
which won't seem like so much gibberish
after you've played with it a little bit.
But to begin with looks kind of like gibberish.
From there, we assemble that assembly code and that
produces the binary.
And once again, you can invoke it just by running Clang.
Clang will recognize that it doesn't have a C file or a C++
file.
It says, oh, goodness, I've got an assembly language file.
And it will produce the binary.
Now, the other thing that turns out to be the case
is because assembly in machine code,
they're really very similar in structure.
Just things like the op codes, which
are the things that are here in blue or purple,
whatever that color is, like these guys,
those correspond to specific bit patterns over here
in the machine code.
And these are the addresses and the registers that we're
operating on, the operands.
Those correspond to other to other bit codes over there.
And there's very much a--
it's not exactly one to one, but it's pretty close one to one
compared to if you had C and you look at the binary,
it's like way, way different.
So one of the things that turns out you can do is if you have
the machine code, and especially if the machine code that was
produced with so-called debug symbols--
that is it was compiled with dash g--
you can use this program called objdump,
which will produce a disassembly of the machine code.
So it will tell you, OK, here's what the mnemonic, more
human readable code is, the assembly code, from the binary.
And that's really useful, especially
if you're trying to do things--
well, let's see why do we bother looking at the assembly?
So why would you want to look at the assembly of your program?
Does anybody have some ideas?
Yeah.
AUDIENCE: [INAUDIBLE] made or not.
CHARLES LEISERSON: Yeah, you can see
whether certain optimizations are made or not.
Other reasons?
Everybody is going to say that one.
OK.
Another one is-- well, let's see, so here's some reasons.
The assembly reveals what the compiler did and did not do,
because you can see exactly what the assembly is that is going
to be executed as machine code.
The second reason, which turns out
to happen more often you would think,
is that, hey, guess what, compiler
is a piece of software.
It has bugs.
So your code isn't operating correctly.
Oh, goodness, what's going on?
Maybe the compiler made an error.
And we have certainly found that, especially when you
start using some of the less frequently used features
of a compiler.
You may discover, oh, it's actually not
that well broken in.
And it mentions here you may only have an effect when
compiling at -03, but if you compile at -00, -01,
everything works out just fine.
So then it says, gee, somewhere in the optimizations,
they did an optimization wrong.
So one of the first principles of optimization is do it right.
And then the second is make it fast.
And so sometimes the compiler doesn't that.
It's also the case that sometimes you cannot write code
that produces the assembly that you want.
And in that case, you can actually
write the assembly by hand.
Now, it used to be many years ago--
many, many years ago--
that a lot of software was written in assembly.
In fact, my first job out of college,
I spent about half the time programming
in assembly language.
And it's not as bad as you would think.
But it certainly is easier to have high-level languages
that's for sure.
You get lot more done a lot quicker.
And the last reason is reverse engineer.
You can figure out what a program does when you only
have access to its source, so, for example,
the matrix multiplication example that I gave on day 1.
You know, we had the overall outer structure,
but the inner loop, we could not match the Intel math kernel
library code.
So what do we do?
We didn't have the source for it.
We looked to see what it was doing.
We said, oh, is that what they're doing?
And then we're able to do it ourselves
without having to get the sauce from them.
So we reverse engineered what they did?
So all those are good reasons.