Subtitles section Play video
-
Things we talked about a spectrum meltdown and they rely on some of the more advanced ways that the CPU operates
-
It's probably worth diving down and actually looking at how a CPU actually executes the code be right
-
I mean, we've touched on this before we did a video on pipelining we did a video on caching, but also delve down and see
-
What happens below the surface when we actually get our CPU to execute our code?
-
Let's start by having a simple example:
-
A line of code that we might want to look at what happens. Let's take a line of code that takes a variable
-
Let's take a line of code. It's gonna add up A plus B plus C plus D
-
Times e so I've written this in this sort of see like language
-
So we're gonna do this calculation now as I'm sure most of us are aware
-
When we take that and put into our C compiler run it it gets converted into the machine code that the CPU executes
-
so we take that client of code, and then we'd have to
-
Convert that into the machine code, and then the CPU
-
Executes that machine code so a program like this would end up looking and I'm going to use arm assembly here
-
Just because I know it better than the anything else perhaps for the first instruction. We would load the value for memory of a
-
Into registers, let's pick our zero. We've got 14 or so of them
-
We can use the 16 of them
-
but some of them get used for different things that we don't really use so although the value of a
-
Into our zero next thing we want to do is you want to add that to the value of B
-
Then after make sure we'll get the operator precedence right so we can load the value of B into a register
-
So let's loading the value of C
-
Here into another register
-
And we might as well do D. And E. As well so load or three come on D. And
-
We'll load our four
-
With E as well, and now we can start
-
Adding these things up multiplying them to produce
-
The actual result we want now we're going to make sure we get the precedence right
-
But we could either start by adding a and B together then add on C. And then
-
Multiply D. And E and have them together or we could do that one first
-
I'm just going to start going from left to right as long as the math is right
-
We'll get the right result so we'll add together a and B now
-
I put those two values in r0 and r1 and we need to store the results somewhere
-
We are going to need the value of a again after this, so we'll reuse the register R
-
0 so we're saying put into R 0
-
the value of R 0
-
plus R
-
1 so this is adding together storing the result in R. 0 so we now added a and B together
-
We want to add on C. And so we could do the same thing add
-
to R 0
-
The value in R. 0 which is now because of this instruction a plus B want to add on the value in R
-
2 there's now about a plus B plus C in
-
Our 0 now we need to do the multiplication
-
And we need to do that separately before we add it on so we get the right result so we'll multiply
-
And we'll see we've got an arm, too cheap here, so we've got the multiply instruction there
-
And we need to put the results on whether it's use our 5 D. Which we put in R. 3 and E
-
Which we put in R?
-
4 and then we want to add the result of that onto the value
-
In our 0 and now our 0 contains the result of a plus B. Plus C plus D times E. And
-
We could then store that
-
back into X
-
So that line of code there at one line of C code would become what 1 2 3 4 5 6 7 8 9 10
-
different lines an assembler and I've numbered them because I'm going to
-
Refer to them at different times so we can say searching one instruction 5 etc to refer to the different ones now
-
We might expect that our CPU will just xu instruction 1 the new instruction 2 instruction 3 instruction 405 and so on in order
-
To generate the result and some cpus do in fact work exactly like that, but actually if you think about
-
What the cpus and what these descriptions are actually doing you might think well actually?
-
when I get this first one I've got to go an access memory and
-
As we talked about in the caching video many years ago, cache is perhaps a an old-fashioned English word
-
but it basically just means a small place where we can store things so you might use it to store your hidden treasure if you're
-
a pirate or to store
-
Your food for winter on a modern CPU probably say around 200
-
Nanoseconds to actually go and get the value out of your main memory and load it into the register now of course
-
If these are already cached in the same bit of memory, then you may find that these all execute very quickly
-
We don't know that this isn't the only way we could write this program because if we take this instruction here instruction 6
-
Where we do the add of r0 and r1 to add up a and B. Well. We've got those two values here
-
They're already in the registers at this point in the program
-
So there's nothing to stop us moving this instruction up there
-
and it would still have exactly the same effect so instruction 6 could be moved to me between instructions 2 & 3
-
And then we do the next instruction which was the same as instruction 3 here?
-
which would be LDR
-
R to come of the values in memory that's representing the letter the variable see how exactly the same effect. We just moved that
-
Instruction earlier so you could rewrite this program in various different ways now
-
Why is that interesting?
-
well when we think about how a CPU is designed and that you will have
-
various different what impress be termed execution units within there now one of them is what's generally referred to as the
-
ALU or the arithmetic and logic unit and that's the bit of your CPU that does
-
Addition it does subtraction it does sort of logical operators and or and so on
-
But you also have other bits inside there
-
And one of the bits you'll often have in a modern CPU is it part of your CPU that handles loading and storing
-
Values from memory sometimes interact sometimes they don't now
-
Assuming that they are separate parts of the CPU if we look back at our instructions here. We execute instruction 1
-
It uses a load store. You need to get a value for memory we execute instruction 2
-
It uses the load store unit to get a value for memory instruction 3
-
It uses a load store unit to get a value for memory for uses the load store unit to get a value for memory
-
5 uses the load store unit to get a value for a memory 6
-
changes and uses the ALU as 2 7 8 & 9
-
before insertion turn uses the load store unit so we've got a pretty sequential series the first 5
-
instructions all execute using the load store part of the CPU the next four instructions execute using the ALU and
-
The final instruction again uses the load store unit but as we said we can reorder that
-
into this version here using instructions w x y and z
-
Differentiate them and we execute the first instruction instruction w uses a load store unit instruction X
-
Uses a load store unit instruction Y uses the ALU restrictions ed uses the load store unit
-
Okay, what difference does that make well let's think about what's happening when we're using the load store unit
-
the ALU isn't being used that part of the CV is just sitting there not being used and
-
When we're using the ALU the load store units sitting there not being used, that's what we saw there
-
But does that have to be the case could we actually design it, and you probably guess the answer is that yes?
-
We can so that
-
While the load store unit say is being used that we can run the instructions on the ALU part as well
-
I'd turn the paper round and I'm going to draw
-
This as a sort of timeline so these are our two units and we've got time running along this side as well
-
I'm using the computer for our paper in a
-
Radically different orientation, but never mind, so we're going to execute the instructions
-
On here and the first thing that happens is that we execute instruction W
-
No problem
-
That's going to take certain amount of time to actually that's using the load store unit to execute it
-
These are being fetched and decoded and sort of executed by the different execution units we then execute the next instruction
-
which is
-
X and we couldn't execute this any earlier because the load store unit was being used to execute that one so no difference than what?
-
We had before we're using this one after the other we now come to
-
execute
-
The add instruction now we can't execute this any earlier than this point in time
-
Because this depends on the value of registers r0 and r1 which aren't set?
-
until this point so we need those two values so we can start doing
-
instruction why here now actually
-
It's an ADD it's not going to take as long as fetching things from memory because it's all inside the CPU so we can use
-
A smaller box and we can put instruction Y there and this depends on the value being fetched from there
-
And I'm just going to show this as an arrow here, but the next instruction load
-
r2 comma C
-
well
-
I doesn't depend on anything except the value in there Marie and our load/store units not being used
-
So if we build our CPU right? There's nothing to start that
-
Instruction being executed at the same time and that means that actually when we come to the next instruction
-
Which would be which will be the best instruction to execute next in this example. Let's go back to our program
-
We've executed instructions one to six and three already
-
That's w x y&z we've rewritten the mass let's put instruction seven here
-
What was instruction seven and this is now going to become?
-
I'm gonna have to use it's gonna become instruction a I'll hopefully remember to say instruction a but
-
You can guess the colonics are referring to a on its own is probably the variable if not is probably the instruction so we can
-
now execute
-
instruction a and again instruction a depends on two things
-
It depends on the value of R. 0 which is going to come from this instruction so we have to have that ready
-
But it also depends on the value of R
-
2 which is coming from this instruction so we have to have that ready as well so it can actually happen any point before
-
This point in time so this would be the LDR R 2 comma dot and this is the add R
-
0 and this is the next add, but again we can start trying to leave more the instructions because I okay well
-
That's what instruction for here at the same time. We'll call this instruction B
-
And so on we put that at that point we can execute
-
Instruction B at the same time as we do way and I'm really confusing myself with pens here and so again
-
We've saved some time because rather than having to execute that in the same thing we can do these two things at the same time
-
Now to be able to do this we need these instructions need to execute on different execution units we couldn't for example
-
Execute to add instructions at the same time because we haven't got to Al use well, though
-
There's no reason why you can build a CPU with two Al use if you look at modern
-
CPU designs from Intel AMD arm and cetera they all have often have multiple Al used or allow you to do just that
-
but because the different types we can execute them at the same time and the reason we can do that is because
-
They don't depend
-
on the results of one to work out the other so they're working on different things and they're using different parts of the CPU and
-
The CPU that enables you to do this is what's known as a superscalar
-
CPU because it can run multiple instructions at the same time will you continue doing this and we'd end up we execute instruction
-
B then we've got to execute instruction C
-
instruction D
-
uses a
-
Multiply and actually on a CPU probably got a separate execution unit which does and multiplies because you can actually do them faster that way
-
So you have a multiply unit as well so we can execute that multiply D up there
-
We think well okay?
-
Can we do the other at the same time well no because we need the result of that as well so we can then execute
-
the ad down here before finally, and it just fits on the paper like that so we can actually squash things up and
-
we're going to save some time because if you think about it you have the original order of the program and
-
Here's one. I made earlier
-
All right, or as in I'm just about to draw and Shawn will do some very clever
-
Cutting so even if we had a superscalar processor. We've only got one load store unit we've only got one
-
Al you really got one multiply unit we wouldn't have any opportunities with this program
-
To run two instructions at the same time so this version of the program would
-
Still take ten instructions this one still takes ten instructions, but with a superscalar processor we have the opportunity to sort of
-
execute two instructions at the same time because they use different bits of the CPU now you need to design the CPU to allow that
-
but that enables us to
-
Speed things up a little bit because while this is working to get the value for memory. We can execute some more
-
instructions
-
Now that's all very well and superscalar processors started to appear in
-
the mid 90s things like the six eight thousand and sixty the
-
Pentium I think was superscalar
-
But they require the code to be written in a way
-
That enables this to happen so this program wouldn't have been able to do anything
-
This one would but as we said when we were developing this we could work out which
-
Instructions we could move around to get that speed up based on
-
What those instructions depended on so this instruction?
-
We said what what six became why only depended on the values of R
-
0 and R 1 which has been set by instructions 1 & 2 so we can move that earlier
-
Without affecting anything in our program because it only depended on those 2 values
-
so we can either do this in the compiler or by hand if you write in the assembly yourself like we just did here or
-
It's possible to let the CPU work it out, and so what a modern CPU does what's called an out of water CPU is
-
Reorders the instructions without
-
supposedly breaking the rules of
-
What each instruction does so it'll still execute it as if it was written like this?
-
And it won't change break any of the rules of that
-
but it will say well hang on it will spot that this instruction could happen earlier and
-
So move it earlier to gain some of that parallelism in fact then execute them together at the same time
-
And that works generally get well
-
But as we saw with things like Spector and meltdown if you allow things to happen too far earlier and start doing what's called speculative
-
Evaluation where you say okay?
-
I've got the stuff. I need to execute it now
-
I don't if I need the result but I might do so I'll execute it anyway, and then if I need it
-
I've already done it and if I don't need it while I was still waiting for this to come in anyway
-
So it doesn't matter that I've done it. I've not lost
-
Any time well
-
Then it's turned out that you can have sort of side channels where you can sort of see that that's happened or not
-
Which is caused a few issues with computing?
-
It goes along here like this
-
Intersects the curve somewhere else flips over and it's over here, so this is for G
-
Now we won't look at any more right the edge of a formula for this is just
-
mathematics to do with lines and the tangent of this curve
-
It's actually not very complicated the point is that what we're doing is by multiplying G