Subtitles section Play video
The following content is provided under a Creative
Commons license.
Your support will help MIT OpenCourseWare
continue to offer high quality educational resources for free.
To make a donation or to view additional materials
from hundreds of MIT courses, visit MIT OpenCourseWare
at ocw.mit.edu.
JULIAN SHUN: All right.
So we've talked a little bit about caching before,
but today we're going to talk in much more detail about caching
and how to design cache-efficient algorithms.
So first, let's look at the caching hardware
on modern machines today.
So here's what the cache hierarchy looks
like for a multicore chip.
We have a whole bunch of processors.
They all have their own private L1 caches
for both the data, as well as the instruction.
They also have a private L2 cache.
And then they share a last level cache, or L3 cache,
which is also called LLC.
They're all connected to a memory controller
that can access DRAM.
And then, oftentimes, you'll have multiple chips
on the same server, and these chips
would be connected through a network.
So here we have a bunch of multicore chips
that are connected together.
So we can see that there are different levels of memory
here.
And the sizes of each one of these levels of memory
is different.
So the sizes tend to go up as you move up
the memory hierarchy.
The L1 caches tend to be about 32 kilobytes.
In fact, these are the specifications for the machines
that you're using in this class.
So 32 kilobytes for both the L1 data cache
and the L1 instruction cache.
256 kilobytes for the L2 cache.
so the L2 cache tends to be about 8 to 10 times
larger than the L1 cache.
And then the last level cache, the size is 30 megabytes.
So this is typically on the order of tens of megabytes.
And then DRAM is on the order of gigabytes.
So here we have 128 gigabyte DRAM.
And nowadays, you can actually get machines
that have terabytes of DRAM.
So the associativity tends to go up as you move up
the cache hierarchy.
And I'll talk more about associativity
on the next couple of slides.
The time to access the memory also tends to go up.
So the latency tends to go up as you move up
the memory hierarchy.
So the L1 caches are the quickest to access,
about two nanoseconds, just rough numbers.
The L2 cache is a little bit slower--
so say four nanoseconds.
Last level cache, maybe six nanoseconds.
And then when you have to go to DRAM,
it's about an order of magnitude slower-- so 50 nanoseconds
in this example.
And the reason why the memory is further down in the cache
hierarchy are faster is because they're
using more expensive materials to manufacture these things.
But since they tend to be more expensive, we can't fit as much
of that on the machines.
So that's why the faster memories are smaller
than the slower memories.
But if we're able to take advantage of locality
in our programs, then we can make use of the fast memory
as much as possible.
And we'll talk about ways to do that in this lecture today.
There's also the latency across the network, which
tends to be cheaper than going to main memory
but slower than doing a last level cache access.
And there's a lot of work in trying
to get the cache coherence protocols right, as we
mentioned before.
So since these processors all have private caches,
we need to make sure that they all
see a consistent view of memory when
they're trying to access the same memory
addresses in parallel.
So we talked about the MSI cache protocol before.
And there are many other protocols out there,
and you can read more about these things online.
But these are very hard to get right,
and there's a lot of verification involved
in trying to prove that the cache coherence protocols are
correct.
So any questions so far?
OK.
So let's talk about the associativity of a cache.
So here I'm showing you a fully associative cache.
And in a fully associative cache,
a cache block can reside anywhere in the cache.
And a basic unit of movement here is a cache block.
In this example, the cache block size is 4 bytes,
but on the machines that we're using for this class,
the cache block size is 64 bytes.
But for this example, I'm going to use a four byte cache line.
So each row here corresponds to one cache line.
And a fully associative cache means that each line here
can go anywhere in the cache.
And then here we're also assuming
a cache size that has 32 bytes.
So, in total, it can store eight cache line
since the cache line is 4 bytes.
So to find a block in a fully associative cache,
you have to actually search the entire cache,
because a cache line can appear anywhere in the cache.
And there's a tag associated with each of these cache lines
here that basically specify which
of the memory addresses in virtual memory space
it corresponds to.
So for the fully associative cache,
we're actually going to use most of the bits of that address
as a tag.
We don't actually need the two lower order
bits, because the things are being
moved at the granularity of cache lines, which
are four bytes.
So the two lower order bits are always going to be the same,
but we're just going to use the rest of the bits
to store the tag.
So if our address space is 64 bits,
then we're going to use 62 bits to store the tag in a fully
associative caching scheme.
And when a cache becomes full, a block
has to be evicted to make room for a new block.
And there are various ways that you can
decide how to evict a block.
So this is known as the replacement policy.
One common replacement policy is LRU Least Recently Used.
So you basically kick the thing out that
has been used the farthest in the past.
The other schemes, such as second chance and clock
replacement, we're not going to talk
too much about the different replacement schemes today.
But you can feel free to read about these things online.
So what's a disadvantage of this scheme?
Yes?
AUDIENCE: It's slow.
JULIAN SHUN: Yeah.
Why is it slow?
AUDIENCE: Because you have to go all the way [INAUDIBLE]..
JULIAN SHUN: Yeah.
So the disadvantage is that searching
for a cache line in the cache can be pretty slow, because you
have to search entire cache in the worst case,
since a cache block can reside anywhere in the cache.
So even though the search can go on in parallel and hardware
is still expensive in terms of power and performance
to have to search most of the cache every time.
So let's look at another extreme.
This is a direct mapped cache.
So in a direct mapped cache, each cache block
can only go in one place in the cache.
So I've color-coded these cache blocks here.
So the red blocks can only go in the first row of this cache,
the orange ones can only go in the second row, and so on.
And the position which a cache block can go into
is known as that cache blocks set.
So the set determines the location
in the cache for each particular block.
So let's look at how the virtual memory address is divided up
into and which of the bits we're going
to use to figure out where a cache block should
go in the cache.
So we have the offset, we have the set,
and then the tag fields.
The offset just tells us which position
we want to access within a cache block.