Placeholder Image

Subtitles section Play video

  • The following content is provided under a Creative

  • Commons license.

  • Your support will help MIT OpenCourseWare

  • continue to offer high quality educational resources for free.

  • To make a donation or to view additional materials

  • from hundreds of MIT courses, visit MIT OpenCourseWare

  • at ocw.mit.edu.

  • JULIAN SHUN: All right.

  • So we've talked a little bit about caching before,

  • but today we're going to talk in much more detail about caching

  • and how to design cache-efficient algorithms.

  • So first, let's look at the caching hardware

  • on modern machines today.

  • So here's what the cache hierarchy looks

  • like for a multicore chip.

  • We have a whole bunch of processors.

  • They all have their own private L1 caches

  • for both the data, as well as the instruction.

  • They also have a private L2 cache.

  • And then they share a last level cache, or L3 cache,

  • which is also called LLC.

  • They're all connected to a memory controller

  • that can access DRAM.

  • And then, oftentimes, you'll have multiple chips

  • on the same server, and these chips

  • would be connected through a network.

  • So here we have a bunch of multicore chips

  • that are connected together.

  • So we can see that there are different levels of memory

  • here.

  • And the sizes of each one of these levels of memory

  • is different.

  • So the sizes tend to go up as you move up

  • the memory hierarchy.

  • The L1 caches tend to be about 32 kilobytes.

  • In fact, these are the specifications for the machines

  • that you're using in this class.

  • So 32 kilobytes for both the L1 data cache

  • and the L1 instruction cache.

  • 256 kilobytes for the L2 cache.

  • so the L2 cache tends to be about 8 to 10 times

  • larger than the L1 cache.

  • And then the last level cache, the size is 30 megabytes.

  • So this is typically on the order of tens of megabytes.

  • And then DRAM is on the order of gigabytes.

  • So here we have 128 gigabyte DRAM.

  • And nowadays, you can actually get machines

  • that have terabytes of DRAM.

  • So the associativity tends to go up as you move up

  • the cache hierarchy.

  • And I'll talk more about associativity

  • on the next couple of slides.

  • The time to access the memory also tends to go up.

  • So the latency tends to go up as you move up

  • the memory hierarchy.

  • So the L1 caches are the quickest to access,

  • about two nanoseconds, just rough numbers.

  • The L2 cache is a little bit slower--

  • so say four nanoseconds.

  • Last level cache, maybe six nanoseconds.

  • And then when you have to go to DRAM,

  • it's about an order of magnitude slower-- so 50 nanoseconds

  • in this example.

  • And the reason why the memory is further down in the cache

  • hierarchy are faster is because they're

  • using more expensive materials to manufacture these things.

  • But since they tend to be more expensive, we can't fit as much

  • of that on the machines.

  • So that's why the faster memories are smaller

  • than the slower memories.

  • But if we're able to take advantage of locality

  • in our programs, then we can make use of the fast memory

  • as much as possible.

  • And we'll talk about ways to do that in this lecture today.

  • There's also the latency across the network, which

  • tends to be cheaper than going to main memory

  • but slower than doing a last level cache access.

  • And there's a lot of work in trying

  • to get the cache coherence protocols right, as we

  • mentioned before.

  • So since these processors all have private caches,

  • we need to make sure that they all

  • see a consistent view of memory when

  • they're trying to access the same memory

  • addresses in parallel.

  • So we talked about the MSI cache protocol before.

  • And there are many other protocols out there,

  • and you can read more about these things online.

  • But these are very hard to get right,

  • and there's a lot of verification involved

  • in trying to prove that the cache coherence protocols are

  • correct.

  • So any questions so far?

  • OK.

  • So let's talk about the associativity of a cache.

  • So here I'm showing you a fully associative cache.

  • And in a fully associative cache,

  • a cache block can reside anywhere in the cache.

  • And a basic unit of movement here is a cache block.

  • In this example, the cache block size is 4 bytes,

  • but on the machines that we're using for this class,

  • the cache block size is 64 bytes.

  • But for this example, I'm going to use a four byte cache line.

  • So each row here corresponds to one cache line.

  • And a fully associative cache means that each line here

  • can go anywhere in the cache.

  • And then here we're also assuming

  • a cache size that has 32 bytes.

  • So, in total, it can store eight cache line

  • since the cache line is 4 bytes.

  • So to find a block in a fully associative cache,

  • you have to actually search the entire cache,

  • because a cache line can appear anywhere in the cache.

  • And there's a tag associated with each of these cache lines

  • here that basically specify which

  • of the memory addresses in virtual memory space

  • it corresponds to.

  • So for the fully associative cache,

  • we're actually going to use most of the bits of that address

  • as a tag.

  • We don't actually need the two lower order

  • bits, because the things are being

  • moved at the granularity of cache lines, which

  • are four bytes.

  • So the two lower order bits are always going to be the same,

  • but we're just going to use the rest of the bits

  • to store the tag.

  • So if our address space is 64 bits,

  • then we're going to use 62 bits to store the tag in a fully

  • associative caching scheme.

  • And when a cache becomes full, a block

  • has to be evicted to make room for a new block.

  • And there are various ways that you can

  • decide how to evict a block.

  • So this is known as the replacement policy.

  • One common replacement policy is LRU Least Recently Used.

  • So you basically kick the thing out that

  • has been used the farthest in the past.

  • The other schemes, such as second chance and clock

  • replacement, we're not going to talk

  • too much about the different replacement schemes today.

  • But you can feel free to read about these things online.

  • So what's a disadvantage of this scheme?

  • Yes?

  • AUDIENCE: It's slow.

  • JULIAN SHUN: Yeah.

  • Why is it slow?

  • AUDIENCE: Because you have to go all the way [INAUDIBLE]..

  • JULIAN SHUN: Yeah.

  • So the disadvantage is that searching

  • for a cache line in the cache can be pretty slow, because you

  • have to search entire cache in the worst case,

  • since a cache block can reside anywhere in the cache.

  • So even though the search can go on in parallel and hardware

  • is still expensive in terms of power and performance

  • to have to search most of the cache every time.

  • So let's look at another extreme.

  • This is a direct mapped cache.

  • So in a direct mapped cache, each cache block

  • can only go in one place in the cache.

  • So I've color-coded these cache blocks here.

  • So the red blocks can only go in the first row of this cache,

  • the orange ones can only go in the second row, and so on.

  • And the position which a cache block can go into

  • is known as that cache blocks set.

  • So the set determines the location

  • in the cache for each particular block.

  • So let's look at how the virtual memory address is divided up

  • into and which of the bits we're going

  • to use to figure out where a cache block should

  • go in the cache.

  • So we have the offset, we have the set,

  • and then the tag fields.

  • The offset just tells us which position

  • we want to access within a cache block.