Lecture 14: Caching and Cache-Efficient Algorithms

Flash and JavaScript are required for this feature.

Download the video from Internet Archive.

Description: Prof. Shun discusses associativity in caches, the ideal cache model, cache-aware algorithms like tiled matrix multiplication, and cache-oblivious algorithms like divide-and-conquer matrix multiplication.

Instructor: Julian Shun

The following content is provided under a Creative Commons license. Your support will help MIT OpenCourseWare continue to offer high quality educational resources for free. To make a donation or to view additional materials from hundreds of MIT courses, visit MIT OpenCourseWare at ocw.mit.edu.

JULIAN SHUN: All right. So we've talked a little bit about caching before, but today we're going to talk in much more detail about caching and how to design cache-efficient algorithms. So first, let's look at the caching hardware on modern machines today.

So here's what the cache hierarchy looks like for a multicore chip. We have a whole bunch of processors. They all have their own private L1 caches for both the data, as well as the instruction. They also have a private L2 cache. And then they share a last level cache, or L3 cache, which is also called LLC.

They're all connected to a memory controller that can access DRAM. And then, oftentimes, you'll have multiple chips on the same server, and these chips would be connected through a network. So here we have a bunch of multicore chips that are connected together.

So we can see that there are different levels of memory here. And the sizes of each one of these levels of memory is different. So the sizes tend to go up as you move up the memory hierarchy. The L1 caches tend to be about 32 kilobytes. In fact, these are the specifications for the machines that you're using in this class. So 32 kilobytes for both the L1 data cache and the L1 instruction cache.

256 kilobytes for the L2 cache. so the L2 cache tends to be about 8 to 10 times larger than the L1 cache. And then the last level cache, the size is 30 megabytes. So this is typically on the order of tens of megabytes. And then DRAM is on the order of gigabytes. So here we have 128 gigabyte DRAM. And nowadays, you can actually get machines that have terabytes of DRAM.

So the associativity tends to go up as you move up the cache hierarchy. And I'll talk more about associativity on the next couple of slides. The time to access the memory also tends to go up. So the latency tends to go up as you move up the memory hierarchy. So the L1 caches are the quickest to access, about two nanoseconds, just rough numbers. The L2 cache is a little bit slower-- so say four nanoseconds. Last level cache, maybe six nanoseconds.

And then when you have to go to DRAM, it's about an order of magnitude slower-- so 50 nanoseconds in this example. And the reason why the memory is further down in the cache hierarchy are faster is because they're using more expensive materials to manufacture these things. But since they tend to be more expensive, we can't fit as much of that on the machines. So that's why the faster memories are smaller than the slower memories.

But if we're able to take advantage of locality in our programs, then we can make use of the fast memory as much as possible. And we'll talk about ways to do that in this lecture today. There's also the latency across the network, which tends to be cheaper than going to main memory but slower than doing a last level cache access.

And there's a lot of work in trying to get the cache coherence protocols right, as we mentioned before. So since these processors all have private caches, we need to make sure that they all see a consistent view of memory when they're trying to access the same memory addresses in parallel. So we talked about the MSI cache protocol before. And there are many other protocols out there, and you can read more about these things online. But these are very hard to get right, and there's a lot of verification involved in trying to prove that the cache coherence protocols are correct.

So any questions so far? OK.

So let's talk about the associativity of a cache. So here I'm showing you a fully associative cache. And in a fully associative cache, a cache block can reside anywhere in the cache. And a basic unit of movement here is a cache block. In this example, the cache block size is 4 bytes, but on the machines that we're using for this class, the cache block size is 64 bytes. But for this example, I'm going to use a four byte cache line.

So each row here corresponds to one cache line. And a fully associative cache means that each line here can go anywhere in the cache. And then here we're also assuming a cache size that has 32 bytes. So, in total, it can store eight cache line since the cache line is 4 bytes.

So to find a block in a fully associative cache, you have to actually search the entire cache, because a cache line can appear anywhere in the cache. And there's a tag associated with each of these cache lines here that basically specify which of the memory addresses in virtual memory space it corresponds to. So for the fully associative cache, we're actually going to use most of the bits of that address as a tag. We don't actually need the two lower order bits, because the things are being moved at the granularity of cache lines, which are four bytes. So the two lower order bits are always going to be the same, but we're just going to use the rest of the bits to store the tag.

So if our address space is 64 bits, then we're going to use 62 bits to store the tag in a fully associative caching scheme. And when a cache becomes full, a block has to be evicted to make room for a new block. And there are various ways that you can decide how to evict a block. So this is known as the replacement policy.

One common replacement policy is LRU Least Recently Used. So you basically kick the thing out that has been used the farthest in the past. The other schemes, such as second chance and clock replacement, we're not going to talk too much about the different replacement schemes today. But you can feel free to read about these things online.

So what's a disadvantage of this scheme? Yes?

AUDIENCE: It's slow.

JULIAN SHUN: Yeah. Why is it slow?

AUDIENCE: Because you have to go all the way [INAUDIBLE].

JULIAN SHUN: Yeah. So the disadvantage is that searching for a cache line in the cache can be pretty slow, because you have to search entire cache in the worst case, since a cache block can reside anywhere in the cache. So even though the search can go on in parallel and hardware is still expensive in terms of power and performance to have to search most of the cache every time.

So let's look at another extreme. This is a direct mapped cache. So in a direct mapped cache, each cache block can only go in one place in the cache. So I've color-coded these cache blocks here. So the red blocks can only go in the first row of this cache, the orange ones can only go in the second row, and so on. And the position which a cache block can go into is known as that cache blocks set. So the set determines the location in the cache for each particular block.

So let's look at how the virtual memory address is divided up into and which of the bits we're going to use to figure out where a cache block should go in the cache. So we have the offset, we have the set, and then the tag fields. The offset just tells us which position we want to access within a cache block. So since a cache block has B bytes, we only need log base 2 of B bits as the offset.

And the reason why we have to offset is because we're not always accessing something at the beginning of a cache block. We might want to access something in the middle. And that's why we need the offset to specify where in the cache block we want to access.

Then there's a set field. And the set field is going to determine which position in the cache that cache block can go into. So there are eight possible positions for each cache block. And therefore, we only need log base 2 of 8 bits-- so three bits for the set in this example. And more generally, it's going to be log base 2 of M over B. And here, M over B is 8.

And then, finally, we're going to use the remaining bits as a tag. So w minus log base 2 of M bits for the tag. And that gets stored along with the cache block in the cache. And that's going to uniquely identify which of the memory blocks the cache block corresponds to in virtual memory.

And you can verify that the sum of all these quantities here sums to w bits. So in total, we have a w bit address space. And the sum of those three things is w.

So what's the advantage and disadvantage of this scheme? So first, what's a good thing about this scheme compared to the previous scheme that we saw? Yes?

AUDIENCE: Faster.

JULIAN SHUN: Yeah. It's fast because you only have to check one place. Because each cache block can only go in one place in a cache, and that's only place you have to check when you try to do a lookup. If the cache block is there, then you find it. If it's not, then you know it's not in the cache.

What's the downside to this scheme? Yeah?

AUDIENCE: You only end up putting the red ones into the cache and you have mostly every [INAUDIBLE], which is totally [INAUDIBLE].

JULIAN SHUN: Yeah. So good answer. So the downside is that you might, for example, just be accessing the red cache blocks and then not accessing any of the other cache blocks. They'll all get mapped to the same location in the cache, and then they'll keep evicting each other, even though there's a lot of empty space in the cache. And this is known as a conflict miss. And these can be very bad for performance and very hard to debug. So that's one downside of a direct map cache is that you can get these conflict misses where you have to evict things from the cache even though there's empty space in the cache.

So as we said, finding a block is very fast. Only a single location in the cache has to be searched. But you might suffer from conflict misses if you keep axing things in the same set repeatedly without accessing the things in the other sets. So any questions? OK.

So these are sort of the two extremes for cache design. There's actually a hybrid solution called set associative cache. And in a set associative cache, you still sets, but each of the sets contains more than one line now. So all the red blocks still map to the red set, but there's actually two possible locations for the red blocks now.

So in this case, this is known as a two-way associate of cache since there are two possible locations inside each set. And again, a cache block's set determines k possible cache locations for that block. So within a set it's fully associative, but each block can only go in one of the sets.

So let's look again at how the bits are divided into in the address. So we still have the tag set and offset fields. The offset field is still a log base 2 of b. The set field is going to take log base 2 of M over kB bits. So the number of sets we have is M over kB. So we need log base 2 of that number to represent the set of a block. And then, finally, we use the remaining bits as a tag, so it's going to be w minus log base 2 of M over k.

And now, to find a block in the cache, only k locations of it's set must be searched. So you basically find which set the cache block maps too, and then you check all k locations within that set to see if that cached block is there. And whenever you want to whenever you try to put something in the cache because it's not there, you have to evict something. And you evict something from the same set as the block that you're placing into the cache.

So for this example, I showed a two-way associative cache. But in practice, the associated is usually bigger say eight-way, 16-way, or sometimes 20-way. And as you keep increasing the associativity, it's going to look more and more like a fully associative cache. And if you have a one way associative cache, then there's just a direct map cache. So this is a sort of a hybrid in between-- a fully mapped cache and a fully associative cache in a direct map cache. So any questions on set associative caches ? OK.

So let's go over a taxonomy of different types of cache misses that you can incur. So the first type of cache miss is called a cold miss. And this is the cache miss that you have to incur the first time you access a cache block. And if you need to access this piece of data, there's no way to get around getting a cold miss for this. Because your cache starts out not having this block, and the first time you access it, you have to bring it into cache.

Then there are capacity misses. So capacity misses are cache misses You get because the cache is full and it can't fit all of the cache blocks that you want to access. So you get a capacity miss when the previous cache copy would have been evicted even with a fully associative scheme. So even if all of the possible locations in your cache could be used for a particular cache line, that cache line still has to be evicted because there's not enough space. So that's what's called a capacity miss. And you can deal with capacity misses by introducing more locality into your code, both spatial and temporal locality. And we'll look at ways to reduce the capacity misses of algorithms later on in this lecture.

Then there are conflict misses. And conflict misses happen in set associate of caches when you have too many blocks from the same set wanting to go into the cache. And some of these have to be evicted, because the set can't fit all of the blocks. And these blocks wouldn't have been evicted if you had a fully associative scheme, so these are what's called conflict misses. For example, if you have 16 things in a set and you keep accessing 17 things that all belong in the set, something's going to get kicked out every time you want to access something. And these cache evictions might not have happened if you had a fully associative cache.

And then, finally, they're sharing misses. So sharing misses only happened in a parallel context. And we talked a little bit about true sharing a false sharing misses in prior lectures. So let's just review this briefly.

So a sharing miss can happen if multiple processors are accessing the same cache line and at least one of them is writing to that cache line. If all of the processors are just reading from the cache line, then the cache [INAUDIBLE] protocol knows how to make it work so that you don't get misses. They can all access the same cache line at the same time if nobody's modifying it.

But if at least one processor is modifying it, you could get either true sharing misses or false sharing misses. So a true sharing miss is when two processors are accessing the same data on the same cache line. And as you recall from a previous lecture, if one of the two processors is writing to this cache line, whenever it does a write it needs to acquire the cache line in exclusive mode and then invalidate that cache line and all other caches. So then when one another processor tries to access the same memory location, it has to bring it back into its own cache, and then you get a cache miss there.

A false sharing this happens if two processes are accessing different data that just happened to reside on the same cache line. Because the basic unit of movement is a cache line in the architecture. So even if you're asking different things, if they are on the same cache line, you're still going to get a sharing miss. And false sharing is pretty hard to deal with, because, in general, you don't know what data gets placed on what cache line.

There are certain heuristics you can use. For example, if you're mallocing a big memory region, you know that that memory region is contiguous, so you can space your access is far enough apart by different processors so they don't touch the same cache line. But if you're just declaring local variables on the stack, you don't know where the compiler is going to decide to place these variables in the virtual memory address space. So these are four different types of cache misses that you should know about.

And there's many models out there for analyzing the cache performance of algorithms. And some of the models ignore some of these different types of cache misses. So just be aware of this when you're looking at algorithm analysis, because not all of the models will capture all of these different types of cache misses.

So let's look at a bad case for conflict misses. So here I want to access a submatrix within a larger matrix. And recall that matrices are stored in row-major order. And let's say our matrix is 4,096 columns by 4,096 rows and it still stores doubles. So therefore, each row here is going to contain 2 to the 15th bytes, because 4,096 is t2 to the 12th, and we have doubles, which takes eight bytes. So 2 to the 12 times to the 3rd, which is 2 to the 15th.

We're going to assume the word width is 64, which is standard. We're going to assume that we have a cache size of 32k. And the cache block size is 64, which, again, is standard. And let's say we have a four-way associative cache.

So let's look at how the bits are divided into. So again we have this offset, which takes log base 2 of B bits. So how many bits do we have for the offset in this example? Right. So we have 6 bits. So it's just log base 2 of 64.

What about for the set? How many bits do we have for that? 7. Who said 7?

Yeah. So it is 7. So M is 32k, which is 2 to the 15th. And then k is 2 to the 2, b is 2 6. So it's 2 to the 15th divided by 2 the 8th, which is to the 7th. And log base 2 of that is 7.

And finally, what about the tag field?

AUDIENCE: 51.

JULIAN SHUN: 51. Why is that?

AUDIENCE: 64 minus 13.

JULIAN SHUN: Yeah. So it's just 64 minus 7 minus 6, which is 51. OK.

So let's say that we want to access a submatrix within this larger matrix. Let's say we want to acts as a 32 by 32 submatrix. And THIS is pretty common in matrix algorithms, where you want to access submatrices, especially in divide and conquer algorithms.

And let's say we want to access a column of this submatrix A. So the addresses of the elements that we're going to access are as follows-- so let's say the first element in the column is stored at address x. Then the second element in the column is going to be stored at address x plus 2 to the 15th, because each row has 2 to the 15th bytes, and we're skipping over an entire row here to get to the element in the next row of the sub matrix. So we're going to add 2 to the 15th.

And then to get the third element, we're going to add 2 times 2 to the 15th. And so on, until we get to the last element, which is x plus 31 times 2 to the 15th. So which fields of the address are changing as we go through one column of this submatrix?

AUDIENCE: You're just adding multiple [INAUDIBLE] tag the [INAUDIBLE].

JULIAN SHUN: Yeah. So it's just going to be the tag that's changing. The set and the offset are going to stay the same, because we're just using the lower 13 bits to store the set and a tag. And therefore, when we increment by 2 to the 15th, we're not going to touch the set and the offset.

So all of these addresses fall into the same set. And this is a problem, because our cache is only four-way associative. So we can only fit four cache lines in each set. And here, we're accessing 31 of these things. So by the time we get to the next column of A, all the things that we access in the current column of A are going to be evicted from cache already. And this is known as a conflict miss, because if you had a fully associative cache this might not have happened, because you could actually use any location in the cache to store these cache blocks.

So does anybody have any questions on why we get conflict misses here? So anybody have any ideas on how to fix this? So what can I do to make it so that I'm not incrementing by exactly 2 to the 15th every time? Yeah.

AUDIENCE: So pad the matrix?

JULIAN SHUN: Yeah. So one solution is to pad the matrix. You can add some constant amount of space to the end of the matrix. So each row is going to be longer than 2 to the 15th bytes. So maybe you add some small constant like 17. So add 17 bytes to the end of each row.

And now, when you access a column of this submatrix, you're not just incrementing by 2 to the 15th, you're also adding some small integer. And that's going to cause the set and the offset fields to change as well, and you're not going to get as many conflict misses. So that's one way to solve the problem.

It turns out that if you're doing a matrix multiplication algorithm, that's a cubic work algorithm, and you can basically afford to copy the submatrix into a temporary 32 by 32 matrix, do all the operations on the temporary matrix, and then copy it back out to the original matrix. The copying only takes quadratic work to do across the whole algorithm. And since the whole algorithm takes cubic work, the quadratic work is a lower order term. So you can use temporary space to make sure that you don't get conflict misses.

Any questions? So this was conflict misses. So conflict misses are important. But usually, we're going to be first concerned about getting good spatial and temporal locality, because those are usually the higher order factors in the performance of a program. And once we get good spatial and temporal locality in our program, we can then start worrying about conflict misses, for example, by using temporary space or padding our data by some small constants so that we don't have as if any conflict misses.

So now, I want to talk about a model that we can use to analyze the cache performance of algorithms. And this is called the ideal-cache model. So in this model, we have a two-level cache hierarchy. So we have the cache and then main memory.

The cache size is of size M, and the cache line size is of B bytes. And therefore, we can fit M over V cache lines inside our cache. This model assumes that the cache is fully associative, so any cache block can go anywhere in the cache. And it also assumes an optimal omniscient replacement policy.

So this means that where we want to evict a cache block from the cache, we're going to pick the thing to evict that gives us the best performance overall. It gives us the lowest number of cache misses throughout our entire algorithm. So we're assuming that we know the sequence of memory requests throughout the entire algorithm. And that's why it's called the omniscient mission replacement policy. And if something is in cache, you can operate on it for free. And if something is in main memory, you have to bring it into cache and then you incur a cache miss.

So two performance measures that we care about-- first, we care about the ordinary work, which is just the ordinary running time of a program. So this is the same as before when we were analyzing algorithms. It's just a total number of operations that the program does. And the number of cache misses is going to be the number of lines we have to transfer between the main memory and the cache. So the number of cache misses just counts a number of cache transfers, whereas as the work counts all the operations that you have to do in the algorithm.

So ideally, we would like to come up with algorithms that have a low number of cache misses without increasing the work from the traditional standard algorithm. Sometimes we can do that, sometimes we can't do that. And then there's a trade-off between the work and the number of cache misses. And it's a trade-off that you have to decide whether it's worthwhile as a performance engineer.

Today, we're going to look at an algorithm where you can't actually reduce the number of cache misses without increasing the work. So you basically get the best of both worlds. So any questions on this ideal cache model?

So this model is just used for analyzing algorithms. You can't actually buy one of these caches at the store. So this is a very ideal cache, and they don't exist.

But it turns out that this optimal omniscient replacement policy has nice theoretical properties. And this is a very important lemma that was proved in 1985. It's called the LRU lemma. It was proved by Slater and Tarjan. And the lemma says, suppose that an algorithm incurs Q cache misses on an ideal cache of size M. Then, on a fully associative cache of size 2M, that uses the LRU, or Least Recently Used replacement policy, it incurs at most 2Q cache misses.

So what this says is if I can show the number of cache misses for an algorithm on the ideal cache, then if I take a fully associative cache that's twice the size and use the LRU replacement policy, which is a pretty practical policy, then the algorithm is going to incur, at most, twice the number of cache misses. And the implication of this lemma is that for asymptotic analyses, you can assume either the optimal replacement policy or the LRU replacement policy as convenient. Because the number of cache misses is just going to be within a constant factor of each other. So this is a very important lemma. It says that this basically makes it much easier for us to analyze our cache misses in algorithms.

And here's a software engineering principle that I want to point out. So first, when you're trying to get good performance, you should come up with a theoretically good algorithm that has good balance on the work and the cache complexity. And then after you come up with an algorithm that's theoretically good, then you start engineering for detailed performance. You start worrying about the details such as real world caches not being fully associative, and, for example, loads and stores having different costs with respect to bandwidth and latency. But coming up with a theoretically good algorithm is the first order bit to getting good performance.

Questions?

So let's start analyzing the number of cache misses in a program. So here's a lemma. So the lemma says, suppose that a program reads a set of r data segments, where the i-th segment consists of s sub i bytes. And suppose that the sum of the sizes of all the segments is equal to N. And we're going to assume that N is less than M over 3. So the sum of the size of the segments is less than the cache size divided by 3.

We're also going to assume that N over r is greater than or equal to B. So recall that r is the number of data segments we have, and N is the total size of the segment. So what does N over r mean, semantically? Yes.

AUDIENCE: Average [INAUDIBLE].

JULIAN SHUN: Yeah. So N over r is the just the average size of a segment. And here we're saying that the average size of a segment is at least B-- so at least the size of a cache line. So if these two assumptions hold, then all of the segments are going to fit into cache, and the number of cache misses to read them all is, at most, 3 times N over B.

So if you had just a single array of size N, then the number of cache misses you would need to read that array into cache is going to be N over B. And this is saying that, even if our data is divided into a bunch of segments, as long as the average length of the segments is large enough, then the number of cache misses is just a constant factor worse than reading a single array. So let's try to prove this cache miss lemma.

So here's a proof so. A single segment, s sub i is going to incur at most s sub i over B plus 2 cache misses. So does anyone want to tell me where the s sub i over B plus 2 comes from? So let's say this is a segment that we're analyzing, and this is how it's aligned in virtual memory. Yes?

AUDIENCE: How many blocks it could overlap worst case.

JULIAN SHUN: Yeah. So s sub i over B plus 2 is the number of blocks that could overlap within the worst case. So you need s sub i over B cache misses just to load those s sub i bytes. But then the beginning and the end of that segment might not be perfectly aligned with a cache line boundary. And therefore, you could waste, at most, one block on each side of the segment. So that's where the plus 2 comes from.

So to get the total number of cache misses, we just have to sum this quantity from i equals 1 to r. So if I sum s sub i over B from i equals 1 to r, I just get N over B, by definition. And then I sum 2 from i equals 1 to r. So that just gives me 2r.

Now, I'm going to multiply the top and the bottom with the second term by B. So 2r B over B now. And then that's less than or equal to N over B plus 2N over B. So where did I get this inequality here? Why do I know that 2r B is less than or equal to 2N? Yes?

AUDIENCE: You know that the N is greater than or equal to B r.

JULIAN SHUN: Yeah. So you know that N is greater than or equal to B r by this assumption up here. So therefore, r B is less than or equal to N. And then, N B plus 2 N B just sums up to 3 N B. So in the worst case, we're going to incur 3N over B cache misses. So any questions on this cache miss lemma?

So the Important thing to remember here is that if you have a whole bunch of data segments and the average length of your segments is large enough-- bigger than a cache block size-- then you can access all of these segments just like a single array. It only increases the number of cache misses by a constant factor. And if you're doing an asymptotic analysis, then it doesn't matter. So we're going to be using this cache miss lemma later on when we analyze algorithms.

So another assumption that we're going to need is called the tall cache assumption. And the tall cache assumption basically says that the cache is taller than it is wide. So it says that B squared is less than c M for some sufficiently small constant c less than or equal to 1. So in other words, it says that the number of cache lines M over B you have is going to be bigger than B.

And this tall cache assumption is usually satisfied in practice. So here are the cache line sizes and the cache sizes on the machines that we're using. So cache line size is 64 bytes, and the L1 cache size is 32 kilobytes. So 64 bytes squared, that's 2 to the 12th. And 32 kilobytes is 2 to the 15th bytes.

So 2 to the 12th is less than 2 to the 15th, so it satisfies the tall cache assumption. And as we go up the memory hierarchy, the cache size increases, but the cache line length stays the same. So the cache has become even taller as we move up the memory hierarchy.

So let's see why this tall cache assumption is going to be useful. To see that, we're going to look at what's wrong with a short cache. So in a short cache, our lines are going to be very wide, and they're wider than the number of lines that we can have in our cache.

And let's say we're working with an m by n submatrix sorted in row-major order. If you have a short cache, then even if n squared is less than c M, meaning that you can fit all the bytes of the submatrix in cache, you might still not be able to fit it into a short cache. And this picture sort of illustrates this.

So we have m rows here. But we can only fit M over B of the rows in the cache, because the cache lines are so long, and we're actually wasting a lot of space on each of the cache lines. We're only using a very small fraction of each cache line to store the row of this submatrix. If this were the entire matrix, then it would actually be OK, because consecutive rows are going to be placed together consecutively in memory. But if this is a submatrix, then we can't be guaranteed that the next row is going to be placed right after the current row. And oftentimes, we have to deal with submatrices when we're doing recursive matrix algorithms.

So this is what's wrong with short caches. And that's why we want us assume the tall cache assumption. And we can assume that, because it's usually satisfied in practice. The TLB be actually tends to be short. It only has a couple of entries, so it might not satisfy the tall cache assumption. But all of the other caches will satisfy this assumption.

Any questions? OK.

So here's another lemma that's going to be useful. This is called the submatrix caching llama. So suppose that we have an n by m matrix, and it's read into a tall cache that satisfies B squared less than c M for some constant c less than or equal to 1. And suppose that n squared is less than M over 3, but it's greater than or equal to c M. Then A is going to fit into cache, and the number of cache misses required to read all of A's elements into cache is, at most, 3n squared over B.

So let's see why this is true. So we're going to let big N denote the total number of bytes that we need to access. So big N is going to be equal to n squared. And we're going to use the cache miss lemma, which says that if the average length of our segments is large enough, then we can read all of the segments in just like it were a single contiguous array.

So the lengths of our segments here are going to be little n. So r is going to be a little n. And also, the number of segments is going to be little n. And the segment length is also going to be little n, since we're working with a square submatrix here.

And then we also have the cache block size B is less than or equal to n. And that's equal to big N over r. And where do we get this property that B is less than or equal to n? So I made some assumptions up here, where I can use to infer that B is less than or equal to n. Does anybody see where? Yeah.

AUDIENCE: So B squared is less than c M, and c M is [INAUDIBLE]

JULIAN SHUN: Yeah. So I know that B squared is less than c M. C M is less than or equal to n squared. So therefore, B squared is less than n squared, and B is less than n.

So now, I also have that N is less than M over 3, just by assumption. And therefore, I can use the cache miss lemma. So the cache miss lemma tells me that I only need a total of 3n squared over B cache misses to read this whole thing in. Any questions on the submatrix caching lemma?

So now, let's analyze matrix multiplication. How many of you have seen matrix multiplication before? So a couple of you.

So here's what the code looks like for the standard cubic work matrix multiplication algorithm. So we have two input matrices, A and B, And we're going to store the result in C. And the height and the width of our matrix is n. We're just going to deal with square matrices here, but what I'm going to talk about also extends to non-square matrices.

And then we just have three loops here. We're going to loop through i from 0 to n minus 1, j from 0 to n minus 1, and k from 0 to n minus 1. And then we're going to let's C of i n plus j be incremented by a of i n plus k times b of k n plus j. So that's just the standard code for matrix multiply.

So what's the work of this algorithm? It should be review for all of you. n cubed.

So now, let's analyze the number of cache misses this algorithm is going to incur. And again, we're going to assume that the matrix is in row-major order, and we satisfy the tall cache assumption. We're also going to analyze the number of cache misses in matrix B, because it turns out that the number of cache misses incurred by matrix B is going to dominate the number of cache misses overall.

And there are three cases we need to consider. The first case is when n is greater than c M over B for some constant c. And we're going to analyze matrix B, as I said. And we're also going to assume LRU, because we can. If you recall, the LRU lemma says that whatever we analyze using the LRU is just going to be a constant factor within what we analyze using the ideal cache.

So to do this matrix multiplication, I'm going to go through one row of A and one column of B and do the dot product there. This is what happens in the innermost loop. And how many cache misses am I going to incur when I go down one column of B here?

So here, I have the case where n is greater than M over B. So I can't fit one block from each row into the cache. So how many cache misses do I have the first time I go down a column of B? So how many rows of B do I have? n. Yeah, and how many cache misses do I need for each row? One. So in total, I'm going to need n cache misses for the first column of B.

What about the second column of B? So recall that I'm assuming the LRU replacement policy here. So when the cache is full, I'm going to evict the thing that was least recently used-- used the furthest in the past. Sorry, could you repeat that?

AUDIENCE: [INAUDIBLE].

JULIAN SHUN: Yeah. So it's still going to be n. Why is that?

AUDIENCE: Because there are [INAUDIBLE] integer.

JULIAN SHUN: Yeah. It's still going to be n, because I can't fit one cache block from each row into my cache. And by the time I get back to the top of my matrix B, the top block has already been evicted from the cache, and I have to load it back in. And this is the same for every other block that I access. So I'm, again, going to need n cache misses for the second column of B. And this is going to be the same for all the columns of B.

And then I have to do this again for the second row of A. So in total, I'm going to need theta of n cubed number of cache misses. And this is one cache miss per entry that I access in B. And this is not very good, because the total work was also theta of n cubed. So I'm not gaining anything from having any locality in this algorithm here.

So any questions on this analysis? So this just case 1. Let's look at case 2.

So in this case, n is less than c M over B. So I can fit one block from each row of B into cache. And then n is also greater than another constant, c prime time square root of M, so I can't fit the whole matrix into cache. And again, let's analyze the number of cache misses incurred by accessing B, assuming LRU.

So how many cache misses am I going to incur for the first column of B?

AUDIENCE: n.

JULIAN SHUN: n. So that's the same as before. What about the second column of B? So by the time I get to the beginning of the matrix here, is the top block going to be in cache? So who thinks the block is still going to be in cache when I get back to the beginning? Yeah. So a couple of people. Who think it going to be out of cache?

So it turns out it is going to be in cache, because I can fit one block for every row of B into my cache since I have n less than c M over B. So therefore, when I get to the beginning of the second column, that block is still going to be in cache, because I loaded it in when I was accessing the first column. So I'm not going to incur any cache misses for the second column. And, in general, if I can fit B columns or some constant times B columns into cache, then I can reduce the number of cache misses I have by a factor of B. So I only need to incur a cache miss the first time I access of block and not for all the subsequent accesses.

And the same is true for the second row of A. And since I have m rows of A, I'm going to have n times theta of n squared over B cache misses. For each row of A, I'm going to incur n squared over B cache misses.

So the overall number of cache misses is n cubed over B. And this is because inside matrix B I can exploit spatial locality. Once I load in a block, I can reuse it the next time I traverse down a column that's nearby. Any questions on this analysis?

So let's look at the third case. And here, n is less than c prime times square root of M. So this means that the entire matrix fits into cache. So let's analyze the number of cache misses for matrix B again, assuming LRU. So how many cache misses do I have now?

So let's count the total number of cache misses I have for every time I go through a row of A. Yes.

AUDIENCE: Is it just n for the first column?

JULIAN SHUN: Yeah. So for the first column, it's going to be n. What about the second column?

AUDIENCE: [INAUDIBLE] the second [INAUDIBLE].

JULIAN SHUN: Right. So basically, for the first row of A, the analysis is going to be the same as before. I need n squared over B cache misses to load the cache in. What about the second row of A? How many cache misses am I going to incur?

AUDIENCE: [INAUDIBLE].

JULIAN SHUN: Yeah. So for the second row of A, I'm not going to incur any cache misses. Because once I load B into cache, it's going to stay in cache. Because the entire matrix can fit in cache, I assume that n is less than c prime times square root of M. So total number of cache misses I need for matrix B is theta of n squared over B since everything fits in cache. And I just apply the submatrix caching lemma from before. Overall, this is not a very good algorithm. Because as you recall, in case 1 I needed a cubic number of cache misses.

What happens if I swap the order of the inner two loops? So recall that this was one of the optimizations in lecture 1, when Charles was talking about matrix multiplication and how to speed it up. So if I swapped the order of the two inner loops, then, for every iteration, what I'm doing is I'm actually going over a row of C and a row of B, and A stays fixed inside the innermost iteration.

So now, when I analyze the number of cache misses of matrix B, assuming LRU, I'm going to benefit from spatial locality, since I'm going row by row and the matrix is stored in row-major order. So across all of the rows, I'm just going to require theta of n squared over B cache misses. And I have to do this n times for the outer loop. So in total, I'm going to get theta of n cubed over B cache misses.

So if you swap the order of the inner two loops this significantly improves the locality of your algorithm and you can benefit from spatial accounting. That's why we saw a significant performance improvement in the first lecture when we swapped the order of the loops. Any questions?

So does anybody think we can do better than n cubed over B cache misses? Or do you think that it's the best you can do? So how many people think you can do better? Yeah. And how many people think this is the best you can do? And how many people don't care?

So it turns out you can do better. And we're going to do better by using an optimization called tiling. So how this is going to work is instead of just having three for loops, I'm going to have six for loops. And I'm going to loop over tiles.

So I've got a loop over s by s submatrices. And within each submatrix, I'm going to do all of the computation I need for that submatrix before moving on to the next submatrix. So the three innermost loops are going to loop inside a submatrix, and a three outermost loops are going to loop within the larger matrix, one submatrix matrix at a time.

So let's analyze the work of this algorithm. So the work that we need to do for a submatrix of size s by s is going to be s cubed, since that's just a bound for matrix multiplication. And then a number of times I have to operate on submatrices is going to be n over s cubed. And you can see this if you just consider each submatrix to be a single element, and then using the same cubic work analysis on the smaller matrix.

So the work is n over s cubed times s cubed, which is equal to theta of n cubed. So the work of this tiled matrix multiplies the same as the version that didn't do tiling. And now, let's analyze the number of cache misses.

So we're going to tune s so that the submatrices just fit into cache. So we're going to set s to be equal to theta of square root of M. We actually need to make this 1/3 square root of M, because we need to fit three submatrices in the cache. But it's going to be some constant times square root of M.

The submatrix caching level implies that for each submatrix we're going to need x squared over B misses to load it in. And once we loaded into cache, it fits entirely into cache, so we can do all of our computations within cache and not incur any more cache misses. So therefore, the total number of cache misses we're going to incur, it's going to be the number of subproblems, which is n over s cubed and then a number of cache misses per subproblem, which is s squared over B.

And if you multiply this out, you're going to get n cubed over B times square root of M. So here, I plugged in square root of M for s. And this is a pretty cool result, because it says that you can actually do better than the n cubed over B bound. You can improve this bound by a factor of square root of M.

And in practice, square root of M is actually not insignificant. So, for example, if you're looking at the last level cache, the size of that is on the order of megabytes. So a square root of M is going to be on the order of thousands. So this significantly improves the performance of the matrix multiplication code if you tune s so that the submatrices just fit in the cache.

It turns out that this bound is optimal. So this was shown in 1981. So for cubic work matrix multiplication, this is the best you can do. If you use another matrix multiply algorithm, like Strassen's algorithm, you can do better.

So I want you to remember this bound. It's a very important bound to know. It says that for a matrix multiplication you can benefit both from spatial locality as well as temporal locality. So I get spatial locality in the B term in the denominator. And then the square root of M term comes from temporal locality, since I'm doing all of the work inside a submatrix before I evict that submatrix from cache. Any questions on this analysis?

So what's one issue with this algorithm here? Yes.

AUDIENCE: It's not portable, like different architecture [INAUDIBLE].

JULIAN SHUN: Yeah. So the problem here is I have to tune s for my particular machine. And I call this a voodoo parameter. It's sort of like a magic number I put into my program so that it fits in the cache on the particular machine I'm running on. And this makes the code not portable, because if I try to run this code on a another machine, the cache sizes might be different there, and then I won't get the same performance as I did on my machine.

And this is also an issue even if you're running it on the same machine, because you might have other programs running at the same time and using up part of the cache. So you don't actually know how much of the cache your program actually gets to use in a multiprogramming environment.

And then this was also just for one level of cache. If we want to optimize for two levels of caches, we're going to have two voodoo parameters, s and t. We're going to have submatrices and sub-submatrices. And then we have to tune both of these parameters to get the best performance on our machine. And multi-dimensional tuning optimization can't be done simply with binary search. So if you're just tuning for one level of cache, you can do a binary search on the parameter s, but here you can't do binary search. So it's much more expensive to optimize here.

And the code becomes a little bit messier. You have nine for loops instead of six. And how many levels of caches do we have on the machines that we're using today?

AUDIENCE: Three.

JULIAN SHUN: Three. So for three level cache, you have three voodoo parameters. You have 12 nested for loops. This code becomes very ugly. And you have to tune these parameters for your particular machine. And this makes the code not very portable, as one student pointed out. And in a multiprogramming environment, you don't actually know the effective cache size that your program has access to. Because other jobs are running at the same time, and therefore it's very easy to mistune the parameters.

Was their question? No? So any questions? Yeah. Is there a way to programmatically get the size of the cache? [INAUDIBLE].

JULIAN SHUN: Yeah. So you can auto-tune your program so that it's optimized for the cache sizes of your particular machine.

AUDIENCE: [INAUDIBLE] instruction to get the size of the cache [INAUDIBLE].

JULIAN SHUN: Instruction to get the size of your cache. I'm not actually sure. Do you know?

AUDIENCE: [INAUDIBLE] in--

AUDIENCE: [INAUDIBLE].

AUDIENCE: Yeah, in the proc--

JULIAN SHUN: Yeah, proc cpuinfo.

AUDIENCE: Yeah. proc cpuinfo or something like that.

JULIAN SHUN: Yeah. So you can probably get that as well.

AUDIENCE: And I think if you google, I think you'll find it pretty quickly.

JULIAN SHUN: Yeah.

AUDIENCE: Yeah. But even if you do that, and you're running this program when other jobs are running, you don't actually know how much cache your program has access to. Yes? Is cache of architecture and stuff like that optimized around matrix problems?

JULIAN SHUN: No. They're actually general purpose. Today, we're just looking at matrix multiply, but on Thursday's lecture we'll actually be looking at many other problems and how to optimize them for the cache hierarchy. Other questions?

So this was a good algorithm in terms of cache performance, but it wasn't very portable. So let's see if we can do better. Let's see if we can come up with a simpler design where we still get pretty good cache performance.

So we're going to turn to divide and conquer. We're going to look at the recursive matrix multiplication algorithm that we saw before. Again, we're going to deal with square matrices, but the results generalize to non-square matrices.

So how this works is we're going to split our [INAUDIBLE] matrices into four submatrices or four quadrants. And then for each quadrant of the output matrix, it's just going to be the sum of two matrix multiplies on n over 2 by n over 2 matrices. So c 1 1 one is going to be a 1 1 times b 1 1, plus a 1 2 times B 2 1.

And then we're going to do this recursively. So every level of recursion we're going to get eight multiplied adds of n over 2 by n over 2 matrices. Here's what the recursive code looks like. You can see that we have eight recursive calls here. The base case here is of size 1. In practice, you want to coarsen the base case to overcome function call overheads.

Let's also look at what these values here correspond to. So I've color coded these so that they correspond to particular elements in the submatrix that I'm looking at on the right. So these values here correspond to the index of the first element in each of my quadrants. So the first element in my first quadrant is just going to have an offset of 0. And then the first element of my second quadrant, that's going to be on the same row as the first element in my first quadrant. So I just need to add the width of my quadrant, which is n over 2.

And then to get the first element in quadrant 2 1, I'm going to jump over and over two rows. And each row has a length row size, so it's just going to be n over 2 times row size. And then to get the first element in quadrant 2 2, it's just the first element in quadrant 2 1 plus n over 2. So that's n over 2 times row size plus 1.

So let's analyze the work of this algorithm. So what's the recurrence for this algorithm-- for the work of this algorithm? So how many subproblems do we have?

AUDIENCE: Eight

JULIAN SHUN: Eight. And what's the size of each Subproblem n over 2. And how much work are we doing to set up the recursive calls? A constant amount of work.

So the recurrences is W of n is equal to 8 W n over 2 plus theta of 1. And what does that solve to? n cubed. So it's one of the three cases of the master theorem.

We're actually going to analyze this in more detail by drawing out the recursion tree. And this is going to give us more intuition about why the master theorem is true. So at the top level of my recursion tree, I'm going to have a problem of size n. And then I'm going to branch into eight subproblems of size n over 2. And then I'm going to do a constant amount of work to set up the recursive calls.

Here, I'm just labeling this with one. So I'm ignoring the constants. But it's not going to matter for asymptotic analysis.

And then I'm going to branch again into eight subproblems of size n over 4. And eventually, I'm going to get down to the leaves. And how many levels do I have until I get to the leaves? Yes?

AUDIENCE: Log n.

JULIAN SHUN: Yeah. So log n-- what's the base of the log? Yeah. So it's log base 2 of n, because I'm dividing my problem size by 2 every time. And therefore, the number of leaves I have is going to be 8 to the log base 2 of n, because I'm branching it eight ways every time. 8 to the log base 2 of n is the same as n to the log base 2 of 8, which is n cubed.

The amount of work I'm doing at the top level is constant. So I'm just going to say 1 here. At the next level, it's eight times, then 64. And then when I get to the leaves, it's going to be theta of n cubed, since I have m cubed leaves, and they're all doing constant work.

And the work is geometrically increasing as I go down the recursion tree. So the overall work is just dominated by the work I need to do at the leaves. So the overall work is just going to be theta of n cubed. And this is the same as the looping versions of matrix multiply-- they're all cubic work.

Now, let's analyze the number of cache misses of this divide and conquer algorithm. So now, my recurrence is going to be different. My base case now is when the submatrix fits in the cache-- so when n squared is less than c M. And when that's true, I just need to load that submatrix into cache, and then I don't incur any more cache misses. So I need theta of n squared over B cache misses when n squared is less than c M for some sufficiently small constant c, less than or equal to 1.

And then, otherwise, I recurse into 8 subproblems of size n over 2. And then I add theta of 1, because I'm doing a constant amount of work to set up the recursive calls. And I get this state of n squared over B term from the submatrix caching lemma. It says I can just load the entire matrix into cache with this many cache misses.

So the difference between the cache analysis here and the work analysis before is that I have a different base case. And I think in all of the algorithms that you've seen before, the base case was always of a constant size. But here, we're working with a base case that's not of a constant size.

So let's try to analyze this using the recursion tree approach. So at the top level, I have a problem of size n that I'm going to branch into eight problems of size n over 2. And then I'm also going to incur a constant number of cache misses. I'm just going to say 1 here. Then I'm going to branch again.

And then, eventually, I'm going to get to the base case where n squared is less than c M. And when n squared is less than c M, then the number of cache misses that I'm going to incur is going to be theta of c M over B. So I can just plug-in c M here for n squared.

And the number of levels of recursion I have in this recursion tree is no longer just log base 2 of n. I'm going to have log base 2 of n minus log base 2 of square root of c M number of levels, which is the same as log base 2 of n minus 1/2 times log base 2 of c M. And then, the number of leaves I get is going to be 8 to this number of levels here. So it's 8 to log base 2 of n minus 1/2 of log base 2 of c M. And this is equal to the theta of n cubed over M to the 3/2.

So the n cubed comes from the 8 to the log base 2 of n term. And then if I do 8 to the negative 1/2 of log base 2 of c M, that's just going to give me M to the 3/2 in the denominator. So any questions on how I computed the number of levels of this recursion tree here?

So I'm basically dividing my problem size by 2 until I get to a problem size that fits in the cache. So that means n is less than square root of c M. So therefore, I can subtract that many levels for my recursion tree. And then to get the number of leaves, since I'm branching eight ways, I just do 8 to the power of the number of levels I have. And then that gives me the total number of leaves.

So now, let's analyze a number of cache misses I need each level of this recursion tree. At the top level, I have a constant number of cache misses-- let's just say 1. The next level, I have 8, 64. And then at the leaves, I'm going to have theta of n cubed over B times square root of M cache misses. And I got this quantity just by multiplying the number of leaves by the number of cache misses per leaf.

So number of leaves is n cubed over M to the 3/2. The cache misses per leaves is theta of c M over B. So I lose one factor of B in the denominator. I'm left with the square root of M at the bottom. And then I also divide by the block size B.

So overall, I get n cubed over B times square root of M cache misses. And again, this is a geometric series. And the number of cache misses at the leaves dominates all of the other levels. So the total number of cache misses I have is going to be theta of n cubed over B times square root of M.

And notice that I'm getting the same number of cache misses as I did with the tiling version of the code. But here, I don't actually have the tune my code for the particular cache size. So what cache sizes does this code work for? So is this code going to work on your machine? Is it going to get good cache performance?

So this code is going to work for all cache sizes, because I didn't tune it for any particular cache size. And this is what's known as a cache-oblivious algorithm. It doesn't have any voodoo tuning parameters, it has no explicit knowledge of the caches, and it's essentially passively auto-tuning itself for the particular cache size of your machine.

It can also work for multi-level caches automatically, because I never specified what level of cache I'm analyzing this for. I can analyze it for any level of cache, and it's still going to give me good cache complexity. And this is also good in multiprogramming environments, where you might have other jobs running and you don't know your effective cache size. This is just going to passively auto-tune for whatever cache size is available.

It turns out that the best cache-oblivious codes to date work on arbitrary rectangular matrices. I just talked about square matrices, but the best codes work on rectangular matrices. And they perform binary splitting instead of eight-way splitting. And you're split on the largest of i, j, and k. So this is what the best cache-oblivious matrix multiplication algorithm does. Any questions?

So I only talked about the serial setting so far. I was assuming that these algorithms ran on just a single thread. What happens if I go to multiple processors? It turns out that the results do generalize to a parallel context. So this is the recursive parallel matrix multiply code that we saw before. And notice that we're executing four sub calls in parallel, doing a sync, and then doing four more sub calls in parallel.

So let's try to analyze the number of cache misses in this parallel code. And to do that, we're going to use this theorem, which says that let Q sub p be the number of cache misses in a deterministic cell computation why run on P processors, each with a private cache of size M. And let S sub p be the number of successful steals during the computation. In the ideal cache model, the number of cache misses we're going to have is Q sub p equal to Q sub 1 plus big O of number of steals times M over B. So the number of cache misses in the parallel context is equal to the number of cache misses when you run it serially plus this term here, which is the number of steals times M over B.

And the proof for this goes as follows-- so every call in the Cilk runtime system, we can have workers steal tasks from other workers when they don't have work to do. And after a worker steals a task from another worker, it's cache becomes completely cold in the worst case, because it wasn't actually working on that subproblem before. But after M over B cold cache misses, its cache is going to become identical to what it would be in the serial execution. So we just need to pay M over B cache misses to make it so that the cache looks the same as if it were executing serially.

And the same is true when a worker resumes a stolen subcomputation after a Cilk sync. And the number of times that these two situations can happen is 2 times as S p-- 2 times the number of steals. And each time, we have to pay M over b cache misses. And this is where this additive term comes from-- order S sub p times M over B.

We also know that the number of steals in a Cilk program is upper-bounded by P times T infinity, in the expectation where P is the number of processors and T infinity is the span of your computation. So if you can minimize the span of your computation, then this also gives you a good cache bounds. So moral of the story here is that minimizing the number of cache misses in a serial elision essentially minimizes them in the parallel execution for a low span algorithm.

So in this recursive matrix multiplication algorithm, the span of this is as follows-- so T infinity of n is 2T infinity of of n over 2 plus theta of 1. Since we're doing a sync here, we have to pay the critical path length of two sub calls. This solves to theta of n. And applying to previous lemma, this gives us a cache miss bound of theta of n cubed over B square root of M. This is just the same as the serial execution And then this additive term is going to be order P times n. And it's a span times M over B

So that was a parallel algorithm for matrix multiply. And we saw that we can also get good cache bounds there.

So here's a summary of what we talked about today. We talked about associativity and caches, different ways you can design a cache. We talked about the ideal cache model that's useful for analyzing algorithms. We talked about cache-aware algorithms that have explicit knowledge of the cache. And the example we used was titled matrix multiply. Then we came up with a much simpler algorithm that was cache-oblivious using divide and conquer.

And then on Thursday's lecture, we'll actually see much more on cache-oblivious algorithm design. And then you'll also have an opportunity to analyze the cache efficiency of some algorithms in the next homework.