1 00:00:00,500 --> 00:00:03,930 So, how can the memory system arrange for the right data 2 00:00:03,930 --> 00:00:07,060 to be in the right place at the right time? 3 00:00:07,060 --> 00:00:09,560 Our goal is to have the frequently-used data 4 00:00:09,560 --> 00:00:12,030 in some fast SRAM. 5 00:00:12,030 --> 00:00:13,630 That means the memory system will 6 00:00:13,630 --> 00:00:16,340 have to be able to predict which memory locations will 7 00:00:16,340 --> 00:00:17,770 be accessed. 8 00:00:17,770 --> 00:00:20,270 And to keep the overhead of moving data into and out 9 00:00:20,270 --> 00:00:24,040 of SRAM manageable, we’d like to amortize the cost of the move 10 00:00:24,040 --> 00:00:25,610 over many accesses. 11 00:00:25,610 --> 00:00:27,520 In other words we want any block of data 12 00:00:27,520 --> 00:00:31,920 we move into SRAM to be accessed many times. 13 00:00:31,920 --> 00:00:36,770 When not in SRAM, data would live in the larger, slower DRAM 14 00:00:36,770 --> 00:00:38,580 that serves as main memory. 15 00:00:38,580 --> 00:00:40,310 If the system is working as planned, 16 00:00:40,310 --> 00:00:43,280 DRAM accesses would happen infrequently, e.g., 17 00:00:43,280 --> 00:00:45,710 only when it’s time to bring another block of data 18 00:00:45,710 --> 00:00:48,240 into SRAM. 19 00:00:48,240 --> 00:00:50,720 If we look at how programs access memory, 20 00:00:50,720 --> 00:00:54,500 it turns out we *can* make accurate predictions about 21 00:00:54,500 --> 00:00:58,040 which memory locations will be accessed. 22 00:00:58,040 --> 00:01:01,790 The guiding principle is “locality of reference” which 23 00:01:01,790 --> 00:01:06,090 tells us that if there’s an access to address X at time t, 24 00:01:06,090 --> 00:01:09,310 it’s very probable that the program will access a nearby 25 00:01:09,310 --> 00:01:11,830 location in the near future. 26 00:01:11,830 --> 00:01:15,380 To understand why programs exhibit locality of reference, 27 00:01:15,380 --> 00:01:19,790 let’s look at how a running program accesses memory. 28 00:01:19,790 --> 00:01:22,290 Instruction fetches are quite predictable. 29 00:01:22,290 --> 00:01:24,770 Execution usually proceeds sequentially 30 00:01:24,770 --> 00:01:27,250 since most of the time the next instruction is 31 00:01:27,250 --> 00:01:29,050 fetched from the location after that 32 00:01:29,050 --> 00:01:31,430 of the current instruction. 33 00:01:31,430 --> 00:01:33,500 Code that loops will repeatedly fetch 34 00:01:33,500 --> 00:01:35,650 the same sequence of instructions, as 35 00:01:35,650 --> 00:01:39,030 shown here on the left of the time line. 36 00:01:39,030 --> 00:01:41,590 There will of course be branches and subroutine calls that 37 00:01:41,590 --> 00:01:44,970 interrupt sequential execution, but then we’re back to fetching 38 00:01:44,970 --> 00:01:47,720 instructions from consecutive locations. 39 00:01:47,720 --> 00:01:49,950 Some programming constructs, e.g., 40 00:01:49,950 --> 00:01:52,800 method dispatch in object-oriented languages, 41 00:01:52,800 --> 00:01:56,370 can produce scattered references to very short code sequences 42 00:01:56,370 --> 00:01:58,570 (as shown on the right of the time line) 43 00:01:58,570 --> 00:02:01,690 but order is quickly restored. 44 00:02:01,690 --> 00:02:05,150 This agrees with our intuition about program execution. 45 00:02:05,150 --> 00:02:07,680 For example, once we execute the first instruction 46 00:02:07,680 --> 00:02:10,410 of a procedure, we’ll almost certainly execute the remaining 47 00:02:10,410 --> 00:02:12,500 instructions in the procedure. 48 00:02:12,500 --> 00:02:15,890 So if we arranged for all the code of a procedure to moved 49 00:02:15,890 --> 00:02:19,300 to SRAM when the procedure’s first instruction was fetched, 50 00:02:19,300 --> 00:02:22,550 we’d expect that many subsequent instruction fetches could be 51 00:02:22,550 --> 00:02:24,830 satisfied by the SRAM. 52 00:02:24,830 --> 00:02:28,160 And although fetching the first word of a block from DRAM has 53 00:02:28,160 --> 00:02:32,170 relatively long latency, the DRAM’s fast column accesses 54 00:02:32,170 --> 00:02:34,950 will quickly stream the remaining words from sequential 55 00:02:34,950 --> 00:02:36,690 addresses. 56 00:02:36,690 --> 00:02:39,810 This will amortize the cost of the initial access 57 00:02:39,810 --> 00:02:43,900 over the whole sequence of transfers. 58 00:02:43,900 --> 00:02:45,790 The story is similar for accesses 59 00:02:45,790 --> 00:02:48,360 by a procedure to its arguments and local variables 60 00:02:48,360 --> 00:02:50,320 in the current stack frame. 61 00:02:50,320 --> 00:02:53,460 Again there will be many accesses to a small region 62 00:02:53,460 --> 00:02:55,970 of memory during the span of time we’re executing 63 00:02:55,970 --> 00:02:59,090 the procedure’s code. 64 00:02:59,090 --> 00:03:02,760 Data accesses generated by LD and ST instructions also 65 00:03:02,760 --> 00:03:04,410 exhibit locality. 66 00:03:04,410 --> 00:03:07,490 The program may be accessing the components of an object 67 00:03:07,490 --> 00:03:08,730 or struct. 68 00:03:08,730 --> 00:03:12,080 Or it may be stepping through the elements of an array. 69 00:03:12,080 --> 00:03:14,770 Sometimes information is moved from one array 70 00:03:14,770 --> 00:03:17,600 or data object to another, as shown by the data 71 00:03:17,600 --> 00:03:21,110 accesses on the right of the timeline. 72 00:03:21,110 --> 00:03:23,800 Using simulations we can estimate the number 73 00:03:23,800 --> 00:03:26,210 of different locations that will be accessed 74 00:03:26,210 --> 00:03:28,210 over a particular span of time. 75 00:03:28,210 --> 00:03:30,930 What we discover when we do this is the notion of a “working 76 00:03:30,930 --> 00:03:34,690 set” of locations that are accessed repeatedly. 77 00:03:34,690 --> 00:03:36,580 If we plot the size of the working 78 00:03:36,580 --> 00:03:39,230 set as a function of the size of the time interval, 79 00:03:39,230 --> 00:03:42,930 we see that the size of the working set levels off. 80 00:03:42,930 --> 00:03:46,490 In other words once the time interval reaches a certain size 81 00:03:46,490 --> 00:03:48,710 the number of locations accessed is 82 00:03:48,710 --> 00:03:51,200 approximately the same independent 83 00:03:51,200 --> 00:03:55,420 of when in time the interval occurs. 84 00:03:55,420 --> 00:03:57,520 As we see in our plot to the left, 85 00:03:57,520 --> 00:04:00,620 the actual addresses accessed will change, 86 00:04:00,620 --> 00:04:03,430 but the number of *different* addresses during the time 87 00:04:03,430 --> 00:04:06,620 interval will, on the average, remain relatively constant 88 00:04:06,620 --> 00:04:10,050 and, surprisingly, not all that large! 89 00:04:10,050 --> 00:04:12,470 This means that if we can arrange for our SRAM 90 00:04:12,470 --> 00:04:15,560 to be large enough to hold the working set of the program, 91 00:04:15,560 --> 00:04:20,370 most accesses will be able to be satisfied by the SRAM. 92 00:04:20,370 --> 00:04:23,850 We’ll occasionally have to move new data into the SRAM and old 93 00:04:23,850 --> 00:04:28,090 data back to DRAM, but the DRAM access will occur less 94 00:04:28,090 --> 00:04:30,960 frequently than SRAM accesses. 95 00:04:30,960 --> 00:04:33,420 We’ll work out the mathematics in a slide or two, 96 00:04:33,420 --> 00:04:36,750 but you can see that thanks to locality of reference we’re 97 00:04:36,750 --> 00:04:40,020 on track to build a memory out of a combination of SRAM 98 00:04:40,020 --> 00:04:44,190 and DRAM that performs like an SRAM but has the capacity 99 00:04:44,190 --> 00:04:45,790 of the DRAM. 100 00:04:45,790 --> 00:04:49,240 The SRAM component of our hierarchical memory system is 101 00:04:49,240 --> 00:04:51,090 called a “cache”. 102 00:04:51,090 --> 00:04:54,670 It provides low-latency access to recently-accessed blocks 103 00:04:54,670 --> 00:04:56,130 of data. 104 00:04:56,130 --> 00:04:58,360 If the requested data is in the cache, 105 00:04:58,360 --> 00:05:03,350 we have a “cache hit” and the data is supplied by the SRAM. 106 00:05:03,350 --> 00:05:05,780 If the requested data is not in the cache, 107 00:05:05,780 --> 00:05:09,110 we have a “cache miss” and a block of data containing 108 00:05:09,110 --> 00:05:11,940 the requested location will have to be moved from DRAM 109 00:05:11,940 --> 00:05:13,980 into the cache. 110 00:05:13,980 --> 00:05:15,980 The locality principle tells us that we 111 00:05:15,980 --> 00:05:18,330 should expect cache hits to occur much more 112 00:05:18,330 --> 00:05:21,820 frequently than cache misses. 113 00:05:21,820 --> 00:05:24,500 Modern computer systems often use multiple levels 114 00:05:24,500 --> 00:05:26,740 of SRAM caches. 115 00:05:26,740 --> 00:05:30,190 The levels closest to the CPU are smaller but very fast, 116 00:05:30,190 --> 00:05:32,350 while the levels further away from the CPU 117 00:05:32,350 --> 00:05:35,330 are larger and hence slower. 118 00:05:35,330 --> 00:05:38,030 A miss at one level of the cache generates an access 119 00:05:38,030 --> 00:05:42,130 to the next level, and so on until a DRAM access is needed 120 00:05:42,130 --> 00:05:45,580 to satisfy the initial request. 121 00:05:45,580 --> 00:05:47,780 Caching is used in many applications 122 00:05:47,780 --> 00:05:51,990 to speed up accesses to frequently-accessed data. 123 00:05:51,990 --> 00:05:53,880 For example, your browser maintains 124 00:05:53,880 --> 00:05:56,210 a cache of frequently-accessed web pages 125 00:05:56,210 --> 00:05:58,610 and uses its local copy of the web page 126 00:05:58,610 --> 00:06:01,040 if it determines the data is still valid, 127 00:06:01,040 --> 00:06:03,120 avoiding the delay of transferring 128 00:06:03,120 --> 00:06:05,220 the data over the Internet. 129 00:06:05,220 --> 00:06:07,830 Here’s an example memory hierarchy that might be found 130 00:06:07,830 --> 00:06:09,430 on a modern computer. 131 00:06:09,430 --> 00:06:12,190 There are three levels on-chip SRAM caches, 132 00:06:12,190 --> 00:06:16,240 followed by DRAM main memory and a flash-memory cache 133 00:06:16,240 --> 00:06:18,430 for the hard disk drive. 134 00:06:18,430 --> 00:06:20,460 The compiler is responsible for deciding 135 00:06:20,460 --> 00:06:23,060 which data values are kept in the CPU registers 136 00:06:23,060 --> 00:06:27,220 and which values require the use of LDs and STs. 137 00:06:27,220 --> 00:06:29,370 The 3-level cache and accesses to DRAM 138 00:06:29,370 --> 00:06:33,190 are managed by circuity in the memory system. 139 00:06:33,190 --> 00:06:35,810 After that the access times are long enough 140 00:06:35,810 --> 00:06:37,840 (many hundreds of instruction times) 141 00:06:37,840 --> 00:06:40,260 that the job of managing the movement of data 142 00:06:40,260 --> 00:06:42,380 between the lower levels of the hierarchy 143 00:06:42,380 --> 00:06:45,370 is turned over to software. 144 00:06:45,370 --> 00:06:49,370 Today we’re discussing how the on-chip caches work. 145 00:06:49,370 --> 00:06:52,590 In Part 3 of the course, we’ll discuss how the software 146 00:06:52,590 --> 00:06:57,240 manages main memory and non-volatile storage devices. 147 00:06:57,240 --> 00:06:59,480 Whether managed by hardware or software, 148 00:06:59,480 --> 00:07:00,990 each layer of the memory system is 149 00:07:00,990 --> 00:07:03,810 designed to provide lower-latency access 150 00:07:03,810 --> 00:07:07,570 to frequently-accessed locations in the next, slower layer. 151 00:07:07,570 --> 00:07:10,070 But, as we’ll see, the implementation strategies will 152 00:07:10,070 --> 00:07:13,650 be quite different in the slower layers of the hierarchy.