1 00:00:00,530 --> 00:00:06,260 Consider this 2-way set associative cache to be used with our beta processor. 2 00:00:06,260 --> 00:00:14,530 Each line holds a single 32 bit data word together with a valid bit and a tag. 3 00:00:14,530 --> 00:00:17,590 Our beta uses 32 bit addresses. 4 00:00:17,590 --> 00:00:22,810 We want to determine which address bits should be used for the cache index and which should 5 00:00:22,810 --> 00:00:28,420 be used for the tag so as to ensure best cache performance. 6 00:00:28,420 --> 00:00:34,089 The bottom two bits of the address are always assumed to be 00 for word alignment since 7 00:00:34,089 --> 00:00:41,089 are addresses are in bytes but our data is in words of 32 bits or 4 bytes. 8 00:00:41,089 --> 00:00:47,499 Our cache has 8 lines in it, so this means that our index must be 3 bits wide. 9 00:00:47,499 --> 00:00:51,989 The bits that we want to use for the index are the next least significant bits which 10 00:00:51,989 --> 00:00:55,260 are address bits [4:2]. 11 00:00:55,260 --> 00:01:00,329 The reason that we want to make these bits be part of the index instead of the tag is 12 00:01:00,329 --> 00:01:02,309 because of locality. 13 00:01:02,309 --> 00:01:07,930 The idea is that instructions or data that are near each other in memory are more likely 14 00:01:07,930 --> 00:01:14,500 to be accessed around the same time than instructions or data that reside in a different part of 15 00:01:14,500 --> 00:01:15,500 memory. 16 00:01:15,500 --> 00:01:21,980 So if for example, our instruction comes from address 0x1000, it is fairly likely that we 17 00:01:21,980 --> 00:01:28,640 will also access the next instruction which is at address 0x1004. 18 00:01:28,640 --> 00:01:34,850 With this scheme the index of the first instruction would map to line 0 of the cache while the 19 00:01:34,850 --> 00:01:40,380 next instruction would map to line 1 of the cache so they would not cause a collision 20 00:01:40,380 --> 00:01:43,590 or a miss in the cache. 21 00:01:43,590 --> 00:01:46,990 This leaves the higher order bits for the tag. 22 00:01:46,990 --> 00:01:52,789 We need to use all of the remaining bits for the tag in order to be able to uniquely identify 23 00:01:52,789 --> 00:01:55,240 each distinct address. 24 00:01:55,240 --> 00:02:00,460 Since many addresses will map to the same line in the cache, we must compare the tag 25 00:02:00,460 --> 00:02:05,390 of the data in the cache to see if we have in fact found the data that we are looking 26 00:02:05,390 --> 00:02:06,650 for. 27 00:02:06,650 --> 00:02:12,590 So we use address bits [31:5] for the tag. 28 00:02:12,590 --> 00:02:18,470 Suppose our beta executes a read of address 0x5678. 29 00:02:18,470 --> 00:02:23,620 We would like to identify which locations in the cache will need to be checked to determine 30 00:02:23,620 --> 00:02:27,860 if our data is already present in the cache or not. 31 00:02:27,860 --> 00:02:33,210 In order to determine this, we need to identify the portion of the address that corresponds 32 00:02:33,210 --> 00:02:34,959 to the index. 33 00:02:34,959 --> 00:02:43,769 The index bits are bits [4:2] which correspond to 110 in binary for this address. 34 00:02:43,769 --> 00:02:49,060 That means that this address would map to cache line 6 in our cache. 35 00:02:49,060 --> 00:02:54,040 Since this is a 2-way set associative cache, there are two possible locations that our 36 00:02:54,040 --> 00:02:59,980 data could be located, either in 6A or 6B. 37 00:02:59,980 --> 00:03:06,239 So we would need to compare both tags to determine whether or not the data we are trying to read 38 00:03:06,239 --> 00:03:12,439 is already in the cache. 39 00:03:12,439 --> 00:03:17,549 Assuming that checking the cache on a read takes 1 cycle, and that refilling the cache 40 00:03:17,549 --> 00:03:23,379 on a miss takes an additional 8 cycles, this means that the time it takes on a miss 41 00:03:23,379 --> 00:03:29,439 is 9 cycles, 1 to first check if the value is in the cache, plus another 8 to bring the 42 00:03:29,439 --> 00:03:33,549 value into the cache if it wasn't already there. 43 00:03:33,549 --> 00:03:40,450 Now suppose that we want to achieve an average read access time of 1.1 cycles. 44 00:03:40,450 --> 00:03:46,079 What is the minimum hit ratio required to achieve this average access time over many 45 00:03:46,079 --> 00:03:47,680 reads? 46 00:03:47,680 --> 00:03:55,650 We know that average access time = (hit time * hit rate) + (miss time * miss rate). 47 00:03:55,650 --> 00:04:01,500 If we call 'a' our hit rate, then our miss rate is (1-a). 48 00:04:01,500 --> 00:04:11,330 So our desired average access time of 1.1 must equal 1 * a plus 9 * (1-a). 49 00:04:11,330 --> 00:04:22,670 This reduces to 1.1 = 9-8a, which means that 8a = 7.9 or a = 7.9/8. 50 00:04:22,670 --> 00:04:33,990 So to achieve a 1.1 cycle average access time our hit rate must be at least 7.9/8. 51 00:04:33,990 --> 00:04:40,030 We are provided with this benchmark program for testing our 2-way set associative cache. 52 00:04:40,030 --> 00:04:44,090 The cache is initially empty before execution begins. 53 00:04:44,090 --> 00:04:49,240 In other words, all the valid bits of the cache are 0. 54 00:04:49,240 --> 00:04:55,210 Assuming that an LRU, or least recently used, replacement strategy is used, we would like 55 00:04:55,210 --> 00:05:01,210 to determine the approximate cache hit ratio for this program. 56 00:05:01,210 --> 00:05:05,330 Let's begin by understanding what this benchmark does. 57 00:05:05,330 --> 00:05:07,430 The program begins at address 0. 58 00:05:07,430 --> 00:05:14,370 It first performs some initialization of registers using three CMOVE operations. 59 00:05:14,370 --> 00:05:20,500 The first, initializes R0 to hold source which is the address in memory where our data will 60 00:05:20,500 --> 00:05:22,440 be stored. 61 00:05:22,440 --> 00:05:31,380 The second initializes R1 to 0, and the third initializes R2 to 0x1000 which is the number 62 00:05:31,380 --> 00:05:35,150 of words that this benchmark will work with. 63 00:05:35,150 --> 00:05:39,070 We then enter the loop which is shown in the yellow rectangle. 64 00:05:39,070 --> 00:05:47,140 The loop loads the first element of our data from location source + 0 into register R3. 65 00:05:47,140 --> 00:05:51,759 It then increments R0 to point to the next piece of data. 66 00:05:51,759 --> 00:05:57,970 Since our data is 32 bits wide, this requires the addition of the constant 4 representing 67 00:05:57,970 --> 00:06:01,479 the number of bytes between consecutive data words. 68 00:06:01,479 --> 00:06:07,740 It then takes the value that was just loaded and adds it to R1 which holds a running sum 69 00:06:07,740 --> 00:06:10,750 of all the data seen so far. 70 00:06:10,750 --> 00:06:18,240 R2 is then decremented by 1 to indicate that we have one fewer piece of data to handle. 71 00:06:18,240 --> 00:06:25,630 Finally, as long as R2 is not equal to 0 it repeats the loop. 72 00:06:25,630 --> 00:06:31,710 At the very end of the benchmark the final sum is stored at address source, and the program 73 00:06:31,710 --> 00:06:34,250 halts. 74 00:06:34,250 --> 00:06:39,480 When trying to determine the approximate hit ratio, the instructions that occur only once 75 00:06:39,480 --> 00:06:45,520 because they live outside of the loop can basically be ignored. 76 00:06:45,520 --> 00:06:51,060 So looking only at what happens over and over again in the loop, each time through the loop, 77 00:06:51,060 --> 00:06:55,409 we have 5 instruction fetches and one data fetch. 78 00:06:55,409 --> 00:06:59,840 The first time through the loop, we miss on the instruction fetches and then bring them 79 00:06:59,840 --> 00:07:01,750 into the cache. 80 00:07:01,750 --> 00:07:04,970 We also miss on the data load from address 0x100. 81 00:07:04,970 --> 00:07:11,900 When this data is brought into the cache, instead of replacing the recently loaded instructions, 82 00:07:11,900 --> 00:07:16,400 it loads the data into the 2nd set of the cache. 83 00:07:16,400 --> 00:07:18,340 The loop is then repeated. 84 00:07:18,340 --> 00:07:22,539 This time through the loop, all the instruction fetches result in hits. 85 00:07:22,539 --> 00:07:28,319 However, the data that we now need is a new piece of data so that will result in a cache 86 00:07:28,319 --> 00:07:34,949 miss and once again load the new data word into the 2nd set of the cache. 87 00:07:34,949 --> 00:07:39,159 This behavior then repeats itself every time through the loop. 88 00:07:39,159 --> 00:07:44,340 Since the loop is executed many times, we can also ignore the initial instruction fetches 89 00:07:44,340 --> 00:07:47,300 on the first iteration of the loop. 90 00:07:47,300 --> 00:07:54,229 So in steady state, we get 5 instruction cache hits, and 1 data cache miss every time through 91 00:07:54,229 --> 00:07:56,680 the loop. 92 00:07:56,680 --> 00:08:04,740 This means that our approximate hit ratio is 5/6. 93 00:08:04,740 --> 00:08:09,870 The last question we want to consider is what is stored in the cache after execution of 94 00:08:09,870 --> 00:08:12,800 this benchmark is completed. 95 00:08:12,800 --> 00:08:18,330 As we saw earlier, because we have a 2-way set associative cache, the instructions and 96 00:08:18,330 --> 00:08:24,039 data don't conflict with each other because they can each go in a separate set. 97 00:08:24,039 --> 00:08:28,750 We want to determine which instruction and which piece of data will end up in line 4 98 00:08:28,750 --> 00:08:33,350 of the cache after execution is completed. 99 00:08:33,350 --> 00:08:38,590 We'll begin by identifying the mapping of instructions to cache lines. 100 00:08:38,590 --> 00:08:44,830 Since our program begins at address 0, the first CMOVE instruction is at address 0, and 101 00:08:44,830 --> 00:08:52,290 it's index is equal to 0b000, or 0 in binary. 102 00:08:52,290 --> 00:08:55,540 This means that it will map to cache line 0. 103 00:08:55,540 --> 00:09:01,750 Since at this point, nothing is in the cache, it will be loaded into line 0 of set A. 104 00:09:01,750 --> 00:09:07,900 In a similar manner the next 2 CMOVE instructions and the LD instruction will be loaded into 105 00:09:07,900 --> 00:09:14,320 lines 1-3 of set A. At this point, we begin loading data. 106 00:09:14,320 --> 00:09:19,480 Since the cache is 2-way set associative, the data will be loaded into set B instead 107 00:09:19,480 --> 00:09:23,790 of removing the instructions that were loaded into set A. 108 00:09:23,790 --> 00:09:28,600 The instructions that are outside the loop will end up getting taken out of set A in 109 00:09:28,600 --> 00:09:34,060 favor of loading a data item into those cache locations, but the instructions that make 110 00:09:34,060 --> 00:09:40,880 up the loop will not be displaced because every time something maps to cache lines 3-7, 111 00:09:40,880 --> 00:09:46,690 the least recently used location will correspond to a data value not to the instructions which 112 00:09:46,690 --> 00:09:49,660 are used over and over again. 113 00:09:49,660 --> 00:09:53,960 This means that at the end of execution of the benchmark, the instruction that will be 114 00:09:53,960 --> 00:10:04,110 in line 4 of the cache is the ADDC instruction from address 0x10 of the program. 115 00:10:04,110 --> 00:10:09,450 The loop instructions which will remain in the cache are shown here. 116 00:10:09,450 --> 00:10:13,500 Now let's consider what happens to the data used in this benchmark. 117 00:10:13,500 --> 00:10:19,410 We expect the loop instructions to remain in lines 3-7 of set A of the cache. 118 00:10:19,410 --> 00:10:26,460 The data will use all of set B plus locations 0-2 of set A as needed. 119 00:10:26,460 --> 00:10:29,230 The data begins at address 0x100. 120 00:10:29,230 --> 00:10:41,940 The index portion of address 0x100 is 0b000 so this data element maps to cache line 0. 121 00:10:41,940 --> 00:10:48,080 Since the least recently used set for line 0 is set B, it will go into set B leaving 122 00:10:48,080 --> 00:10:57,000 the instructions in tact in set A. The next data element is at address 0x104. 123 00:10:57,000 --> 00:11:02,510 Since the bottom two bits are used for word alignment, the index portion of this address 124 00:11:02,510 --> 00:11:14,350 is 0b001, so this data element maps to line 1 of set B, and so on through element 0x7. 125 00:11:14,350 --> 00:11:20,140 Data element 0x8 is at address 0x120. 126 00:11:20,140 --> 00:11:25,710 The index portion of this address is once again 0b000. 127 00:11:25,710 --> 00:11:32,190 So this element maps to cache line 0 just like element 0x0 did. 128 00:11:32,190 --> 00:11:38,930 However, at this point line 0 of set B was accessed more recently than line 0 of set 129 00:11:38,930 --> 00:11:46,310 A, so the CMOVE instruction which is executed only once will get replaced by a data element 130 00:11:46,310 --> 00:11:49,770 that maps to line 0. 131 00:11:49,770 --> 00:11:54,890 As we mentioned earlier, all the instructions in the loop will not be displaced because 132 00:11:54,890 --> 00:12:00,570 they are accessed over and over again so they never end up being the least recently used 133 00:12:00,570 --> 00:12:03,850 item in a cache line. 134 00:12:03,850 --> 00:12:11,320 After executing the loop 16 times meaning that data elements 0 through 0xF have been 135 00:12:11,320 --> 00:12:14,290 accessed, the cache will look like this. 136 00:12:14,290 --> 00:12:19,700 Note that the loop instructions continue in their original location in the cache. 137 00:12:19,700 --> 00:12:25,330 The state of the cache continues to look like this with more recently accessed data elements 138 00:12:25,330 --> 00:12:28,580 replacing the earlier data elements. 139 00:12:28,580 --> 00:12:36,790 Since there are 0x1000 elements of data, this continues until just before address 0x4100. 140 00:12:36,790 --> 00:12:46,960 The last element is actually located 1 word before that at address 0x40FC. 141 00:12:46,960 --> 00:12:53,130 The last 8 elements of the data and their mapping to cache lines is shown here. 142 00:12:53,130 --> 00:12:59,230 The data element that ends up in line 4 of the cache, when the benchmark is done executing, 143 00:12:59,230 --> 00:13:09,680 is element 0x0FFC of the data which comes from address 0x40F0.