1 00:00:00,610 --> 00:00:05,000 A conceptual schematic for a multicore processor is shown below. 2 00:00:05,000 --> 00:00:09,990 To reduce the average memory access time, each of the four cores has its own cache, 3 00:00:09,990 --> 00:00:13,630 which will satisfy most memory requests. 4 00:00:13,630 --> 00:00:17,060 If there's a cache miss, a request is sent to the shared main memory. 5 00:00:17,060 --> 00:00:22,890 With a modest number of cores and a good cache hit ratio, the number of memory requests that 6 00:00:22,890 --> 00:00:27,860 must access main memory during normal operation should be pretty small. 7 00:00:27,860 --> 00:00:33,320 To keep the number of memory accesses to a minimum, the caches implement a write-back 8 00:00:33,320 --> 00:00:36,950 strategy, where ST instructions update the cache, but 9 00:00:36,950 --> 00:00:42,240 main memory is only updated when a dirty cache line is replaced. 10 00:00:42,240 --> 00:00:46,989 Our goal is that each core should share the contents of main memory, i.e., changes made 11 00:00:46,989 --> 00:00:51,310 by one core should visible to all the other cores. 12 00:00:51,310 --> 00:00:58,329 In the example shown here, core 0 is running Thread A and core 1 is running Thread B. 13 00:00:58,329 --> 00:01:03,600 Both threads reference two shared memory locations holding the values for the variables X and 14 00:01:03,600 --> 00:01:05,510 Y. 15 00:01:05,510 --> 00:01:09,890 The current values of X and Y are 1 and 2, respectively. 16 00:01:09,890 --> 00:01:15,120 Those values are held in main memory as well as being cached by each core. 17 00:01:15,120 --> 00:01:18,090 What happens when the threads are executed? 18 00:01:18,090 --> 00:01:23,500 Each thread executes independently, updating its cache during stores to X and Y. 19 00:01:23,500 --> 00:01:29,241 For any possible execution order, either concurrent or sequential, the result is the same: Thread 20 00:01:29,241 --> 00:01:33,050 A prints "2", Thread B prints "1". 21 00:01:33,050 --> 00:01:38,120 Hardware engineers would point to the consistent outcomes and declare victory! 22 00:01:38,120 --> 00:01:42,840 But closer examination of the final system state reveals some problems. 23 00:01:42,840 --> 00:01:47,590 After execution is complete, the two cores disagree on the values of X and Y. 24 00:01:47,590 --> 00:01:52,240 Threads running on core 0 will see X=3 and Y=2. 25 00:01:52,240 --> 00:01:57,340 Threads running on core 1 will see X=1 and Y=4. 26 00:01:57,340 --> 00:02:02,590 Because of the caches, the system isn't behaving as if there's a single shared memory. 27 00:02:02,590 --> 00:02:06,420 On the other hand, we can't eliminate the caches since that would cause the average 28 00:02:06,420 --> 00:02:11,940 memory access time to skyrocket, ruining any hoped-for performance improvement from using 29 00:02:11,940 --> 00:02:13,849 multiple cores. 30 00:02:13,849 --> 00:02:16,310 What outcome should we expect? 31 00:02:16,310 --> 00:02:20,650 One plausible standard of correctness is the outcome when the threads are run a single 32 00:02:20,650 --> 00:02:23,000 timeshared core. 33 00:02:23,000 --> 00:02:28,460 The argument would be that a multicore implementation should produce the same outcome but more quickly, 34 00:02:28,460 --> 00:02:31,510 with parallel execution replacing timesharing. 35 00:02:31,510 --> 00:02:36,569 The table shows the possible results of the timesharing experiment, where the outcome 36 00:02:36,569 --> 00:02:40,340 depends on the order in which the statements are executed. 37 00:02:40,340 --> 00:02:45,230 Programmers will understand that there is more than one possible outcome and know that 38 00:02:45,230 --> 00:02:50,640 they would have to impose additional constraints on execution order, say, using semaphores, 39 00:02:50,640 --> 00:02:53,910 if they wanted a specific outcome. 40 00:02:53,910 --> 00:03:00,239 Notice that the multicore outcome of 2,1 doesn't appear anywhere on the list of possible outcomes 41 00:03:00,239 --> 00:03:03,190 from sequential timeshared execution. 42 00:03:03,190 --> 00:03:08,330 The notion that executing N threads in parallel should correspond to some interleaved execution 43 00:03:08,330 --> 00:03:13,080 of those threads on a single core is called "sequential consistency". 44 00:03:13,080 --> 00:03:17,910 If multicore systems implement sequential consistency, then programmers can think of 45 00:03:17,910 --> 00:03:21,230 the systems as providing hardware-accelerated timesharing. 46 00:03:21,230 --> 00:03:24,870 So, our simple multicore system fails on two accounts. 47 00:03:24,870 --> 00:03:29,500 First, it doesn't correctly implement a shared memory since, as we've seen, it's possible 48 00:03:29,500 --> 00:03:33,780 for the two cores to disagree about the current value of a shared variable. 49 00:03:33,780 --> 00:03:39,940 Second, as a consequence of the first problem, the system doesn't implement sequential consistency. 50 00:03:39,940 --> 00:03:43,110 Clearly, we'll need to figure out a fix! 51 00:03:43,110 --> 00:03:46,890 One possible fix is to give up on sequential consistency. 52 00:03:46,890 --> 00:03:52,150 An alternative memory semantics is "weak consistency", which only requires that the memory operations 53 00:03:52,150 --> 00:03:56,239 from each thread appear to be performed in the order issued by that thread. 54 00:03:56,239 --> 00:04:01,909 In other words, in a weakly consistent system, if a particular thread writes to X and then 55 00:04:01,909 --> 00:04:07,879 writes to Y, the possible outcomes from reads of X and Y by any thread would be one of 56 00:04:07,879 --> 00:04:13,870 (unchanged X, unchanged Y), or (changed X, unchanged Y), 57 00:04:13,870 --> 00:04:17,680 or (changed X, changed Y). 58 00:04:17,680 --> 00:04:22,580 But no thread would see changed Y but unchanged X. 59 00:04:22,580 --> 00:04:26,949 In a weakly consistent system, memory operations from other threads may overlap in arbitrary 60 00:04:26,949 --> 00:04:32,470 ways (not necessarily consistent with any sequential interleaving). 61 00:04:32,470 --> 00:04:37,819 Note that our multicore cache system doesn't itself guarantee even weak consistency. 62 00:04:37,819 --> 00:04:43,110 A thread that executes "write X; write Y" will update its local cache, but later cache 63 00:04:43,110 --> 00:04:48,139 replacements may cause the updated Y value to be written to main memory before the updated 64 00:04:48,139 --> 00:04:50,039 X value. 65 00:04:50,039 --> 00:04:55,460 To implement weak consistency, the thread should be modified to "write X; communicate 66 00:04:55,460 --> 00:04:59,270 changes to all other processors; write Y". 67 00:04:59,270 --> 00:05:03,149 In the next section, we'll discuss how to modify the caches to perform the required 68 00:05:03,149 --> 00:05:06,800 communication automatically. 69 00:05:06,800 --> 00:05:11,020 Out-of-order cores have an extra complication since there's no guarantee that successive 70 00:05:11,020 --> 00:05:16,259 ST instructions will complete in the order they appeared in the program. 71 00:05:16,259 --> 00:05:20,629 These architectures provide a BARRIER instruction that guarantees that memory operations before 72 00:05:20,629 --> 00:05:25,779 the BARRIER are completed before memory operation executed after the BARRIER. 73 00:05:25,779 --> 00:05:31,639 There are many types of memory consistency - each commercially-available multicore system 74 00:05:31,639 --> 00:05:35,409 has its own particular guarantees about what happens when. 75 00:05:35,409 --> 00:05:42,610 So the prudent programmer needs to read the ISA manual carefully to ensure that her program 76 00:05:42,610 --> 00:05:44,219 will do what she wants. 77 00:05:44,219 --> 00:05:50,229 See the referenced PDF file for a very readable discussion about memory semantics in multicore 78 00:05:50,229 --> 00:05:50,490 systems.