1 00:00:01,550 --> 00:00:03,920 The following content is provided under a Creative 2 00:00:03,920 --> 00:00:05,310 Commons license. 3 00:00:05,310 --> 00:00:07,520 Your support will help MIT OpenCourseWare 4 00:00:07,520 --> 00:00:11,610 continue to offer high-quality educational resources for free. 5 00:00:11,610 --> 00:00:14,180 To make a donation or to view additional materials 6 00:00:14,180 --> 00:00:18,140 from hundreds of MIT courses, visit MIT OpenCourseWare 7 00:00:18,140 --> 00:00:19,026 at ocw.mit.edu. 8 00:00:21,725 --> 00:00:23,350 JULIAN SHUN: Today, we're going to talk 9 00:00:23,350 --> 00:00:26,810 about multicore programming. 10 00:00:26,810 --> 00:00:30,730 And as I was just informed by Charles, it's 2018. 11 00:00:30,730 --> 00:00:35,110 I had 2017 on the slide. 12 00:00:35,110 --> 00:00:40,380 So first, congratulations to all of you. 13 00:00:40,380 --> 00:00:45,850 You turned in the first project's data. 14 00:00:45,850 --> 00:00:50,140 Here's a plot showing the tiers that different groups reached 15 00:00:50,140 --> 00:00:51,460 for the beta. 16 00:00:51,460 --> 00:00:53,650 And this is in sorted order. 17 00:00:53,650 --> 00:00:57,910 And we set the beta cutoff to be tier 45. 18 00:00:57,910 --> 00:00:59,860 The final cutoff is tier 48. 19 00:00:59,860 --> 00:01:03,550 So the final cutoff we did set a little bit aggressively, 20 00:01:03,550 --> 00:01:06,100 but keep in mind that you don't necessarily 21 00:01:06,100 --> 00:01:08,380 have to get to the final cutoff in order 22 00:01:08,380 --> 00:01:10,300 to get an A on this project. 23 00:01:14,260 --> 00:01:18,540 So we're going to talk about multicore processing today. 24 00:01:18,540 --> 00:01:20,830 That's going to be the topic of the next project 25 00:01:20,830 --> 00:01:24,160 after you finish the first project. 26 00:01:24,160 --> 00:01:27,070 So in a multicore processor, we have a whole bunch 27 00:01:27,070 --> 00:01:30,760 of cores that are all placed on the same chip, 28 00:01:30,760 --> 00:01:34,450 and they have access to shared memory. 29 00:01:34,450 --> 00:01:38,590 They usually also have some sort of private cache, and then 30 00:01:38,590 --> 00:01:41,950 a shared last level cache, so L3, in this case. 31 00:01:41,950 --> 00:01:44,990 And then they all have access the same memory controller, 32 00:01:44,990 --> 00:01:46,390 which goes out to main memory. 33 00:01:46,390 --> 00:01:49,960 And then they also have access to I/O. 34 00:01:49,960 --> 00:01:54,820 But for a very long time, chips only had a single core on them. 35 00:01:54,820 --> 00:01:58,240 So why do we have multicore processors nowadays? 36 00:01:58,240 --> 00:02:00,640 Why did semiconductor vendors start 37 00:02:00,640 --> 00:02:02,800 producing chips that had multiple processor 38 00:02:02,800 --> 00:02:03,580 cores on them? 39 00:02:06,880 --> 00:02:10,100 So the answer is because of two things. 40 00:02:10,100 --> 00:02:12,880 So first, there's Moore's Law, which 41 00:02:12,880 --> 00:02:16,720 says that we get more transistors every year. 42 00:02:16,720 --> 00:02:19,030 So the number of transistors that you can fit on a chip 43 00:02:19,030 --> 00:02:21,490 doubles approximately every two years. 44 00:02:21,490 --> 00:02:25,340 And secondly, there's the end of scaling of clock frequency. 45 00:02:25,340 --> 00:02:27,040 So for a very long time, we could just 46 00:02:27,040 --> 00:02:32,140 keep increasing the frequency of the single core on the chip. 47 00:02:32,140 --> 00:02:37,330 But at around 2004 to 2005, that was no longer the case. 48 00:02:37,330 --> 00:02:42,530 We couldn't scale the clock frequency anymore. 49 00:02:42,530 --> 00:02:46,820 So here's a plot showing both the number of transistors 50 00:02:46,820 --> 00:02:48,740 you could fit on the chip over time, 51 00:02:48,740 --> 00:02:52,110 as well as the clock frequency of the processors over time. 52 00:02:52,110 --> 00:02:55,730 And notice that the y-axis is in log scale here. 53 00:02:55,730 --> 00:02:58,730 And the blue line is basically Moore's Law, 54 00:02:58,730 --> 00:03:00,860 which says that the number of transistors 55 00:03:00,860 --> 00:03:04,050 you can fit on a chip doubles approximately every two years. 56 00:03:04,050 --> 00:03:06,350 And that's been growing pretty steadily. 57 00:03:06,350 --> 00:03:09,470 So this plot goes up to 2010, but in fact, it's 58 00:03:09,470 --> 00:03:11,320 been growing even up until the present. 59 00:03:11,320 --> 00:03:13,310 And it will continue to grow for a couple 60 00:03:13,310 --> 00:03:16,670 more years before Moore's Law ends. 61 00:03:16,670 --> 00:03:19,980 However, if you look at the clock frequency line, 62 00:03:19,980 --> 00:03:22,700 you see that it was growing quite 63 00:03:22,700 --> 00:03:26,720 steadily until about the early 2000s, and then at that point, 64 00:03:26,720 --> 00:03:28,460 it flattened out. 65 00:03:32,580 --> 00:03:36,620 So at that point, we couldn't increase the clock frequencies 66 00:03:36,620 --> 00:03:38,960 anymore, and the clock speed was bounded 67 00:03:38,960 --> 00:03:40,880 at about four gigahertz. 68 00:03:40,880 --> 00:03:42,960 So nowadays, if you go buy a processor, 69 00:03:42,960 --> 00:03:46,820 it's usually still bounded by around 4 gigahertz. 70 00:03:46,820 --> 00:03:49,210 It's usually a little bit less than 4 gigahertz, 71 00:03:49,210 --> 00:03:51,710 because it doesn't really make sense to push it all the way. 72 00:03:51,710 --> 00:03:55,280 But you might find some processors 73 00:03:55,280 --> 00:04:00,170 that are around 4 gigahertz nowadays. 74 00:04:00,170 --> 00:04:03,710 So what happened at around 2004 to 2005? 75 00:04:03,710 --> 00:04:05,150 Does anyone know? 76 00:04:13,720 --> 00:04:15,360 So Moore's Law basically says that we 77 00:04:15,360 --> 00:04:17,970 can fit more transistors on a chip 78 00:04:17,970 --> 00:04:20,730 because the transistors become smaller. 79 00:04:20,730 --> 00:04:23,700 And when the transistors become smaller, 80 00:04:23,700 --> 00:04:25,260 you can reduce the voltage that's 81 00:04:25,260 --> 00:04:27,390 needed to operate the transistors. 82 00:04:27,390 --> 00:04:30,570 And as a result, you can increase the clock frequency 83 00:04:30,570 --> 00:04:33,210 while maintaining the same power density. 84 00:04:33,210 --> 00:04:37,890 And that's what manufacturers did until about 2004 to 2005. 85 00:04:37,890 --> 00:04:39,900 They just kept increasing the clock frequency 86 00:04:39,900 --> 00:04:42,240 to take advantage of Moore's law. 87 00:04:42,240 --> 00:04:44,310 But it turns out that once transistors become 88 00:04:44,310 --> 00:04:46,890 small enough, and the voltage used 89 00:04:46,890 --> 00:04:50,430 to operate them becomes small enough, 90 00:04:50,430 --> 00:04:52,170 there's something called leakage current. 91 00:04:52,170 --> 00:04:55,070 So there's current that leaks, and we're 92 00:04:55,070 --> 00:04:58,080 unable to keep reducing the voltage while still having 93 00:04:58,080 --> 00:05:00,510 reliable switching. 94 00:05:00,510 --> 00:05:03,250 And if you can't reduce the voltage anymore, 95 00:05:03,250 --> 00:05:07,133 then you can't increase the clock frequency 96 00:05:07,133 --> 00:05:08,925 if you want to keep the same power density. 97 00:05:13,280 --> 00:05:17,840 So here's a plot from Intel back in 2004 98 00:05:17,840 --> 00:05:22,040 when they first started producing multicore processors. 99 00:05:22,040 --> 00:05:25,220 And this is plotting the power density versus time. 100 00:05:25,220 --> 00:05:29,490 And again, the y-axis is in log scale here. 101 00:05:29,490 --> 00:05:32,120 So the green data points are actual data points, 102 00:05:32,120 --> 00:05:34,790 and the orange ones are projected. 103 00:05:34,790 --> 00:05:38,660 And they projected what the power density 104 00:05:38,660 --> 00:05:40,850 would be if we kept increasing the clock 105 00:05:40,850 --> 00:05:46,260 frequency at a trend of about 25% to 30% per year, 106 00:05:46,260 --> 00:05:50,540 which is what happened up until around 2004. 107 00:05:50,540 --> 00:05:53,330 And because we couldn't reduce the voltage anymore, 108 00:05:53,330 --> 00:05:57,050 the power density will go up. 109 00:05:57,050 --> 00:05:59,120 And you can see that eventually, it 110 00:05:59,120 --> 00:06:02,540 reaches the power density of a nuclear reactor, which 111 00:06:02,540 --> 00:06:05,510 is pretty hot. 112 00:06:05,510 --> 00:06:08,312 And then it reaches the power density of a rocket nozzle, 113 00:06:08,312 --> 00:06:09,770 and eventually you get to the power 114 00:06:09,770 --> 00:06:13,380 density of the sun's surface. 115 00:06:13,380 --> 00:06:17,750 So if you have a chip that has a power density 116 00:06:17,750 --> 00:06:19,580 equal to the sun's surface-- 117 00:06:19,580 --> 00:06:22,335 well, you don't actually really have a chip anymore. 118 00:06:25,970 --> 00:06:28,310 So basically if you get into this orange region, 119 00:06:28,310 --> 00:06:30,650 you basically have a fire, and you can't really 120 00:06:30,650 --> 00:06:33,020 do anything interesting, in terms of performance 121 00:06:33,020 --> 00:06:36,230 engineering, at that point. 122 00:06:36,230 --> 00:06:43,640 So to solve this problem, semiconductor vendors 123 00:06:43,640 --> 00:06:47,690 didn't increased the clock frequency anymore, 124 00:06:47,690 --> 00:06:50,150 but we still had Moore's Law giving us 125 00:06:50,150 --> 00:06:52,370 more and more transistors every year. 126 00:06:52,370 --> 00:06:55,880 So what they decided to do with these extra transistors 127 00:06:55,880 --> 00:06:59,480 was to put them into multiple cores, 128 00:06:59,480 --> 00:07:02,280 and then put multiple cores on the same chip. 129 00:07:02,280 --> 00:07:05,420 So we can see that, starting at around 2004, 130 00:07:05,420 --> 00:07:10,820 the number of cores per chip becomes more than one. 131 00:07:13,800 --> 00:07:15,860 And each generation of Moore's Law 132 00:07:15,860 --> 00:07:17,870 will potentially double the number of cores 133 00:07:17,870 --> 00:07:20,480 that you can fit on a chip, because it's doubling 134 00:07:20,480 --> 00:07:21,920 the number of transistors. 135 00:07:21,920 --> 00:07:26,090 And we've seen this trend up until about today. 136 00:07:26,090 --> 00:07:29,030 And again, it's going to continue for a couple 137 00:07:29,030 --> 00:07:31,970 more years before Moore's Law ends. 138 00:07:31,970 --> 00:07:37,160 So that's why we have chips with multiple cores today. 139 00:07:37,160 --> 00:07:42,000 So today, we're going to look at multicore processing. 140 00:07:42,000 --> 00:07:44,780 So I first want to introduce the abstract multicore 141 00:07:44,780 --> 00:07:45,540 architecture. 142 00:07:45,540 --> 00:07:48,290 So this is a very simplified version, 143 00:07:48,290 --> 00:07:52,700 but I can fit it on this slide, and it's a good example 144 00:07:52,700 --> 00:07:53,630 for illustration. 145 00:07:53,630 --> 00:07:57,140 So here, we have a whole bunch of processors. 146 00:07:57,140 --> 00:07:59,600 They each have a cache, so that's 147 00:07:59,600 --> 00:08:02,570 indicated with the dollar sign. 148 00:08:02,570 --> 00:08:05,390 And usually they have a private cache as well as 149 00:08:05,390 --> 00:08:09,500 a shared cache, so a shared last level cache, like the L3 cache. 150 00:08:09,500 --> 00:08:13,220 And then they're all connected to the network. 151 00:08:13,220 --> 00:08:15,440 And then, through the network, they 152 00:08:15,440 --> 00:08:17,580 can connect to the main memory. 153 00:08:17,580 --> 00:08:21,050 They can all access the same shared memory. 154 00:08:21,050 --> 00:08:23,517 And then usually there's a separate network for the I/O 155 00:08:23,517 --> 00:08:26,100 as well, even though I've drawn them as a single network here, 156 00:08:26,100 --> 00:08:28,400 so they can access the I/O interface. 157 00:08:28,400 --> 00:08:30,110 And potentially, the network will also 158 00:08:30,110 --> 00:08:35,780 connect to other multiprocessors on the same system. 159 00:08:35,780 --> 00:08:37,789 And this abstract multicore architecture 160 00:08:37,789 --> 00:08:41,570 is known as a chip multiprocessor, or CMP. 161 00:08:41,570 --> 00:08:44,179 So that's the architecture that we'll be looking at today. 162 00:08:48,940 --> 00:08:51,860 So here's an outline of today's lecture. 163 00:08:51,860 --> 00:08:57,120 So first, I'm going to go over some hardware challenges 164 00:08:57,120 --> 00:09:00,630 with shared memory multicore machines. 165 00:09:00,630 --> 00:09:05,460 So we're going to look at the cache coherence protocol. 166 00:09:05,460 --> 00:09:07,530 And then after looking at hardware, 167 00:09:07,530 --> 00:09:11,460 we're going to look at some software solutions 168 00:09:11,460 --> 00:09:14,310 to write parallel programs on these multicore machines 169 00:09:14,310 --> 00:09:17,343 to take advantage of the extra cores. 170 00:09:17,343 --> 00:09:19,260 And we're going to look at several concurrency 171 00:09:19,260 --> 00:09:21,180 platforms listed here. 172 00:09:21,180 --> 00:09:23,490 We're going to look at Pthreads. 173 00:09:23,490 --> 00:09:25,620 This is basically a low-level API 174 00:09:25,620 --> 00:09:31,240 for accessing, or for running your code in parallel. 175 00:09:31,240 --> 00:09:34,080 And if you program on Microsoft products, 176 00:09:34,080 --> 00:09:36,900 the Win API threads is pretty similar. 177 00:09:36,900 --> 00:09:39,190 Then there's Intel Threading Building Blocks, 178 00:09:39,190 --> 00:09:42,180 which is a library solution to concurrency. 179 00:09:42,180 --> 00:09:44,070 And then there are two linguistic solutions 180 00:09:44,070 --> 00:09:45,153 that we'll be looking at-- 181 00:09:45,153 --> 00:09:48,090 OpenMP and Cilk Plus. 182 00:09:48,090 --> 00:09:51,660 And Cilk Plus is actually the concurrency platform 183 00:09:51,660 --> 00:09:54,060 that we'll be using for most of this class. 184 00:10:06,995 --> 00:10:12,110 So let's look at how caches work. 185 00:10:12,110 --> 00:10:16,160 So let's say that we have a value in memory 186 00:10:16,160 --> 00:10:19,820 at some location, and that value is-- 187 00:10:19,820 --> 00:10:25,350 let's say that value is x equals 3. 188 00:10:25,350 --> 00:10:27,750 If one processor says, we want to load 189 00:10:27,750 --> 00:10:31,580 x, what happens is that processor reads 190 00:10:31,580 --> 00:10:35,850 this value from a main memory, brings it into its own cache, 191 00:10:35,850 --> 00:10:38,370 and then it also reads the value, loads it 192 00:10:38,370 --> 00:10:40,740 into one of its registers. 193 00:10:40,740 --> 00:10:42,900 And it keeps this value in cache so 194 00:10:42,900 --> 00:10:46,812 that if it wants to access this value again in the near future, 195 00:10:46,812 --> 00:10:49,020 it doesn't have to go all the way out to main memory. 196 00:10:49,020 --> 00:10:52,740 It can just look at the value in its cache. 197 00:10:52,740 --> 00:10:58,173 Now, what happens if another processor wants to load x? 198 00:10:58,173 --> 00:10:59,590 Well, it just does the same thing. 199 00:10:59,590 --> 00:11:01,200 It reads the value from main memory, 200 00:11:01,200 --> 00:11:03,750 brings it into its cache, and then also loads it 201 00:11:03,750 --> 00:11:07,380 into one of the registers. 202 00:11:07,380 --> 00:11:10,820 And then same thing with another processor. 203 00:11:10,820 --> 00:11:12,360 It turns out that you don't actually 204 00:11:12,360 --> 00:11:15,360 always have to go out to main memory to get the value. 205 00:11:15,360 --> 00:11:19,050 If the value resides in one of the other processor's caches, 206 00:11:19,050 --> 00:11:22,590 you can also get the value through the other processor's 207 00:11:22,590 --> 00:11:23,370 cache. 208 00:11:23,370 --> 00:11:25,980 And sometimes that's cheaper than going all the way out 209 00:11:25,980 --> 00:11:27,390 to main memory. 210 00:11:33,940 --> 00:11:35,848 So the second processor now loads x again. 211 00:11:35,848 --> 00:11:37,390 And it's in cache, so it doesn't have 212 00:11:37,390 --> 00:11:41,140 to go to main memory or anybody else's cache. 213 00:11:41,140 --> 00:11:44,140 So what happens now if we want to store 214 00:11:44,140 --> 00:11:48,830 x, if we want to set the value of x to something else? 215 00:11:48,830 --> 00:11:54,650 So let's say this processor wants to set x equal to 5. 216 00:11:54,650 --> 00:11:57,150 So it's going to write x equals 5 217 00:11:57,150 --> 00:12:00,300 and store that result in its own cache. 218 00:12:00,300 --> 00:12:01,680 So that's all well and good. 219 00:12:05,460 --> 00:12:09,480 Now what happens when the first processor wants to load x? 220 00:12:09,480 --> 00:12:14,380 Well, it seems that the value of x is in its own cache, 221 00:12:14,380 --> 00:12:16,560 so it's just going to read the value of x there, 222 00:12:16,560 --> 00:12:19,740 and it gets a value of 3. 223 00:12:19,740 --> 00:12:21,060 So what's the problem there? 224 00:12:28,080 --> 00:12:28,580 Yes? 225 00:12:28,580 --> 00:12:29,980 AUDIENCE: The path is stale. 226 00:12:29,980 --> 00:12:30,730 JULIAN SHUN: Yeah. 227 00:12:30,730 --> 00:12:34,670 So the problem is that the value of x in the first processor's 228 00:12:34,670 --> 00:12:38,480 cache is stale, because another processor updated it. 229 00:12:38,480 --> 00:12:42,240 So now this value of x in the first processor's cache 230 00:12:42,240 --> 00:12:42,740 is invalid. 231 00:12:46,200 --> 00:12:48,180 So that's the problem. 232 00:12:48,180 --> 00:12:51,180 And one of the main challenges of multicore hardware 233 00:12:51,180 --> 00:12:54,570 is to try to solve this problem of cache coherence-- 234 00:12:54,570 --> 00:12:59,460 making sure that the values in different processors' caches 235 00:12:59,460 --> 00:13:01,785 are consistent across updates. 236 00:13:06,630 --> 00:13:11,580 So one basic protocol for solving this problem 237 00:13:11,580 --> 00:13:14,640 is known as the MSI protocol. 238 00:13:14,640 --> 00:13:19,010 And in this protocol, each cache line is labeled with a state. 239 00:13:19,010 --> 00:13:20,510 So there are three possible states-- 240 00:13:20,510 --> 00:13:25,260 M, S, and I. And this is done on the granularity of cache lines. 241 00:13:25,260 --> 00:13:28,458 Because it turns out that storing this information 242 00:13:28,458 --> 00:13:30,000 is relatively expensive, so you don't 243 00:13:30,000 --> 00:13:31,792 want to store it for every memory location. 244 00:13:31,792 --> 00:13:35,820 So they do it on a per cache line basis. 245 00:13:35,820 --> 00:13:38,130 Does anyone know what the size of a cache line 246 00:13:38,130 --> 00:13:39,990 is, on the machines that we're using? 247 00:13:47,090 --> 00:13:47,590 Yeah? 248 00:13:47,590 --> 00:13:49,030 AUDIENCE: 64 bytes. 249 00:13:49,030 --> 00:13:51,890 JULIAN SHUN: Yeah, so it's 64 bytes. 250 00:13:51,890 --> 00:13:56,510 And that's typically what you see today on most Intel and AMD 251 00:13:56,510 --> 00:13:57,710 machines. 252 00:13:57,710 --> 00:14:00,650 There's some architectures that have different cache lines, 253 00:14:00,650 --> 00:14:01,970 like 128 bytes. 254 00:14:01,970 --> 00:14:04,310 But for our class, the machines that we're using 255 00:14:04,310 --> 00:14:06,380 will have 64 byte cache lines. 256 00:14:06,380 --> 00:14:09,380 It's important to remember that so that when you're doing 257 00:14:09,380 --> 00:14:10,940 back-of-the-envelope calculations, 258 00:14:10,940 --> 00:14:14,120 you can get accurate estimates. 259 00:14:14,120 --> 00:14:18,050 So the three states in the MSI protocol are M, S, and I. 260 00:14:18,050 --> 00:14:20,600 So M stands for modified. 261 00:14:20,600 --> 00:14:23,030 And when a cache block is in the modified state, 262 00:14:23,030 --> 00:14:25,760 that means no other caches can contain this block 263 00:14:25,760 --> 00:14:29,040 in the M or the S states. 264 00:14:29,040 --> 00:14:32,090 The S state means that the block is shared, 265 00:14:32,090 --> 00:14:36,960 so other caches can also have this block in shared state. 266 00:14:36,960 --> 00:14:40,190 And then finally, I mean the cache block is invalid. 267 00:14:40,190 --> 00:14:42,800 So that's essentially the same as the cache block 268 00:14:42,800 --> 00:14:45,980 not being in the cache. 269 00:14:45,980 --> 00:14:49,370 And to solve the problem of cache coherency, when 270 00:14:49,370 --> 00:14:51,840 one cache modifies a location, it 271 00:14:51,840 --> 00:14:55,490 has to inform all the other caches 272 00:14:55,490 --> 00:15:00,200 that their values are now stale, because this cache modified 273 00:15:00,200 --> 00:15:01,760 the value. 274 00:15:01,760 --> 00:15:04,430 So it's going to invalidate all of the other copies 275 00:15:04,430 --> 00:15:07,010 of that cache line in other caches 276 00:15:07,010 --> 00:15:13,130 by changing their state from S to I. 277 00:15:13,130 --> 00:15:14,370 So let's see how this works. 278 00:15:14,370 --> 00:15:18,530 So let's say that the second processor wants to store y 279 00:15:18,530 --> 00:15:19,100 equals 5. 280 00:15:19,100 --> 00:15:23,360 So previously, a value of y was 17, and it was in shared state. 281 00:15:23,360 --> 00:15:27,320 The cache line containing y equals 17 was in shared state. 282 00:15:27,320 --> 00:15:30,710 So now, when I do y equals 5, I'm 283 00:15:30,710 --> 00:15:36,440 going to set the second processor's cache-- 284 00:15:36,440 --> 00:15:39,170 that cache line-- to modified state. 285 00:15:39,170 --> 00:15:41,540 And then I'm going to invalidate the cache 286 00:15:41,540 --> 00:15:44,820 line in all of the other caches that contain that cache line. 287 00:15:44,820 --> 00:15:48,230 So now the first cache and the fourth cache 288 00:15:48,230 --> 00:15:51,710 each have a state of I for y equals 17, 289 00:15:51,710 --> 00:15:53,976 because that value is stale. 290 00:15:53,976 --> 00:15:57,390 Is there any questions? 291 00:15:57,390 --> 00:15:58,075 Yes? 292 00:15:58,075 --> 00:16:01,237 AUDIENCE: If we already have to tell the other things to switch 293 00:16:01,237 --> 00:16:05,013 to invalid, why not just tell them the value of y? 294 00:16:05,013 --> 00:16:06,680 JULIAN SHUN: Yeah, so there are actually 295 00:16:06,680 --> 00:16:08,390 some protocols that do that. 296 00:16:08,390 --> 00:16:11,690 So this is just the most basic protocol. 297 00:16:11,690 --> 00:16:13,250 So this protocol doesn't do it. 298 00:16:13,250 --> 00:16:15,800 But there are some that are used in practice 299 00:16:15,800 --> 00:16:17,720 that actually do do that. 300 00:16:17,720 --> 00:16:20,800 So it's a good point. 301 00:16:20,800 --> 00:16:24,350 But I just want to present the most basic protocol for now. 302 00:16:29,400 --> 00:16:29,900 Sorry. 303 00:16:32,770 --> 00:16:35,140 And then, when you load a value, you 304 00:16:35,140 --> 00:16:40,720 can first check whether your cache line is in M or S state. 305 00:16:40,720 --> 00:16:42,790 And if it is an M or S state, then you 306 00:16:42,790 --> 00:16:45,980 can just read that value directly. 307 00:16:45,980 --> 00:16:49,480 But if it's in the I state, or if it's not there, 308 00:16:49,480 --> 00:16:51,430 then you have to fetch that block 309 00:16:51,430 --> 00:16:53,980 from either another processor's cache 310 00:16:53,980 --> 00:16:58,250 or fetch it from main memory. 311 00:16:58,250 --> 00:17:03,130 So it turns out that there are many other protocols out there. 312 00:17:03,130 --> 00:17:08,050 There's something known as MESI, the messy protocol. 313 00:17:08,050 --> 00:17:11,980 There's also MOESI and many other different protocols. 314 00:17:11,980 --> 00:17:13,720 And some of them are proprietary. 315 00:17:13,720 --> 00:17:17,319 And they all do different things. 316 00:17:17,319 --> 00:17:19,480 And it turns out that all of these protocols 317 00:17:19,480 --> 00:17:21,880 are quite complicated, and it's very hard 318 00:17:21,880 --> 00:17:25,119 to get these protocols right. 319 00:17:25,119 --> 00:17:27,910 And in fact, one of the most earliest successes 320 00:17:27,910 --> 00:17:31,300 of formal verification was improving some of these cache 321 00:17:31,300 --> 00:17:34,210 [INAUDIBLE] protocols to be correct. 322 00:17:34,210 --> 00:17:35,020 Yes, question? 323 00:17:35,020 --> 00:17:37,558 AUDIENCE: What happens if two processors try to modify 324 00:17:37,558 --> 00:17:40,310 one value at the same time 325 00:17:40,310 --> 00:17:42,220 JULIAN SHUN: Yeah, so if two processors 326 00:17:42,220 --> 00:17:45,243 try to modify the value, one of them has to happen first. 327 00:17:45,243 --> 00:17:47,160 So the hardware is going to take care of that. 328 00:17:47,160 --> 00:17:49,750 So the first one that actually modifies 329 00:17:49,750 --> 00:17:51,730 it will invalidate all the other copies, 330 00:17:51,730 --> 00:17:54,100 and then the second one that modifies the value 331 00:17:54,100 --> 00:17:56,530 will again invalidate all of the other copies. 332 00:17:56,530 --> 00:17:58,810 And when you do that-- 333 00:17:58,810 --> 00:18:01,720 when a lot of processors try to modify the same value, 334 00:18:01,720 --> 00:18:04,150 you get something known as an invalidation storm. 335 00:18:04,150 --> 00:18:06,430 So you have a bunch of invalidation messages 336 00:18:06,430 --> 00:18:09,340 going throughout the hardware. 337 00:18:09,340 --> 00:18:11,590 And that can lead to a big performance bottleneck. 338 00:18:11,590 --> 00:18:14,840 Because each processor, when it modifies its value, 339 00:18:14,840 --> 00:18:17,188 it has to inform all the other processors. 340 00:18:17,188 --> 00:18:19,480 And if all the processors are modifying the same value, 341 00:18:19,480 --> 00:18:22,343 you get this sort of quadratic behavior. 342 00:18:22,343 --> 00:18:24,010 The hardware is still going to guarantee 343 00:18:24,010 --> 00:18:26,590 that one of their processors is going to end up 344 00:18:26,590 --> 00:18:27,590 writing the value there. 345 00:18:27,590 --> 00:18:30,400 But you should be aware of this performance issue 346 00:18:30,400 --> 00:18:33,130 when you're writing parallel code. 347 00:18:33,130 --> 00:18:33,995 Yes? 348 00:18:33,995 --> 00:18:35,657 AUDIENCE: So all of this protocol stuff 349 00:18:35,657 --> 00:18:37,320 happens in hardware? 350 00:18:37,320 --> 00:18:40,250 JULIAN SHUN: Yes, so this is all implemented in hardware. 351 00:18:40,250 --> 00:18:42,880 So if you take a computer architecture class, 352 00:18:42,880 --> 00:18:46,030 you'll learn much more about these protocols and all 353 00:18:46,030 --> 00:18:48,400 of their variants. 354 00:18:48,400 --> 00:18:51,880 So for our purposes, we don't actually 355 00:18:51,880 --> 00:18:54,890 need to understand all the details of the hardware. 356 00:18:54,890 --> 00:18:57,800 We just need to understand what it's doing at a high level 357 00:18:57,800 --> 00:19:02,600 so we can understand when we have a performance bottleneck 358 00:19:02,600 --> 00:19:04,450 and why we have a performance bottleneck. 359 00:19:04,450 --> 00:19:06,730 So that's why I'm just introducing the most 360 00:19:06,730 --> 00:19:07,990 basic protocol here. 361 00:19:14,770 --> 00:19:15,990 Any other questions? 362 00:19:21,030 --> 00:19:26,320 So I talked a little bit about the shared memory hardware. 363 00:19:26,320 --> 00:19:30,070 Let's now look at some concurrency platforms. 364 00:19:30,070 --> 00:19:35,880 So these are the four platforms that we'll be looking at today. 365 00:19:35,880 --> 00:19:40,000 So first, what is a concurrency platform? 366 00:19:40,000 --> 00:19:44,250 Well, writing parallel programs is very difficult. 367 00:19:44,250 --> 00:19:46,793 It's very hard to get these programs to be correct. 368 00:19:46,793 --> 00:19:48,710 And if you want to optimize their performance, 369 00:19:48,710 --> 00:19:50,230 it becomes even harder. 370 00:19:50,230 --> 00:19:52,260 So it's very painful and error-prone. 371 00:19:52,260 --> 00:19:55,610 And a concurrency platform abstracts processor 372 00:19:55,610 --> 00:19:57,710 cores and handles synchronization 373 00:19:57,710 --> 00:19:59,720 and communication protocols. 374 00:19:59,720 --> 00:20:01,860 And it also performs load balancing for you. 375 00:20:01,860 --> 00:20:05,000 So it makes your lives much easier. 376 00:20:05,000 --> 00:20:08,660 And so today we're going to talk about some 377 00:20:08,660 --> 00:20:14,240 of these different concurrency platforms. 378 00:20:14,240 --> 00:20:16,730 So to illustrate these concurrency platforms, 379 00:20:16,730 --> 00:20:20,990 I'm going to do the Fibonacci numbers example. 380 00:20:20,990 --> 00:20:23,870 So does anybody not know what Fibonacci is? 381 00:20:27,840 --> 00:20:28,350 So good. 382 00:20:28,350 --> 00:20:30,270 Everybody knows what Fibonacci is. 383 00:20:33,100 --> 00:20:36,480 So it's a sequence where each number is the sum 384 00:20:36,480 --> 00:20:37,770 of the previous two numbers. 385 00:20:37,770 --> 00:20:43,860 And the recurrence is shown in this brown box here. 386 00:20:43,860 --> 00:20:50,010 The sequence is named after Leonardo di Pisa, who was also 387 00:20:50,010 --> 00:20:54,240 known as Fibonacci, which is a contraction of Bonacci, 388 00:20:54,240 --> 00:20:55,950 son of Bonaccio. 389 00:20:55,950 --> 00:20:58,830 So that's where the name Fibonacci came from. 390 00:20:58,830 --> 00:21:03,970 And in Fibonacce's 1202 book, Liber Abaci, 391 00:21:03,970 --> 00:21:06,660 he introduced the sequence-- 392 00:21:06,660 --> 00:21:10,710 the Fibonacci sequence-- to Western mathematics, 393 00:21:10,710 --> 00:21:12,990 although it had been previously known 394 00:21:12,990 --> 00:21:19,260 to Indian mathematicians for several centuries. 395 00:21:19,260 --> 00:21:21,960 But this is what we call the sequence nowadays-- 396 00:21:21,960 --> 00:21:22,950 Fibonacci numbers. 397 00:21:25,840 --> 00:21:31,090 So here's a Fibonacci program. 398 00:21:31,090 --> 00:21:33,160 Has anyone seen this algorithm before? 399 00:21:36,590 --> 00:21:39,570 A couple of people. 400 00:21:39,570 --> 00:21:41,880 Probably more, but people didn't raise their hands. 401 00:21:45,810 --> 00:21:49,410 So it's a recursive program. 402 00:21:49,410 --> 00:21:51,930 So it basically implements the recurrence 403 00:21:51,930 --> 00:21:53,260 from the previous slide. 404 00:21:53,260 --> 00:21:56,400 So if n is less than 2, we just return n. 405 00:21:56,400 --> 00:21:58,880 Otherwise, we compute fib of n minus 1, 406 00:21:58,880 --> 00:22:03,180 store that value in x, fib of n minus 2, store that value in y, 407 00:22:03,180 --> 00:22:05,040 and then return the sum of x and y. 408 00:22:10,560 --> 00:22:12,100 So I do want to make a disclaimer 409 00:22:12,100 --> 00:22:14,410 to the algorithms police that this is actually 410 00:22:14,410 --> 00:22:16,480 a very bad algorithm. 411 00:22:16,480 --> 00:22:20,650 So this algorithm takes exponential time, 412 00:22:20,650 --> 00:22:22,240 and there's actually much better ways 413 00:22:22,240 --> 00:22:25,010 to compute the end Fibonacci number. 414 00:22:25,010 --> 00:22:27,535 There's a linear time algorithm, which 415 00:22:27,535 --> 00:22:31,720 just computes the Fibonacci numbers from bottom up. 416 00:22:31,720 --> 00:22:34,360 This algorithm here is actually redoing a lot of the work, 417 00:22:34,360 --> 00:22:39,610 because it's computing Fibonacci numbers multiple times. 418 00:22:39,610 --> 00:22:43,450 Whereas if you do a linear scan from the smallest numbers up, 419 00:22:43,450 --> 00:22:45,350 you only have to compute each one once. 420 00:22:45,350 --> 00:22:47,500 And there's actually an even better algorithm 421 00:22:47,500 --> 00:22:50,980 that takes logarithmic time, and it's 422 00:22:50,980 --> 00:22:52,370 based on squaring matrices. 423 00:22:52,370 --> 00:22:57,280 So has anyone seen that algorithm before? 424 00:22:57,280 --> 00:22:59,020 So a couple of people. 425 00:22:59,020 --> 00:23:00,855 So if you're interested in learning more 426 00:23:00,855 --> 00:23:02,230 about this algorithm, I encourage 427 00:23:02,230 --> 00:23:05,230 you to look at your favorite textbook, Introduction 428 00:23:05,230 --> 00:23:09,140 to Algorithms by Cormen, Leiserson, Rivest, and Stein. 429 00:23:11,675 --> 00:23:12,550 So even though this-- 430 00:23:12,550 --> 00:23:13,520 [LAUGHTER] 431 00:23:13,520 --> 00:23:15,850 Yes. 432 00:23:15,850 --> 00:23:19,540 So even though this is a pretty bad algorithm, 433 00:23:19,540 --> 00:23:22,060 it's still a good educational example, 434 00:23:22,060 --> 00:23:24,400 because I can fit it on one slide 435 00:23:24,400 --> 00:23:28,450 and illustrate all the concepts of parallelism 436 00:23:28,450 --> 00:23:31,820 that we want to cover today. 437 00:23:31,820 --> 00:23:36,610 So here's the execution tree for fib of 4. 438 00:23:36,610 --> 00:23:41,380 So we see that fib of 4 is going to call fib of 3 and fib of 2. 439 00:23:41,380 --> 00:23:45,560 Fib of 3 is going to call fib of 2, fib of 1, and so on. 440 00:23:45,560 --> 00:23:47,560 And you can see that repeated computations here. 441 00:23:47,560 --> 00:23:52,460 So fib of 2 is being computed twice, and so on. 442 00:23:52,460 --> 00:23:55,000 And if you have a much larger tree-- 443 00:23:55,000 --> 00:23:57,460 say you ran this on fib of 40-- then 444 00:23:57,460 --> 00:24:00,550 you'll have many more overlapping computations. 445 00:24:04,310 --> 00:24:09,860 It turns out that the two recursive calls can actually 446 00:24:09,860 --> 00:24:12,260 be parallelized, because they're completely independent 447 00:24:12,260 --> 00:24:13,710 calculations. 448 00:24:13,710 --> 00:24:16,160 So the key idea for parallelization 449 00:24:16,160 --> 00:24:22,100 is to simultaneously execute the two recursive sub-calls to fib. 450 00:24:22,100 --> 00:24:24,170 And in fact, you can do this recursively. 451 00:24:24,170 --> 00:24:27,860 So the two sub-calls to fib of 3 can also 452 00:24:27,860 --> 00:24:30,890 be executed in parallel, and the two sub-calls of fib of 2 453 00:24:30,890 --> 00:24:33,060 can also be executed in parallel, and so on. 454 00:24:33,060 --> 00:24:35,900 So you have all of these calls that 455 00:24:35,900 --> 00:24:38,020 can be executed in parallel. 456 00:24:38,020 --> 00:24:41,390 So that's the key idea for extracting parallelism 457 00:24:41,390 --> 00:24:42,410 from this algorithm. 458 00:24:45,980 --> 00:24:48,890 So let's now look at how we can use 459 00:24:48,890 --> 00:24:54,072 Pthreads to implement this simple Fibonacci algorithm. 460 00:24:56,730 --> 00:25:00,480 So Pthreads is a standard API for threading, 461 00:25:00,480 --> 00:25:04,800 and it's supported on all Unix-based machines. 462 00:25:04,800 --> 00:25:08,670 And if you're programming using Microsoft products, 463 00:25:08,670 --> 00:25:12,900 then the equivalent is Win API threads. 464 00:25:12,900 --> 00:25:18,450 And Pthreads is actually standard in ANSI and IEEE, 465 00:25:18,450 --> 00:25:21,570 so there's this number here that specifies the standard. 466 00:25:21,570 --> 00:25:24,070 But nowadays, we just call it Pthreads. 467 00:25:24,070 --> 00:25:26,070 And it's basically a do-it-yourself concurrency 468 00:25:26,070 --> 00:25:26,670 platform. 469 00:25:26,670 --> 00:25:29,190 So it's like the assembly language 470 00:25:29,190 --> 00:25:31,500 of parallel programming. 471 00:25:31,500 --> 00:25:33,570 It's built as a library of functions 472 00:25:33,570 --> 00:25:36,900 with special non-C semantics. 473 00:25:36,900 --> 00:25:39,240 Because if you're just writing code in C, 474 00:25:39,240 --> 00:25:42,508 you can't really say which parts of the code 475 00:25:42,508 --> 00:25:43,800 should be executed in parallel. 476 00:25:43,800 --> 00:25:45,990 So Pthreads provides you a library 477 00:25:45,990 --> 00:25:49,800 of functions that allow you to specify concurrency 478 00:25:49,800 --> 00:25:52,290 in your program. 479 00:25:52,290 --> 00:25:56,640 And each thread implements an abstraction of a processor, 480 00:25:56,640 --> 00:25:58,920 and these threads are then multiplexed 481 00:25:58,920 --> 00:26:02,040 onto the actual machine resources. 482 00:26:02,040 --> 00:26:04,590 So the number of threads that you create 483 00:26:04,590 --> 00:26:07,320 doesn't necessarily have to match the number of processors 484 00:26:07,320 --> 00:26:09,400 you have on your machine. 485 00:26:09,400 --> 00:26:12,690 So if you have more threads than the number of processors 486 00:26:12,690 --> 00:26:14,790 you have, then they'll just be multiplexing. 487 00:26:14,790 --> 00:26:17,400 So you can actually run a Pthreads program 488 00:26:17,400 --> 00:26:21,090 on a single core even though you have multiple threads 489 00:26:21,090 --> 00:26:21,930 in the program. 490 00:26:21,930 --> 00:26:25,560 They would just be time-sharing. 491 00:26:25,560 --> 00:26:28,590 All the threads communicate through shared memory, 492 00:26:28,590 --> 00:26:32,400 so they all have access to the same view of the memory. 493 00:26:32,400 --> 00:26:35,995 And the library functions that Pthreads provides mask 494 00:26:35,995 --> 00:26:40,170 the protocols involved in interthread coordination, 495 00:26:40,170 --> 00:26:41,670 so you don't have to do it yourself. 496 00:26:41,670 --> 00:26:44,880 Because it turns out that this is quite difficult to 497 00:26:44,880 --> 00:26:46,350 do correctly by hand. 498 00:26:48,930 --> 00:26:52,990 So now I want to look at the key Pthread functions. 499 00:26:52,990 --> 00:26:56,610 So the first Pthread is pthread_create. 500 00:26:56,610 --> 00:26:59,380 And this takes four arguments. 501 00:26:59,380 --> 00:27:04,350 So the first argument is this pthread_t type. 502 00:27:07,210 --> 00:27:09,420 This is basically going to store an identifier 503 00:27:09,420 --> 00:27:12,000 for the new thread that pthread_create 504 00:27:12,000 --> 00:27:14,880 will create so that we can use that thread 505 00:27:14,880 --> 00:27:17,640 in our computations. 506 00:27:17,640 --> 00:27:23,670 pthread_attr_t-- this set some thread attributes, 507 00:27:23,670 --> 00:27:26,330 and for our purposes, we can just set it to null and use 508 00:27:26,330 --> 00:27:29,460 the default attributes. 509 00:27:29,460 --> 00:27:32,430 The third argument is this function 510 00:27:32,430 --> 00:27:36,180 that's going to be executed after we create the thread. 511 00:27:36,180 --> 00:27:38,430 So we're going to need to define this function that we 512 00:27:38,430 --> 00:27:40,800 want the thread to execute. 513 00:27:40,800 --> 00:27:46,170 And then finally, we have this void *arg argument, 514 00:27:46,170 --> 00:27:48,960 which stores the arguments that are going to be passed 515 00:27:48,960 --> 00:27:53,430 to the function that we're going to be executing. 516 00:27:53,430 --> 00:27:57,220 And then pthread_create also returns an error status, 517 00:27:57,220 --> 00:28:00,370 returns an integer specifying whether the thread creation 518 00:28:00,370 --> 00:28:03,190 was successful or not. 519 00:28:03,190 --> 00:28:06,760 And then there's another function called pthread_join. 520 00:28:06,760 --> 00:28:09,640 pthread_join basically says that we 521 00:28:09,640 --> 00:28:15,820 want to block at this part of our code 522 00:28:15,820 --> 00:28:18,010 until this specified thread finishes. 523 00:28:18,010 --> 00:28:21,760 So it takes as argument pthread_t. 524 00:28:21,760 --> 00:28:24,430 So this thread identifier, and these thread identifiers, 525 00:28:24,430 --> 00:28:29,016 were created when we called pthread_create. 526 00:28:29,016 --> 00:28:31,990 It also has a second argument, status, 527 00:28:31,990 --> 00:28:34,090 which is going to store the status 528 00:28:34,090 --> 00:28:37,020 of the terminating thread. 529 00:28:37,020 --> 00:28:39,400 And then pthread_join also returns an error status. 530 00:28:39,400 --> 00:28:41,020 So essentially what this does is it 531 00:28:41,020 --> 00:28:46,230 says to wait until this thread finishes before we continue on 532 00:28:46,230 --> 00:28:46,855 in our program. 533 00:28:49,960 --> 00:28:51,770 So any questions so far? 534 00:29:00,900 --> 00:29:03,780 So here's what the implementation of Fibonacci 535 00:29:03,780 --> 00:29:07,350 looks like using Pthreads. 536 00:29:07,350 --> 00:29:12,330 So on the left, we see the original program that we had, 537 00:29:12,330 --> 00:29:13,590 the fib function there. 538 00:29:13,590 --> 00:29:16,830 That's just the sequential code. 539 00:29:16,830 --> 00:29:19,200 And then we have all this other stuff 540 00:29:19,200 --> 00:29:22,300 to enable it to run in parallel. 541 00:29:22,300 --> 00:29:26,880 So first, we have this struct on the left, thread_args. 542 00:29:26,880 --> 00:29:30,690 This struct here is used to store the arguments that 543 00:29:30,690 --> 00:29:35,430 are passed to the function that the thread is going to execute. 544 00:29:35,430 --> 00:29:38,160 And then we have this thread_func. 545 00:29:38,160 --> 00:29:42,540 What that does is it reads the input 546 00:29:42,540 --> 00:29:45,840 argument from this thread_args struct, 547 00:29:45,840 --> 00:29:49,950 and then it sets that to i, and then it calls fib of i. 548 00:29:49,950 --> 00:29:52,410 And that gives you the output, and then we store the result 549 00:29:52,410 --> 00:29:54,540 into the output of the struct. 550 00:29:57,640 --> 00:30:00,475 And then that also just returns null. 551 00:30:03,000 --> 00:30:04,820 And then over on the right hand side, 552 00:30:04,820 --> 00:30:08,930 we have the main function that will actually call the fib 553 00:30:08,930 --> 00:30:10,580 function on the left. 554 00:30:10,580 --> 00:30:15,260 So we initialize a whole bunch of variables 555 00:30:15,260 --> 00:30:19,640 that we need to execute these threads. 556 00:30:19,640 --> 00:30:23,370 And then we first check if n is less than 30. 557 00:30:23,370 --> 00:30:24,950 If n is less than 30, it turns out 558 00:30:24,950 --> 00:30:27,620 that it's actually not worth creating threads 559 00:30:27,620 --> 00:30:29,660 to execute this program in parallel, because 560 00:30:29,660 --> 00:30:31,370 of the overhead of thread creation. 561 00:30:31,370 --> 00:30:34,280 So if n is less than 30, we'll just execute the program 562 00:30:34,280 --> 00:30:36,860 sequentially. 563 00:30:36,860 --> 00:30:39,030 And this idea is known as coarsening. 564 00:30:39,030 --> 00:30:42,470 So you saw a similar example a couple of lectures 565 00:30:42,470 --> 00:30:45,270 ago when we did coarsening for sorting. 566 00:30:45,270 --> 00:30:47,840 But this is in the context of a parallel programming. 567 00:30:47,840 --> 00:30:50,660 So here, because there are some overheads 568 00:30:50,660 --> 00:30:53,330 to running a function in parallel, 569 00:30:53,330 --> 00:30:55,250 if the input size is small enough, 570 00:30:55,250 --> 00:30:57,710 sometimes you want to just execute it sequentially. 571 00:31:00,230 --> 00:31:02,990 And then we're going to-- 572 00:31:02,990 --> 00:31:04,820 so let me just walk through this code, 573 00:31:04,820 --> 00:31:06,800 since I have an animation. 574 00:31:10,160 --> 00:31:12,020 So the next thing it's going to do 575 00:31:12,020 --> 00:31:14,900 is it's going to marshal the input argument to the thread 576 00:31:14,900 --> 00:31:17,540 so it's going to store the input argument n minus 1 577 00:31:17,540 --> 00:31:23,000 in this args struct. 578 00:31:23,000 --> 00:31:26,120 And then we're going to call pthread_create 579 00:31:26,120 --> 00:31:28,550 with a thread variable. 580 00:31:28,550 --> 00:31:31,007 For thread_args, we're just going to use null. 581 00:31:31,007 --> 00:31:32,840 And then we're going to pass the thread_func 582 00:31:32,840 --> 00:31:35,850 that we defined on the left. 583 00:31:35,850 --> 00:31:39,180 And then we're going to pass the args structure. 584 00:31:39,180 --> 00:31:44,090 And inside this args structure, the input is set to n minus 1, 585 00:31:44,090 --> 00:31:45,590 which we did on the previous line. 586 00:31:51,440 --> 00:31:57,200 And then pthread_create is going to give a return value. 587 00:32:00,600 --> 00:32:04,500 So if the Pthread creation was successful, 588 00:32:04,500 --> 00:32:07,725 then the status is going to be null, and we can continue. 589 00:32:10,325 --> 00:32:11,700 And when we continue, we're going 590 00:32:11,700 --> 00:32:16,140 to execute, now, fib of n minus 2 and store the result of that 591 00:32:16,140 --> 00:32:17,800 into our result variable. 592 00:32:17,800 --> 00:32:21,000 And this is done at the same time that fib of n minus 1 593 00:32:21,000 --> 00:32:21,660 is executing. 594 00:32:21,660 --> 00:32:25,800 Because we created this Pthread, and we 595 00:32:25,800 --> 00:32:29,100 told it to call this thread_func function 596 00:32:29,100 --> 00:32:30,270 that we defined on the left. 597 00:32:30,270 --> 00:32:33,240 So both fib of n minus 1 and fib of n minus 2 598 00:32:33,240 --> 00:32:36,210 are executing in parallel now. 599 00:32:36,210 --> 00:32:39,210 And then we have this pthread_join, 600 00:32:39,210 --> 00:32:41,850 which says we're going to wait until the thread 601 00:32:41,850 --> 00:32:44,520 that we've created finishes before we move on, because we 602 00:32:44,520 --> 00:32:47,970 need to know the result of both of the sub-calls 603 00:32:47,970 --> 00:32:51,400 before we can finish this function. 604 00:32:51,400 --> 00:32:53,010 And once that's done-- 605 00:32:53,010 --> 00:32:56,130 well, we first check the status to see if it was successful. 606 00:32:56,130 --> 00:32:59,780 And if so, then we add the outputs of the argument's 607 00:32:59,780 --> 00:33:02,790 struct to the result. So args.output will store 608 00:33:02,790 --> 00:33:05,730 the output of fib of n minus 1. 609 00:33:09,430 --> 00:33:13,530 So that's the Pthreads code. 610 00:33:13,530 --> 00:33:15,450 Any questions on how this works? 611 00:33:20,870 --> 00:33:21,630 Yeah? 612 00:33:21,630 --> 00:33:25,645 AUDIENCE: I have a question about the thread function. 613 00:33:25,645 --> 00:33:28,120 So it looks like you passed a void pointer, 614 00:33:28,120 --> 00:33:30,407 but then you cast it to something else every time 615 00:33:30,407 --> 00:33:33,330 you use that-- 616 00:33:33,330 --> 00:33:35,080 JULIAN SHUN: Yeah, so this is because 617 00:33:35,080 --> 00:33:38,020 the pthread_create function takes 618 00:33:38,020 --> 00:33:40,630 as input a void star pointer. 619 00:33:40,630 --> 00:33:42,340 Because it's actually a generic function, 620 00:33:42,340 --> 00:33:44,520 so it doesn't know what the data type is. 621 00:33:44,520 --> 00:33:46,150 It has to work for all data types, 622 00:33:46,150 --> 00:33:48,250 and that's why we need to cast it to avoid star. 623 00:33:48,250 --> 00:33:51,280 When we pass it to pthread_create and then 624 00:33:51,280 --> 00:33:52,870 inside the thread_func, we actually 625 00:33:52,870 --> 00:33:56,420 do know what type of pointer that is, so then we cast it. 626 00:34:02,880 --> 00:34:04,555 So does this code seem very parallel? 627 00:34:09,560 --> 00:34:13,820 So how many parallel calls am I doing here? 628 00:34:13,820 --> 00:34:14,320 Yeah? 629 00:34:14,320 --> 00:34:16,315 AUDIENCE: Just one. 630 00:34:16,315 --> 00:34:18,440 JULIAN SHUN: Yeah, so I'm only creating one thread. 631 00:34:18,440 --> 00:34:22,040 So I'm executing two things in parallel. 632 00:34:22,040 --> 00:34:25,750 So if I ran this code on four processors, 633 00:34:25,750 --> 00:34:28,013 what's the maximum speed-up I could get? 634 00:34:28,013 --> 00:34:29,512 AUDIENCE: [INAUDIBLE]. 635 00:34:29,512 --> 00:34:31,429 JULIAN SHUN: So the maximum speed-up I can get 636 00:34:31,429 --> 00:34:35,429 is just two, because I'm only running two things in parallel. 637 00:34:35,429 --> 00:34:40,760 So this doesn't recursively create threads. 638 00:34:40,760 --> 00:34:43,679 It only creates one thread at the top level. 639 00:34:43,679 --> 00:34:47,389 And if you wanted to make it so that this code actually 640 00:34:47,389 --> 00:34:49,820 recursively created threads, it would actually 641 00:34:49,820 --> 00:34:52,699 become much more complicated. 642 00:34:52,699 --> 00:34:56,780 And that's one of the disadvantages of implementing 643 00:34:56,780 --> 00:34:58,450 this code in Pthreads. 644 00:34:58,450 --> 00:35:00,200 So we'll look at other solutions that will 645 00:35:00,200 --> 00:35:01,595 make this task much easier. 646 00:35:05,120 --> 00:35:06,890 So some of the issues with Pthreads 647 00:35:06,890 --> 00:35:08,280 are shown on this slide here. 648 00:35:08,280 --> 00:35:12,020 So there's a high overhead to creating a thread. 649 00:35:12,020 --> 00:35:14,480 So creating a thread typically takes over 10 650 00:35:14,480 --> 00:35:17,720 to the 4th cycles. 651 00:35:17,720 --> 00:35:21,380 And this leads to very coarse-grained concurrency, 652 00:35:21,380 --> 00:35:24,530 because your tasks have to do a lot of work 653 00:35:24,530 --> 00:35:30,140 in order to amortize the costs of creating that thread. 654 00:35:30,140 --> 00:35:32,570 There are something called thread pulls, which can help. 655 00:35:32,570 --> 00:35:34,862 And the idea here is to create a whole bunch of threads 656 00:35:34,862 --> 00:35:38,660 at the same time to amortize the costs of thread creation. 657 00:35:38,660 --> 00:35:40,730 And then when you need a thread, you just 658 00:35:40,730 --> 00:35:42,060 take one from the thread pull. 659 00:35:42,060 --> 00:35:43,850 So the thread pull contains threads that 660 00:35:43,850 --> 00:35:45,680 are just waiting to do work. 661 00:35:48,300 --> 00:35:50,780 There's also a scalability issue with this code 662 00:35:50,780 --> 00:35:53,090 that I showed on the previous slide. 663 00:35:53,090 --> 00:35:56,000 The Fibonacci code gets, at most, 664 00:35:56,000 --> 00:35:59,280 1.5x speed-up for two cores. 665 00:35:59,280 --> 00:36:01,142 Why is it 1.5 here? 666 00:36:01,142 --> 00:36:01,850 Does anyone know? 667 00:36:05,130 --> 00:36:05,630 Yeah? 668 00:36:05,630 --> 00:36:08,170 AUDIENCE: You have the asymmetry in the size of the two calls. 669 00:36:08,170 --> 00:36:09,587 JULIAN SHUN: Yeah, so it turns out 670 00:36:09,587 --> 00:36:13,170 that the two calls that I'm executing in parallel-- 671 00:36:13,170 --> 00:36:14,920 they're not doing the same amount of work. 672 00:36:14,920 --> 00:36:17,095 So one is computing fib of n minus 1, 673 00:36:17,095 --> 00:36:19,330 one is computing fib of n minus 2. 674 00:36:19,330 --> 00:36:23,630 And does anyone know what the ratio between these two values 675 00:36:23,630 --> 00:36:24,130 is? 676 00:36:27,360 --> 00:36:29,140 Yeah, so it's the golden ratio. 677 00:36:29,140 --> 00:36:31,330 It's about 1.6. 678 00:36:31,330 --> 00:36:33,970 It turns out that if you can get a speed-up of 1.6, 679 00:36:33,970 --> 00:36:34,720 then that's great. 680 00:36:34,720 --> 00:36:38,490 But there are some overheads, so this code 681 00:36:38,490 --> 00:36:42,410 will get about a 1.5 speed up. 682 00:36:42,410 --> 00:36:45,292 And if you want to run this to take advantage of more cores, 683 00:36:45,292 --> 00:36:46,750 then you need to rewrite this code, 684 00:36:46,750 --> 00:36:50,440 and it becomes more complicated. 685 00:36:50,440 --> 00:36:52,420 Third, there's the issue of modularity. 686 00:36:52,420 --> 00:36:56,560 So if you look at this code here, 687 00:36:56,560 --> 00:37:00,880 you see that the Fibonacci logic is not nicely encapsulated 688 00:37:00,880 --> 00:37:02,050 within one function. 689 00:37:02,050 --> 00:37:05,350 We have that logic in the fib function on the left, 690 00:37:05,350 --> 00:37:08,830 but then we also have some of the fib logic on the right 691 00:37:08,830 --> 00:37:10,660 in our main function. 692 00:37:10,660 --> 00:37:13,930 And this makes this code not modular. 693 00:37:13,930 --> 00:37:16,900 And if we want to build programs on top of this, 694 00:37:16,900 --> 00:37:18,350 it makes it very hard to maintain, 695 00:37:18,350 --> 00:37:22,690 if we want to just change the logic of the Fibonacci 696 00:37:22,690 --> 00:37:24,790 function a little bit, because now we 697 00:37:24,790 --> 00:37:26,290 have to change it in multiple places 698 00:37:26,290 --> 00:37:29,460 instead of just having everything in one place. 699 00:37:29,460 --> 00:37:32,740 So it's not a good idea to write code that's not modular, 700 00:37:32,740 --> 00:37:35,260 so please don't do that in your projects. 701 00:37:40,420 --> 00:37:44,020 And then finally, the code becomes 702 00:37:44,020 --> 00:37:47,110 complicated because you have to actually move 703 00:37:47,110 --> 00:37:48,070 these arguments around. 704 00:37:48,070 --> 00:37:50,230 That's known as argument marshaling. 705 00:37:50,230 --> 00:37:52,810 And then you have to engage in error-prone protocols 706 00:37:52,810 --> 00:37:55,270 in order to do load balancing. 707 00:37:55,270 --> 00:37:57,940 So if you recall here, we have to actually 708 00:37:57,940 --> 00:38:02,710 place the argument n minus 1 into args.input 709 00:38:02,710 --> 00:38:05,350 and we have to extract the value out of args.output. 710 00:38:05,350 --> 00:38:07,495 So that makes the code very messy. 711 00:38:13,090 --> 00:38:17,770 So why do I say shades of 1958 here? 712 00:38:17,770 --> 00:38:21,760 Does anyone know what happened in 1958? 713 00:38:21,760 --> 00:38:25,930 Who was around in 1958? 714 00:38:25,930 --> 00:38:26,520 Just Charles? 715 00:38:29,340 --> 00:38:33,610 So there was a first something in 1958. 716 00:38:33,610 --> 00:38:34,110 What was it? 717 00:38:42,200 --> 00:38:47,180 So turns out in 1958, we had the first compiler. 718 00:38:47,180 --> 00:38:50,390 And this was the Fortran compiler. 719 00:38:50,390 --> 00:38:52,730 And before we had Fortran compiler, 720 00:38:52,730 --> 00:38:54,830 programmers were writing things in assembly. 721 00:38:54,830 --> 00:38:56,750 And when you write things in assembly, 722 00:38:56,750 --> 00:38:59,210 you have to do argument marshaling, 723 00:38:59,210 --> 00:39:02,480 because you have to place things into the appropriate registers 724 00:39:02,480 --> 00:39:05,150 before calling a function, and also move things around when 725 00:39:05,150 --> 00:39:06,920 you return from a function. 726 00:39:06,920 --> 00:39:09,530 And the nice thing about the first compiler 727 00:39:09,530 --> 00:39:12,210 is that it actually did all of this argument marshaling 728 00:39:12,210 --> 00:39:12,710 for you. 729 00:39:12,710 --> 00:39:15,680 So now you can just pass arguments to a function, 730 00:39:15,680 --> 00:39:17,420 and the compiler will generate code 731 00:39:17,420 --> 00:39:22,340 that will do the argument marshaling for us. 732 00:39:22,340 --> 00:39:24,320 So having you do this in Pthreads 733 00:39:24,320 --> 00:39:27,560 is similar to having to write code in assembly, 734 00:39:27,560 --> 00:39:29,630 because you have to actually manually marshal 735 00:39:29,630 --> 00:39:31,320 these arguments. 736 00:39:31,320 --> 00:39:33,890 So hopefully, there are better ways to do this. 737 00:39:33,890 --> 00:39:37,490 And indeed, we'll look at some other solutions that will make 738 00:39:37,490 --> 00:39:40,900 it easier on the programmer. 739 00:39:40,900 --> 00:39:42,470 Any questions before I continue? 740 00:39:48,980 --> 00:39:51,020 So we looked at Pthreads. 741 00:39:51,020 --> 00:39:53,570 Next, let's look at Threading Building Blocks. 742 00:39:57,160 --> 00:40:00,340 So Threading Building Blocks is a library solution. 743 00:40:00,340 --> 00:40:02,920 It was developed by Intel. 744 00:40:02,920 --> 00:40:07,060 And it's implemented as a C++ library that runs on top 745 00:40:07,060 --> 00:40:09,090 of native threads. 746 00:40:09,090 --> 00:40:12,730 So the underlying implementation uses threads, 747 00:40:12,730 --> 00:40:15,370 but the programmer doesn't deal with threads. 748 00:40:15,370 --> 00:40:19,000 Instead, the programmer specifies tasks, 749 00:40:19,000 --> 00:40:21,430 and these tasks are automatically load-balanced 750 00:40:21,430 --> 00:40:25,090 across the threads using a work-stealing algorithm 751 00:40:25,090 --> 00:40:27,880 inspired by research at MIT-- 752 00:40:27,880 --> 00:40:31,230 Charles Leiserson's research. 753 00:40:31,230 --> 00:40:34,690 And the focus of Intel TBB is on performance. 754 00:40:34,690 --> 00:40:37,840 And as we'll see, the code written using TBB 755 00:40:37,840 --> 00:40:39,700 is simpler than what you would have 756 00:40:39,700 --> 00:40:42,260 to write if you used Pthreads. 757 00:40:42,260 --> 00:40:46,270 So let's look at how we can implement Fibonacci using TBB. 758 00:40:49,810 --> 00:40:55,000 So in TBB, we have to create these tasks. 759 00:40:55,000 --> 00:41:01,460 So in the Fibonacci code, we create this fib task class. 760 00:41:01,460 --> 00:41:06,065 And inside the task, we have to define this execute function. 761 00:41:10,580 --> 00:41:12,800 So the execute function is the function 762 00:41:12,800 --> 00:41:16,080 that performs a computation when we start the task. 763 00:41:16,080 --> 00:41:21,050 And this is where we define the Fibonacci logic. 764 00:41:21,050 --> 00:41:24,990 This task also takes as input these arguments parameter, n 765 00:41:24,990 --> 00:41:25,490 and sum. 766 00:41:25,490 --> 00:41:27,365 So n is the input here and sum is the output. 767 00:41:31,640 --> 00:41:37,370 And in TBB, we can easily create a recursive program 768 00:41:37,370 --> 00:41:40,950 that extracts more parallelism. 769 00:41:40,950 --> 00:41:43,680 And here, what we're doing is we're recursively creating 770 00:41:43,680 --> 00:41:46,200 two child tasks, a and b. 771 00:41:46,200 --> 00:41:49,530 That's the syntax for creating the tasks. 772 00:41:49,530 --> 00:41:52,200 And here, we can just pass the arguments to FibTask 773 00:41:52,200 --> 00:41:56,960 instead of marshaling the arguments ourselves. 774 00:41:56,960 --> 00:42:00,600 And then what we have here is a set_ref_count. 775 00:42:00,600 --> 00:42:03,690 And this basically is the number of tasks 776 00:42:03,690 --> 00:42:07,470 that we have to wait for plus one, so plus one for ourselves. 777 00:42:07,470 --> 00:42:10,920 And in this case, we created two children tasks, 778 00:42:10,920 --> 00:42:14,940 and we have ourselves, so that's 2 plus 1. 779 00:42:14,940 --> 00:42:20,720 And then after that, we start task b using the spawn(b) call. 780 00:42:20,720 --> 00:42:25,310 And then we do spawn_and_wait_for_all 781 00:42:25,310 --> 00:42:27,050 with a as the argument. 782 00:42:27,050 --> 00:42:30,455 In this place, he says, we're going to start task a, 783 00:42:30,455 --> 00:42:33,350 and then also wait for both a and b 784 00:42:33,350 --> 00:42:35,070 to finish before we proceed. 785 00:42:35,070 --> 00:42:37,670 So this spawn_and_wait_for_all call 786 00:42:37,670 --> 00:42:40,550 is going to look at the ref count that we set above 787 00:42:40,550 --> 00:42:44,400 and wait for that many tasks to finish before it continues. 788 00:42:44,400 --> 00:42:47,870 And after both and a and b have completed, 789 00:42:47,870 --> 00:42:50,480 then we can just sum up the results 790 00:42:50,480 --> 00:42:53,930 and store that into the sum variable. 791 00:42:53,930 --> 00:42:56,870 And here, these tasks are created recursively. 792 00:42:56,870 --> 00:42:58,820 So unlike the Pthreads implementation 793 00:42:58,820 --> 00:43:01,760 that was only creating one thread at the top level, 794 00:43:01,760 --> 00:43:05,120 here, we're actually recursively creating more and more tasks. 795 00:43:05,120 --> 00:43:07,370 So we can actually get more parallelism 796 00:43:07,370 --> 00:43:09,830 from this code and scale to more processors. 797 00:43:14,510 --> 00:43:18,330 We also need this main function just to start up the program. 798 00:43:18,330 --> 00:43:22,160 So what we do here is we create a root task, 799 00:43:22,160 --> 00:43:26,150 which just computes fib of n, and then we call 800 00:43:26,150 --> 00:43:28,880 spawn_root_and_wait(a). 801 00:43:28,880 --> 00:43:31,430 So a is the task for the root. 802 00:43:31,430 --> 00:43:33,530 And then it will just run the root task. 803 00:43:36,990 --> 00:43:40,130 So that's what Fibonacci looks like in TBB. 804 00:43:40,130 --> 00:43:44,870 So this is much simpler than the Pthreads implementation. 805 00:43:44,870 --> 00:43:46,520 And it also gets better performance, 806 00:43:46,520 --> 00:43:48,320 because we can extract more parallelism 807 00:43:48,320 --> 00:43:50,330 from the computation. 808 00:43:54,480 --> 00:43:55,440 Any questions? 809 00:44:02,430 --> 00:44:08,130 So TBB also has many other features in addition to tasks. 810 00:44:08,130 --> 00:44:11,730 So TBB provides many C++ templates to express common 811 00:44:11,730 --> 00:44:15,390 patterns, and you can use these templates on different data 812 00:44:15,390 --> 00:44:16,450 types. 813 00:44:16,450 --> 00:44:18,300 So they have a parallel_for, which 814 00:44:18,300 --> 00:44:21,110 is used to express loop parallelism. 815 00:44:21,110 --> 00:44:23,790 So you can loop over a bunch of iterations in parallel. 816 00:44:23,790 --> 00:44:26,580 They also have a parallel_reduce for data aggregation. 817 00:44:26,580 --> 00:44:28,890 For example, if you want to sum together 818 00:44:28,890 --> 00:44:31,140 a whole bunch of values, you can use a parallel_reduce 819 00:44:31,140 --> 00:44:33,610 to do that in parallel. 820 00:44:33,610 --> 00:44:37,170 They also have pipeline and filter. 821 00:44:37,170 --> 00:44:40,260 That's used for software pipelining. 822 00:44:40,260 --> 00:44:43,740 TBB provides many concurrent container classes, 823 00:44:43,740 --> 00:44:46,590 which allow multiple threads to safely access and update 824 00:44:46,590 --> 00:44:48,400 the items in a container concurrently. 825 00:44:48,400 --> 00:44:53,100 So for example, they have hash tables, trees, priority cues, 826 00:44:53,100 --> 00:44:53,830 and so on. 827 00:44:53,830 --> 00:44:55,980 And you can just use these out of the box, 828 00:44:55,980 --> 00:44:58,050 and they'll work in parallel. 829 00:44:58,050 --> 00:45:01,410 You can do concurrent updates and reads 830 00:45:01,410 --> 00:45:03,810 to these data structures. 831 00:45:03,810 --> 00:45:07,620 TBB also has a variety of mutual exclusion library functions, 832 00:45:07,620 --> 00:45:11,160 such as locks and atomic operations. 833 00:45:11,160 --> 00:45:13,560 So there are a lot of features of TBB, 834 00:45:13,560 --> 00:45:16,980 which is why it's one of the more popular concurrency 835 00:45:16,980 --> 00:45:18,120 platforms. 836 00:45:18,120 --> 00:45:20,040 And because of all of these features, 837 00:45:20,040 --> 00:45:22,830 you don't have to implement many of these things by yourself, 838 00:45:22,830 --> 00:45:24,780 and still get pretty good performance. 839 00:45:28,770 --> 00:45:33,660 So TBB was a library solution to the concurrency problem. 840 00:45:33,660 --> 00:45:36,270 Now we're going to look at two linguistic solutions-- 841 00:45:36,270 --> 00:45:39,105 OpenMP and Cilk. 842 00:45:39,105 --> 00:45:40,230 So let's start with OpenMP. 843 00:45:44,050 --> 00:45:49,840 So OpenMP is a specification by an industry consortium. 844 00:45:49,840 --> 00:45:54,130 And there are several compilers available that support OpenMP, 845 00:45:54,130 --> 00:45:56,950 both open source and proprietary. 846 00:45:56,950 --> 00:46:00,040 So nowadays, GCC, ICC, and Clang all 847 00:46:00,040 --> 00:46:03,880 support OpenMP, as well as Visual Studio. 848 00:46:03,880 --> 00:46:08,560 And OpenMP is-- it provides linguistic extensions to C 849 00:46:08,560 --> 00:46:13,300 and C++, as well as Fortran, in the form of compiler pragmas. 850 00:46:13,300 --> 00:46:15,910 So you use these compiler pragmas 851 00:46:15,910 --> 00:46:19,900 in your code to specify which pieces of code 852 00:46:19,900 --> 00:46:22,780 should run in parallel. 853 00:46:22,780 --> 00:46:26,170 And OpenMP also runs on top of native threads, 854 00:46:26,170 --> 00:46:30,400 but the programmer isn't exposed to these threads. 855 00:46:30,400 --> 00:46:33,120 OpenMP supports loop parallelism, 856 00:46:33,120 --> 00:46:35,130 so you can do parallel for loops. 857 00:46:35,130 --> 00:46:39,144 They have task parallelism as well as pipeline parallelism. 858 00:46:41,750 --> 00:46:44,800 So let's look at how we can implement Fibonacci in OpenMP. 859 00:46:47,560 --> 00:46:51,140 So this is the entire code. 860 00:46:51,140 --> 00:46:54,290 So I want you to compare this to the Pthreads implementation 861 00:46:54,290 --> 00:46:57,770 that we saw 10 minutes ago. 862 00:46:57,770 --> 00:47:00,890 So this code is much cleaner than the Pthreads 863 00:47:00,890 --> 00:47:04,320 implementation, and it also performs better. 864 00:47:04,320 --> 00:47:06,110 So let's see how this code works. 865 00:47:10,270 --> 00:47:12,360 So we have these compiler pragmas, 866 00:47:12,360 --> 00:47:15,000 or compiler directives. 867 00:47:15,000 --> 00:47:20,040 And the compiler pragma for creating a parallel task 868 00:47:20,040 --> 00:47:24,840 is omp task. 869 00:47:24,840 --> 00:47:27,840 So we're going to create an OpenMP task for fib 870 00:47:27,840 --> 00:47:30,230 of n minus 1 as well as fib of n minus 2. 871 00:47:37,840 --> 00:47:40,780 There's also this shared pragma, which 872 00:47:40,780 --> 00:47:43,900 specifies that the two variables in the arguments 873 00:47:43,900 --> 00:47:47,150 are shared across different threads. 874 00:47:47,150 --> 00:47:49,630 So you also have to specify whether variables 875 00:47:49,630 --> 00:47:50,725 are private or shared. 876 00:47:55,030 --> 00:47:57,970 And then the pragma omp wait just 877 00:47:57,970 --> 00:48:00,970 says we're going to wait for the preceding task 878 00:48:00,970 --> 00:48:03,410 to complete before we continue. 879 00:48:03,410 --> 00:48:05,425 So here, it's going to wait for fib of n minus 1 880 00:48:05,425 --> 00:48:08,290 and fib of n minus 2 to finish before we 881 00:48:08,290 --> 00:48:11,030 return the result, which is what we want. 882 00:48:11,030 --> 00:48:14,260 And then after that, we just return x plus y. 883 00:48:14,260 --> 00:48:15,490 So that's the entire code. 884 00:48:22,300 --> 00:48:26,170 And OpenMP also provides many other pragma directives, 885 00:48:26,170 --> 00:48:28,300 in addition to task. 886 00:48:28,300 --> 00:48:32,440 So we can use a parallel for to do loop parallelism. 887 00:48:32,440 --> 00:48:33,610 There's reduction. 888 00:48:33,610 --> 00:48:36,080 There's also directives for scheduling and data sharing. 889 00:48:36,080 --> 00:48:39,550 So you can specify how you want a particular loop 890 00:48:39,550 --> 00:48:40,360 to be scheduled. 891 00:48:40,360 --> 00:48:43,180 OpenMP has many different scheduling policies. 892 00:48:43,180 --> 00:48:46,730 They have static parallelism, dynamic parallelism, and so on. 893 00:48:46,730 --> 00:48:49,150 And then these scheduling directives also 894 00:48:49,150 --> 00:48:51,190 have different grain sizes. 895 00:48:51,190 --> 00:48:56,200 The data sharing directives are specifying whether variables 896 00:48:56,200 --> 00:48:58,770 are private or shared. 897 00:48:58,770 --> 00:49:01,870 OpenMP also supplies a variety of synchronization constructs, 898 00:49:01,870 --> 00:49:05,950 such as barriers, atomic updates, mutual exclusion, 899 00:49:05,950 --> 00:49:07,030 or mutex locks. 900 00:49:07,030 --> 00:49:09,250 So OpenMP also has many features, 901 00:49:09,250 --> 00:49:13,840 and it's also one of the more popular solutions 902 00:49:13,840 --> 00:49:17,110 to writing parallel programs. 903 00:49:17,110 --> 00:49:19,030 As you saw in the previous example, 904 00:49:19,030 --> 00:49:20,980 the code is much simpler than if you 905 00:49:20,980 --> 00:49:25,690 were to write something using Pthreads or even TBB. 906 00:49:25,690 --> 00:49:27,190 This is a much simpler solution. 907 00:49:32,430 --> 00:49:33,370 Any questions? 908 00:49:37,400 --> 00:49:38,210 Yeah? 909 00:49:38,210 --> 00:49:42,050 AUDIENCE: So with every compiler directive, 910 00:49:42,050 --> 00:49:47,605 does it spawn a new [INAUDIBLE] on a different processor? 911 00:49:47,605 --> 00:49:49,610 JULIAN SHUN: So this code here is actually 912 00:49:49,610 --> 00:49:51,450 independent of the number of processors. 913 00:49:51,450 --> 00:49:53,750 So there is actually a scheduling algorithm 914 00:49:53,750 --> 00:49:56,000 that will determine how the tasks get 915 00:49:56,000 --> 00:49:57,680 mapped to different processors. 916 00:49:57,680 --> 00:50:01,460 So if you spawn a new task, it doesn't necessarily put it 917 00:50:01,460 --> 00:50:02,540 on a different processor. 918 00:50:02,540 --> 00:50:04,480 And you can have more tasks than the number 919 00:50:04,480 --> 00:50:05,480 of processors available. 920 00:50:05,480 --> 00:50:06,860 So there's a scheduling algorithm 921 00:50:06,860 --> 00:50:09,650 that will take care of how these tasks get 922 00:50:09,650 --> 00:50:11,480 mapped to different processors, and that's 923 00:50:11,480 --> 00:50:13,340 hidden from the programmer. 924 00:50:13,340 --> 00:50:16,190 Although you can use these scheduling 925 00:50:16,190 --> 00:50:19,640 pragmas to give hints to the compiler 926 00:50:19,640 --> 00:50:22,750 how it should schedule it. 927 00:50:22,750 --> 00:50:23,500 Yeah? 928 00:50:23,500 --> 00:50:25,583 AUDIENCE: What is the operating system [INAUDIBLE] 929 00:50:25,583 --> 00:50:28,027 scheduling [INAUDIBLE]? 930 00:50:28,027 --> 00:50:29,860 JULIAN SHUN: Underneath, this is implemented 931 00:50:29,860 --> 00:50:34,030 using Pthreads, which has to make operating system calls to, 932 00:50:34,030 --> 00:50:37,390 basically, directly talk to the processor cores 933 00:50:37,390 --> 00:50:39,850 and do multiplexing and so forth. 934 00:50:39,850 --> 00:50:44,020 So the operating system is involved at a very low level. 935 00:50:56,630 --> 00:50:59,900 So the last concurrency platform that we'll be looking at today 936 00:50:59,900 --> 00:51:00,590 is Cilk. 937 00:51:08,158 --> 00:51:09,950 We're going to look at Cilk Plus, actually. 938 00:51:09,950 --> 00:51:12,870 And the Cilk part of Cilk Plus is a small set of linguistic 939 00:51:12,870 --> 00:51:18,040 extensions to C and C++ that support fork-join parallelism. 940 00:51:18,040 --> 00:51:21,180 So for example, the Fibonacci example 941 00:51:21,180 --> 00:51:22,770 uses fork-join parallelism, so you 942 00:51:22,770 --> 00:51:24,740 can use Cilk to implement that. 943 00:51:24,740 --> 00:51:28,680 And the Plus part of Cilk Plus supports vector parallelism, 944 00:51:28,680 --> 00:51:34,990 which you had experience working with in your homeworks. 945 00:51:34,990 --> 00:51:39,960 So Cilk Plus was initially developed by Cilk Arts, 946 00:51:39,960 --> 00:51:42,570 which was an MIT spin-off. 947 00:51:42,570 --> 00:51:47,730 And Cilk Arts was acquired by Intel in July 2009. 948 00:51:47,730 --> 00:51:52,560 And the Cilk Plus implementation was 949 00:51:52,560 --> 00:51:55,350 based on the award-winning Cilk multi-threaded language that 950 00:51:55,350 --> 00:51:59,670 was developed two decades ago here at MIT by Charles 951 00:51:59,670 --> 00:52:03,240 Leiserson's research group. 952 00:52:03,240 --> 00:52:05,610 And it features a provably efficient 953 00:52:05,610 --> 00:52:07,110 work-stealing scheduler. 954 00:52:07,110 --> 00:52:09,390 So this scheduler is provably efficient. 955 00:52:09,390 --> 00:52:12,280 You can actually prove theoretical bounds on it. 956 00:52:12,280 --> 00:52:14,700 And this allows you to implement theoretically efficient 957 00:52:14,700 --> 00:52:18,030 algorithms, which we'll talk more about in another lecture-- 958 00:52:18,030 --> 00:52:18,840 algorithm design. 959 00:52:18,840 --> 00:52:21,690 But it provides a provably efficient 960 00:52:21,690 --> 00:52:23,460 work-stealing scheduler. 961 00:52:23,460 --> 00:52:26,310 And Charles Leiserson has a very famous paper 962 00:52:26,310 --> 00:52:29,760 that has a proof of that this scheduler is optimal. 963 00:52:29,760 --> 00:52:32,640 So if you're interested in reading about this, 964 00:52:32,640 --> 00:52:35,070 you can talk to us offline. 965 00:52:35,070 --> 00:52:37,860 Cilk Plus also provides a hyperobject library 966 00:52:37,860 --> 00:52:42,120 for parallelizing code with global variables. 967 00:52:42,120 --> 00:52:44,760 And you'll have a chance to play around with hyperobjects 968 00:52:44,760 --> 00:52:47,930 in homework 4. 969 00:52:47,930 --> 00:52:50,730 The Cilk Plus ecosystem also includes 970 00:52:50,730 --> 00:52:54,210 useful programming tools, such as the Cilk Screen Race 971 00:52:54,210 --> 00:52:54,750 Detector. 972 00:52:54,750 --> 00:52:57,810 So this allows you to detect determinacy races 973 00:52:57,810 --> 00:53:01,710 in your program to help you isolate bugs and performance 974 00:53:01,710 --> 00:53:03,120 bottlenecks. 975 00:53:03,120 --> 00:53:06,810 It also has a scalability analyzer called Cilk View. 976 00:53:06,810 --> 00:53:11,940 And Cilk View will basically analyze the amount of work 977 00:53:11,940 --> 00:53:15,120 that your program is doing, as well as 978 00:53:15,120 --> 00:53:16,740 the maximum amount of parallelism 979 00:53:16,740 --> 00:53:22,030 that your code could possibly extract from the hardware. 980 00:53:22,030 --> 00:53:25,683 So that's Intel Cilk Plus. 981 00:53:25,683 --> 00:53:27,350 But it turns out that we're not actually 982 00:53:27,350 --> 00:53:29,540 going to be using Intel Cilk Plus in this class. 983 00:53:29,540 --> 00:53:32,280 We're going to be using a better compiler. 984 00:53:32,280 --> 00:53:36,830 And this compiler is based on Tapir/LLVM. 985 00:53:36,830 --> 00:53:40,760 And it supports the Cilk subset of Cilk Plus. 986 00:53:40,760 --> 00:53:45,620 And Tapir/LLVM was actually recently developed at MIT 987 00:53:45,620 --> 00:53:50,870 by T. B. Schardl, who gave a lecture last week, William 988 00:53:50,870 --> 00:53:53,810 Moses, who's a grad student working with Charles, 989 00:53:53,810 --> 00:53:55,280 as well as Charles Leiserson. 990 00:53:58,550 --> 00:54:02,660 So talking a lot about Charles's work today. 991 00:54:02,660 --> 00:54:05,390 And Tapir/LLVM generally produces 992 00:54:05,390 --> 00:54:08,450 better code, relative to its base compiler, 993 00:54:08,450 --> 00:54:10,970 than all other implementations of Cilk out there. 994 00:54:10,970 --> 00:54:15,740 So it's the best Cilk compiler that's available today. 995 00:54:15,740 --> 00:54:18,500 And they actually wrote a very nice paper 996 00:54:18,500 --> 00:54:21,230 on this last year, Charles Leiserson and his group. 997 00:54:21,230 --> 00:54:23,750 And that paper received the Best Paper Award 998 00:54:23,750 --> 00:54:27,080 at the annual Symposium on Principles and Practices 999 00:54:27,080 --> 00:54:29,360 of Parallel Programming, or PPoPP. 1000 00:54:29,360 --> 00:54:34,500 So you should look at that paper as well. 1001 00:54:34,500 --> 00:54:38,600 So right now, Tapir/LLVM uses the Intel Cilk Plus runtime 1002 00:54:38,600 --> 00:54:43,790 system, but I believe Charles's group has plans to implement 1003 00:54:43,790 --> 00:54:46,460 a better runtime system. 1004 00:54:46,460 --> 00:54:49,460 And Tapir/LLVM also supports more general features 1005 00:54:49,460 --> 00:54:51,410 than existing Cilk compilers. 1006 00:54:51,410 --> 00:54:55,290 So in addition to spawning functions, 1007 00:54:55,290 --> 00:54:57,230 you can also spawn code blocks that are not 1008 00:54:57,230 --> 00:55:02,120 separate functions, and this makes 1009 00:55:02,120 --> 00:55:03,542 writing programs more flexible. 1010 00:55:03,542 --> 00:55:05,750 You don't have to actually create a separate function 1011 00:55:05,750 --> 00:55:11,606 if you want to execute a code block in parallel. 1012 00:55:11,606 --> 00:55:13,103 Any questions? 1013 00:55:21,590 --> 00:55:26,330 So this is the Cilk code for Fibonacci. 1014 00:55:26,330 --> 00:55:29,320 So it's also pretty simple. 1015 00:55:29,320 --> 00:55:31,960 It looks very similar to the sequential program, 1016 00:55:31,960 --> 00:55:35,320 except we have these cilk_spawn and cilk_synv 1017 00:55:35,320 --> 00:55:36,940 statements in the code. 1018 00:55:36,940 --> 00:55:40,260 So what do these statements do? 1019 00:55:40,260 --> 00:55:45,190 So cilk_spawn says that the named child function, which 1020 00:55:45,190 --> 00:55:47,590 is the function that is right after this cilk_spawn 1021 00:55:47,590 --> 00:55:50,800 statement, may execute in parallel with the parent 1022 00:55:50,800 --> 00:55:51,400 caller. 1023 00:55:51,400 --> 00:55:52,930 The parent caller is the function 1024 00:55:52,930 --> 00:55:55,270 that is calling cilk_spawn. 1025 00:55:55,270 --> 00:55:57,670 So this says that fib of n minus 1 1026 00:55:57,670 --> 00:56:02,500 can execute in parallel with the function that called it. 1027 00:56:02,500 --> 00:56:05,890 And then this function is then going to call fib of n minus 2. 1028 00:56:05,890 --> 00:56:08,610 And fib of n minus 2 and fib of n minus 1 1029 00:56:08,610 --> 00:56:12,130 now can be executing in parallel. 1030 00:56:12,130 --> 00:56:16,510 And then cilk_sync says that control cannot pass this point 1031 00:56:16,510 --> 00:56:21,030 until all of the spawn children have returned. 1032 00:56:21,030 --> 00:56:23,560 So this is going to wait for fib of n minus 1 1033 00:56:23,560 --> 00:56:28,150 to return before we go to the return statement 1034 00:56:28,150 --> 00:56:29,640 where we add up x and y. 1035 00:56:34,760 --> 00:56:36,440 So one important thing to note is 1036 00:56:36,440 --> 00:56:38,750 that the Cilk keywords grant permission 1037 00:56:38,750 --> 00:56:42,830 for parallel execution, but they don't actually force or command 1038 00:56:42,830 --> 00:56:44,000 parallel execution. 1039 00:56:44,000 --> 00:56:47,980 So even though I said cilk_spawn here, 1040 00:56:47,980 --> 00:56:50,240 the runtime system doesn't necessarily 1041 00:56:50,240 --> 00:56:55,010 have to run fib of n minus 1 in parallel with fib of n minus 2. 1042 00:56:55,010 --> 00:56:58,340 I'm just saying that I could run these two things in parallel, 1043 00:56:58,340 --> 00:56:59,750 and it's up to the runtime system 1044 00:56:59,750 --> 00:57:03,830 to decide whether or not to run these things in parallel, 1045 00:57:03,830 --> 00:57:08,480 based on its scheduling policy. 1046 00:57:08,480 --> 00:57:13,040 So let's look at another example of Cilk. 1047 00:57:13,040 --> 00:57:15,960 So let's look at loop parallelism. 1048 00:57:15,960 --> 00:57:18,860 So here we want to do a matrix transpose, 1049 00:57:18,860 --> 00:57:21,080 and we want to do this in-place. 1050 00:57:21,080 --> 00:57:24,380 So the idea here is we want to basically swap 1051 00:57:24,380 --> 00:57:31,040 the elements below the diagonal to its mirror 1052 00:57:31,040 --> 00:57:34,410 image above the diagonal. 1053 00:57:34,410 --> 00:57:36,950 And here's some code to do this. 1054 00:57:36,950 --> 00:57:39,020 So we have a cilk_for. 1055 00:57:39,020 --> 00:57:42,590 So this is basically a parallel for loop. 1056 00:57:42,590 --> 00:57:45,710 It goes from i equals 1 to n minus 1. 1057 00:57:45,710 --> 00:57:49,310 And then the inner for loop goes from j equals 0 up 1058 00:57:49,310 --> 00:57:51,590 to i minus 1. 1059 00:57:51,590 --> 00:57:55,730 And then we just swap a of i j with a of j i, 1060 00:57:55,730 --> 00:57:58,400 using these three statements inside the body of the 1061 00:57:58,400 --> 00:57:58,970 for loop. 1062 00:58:02,110 --> 00:58:04,210 So to execute a for loop in parallel, 1063 00:58:04,210 --> 00:58:10,630 you just have to add cilk underscore to the for keyword. 1064 00:58:10,630 --> 00:58:14,160 And that's as simple as it gets. 1065 00:58:14,160 --> 00:58:17,100 So this code is actually going to run in parallel 1066 00:58:17,100 --> 00:58:22,890 and get pretty good speed-up for this particular problem. 1067 00:58:22,890 --> 00:58:25,080 And internally, Cilk for loops are 1068 00:58:25,080 --> 00:58:28,980 transformed into nested cilk_spawn and cilk_sync calls. 1069 00:58:28,980 --> 00:58:32,880 So the compiler is going to get rid of the cilk_for 1070 00:58:32,880 --> 00:58:36,150 and change it into cilk_spawn and cilk_sync. 1071 00:58:36,150 --> 00:58:38,370 So it's going to recursively divide the iteration 1072 00:58:38,370 --> 00:58:44,220 space into half, and then it's going to spawn off one half 1073 00:58:44,220 --> 00:58:46,920 and then execute the other half in parallel with that, 1074 00:58:46,920 --> 00:58:49,590 and then recursively do that until the iteration 1075 00:58:49,590 --> 00:58:51,810 range becomes small enough, at which point 1076 00:58:51,810 --> 00:58:54,870 it doesn't make sense to execute it in parallel anymore, 1077 00:58:54,870 --> 00:58:57,750 so we just execute that range sequentially. 1078 00:59:01,310 --> 00:59:03,180 So that's loop parallelism in Cilk. 1079 00:59:03,180 --> 00:59:06,520 Any questions? 1080 00:59:06,520 --> 00:59:07,060 Yes? 1081 00:59:07,060 --> 00:59:12,070 AUDIENCE: How does it know [INAUDIBLE] something weird, 1082 00:59:12,070 --> 00:59:15,103 can it still do that? 1083 00:59:15,103 --> 00:59:16,520 JULIAN SHUN: Yeah, so the compiler 1084 00:59:16,520 --> 00:59:19,940 can actually figure out what the iteration space is. 1085 00:59:19,940 --> 00:59:22,850 So you don't necessarily have to be incrementing by 1. 1086 00:59:22,850 --> 00:59:24,302 You can do something else. 1087 00:59:24,302 --> 00:59:26,510 You just have to guarantee that all of the iterations 1088 00:59:26,510 --> 00:59:29,780 are independent. 1089 00:59:29,780 --> 00:59:32,090 So if you have a determinacy race 1090 00:59:32,090 --> 00:59:35,270 across the different iterations of your cilk_for loop, 1091 00:59:35,270 --> 00:59:37,910 then your result might not necessarily be correct. 1092 00:59:37,910 --> 00:59:40,310 So you have to make sure that the iterations are, indeed, 1093 00:59:40,310 --> 00:59:42,270 independent. 1094 00:59:42,270 --> 00:59:42,770 Yes? 1095 00:59:42,770 --> 00:59:44,510 AUDIENCE: Can you nest cilk_fors? 1096 00:59:44,510 --> 00:59:47,980 JULIAN SHUN: Yes, so you can nest cilk_fors. 1097 00:59:47,980 --> 00:59:50,210 But it turns out that, for this example, 1098 00:59:50,210 --> 00:59:52,460 usually, you already have enough parallelism 1099 00:59:52,460 --> 00:59:54,890 in the outer loop for large enough values of n, 1100 00:59:54,890 --> 00:59:57,950 so it doesn't make sense to put a cilk_for loop inside, 1101 00:59:57,950 --> 01:00:01,610 because using a cilk_for loop adds some additional overheads. 1102 01:00:01,610 --> 01:00:04,610 But you can actually do nested cilk_for loops. 1103 01:00:04,610 --> 01:00:07,170 And in some cases, it does make sense, 1104 01:00:07,170 --> 01:00:10,910 especially if there's not enough parallelism 1105 01:00:10,910 --> 01:00:13,280 in the outermost for loop. 1106 01:00:13,280 --> 01:00:15,805 So good question. 1107 01:00:15,805 --> 01:00:16,305 Yes? 1108 01:00:16,305 --> 01:00:17,847 AUDIENCE: What does the assembly code 1109 01:00:17,847 --> 01:00:20,390 look like for the parallel code? 1110 01:00:20,390 --> 01:00:24,145 JULIAN SHUN: So it has a bunch of calls to the Cilk runtime 1111 01:00:24,145 --> 01:00:24,645 system. 1112 01:00:27,682 --> 01:00:29,640 I don't know all the details, because I haven't 1113 01:00:29,640 --> 01:00:30,640 looked at this recently. 1114 01:00:30,640 --> 01:00:32,730 But I think you can actually generate 1115 01:00:32,730 --> 01:00:35,700 the assembly code using a flag in the Clang compiler. 1116 01:00:35,700 --> 01:00:37,710 So that's a good exercise. 1117 01:00:47,295 --> 01:00:48,670 AUDIENCE: Yeah, you probably want 1118 01:00:48,670 --> 01:00:54,550 to look at the LLVM IR, rather than the assembly, 1119 01:00:54,550 --> 01:00:57,580 to begin with, to understand what's going on. 1120 01:00:57,580 --> 01:00:59,920 It has three instructions that are not 1121 01:00:59,920 --> 01:01:07,990 in the standard LLVM, which were added to support parallelism. 1122 01:01:07,990 --> 01:01:13,750 Those things, when it's lowered into assembly, 1123 01:01:13,750 --> 01:01:16,000 each of those instructions becomes 1124 01:01:16,000 --> 01:01:19,270 a bunch of assembly language instructions. 1125 01:01:19,270 --> 01:01:23,980 So you don't want to mess with looking at it in the assembler 1126 01:01:23,980 --> 01:01:26,590 until you see what it looks like in the LLVM first. 1127 01:01:31,400 --> 01:01:34,060 JULIAN SHUN: So good question. 1128 01:01:34,060 --> 01:01:36,930 Any other questions about this code here? 1129 01:01:44,270 --> 01:01:49,611 OK, so let's look at another example. 1130 01:01:49,611 --> 01:01:52,080 So let's say we had this for loop 1131 01:01:52,080 --> 01:01:54,540 where, on each iteration i, we're 1132 01:01:54,540 --> 01:01:58,530 just incrementing a variable sum by i. 1133 01:01:58,530 --> 01:02:01,260 So this is essentially going to compute 1134 01:02:01,260 --> 01:02:04,980 the summation of everything from i equals 0 up to n minus 1, 1135 01:02:04,980 --> 01:02:06,930 and then print out the result. 1136 01:02:06,930 --> 01:02:13,710 So one straightforward way to try to parallelize this code 1137 01:02:13,710 --> 01:02:18,790 is to just change the for to cilk_for. 1138 01:02:18,790 --> 01:02:20,560 So does this code work? 1139 01:02:27,330 --> 01:02:31,540 Who thinks that this code doesn't work? 1140 01:02:31,540 --> 01:02:35,750 Or doesn't compute the correct result? 1141 01:02:35,750 --> 01:02:38,140 So about half of you. 1142 01:02:38,140 --> 01:02:43,470 And who thinks this code does work? 1143 01:02:43,470 --> 01:02:44,900 So a couple people. 1144 01:02:44,900 --> 01:02:50,310 And I guess the rest of the people don't care. 1145 01:02:50,310 --> 01:02:55,170 So it turns out that it's not actually necessarily going 1146 01:02:55,170 --> 01:02:56,550 to give you the right answer. 1147 01:02:56,550 --> 01:02:59,940 Because the cilk_for loop says you 1148 01:02:59,940 --> 01:03:02,220 can execute these iterations in parallel, 1149 01:03:02,220 --> 01:03:06,630 but they're all updating the same shared variable sum here. 1150 01:03:06,630 --> 01:03:10,410 So you have what's called a determinacy race, where 1151 01:03:10,410 --> 01:03:12,940 multiple processors can be writing to the same memory 1152 01:03:12,940 --> 01:03:13,440 location. 1153 01:03:13,440 --> 01:03:15,450 We'll talk much more about determinacy races 1154 01:03:15,450 --> 01:03:17,510 in the next lecture. 1155 01:03:17,510 --> 01:03:19,260 But for this example, it's not necessarily 1156 01:03:19,260 --> 01:03:24,750 going to work if you run it on more than one processor. 1157 01:03:24,750 --> 01:03:28,630 And Cilk actually has a nice way to deal with this. 1158 01:03:28,630 --> 01:03:31,650 So in Cilk, we have something known as a reducer. 1159 01:03:31,650 --> 01:03:34,110 This is one example of a hyperobject, 1160 01:03:34,110 --> 01:03:36,000 which I mentioned earlier. 1161 01:03:36,000 --> 01:03:38,040 And with a reducer, what you have to do 1162 01:03:38,040 --> 01:03:42,090 is, instead of declaring the sum variable just 1163 01:03:42,090 --> 01:03:44,910 has an unsigned long data type, what you do 1164 01:03:44,910 --> 01:03:49,440 is you use this macro called CILK_C_REDUCER_OPADD, which 1165 01:03:49,440 --> 01:03:53,340 specifies we want to create a reducer with the addition 1166 01:03:53,340 --> 01:03:54,660 function. 1167 01:03:54,660 --> 01:03:56,250 Then we have the variable name sum, 1168 01:03:56,250 --> 01:04:00,580 the data type unsigned long, and then the initial value 0. 1169 01:04:00,580 --> 01:04:03,480 And then we have a macro to register this reducer, 1170 01:04:03,480 --> 01:04:06,810 so a CILK_C_REGISTER_REDUCER. 1171 01:04:06,810 --> 01:04:08,580 And then now, inside this cilk_for loop, 1172 01:04:08,580 --> 01:04:13,410 we can increment the sum, or REDUCER_VIEW, of sum, 1173 01:04:13,410 --> 01:04:16,350 which is another macro, by i. 1174 01:04:16,350 --> 01:04:18,540 And you can actually execute this in parallel, 1175 01:04:18,540 --> 01:04:21,540 and it will give you the same answer 1176 01:04:21,540 --> 01:04:23,880 that you would get if you ran this sequentially. 1177 01:04:23,880 --> 01:04:28,740 So the reducer will take care of this determinacy race for you. 1178 01:04:28,740 --> 01:04:31,320 And at the end, when you print out this result, 1179 01:04:31,320 --> 01:04:36,450 you'll see that the sum is equal to the sum that you expect. 1180 01:04:36,450 --> 01:04:38,460 And then after you finish using the reducer, 1181 01:04:38,460 --> 01:04:43,380 you use this other macro called CILK_C_UNREGISTER_REDUCER(sum) 1182 01:04:43,380 --> 01:04:48,450 that tells the system that you're done using this reducer. 1183 01:04:48,450 --> 01:04:51,810 So this is one way to deal with this problem 1184 01:04:51,810 --> 01:04:54,780 when you want to do a reduction. 1185 01:04:54,780 --> 01:04:57,450 And it turns out that there are many other interesting 1186 01:04:57,450 --> 01:04:59,960 reduction operators that you might want to use. 1187 01:04:59,960 --> 01:05:03,750 And in general, you can create reduces for monoids. 1188 01:05:03,750 --> 01:05:06,150 And monoids are algebraic structures 1189 01:05:06,150 --> 01:05:09,000 that have an associative binary operation as well 1190 01:05:09,000 --> 01:05:10,740 as an identity element. 1191 01:05:10,740 --> 01:05:13,320 So the addition operator is a monoid, 1192 01:05:13,320 --> 01:05:16,230 because it's associative, it's binary, 1193 01:05:16,230 --> 01:05:19,830 and the identity element is 0. 1194 01:05:19,830 --> 01:05:23,160 Cilk also has several other predefined reducers, 1195 01:05:23,160 --> 01:05:27,900 including multiplication, min, max, and, or, xor, et cetera. 1196 01:05:27,900 --> 01:05:29,550 So these are all monoids. 1197 01:05:29,550 --> 01:05:32,280 And you can also define your own reducer. 1198 01:05:32,280 --> 01:05:33,827 So in fact, in the next homework, 1199 01:05:33,827 --> 01:05:36,160 you'll have the opportunity to play around with reducers 1200 01:05:36,160 --> 01:05:41,193 and write a reducer for lists. 1201 01:05:41,193 --> 01:05:41,985 So that's reducers. 1202 01:05:46,740 --> 01:05:49,770 Another nice thing about Cilk is that there's always 1203 01:05:49,770 --> 01:05:53,560 a valid serial interpretation of the program. 1204 01:05:53,560 --> 01:05:56,730 So the serial elision of a Cilk program 1205 01:05:56,730 --> 01:05:58,950 is always a legal interpretation. 1206 01:05:58,950 --> 01:06:02,640 And for the Cilk source code on the left, 1207 01:06:02,640 --> 01:06:04,740 the serial elision is basically the code 1208 01:06:04,740 --> 01:06:07,020 you get if you get rid of the cilk_spawn 1209 01:06:07,020 --> 01:06:09,600 and cilk_sync statements. 1210 01:06:09,600 --> 01:06:12,750 And this looks just like the sequential code. 1211 01:06:17,170 --> 01:06:20,190 And remember that the Cilk keywords grant permission 1212 01:06:20,190 --> 01:06:22,470 for parallel execution, but they don't necessarily 1213 01:06:22,470 --> 01:06:24,025 command parallel execution. 1214 01:06:24,025 --> 01:06:28,950 So if you ran this Cilk code using a single core, 1215 01:06:28,950 --> 01:06:31,415 it wouldn't actually create these parallel tasks, 1216 01:06:31,415 --> 01:06:32,790 and you would get the same answer 1217 01:06:32,790 --> 01:06:35,640 as the sequential program. 1218 01:06:35,640 --> 01:06:38,400 And this is-- in the serial edition-- is also 1219 01:06:38,400 --> 01:06:39,690 a correct interpretation. 1220 01:06:39,690 --> 01:06:44,550 So unlike other solutions, such as TBB and Pthreads, 1221 01:06:44,550 --> 01:06:46,920 it's actually difficult, in those environments, 1222 01:06:46,920 --> 01:06:51,000 to get a program that does what the sequential program does. 1223 01:06:51,000 --> 01:06:54,990 Because they're actually doing a lot of additional work 1224 01:06:54,990 --> 01:06:58,990 to set up these parallel calls and create these argument 1225 01:06:58,990 --> 01:07:01,170 structures and other scheduling constructs. 1226 01:07:01,170 --> 01:07:03,435 Whereas in Cilk, it's very easy just 1227 01:07:03,435 --> 01:07:04,560 to get this serial elision. 1228 01:07:04,560 --> 01:07:10,020 You just define cilk_spawn and cilk_sync to be null. 1229 01:07:10,020 --> 01:07:12,680 You also define cilk_for to be for. 1230 01:07:12,680 --> 01:07:16,170 And then this gives you a valid sequential program. 1231 01:07:16,170 --> 01:07:19,350 So when you're debugging code, and you 1232 01:07:19,350 --> 01:07:24,300 might first want to check if the sequential elision of your Cilk 1233 01:07:24,300 --> 01:07:25,950 program is correct, and you can easily 1234 01:07:25,950 --> 01:07:28,170 do that by using these macros. 1235 01:07:28,170 --> 01:07:30,780 Or actually, there's actually a compiler flag 1236 01:07:30,780 --> 01:07:34,720 that will do that for you and give you the equivalent C 1237 01:07:34,720 --> 01:07:35,220 program. 1238 01:07:35,220 --> 01:07:37,110 So this is a nice way to debug, because you 1239 01:07:37,110 --> 01:07:39,630 don't have to start with the parallel program. 1240 01:07:39,630 --> 01:07:42,460 You can first check if this serial program is correct 1241 01:07:42,460 --> 01:07:45,370 before you go on to debug the parallel program. 1242 01:07:47,930 --> 01:07:51,030 Questions? 1243 01:07:51,030 --> 01:07:52,090 Yes? 1244 01:07:52,090 --> 01:07:54,030 AUDIENCE: So does cilk_for-- 1245 01:07:54,030 --> 01:07:59,730 does each iteration of the cilk_for its own task 1246 01:07:59,730 --> 01:08:04,095 that the scheduler decides if it wants to execute in parallel, 1247 01:08:04,095 --> 01:08:06,520 or if it executes in parallel, do all of the iterations 1248 01:08:06,520 --> 01:08:08,950 execute in parallel? 1249 01:08:08,950 --> 01:08:12,310 JULIAN SHUN: So it turns out that by default, 1250 01:08:12,310 --> 01:08:16,899 it groups a bunch of iterations together into a single task, 1251 01:08:16,899 --> 01:08:19,479 because it doesn't make sense to break it down 1252 01:08:19,479 --> 01:08:23,590 into such small chunks, due to the overheads of parallelism. 1253 01:08:23,590 --> 01:08:26,170 But there's actually a setting you 1254 01:08:26,170 --> 01:08:28,540 can do to change the grain size of the for loop. 1255 01:08:28,540 --> 01:08:32,069 So you could actually make it so that each iteration 1256 01:08:32,069 --> 01:08:34,330 is its own task. 1257 01:08:34,330 --> 01:08:37,359 And then, as you the scheduler will 1258 01:08:37,359 --> 01:08:39,850 decide how to map these different task 1259 01:08:39,850 --> 01:08:42,189 onto different processors, or even 1260 01:08:42,189 --> 01:08:45,549 if it wants to execute any of these tasks in parallel. 1261 01:08:45,549 --> 01:08:46,479 So good question. 1262 01:08:56,600 --> 01:09:00,410 So the idea in Cilk is to allow the programmer 1263 01:09:00,410 --> 01:09:03,870 to express logical parallelism in an application. 1264 01:09:03,870 --> 01:09:06,890 So the programmer just has to identify 1265 01:09:06,890 --> 01:09:09,649 which pieces of the code could be executed in parallel, 1266 01:09:09,649 --> 01:09:15,050 but doesn't necessarily have to determine which pieces of code 1267 01:09:15,050 --> 01:09:18,439 should be executed in parallel. 1268 01:09:18,439 --> 01:09:21,350 And then Cilk has a runtime scheduler 1269 01:09:21,350 --> 01:09:24,560 that will automatically map the executing 1270 01:09:24,560 --> 01:09:28,760 program onto the available processor cores' runtime. 1271 01:09:28,760 --> 01:09:31,282 And it does this dynamically using 1272 01:09:31,282 --> 01:09:34,149 a work-stealing scheduling algorithm. 1273 01:09:34,149 --> 01:09:35,720 And the work-stealing scheduler is 1274 01:09:35,720 --> 01:09:39,439 used to balance the tasks evenly across 1275 01:09:39,439 --> 01:09:40,939 the different processors. 1276 01:09:40,939 --> 01:09:44,000 And we'll talk more about the work-stealing scheduler 1277 01:09:44,000 --> 01:09:45,740 in a future lecture. 1278 01:09:45,740 --> 01:09:49,340 But I want to emphasize that unlike the other concurrency 1279 01:09:49,340 --> 01:09:52,279 platforms that we looked at today, 1280 01:09:52,279 --> 01:09:55,520 Cilk's work-stealing scheduling algorithm is theoretically 1281 01:09:55,520 --> 01:10:00,560 efficient, whereas the OpenMP and TBB schedulers are not 1282 01:10:00,560 --> 01:10:01,580 theoretically efficient. 1283 01:10:01,580 --> 01:10:04,490 So this is a nice property, because it will guarantee you 1284 01:10:04,490 --> 01:10:07,910 that the algorithms you write on top of Cilk 1285 01:10:07,910 --> 01:10:10,208 will also be theoretically efficient. 1286 01:10:13,420 --> 01:10:15,520 So here's a high-level illustration 1287 01:10:15,520 --> 01:10:19,460 of the Cilk ecosystem. 1288 01:10:19,460 --> 01:10:22,240 It's a very simplified view, but I did this 1289 01:10:22,240 --> 01:10:25,860 to fit it on a single slide. 1290 01:10:25,860 --> 01:10:28,840 So what you do is you take the Cilk source code, 1291 01:10:28,840 --> 01:10:32,320 you pass it to your favorite Cilk compiler-- 1292 01:10:32,320 --> 01:10:35,410 the Tapir/LLVM compiler-- and this 1293 01:10:35,410 --> 01:10:40,720 gives you a binary that you can run on multiple processors. 1294 01:10:40,720 --> 01:10:43,510 And then you pass a program input to the binary, 1295 01:10:43,510 --> 01:10:48,460 you run it on however many processors you have, 1296 01:10:48,460 --> 01:10:50,830 and then this allows you to benchmark the parallel 1297 01:10:50,830 --> 01:10:52,150 performance of your program. 1298 01:10:55,890 --> 01:10:58,010 You can also do serial testing. 1299 01:10:58,010 --> 01:11:01,820 And to do this, you just obtain a serial elision of the Cilk 1300 01:11:01,820 --> 01:11:06,125 program, and you pass it to an ordinary C or C++ compiler. 1301 01:11:06,125 --> 01:11:11,360 It generates a binary that can only run on a single processor, 1302 01:11:11,360 --> 01:11:14,330 and you run your suite of serial regression tests 1303 01:11:14,330 --> 01:11:17,020 on this single threaded binary. 1304 01:11:17,020 --> 01:11:19,910 And this will allow you to benchmark the performance 1305 01:11:19,910 --> 01:11:22,970 of your serial code and also debug any issues 1306 01:11:22,970 --> 01:11:25,100 that might have arised when you were running 1307 01:11:25,100 --> 01:11:26,868 this program sequentially. 1308 01:11:30,460 --> 01:11:32,690 Another way to do this is you can actually just 1309 01:11:32,690 --> 01:11:36,320 compile the original Cilk code but run it 1310 01:11:36,320 --> 01:11:37,520 on a single processor. 1311 01:11:37,520 --> 01:11:39,290 So there's a command line argument 1312 01:11:39,290 --> 01:11:42,410 that tells the runtime system how many processors you 1313 01:11:42,410 --> 01:11:42,950 want to use. 1314 01:11:42,950 --> 01:11:45,500 And if you set that parameter to 1, 1315 01:11:45,500 --> 01:11:47,690 then it will only use a single processor. 1316 01:11:47,690 --> 01:11:53,120 And this allows you to benchmark the single threaded performance 1317 01:11:53,120 --> 01:11:54,260 of your code as well. 1318 01:11:54,260 --> 01:11:57,560 And the parallel program executing on a single core 1319 01:11:57,560 --> 01:12:00,050 should behave exactly the same way 1320 01:12:00,050 --> 01:12:02,810 as the execution of this serial elision. 1321 01:12:02,810 --> 01:12:07,780 So that's one of the advantages of using Cilk. 1322 01:12:07,780 --> 01:12:12,560 And because you can easily do serial testing using the Cilk 1323 01:12:12,560 --> 01:12:15,500 platform, this allows you to separate out 1324 01:12:15,500 --> 01:12:17,930 the serial correctness from the parallel correctness. 1325 01:12:17,930 --> 01:12:21,050 As I said earlier, you can first debug the serial correctness, 1326 01:12:21,050 --> 01:12:23,180 as well as any performance issues, before moving on 1327 01:12:23,180 --> 01:12:26,220 to the parallel version. 1328 01:12:26,220 --> 01:12:27,710 And another point I want to make is 1329 01:12:27,710 --> 01:12:35,630 that because Cilk actually uses the serial program 1330 01:12:35,630 --> 01:12:38,210 inside its task, it's actually good to optimize 1331 01:12:38,210 --> 01:12:40,670 the serial program even when you're 1332 01:12:40,670 --> 01:12:42,830 writing a parallel program, because optimizing 1333 01:12:42,830 --> 01:12:44,510 the serial program for performance 1334 01:12:44,510 --> 01:12:47,930 will also translate to better parallel performance. 1335 01:12:52,550 --> 01:12:55,460 Another nice feature of Cilk is that it 1336 01:12:55,460 --> 01:12:58,340 has this tool called Cilksan, which 1337 01:12:58,340 --> 01:13:01,070 stands for Cilk Sanitizer. 1338 01:13:01,070 --> 01:13:06,020 And Cilksan will detect any determinacy races 1339 01:13:06,020 --> 01:13:08,930 that you have in your code, which will significantly 1340 01:13:08,930 --> 01:13:12,620 help you with debugging the correctness 1341 01:13:12,620 --> 01:13:16,290 as well as the performance of your code. 1342 01:13:16,290 --> 01:13:21,200 So if you compile the Cilk code using the Cilksan flag, 1343 01:13:21,200 --> 01:13:24,410 it will generate an instrumented binary that, when you run, 1344 01:13:24,410 --> 01:13:27,983 it will find and localize all the determinacy races 1345 01:13:27,983 --> 01:13:28,650 in your program. 1346 01:13:28,650 --> 01:13:31,340 So it will tell you where the determinacy races occur, 1347 01:13:31,340 --> 01:13:33,770 so that you can go inspect that part of your code 1348 01:13:33,770 --> 01:13:37,740 and fix it if necessary. 1349 01:13:37,740 --> 01:13:42,170 So this is a very useful tool for benchmarking 1350 01:13:42,170 --> 01:13:43,170 your parallel programs. 1351 01:13:45,890 --> 01:13:49,400 Cilk also has another nice tool called Cilkscale. 1352 01:13:49,400 --> 01:13:53,720 Cilkscale is a performance analyzer. 1353 01:13:53,720 --> 01:13:55,850 It will analyze how much parallelism 1354 01:13:55,850 --> 01:13:58,880 is available in your program as well as the total amount 1355 01:13:58,880 --> 01:14:00,800 of work that it's doing. 1356 01:14:00,800 --> 01:14:03,440 So again, you pass a flag to the compiler that 1357 01:14:03,440 --> 01:14:05,870 will turn on Cilkscale, and it will generate 1358 01:14:05,870 --> 01:14:08,360 a binary that is instrumented. 1359 01:14:08,360 --> 01:14:11,192 And then when you run this code, it 1360 01:14:11,192 --> 01:14:12,650 will give you a scalability report. 1361 01:14:15,470 --> 01:14:17,390 So you'll find these tools very useful when 1362 01:14:17,390 --> 01:14:20,540 you're doing the next project. 1363 01:14:20,540 --> 01:14:23,210 And we'll talk a little bit more about these two tools 1364 01:14:23,210 --> 01:14:24,110 in the next lecture. 1365 01:14:26,630 --> 01:14:29,300 And as I said, Cilkscale will analyze how well your program 1366 01:14:29,300 --> 01:14:30,860 will scale to larger machines. 1367 01:14:30,860 --> 01:14:33,860 So it will basically tell you the maximum number 1368 01:14:33,860 --> 01:14:36,540 of processors that your code could possibly take advantage 1369 01:14:36,540 --> 01:14:37,040 of. 1370 01:14:39,860 --> 01:14:40,980 Any questions? 1371 01:14:40,980 --> 01:14:41,480 Yes? 1372 01:14:41,480 --> 01:14:43,900 AUDIENCE: What do you mean when you say runtime? 1373 01:14:43,900 --> 01:14:46,970 JULIAN SHUN: So I mean the scheduler-- the Cilk runtime 1374 01:14:46,970 --> 01:14:50,960 scheduler that's scheduling the different tasks when 1375 01:14:50,960 --> 01:14:52,874 you're running the program. 1376 01:14:52,874 --> 01:14:55,830 AUDIENCE: So that's included in the binary. 1377 01:14:55,830 --> 01:14:57,960 JULIAN SHUN: So it's linked from the binary. 1378 01:14:57,960 --> 01:14:59,420 It's not stored in the same place. 1379 01:14:59,420 --> 01:15:00,380 It's linked. 1380 01:15:03,740 --> 01:15:05,180 Other questions? 1381 01:15:08,400 --> 01:15:11,300 So let me summarize what we looked at today. 1382 01:15:11,300 --> 01:15:16,300 So first, we saw that most processors today 1383 01:15:16,300 --> 01:15:17,470 have multiple cores. 1384 01:15:17,470 --> 01:15:20,800 And probably all of your laptops have more than one core on it. 1385 01:15:20,800 --> 01:15:23,590 Who has a laptop that only has one core? 1386 01:15:26,901 --> 01:15:29,270 AUDIENCE: [INAUDIBLE]. 1387 01:15:29,270 --> 01:15:32,398 JULIAN SHUN: When did you buy it? 1388 01:15:32,398 --> 01:15:33,440 Probably a long time ago. 1389 01:15:42,520 --> 01:15:45,740 So nowadays, obtaining high performance on your machines 1390 01:15:45,740 --> 01:15:48,220 requires you to write parallel programs. 1391 01:15:48,220 --> 01:15:51,178 But parallel programming can be very hard, 1392 01:15:51,178 --> 01:15:52,970 especially if you have the program directly 1393 01:15:52,970 --> 01:15:55,310 on the processor cores and interact with the operating 1394 01:15:55,310 --> 01:15:56,900 system yourself. 1395 01:15:56,900 --> 01:16:00,740 So Cilk is very nice, because it abstracts the processor cores 1396 01:16:00,740 --> 01:16:03,140 from the programmer, it handles synchronization 1397 01:16:03,140 --> 01:16:06,860 and communication protocols, and it also performs 1398 01:16:06,860 --> 01:16:09,870 provably good load-balancing. 1399 01:16:09,870 --> 01:16:11,990 And in the next project, you'll have a chance 1400 01:16:11,990 --> 01:16:14,270 to play around with Cilk. 1401 01:16:14,270 --> 01:16:17,490 You'll be implementing your own parallel screensaver, 1402 01:16:17,490 --> 01:16:20,200 so that's a very fun project to do. 1403 01:16:20,200 --> 01:16:22,430 And possibly, in one of the future lectures, 1404 01:16:22,430 --> 01:16:24,820 we'll post some of the nicest screensaver 1405 01:16:24,820 --> 01:16:28,920 that students developed for everyone to see. 1406 01:16:28,920 --> 01:16:31,070 OK, so that's all.