1 00:00:01,378 --> 00:00:03,920 VOICEOVER: The following content is provided under a Creative 2 00:00:03,920 --> 00:00:05,310 Commons license. 3 00:00:05,310 --> 00:00:07,520 Your support will help MIT Open Courseware 4 00:00:07,520 --> 00:00:11,610 continue to offer high-quality educational resources for free. 5 00:00:11,610 --> 00:00:14,180 To make a donation or to view additional materials 6 00:00:14,180 --> 00:00:18,140 from hundreds of MIT courses, visit MIT OpenCourseWare 7 00:00:18,140 --> 00:00:19,026 at ocw.mit.edu. 8 00:00:22,180 --> 00:00:24,500 JULIAN SHUN: Good afternoon, everyone. 9 00:00:24,500 --> 00:00:29,780 So today we're going to talk about storage allocation. 10 00:00:29,780 --> 00:00:32,890 This is a continuation from last lecture 11 00:00:32,890 --> 00:00:35,345 where we talked about serial storage allocation. 12 00:00:35,345 --> 00:00:36,970 Today we'll also talk a little bit more 13 00:00:36,970 --> 00:00:39,430 about serial allocation. 14 00:00:39,430 --> 00:00:44,020 But then I'll talk more about parallel allocation and also 15 00:00:44,020 --> 00:00:45,190 garbage collection. 16 00:00:45,190 --> 00:00:49,870 So I want to just do a review of some memory allocation 17 00:00:49,870 --> 00:00:52,540 primitives. 18 00:00:52,540 --> 00:00:56,890 So recall that you can use malloc to allocate memory 19 00:00:56,890 --> 00:00:58,500 from the heap. 20 00:00:58,500 --> 00:01:02,140 And if you call malloc with the size of s, 21 00:01:02,140 --> 00:01:03,610 it's going to allocate and return 22 00:01:03,610 --> 00:01:08,630 a pointer to a block of memory containing at least s bytes. 23 00:01:08,630 --> 00:01:11,530 So you might actually get more than s bytes, 24 00:01:11,530 --> 00:01:13,120 even though you asked for s bytes. 25 00:01:13,120 --> 00:01:16,990 But it's guaranteed to give you at least s bytes. 26 00:01:16,990 --> 00:01:20,370 The return values avoid star, but good programming practice 27 00:01:20,370 --> 00:01:23,620 is to typecast this pointer to whatever type 28 00:01:23,620 --> 00:01:26,650 you're using this memory for when you receive 29 00:01:26,650 --> 00:01:27,970 this from the malloc call. 30 00:01:30,550 --> 00:01:33,850 There's also aligned allocation. 31 00:01:33,850 --> 00:01:36,790 So you can do aligned allocation with memalign, 32 00:01:36,790 --> 00:01:41,880 which takes two arguments, a size a as well as a size s. 33 00:01:41,880 --> 00:01:44,200 And a has to be an exact power of 2, 34 00:01:44,200 --> 00:01:46,090 and it's going to allocate and return 35 00:01:46,090 --> 00:01:48,490 a pointer to a block of memory again containing 36 00:01:48,490 --> 00:01:49,900 at least s bytes. 37 00:01:49,900 --> 00:01:51,490 But this time this memory is going 38 00:01:51,490 --> 00:01:54,310 to be aligned to a multiple of a, 39 00:01:54,310 --> 00:01:56,350 so the address is going to be a multiple of a, 40 00:01:56,350 --> 00:01:57,690 where this memory block starts. 41 00:02:00,400 --> 00:02:04,930 So does anyone know why we might want to do an aligned memory 42 00:02:04,930 --> 00:02:05,590 allocation? 43 00:02:08,360 --> 00:02:10,175 Yeah? 44 00:02:10,175 --> 00:02:15,630 STUDENT: [INAUDIBLE] 45 00:02:15,630 --> 00:02:17,240 JULIAN SHUN: Yeah, so one reason is 46 00:02:17,240 --> 00:02:20,060 that you can align memories so that they're 47 00:02:20,060 --> 00:02:24,920 aligned to cache lines, so that when you access an object that 48 00:02:24,920 --> 00:02:26,990 fits within the cache line, it's not 49 00:02:26,990 --> 00:02:29,270 going to cross two cache lines. 50 00:02:29,270 --> 00:02:32,490 And you'll only get one cache axis instead of two. 51 00:02:32,490 --> 00:02:36,260 So one reason is that you want to align 52 00:02:36,260 --> 00:02:38,330 the memory to cache lines to reduce 53 00:02:38,330 --> 00:02:39,860 the number of cache misses. 54 00:02:39,860 --> 00:02:45,530 You get another reason is that the vectorization operators 55 00:02:45,530 --> 00:02:48,050 also require you to have memory addresses that 56 00:02:48,050 --> 00:02:50,280 are aligned to some power of 2. 57 00:02:50,280 --> 00:02:53,480 So if you align your memory allocation with memalign, 58 00:02:53,480 --> 00:02:57,380 then that's also good for the vector units. 59 00:02:57,380 --> 00:02:59,180 We also talked about deallocations. 60 00:02:59,180 --> 00:03:04,890 You can free memory back to the heap with the free function. 61 00:03:04,890 --> 00:03:07,610 So if you pass at a point of p to some block of memory, 62 00:03:07,610 --> 00:03:11,510 it's going to deallocate this block 63 00:03:11,510 --> 00:03:14,490 and return it to the storage allocator. 64 00:03:14,490 --> 00:03:20,150 And we also talked about some anomalies of freeing. 65 00:03:20,150 --> 00:03:22,670 So what is it called when you fail to free 66 00:03:22,670 --> 00:03:24,200 some memory that you allocated? 67 00:03:29,381 --> 00:03:31,265 Yes? 68 00:03:31,265 --> 00:03:34,110 Yeah, so If you fail to freeze something that you allocated, 69 00:03:34,110 --> 00:03:35,960 that's called a memory leak. 70 00:03:35,960 --> 00:03:40,460 And this can cause your program to use more and more memory. 71 00:03:40,460 --> 00:03:43,102 And eventually your program is going 72 00:03:43,102 --> 00:03:44,810 to use up all the memory on your machine, 73 00:03:44,810 --> 00:03:47,420 and it's going to crash. 74 00:03:47,420 --> 00:03:50,600 We also talked about freeing something more than once. 75 00:03:50,600 --> 00:03:53,870 Does anyone remember what that's called? 76 00:03:53,870 --> 00:03:55,010 Yeah? 77 00:03:55,010 --> 00:03:56,660 Yeah, so that's called double freeing. 78 00:03:56,660 --> 00:03:59,480 Double freeing is when you free something more than once. 79 00:03:59,480 --> 00:04:02,050 And the behavior is going to be undefined. 80 00:04:02,050 --> 00:04:05,420 You might get a seg fault immediately, 81 00:04:05,420 --> 00:04:07,760 or you'll free something that was allocated 82 00:04:07,760 --> 00:04:09,412 for some other purpose. 83 00:04:09,412 --> 00:04:11,120 And then later down the road your program 84 00:04:11,120 --> 00:04:13,320 is going to have some unexpected behavior. 85 00:04:16,750 --> 00:04:18,370 OK. 86 00:04:18,370 --> 00:04:20,709 I also want to talk about m map. 87 00:04:20,709 --> 00:04:24,520 So m map is a system call. 88 00:04:24,520 --> 00:04:28,690 And usually m map is used to treat some file on disk 89 00:04:28,690 --> 00:04:31,420 as part of memory, so that when you 90 00:04:31,420 --> 00:04:36,490 write to that memory region, it also backs it up on disk. 91 00:04:36,490 --> 00:04:37,990 In this context here, I'm actually 92 00:04:37,990 --> 00:04:42,190 using m map to allocate virtual memory without having 93 00:04:42,190 --> 00:04:43,320 any backing file. 94 00:04:43,320 --> 00:04:43,820 So 95 00:04:43,820 --> 00:04:47,020 So our map has a whole bunch of parameters here. 96 00:04:47,020 --> 00:04:49,330 The second to the last parameter indicates 97 00:04:49,330 --> 00:04:51,832 the file I want to map, and if I pass a negative 1, 98 00:04:51,832 --> 00:04:53,290 that means there's no backing file. 99 00:04:53,290 --> 00:04:57,130 So I'm just using this to allocate some virtual memory. 100 00:04:57,130 --> 00:05:00,020 The first argument is where I want to allocate it. 101 00:05:00,020 --> 00:05:02,160 And 0 means that I don't care. 102 00:05:02,160 --> 00:05:04,370 The size in terms of number of bytes 103 00:05:04,370 --> 00:05:07,720 has how much memory I want to allocate. 104 00:05:07,720 --> 00:05:10,120 Then there's also permissions. 105 00:05:10,120 --> 00:05:15,280 So here it says I can read and write this memory region. 106 00:05:15,280 --> 00:05:17,560 s private means that this memory region 107 00:05:17,560 --> 00:05:20,860 is private to the process that's allocating it. 108 00:05:20,860 --> 00:05:23,980 And then map anon means that there is no name associated 109 00:05:23,980 --> 00:05:26,470 with this memory region. 110 00:05:26,470 --> 00:05:28,300 And then as I said, negative 1 means 111 00:05:28,300 --> 00:05:29,620 that there's no backing file. 112 00:05:29,620 --> 00:05:33,880 And the last parameter is just 0 if there's no backing file. 113 00:05:33,880 --> 00:05:36,467 Normally it would be an offset into the file 114 00:05:36,467 --> 00:05:37,550 that you're trying to map. 115 00:05:37,550 --> 00:05:39,430 But here there's no backing file. 116 00:05:39,430 --> 00:05:43,390 And what m map does is it finds a contiguous unused region 117 00:05:43,390 --> 00:05:46,000 in the address space of the application that's large enough 118 00:05:46,000 --> 00:05:48,280 to hold size bytes. 119 00:05:48,280 --> 00:05:50,890 And then it updates the page table 120 00:05:50,890 --> 00:05:54,430 so that it now contains an entry for the pages 121 00:05:54,430 --> 00:05:56,230 that you allocated. 122 00:05:56,230 --> 00:05:59,290 And then it creates a necessary virtual memory management 123 00:05:59,290 --> 00:06:01,630 structures within the operating system 124 00:06:01,630 --> 00:06:05,230 to make it so that users accesses to this area 125 00:06:05,230 --> 00:06:11,650 are legal, and accesses won't result in a seg fault. 126 00:06:11,650 --> 00:06:16,660 If you try to access some region of memory without using-- 127 00:06:16,660 --> 00:06:21,610 without having OS set these parameters, 128 00:06:21,610 --> 00:06:24,310 then you might get a set fault because the program might not 129 00:06:24,310 --> 00:06:26,120 have permission to access that area. 130 00:06:26,120 --> 00:06:29,380 But m map is going to make sure that the user can access 131 00:06:29,380 --> 00:06:31,630 this area of virtual memory. 132 00:06:31,630 --> 00:06:35,080 And m map is a system call, whereas malloc, 133 00:06:35,080 --> 00:06:37,540 which we talked about last time, is a library call. 134 00:06:37,540 --> 00:06:39,400 So these are two different things. 135 00:06:39,400 --> 00:06:42,520 And malloc actually uses m map under the hood 136 00:06:42,520 --> 00:06:47,810 to get more memory from the operating system. 137 00:06:47,810 --> 00:06:52,210 So let's look at some properties of m map. 138 00:06:52,210 --> 00:06:54,190 So m map is lazy. 139 00:06:54,190 --> 00:06:57,340 So when you request a certain amount of memory, 140 00:06:57,340 --> 00:07:00,370 it doesn't immediately allocate physical memory 141 00:07:00,370 --> 00:07:02,950 for the requested allocation. 142 00:07:02,950 --> 00:07:06,070 Instead it just populates the page table 143 00:07:06,070 --> 00:07:08,710 with entries pointing to a special 0 page. 144 00:07:08,710 --> 00:07:12,500 And then it marks these pages as read only. 145 00:07:12,500 --> 00:07:14,920 And then the first time you write to such a page, 146 00:07:14,920 --> 00:07:18,340 it will cause a page fault. And at that point, 147 00:07:18,340 --> 00:07:22,630 the OS is going to modify the page table, 148 00:07:22,630 --> 00:07:25,400 get the appropriate physical memory, 149 00:07:25,400 --> 00:07:29,290 and store the mapping from the virtual address space 150 00:07:29,290 --> 00:07:32,055 to physical address space for the particular page 151 00:07:32,055 --> 00:07:32,680 that you touch. 152 00:07:32,680 --> 00:07:34,388 And then it will restart the instructions 153 00:07:34,388 --> 00:07:37,120 so that it can continue to execute. 154 00:07:40,210 --> 00:07:41,965 You can-- turns out that you can actually 155 00:07:41,965 --> 00:07:44,470 m map a terabyte of virtual memory, 156 00:07:44,470 --> 00:07:48,220 even on a machine with just a gigabyte of d ram. 157 00:07:48,220 --> 00:07:51,340 Because when you call m map, it doesn't actually 158 00:07:51,340 --> 00:07:54,250 allocate the physical memory. 159 00:07:54,250 --> 00:07:57,310 But then you should be careful, because a process might 160 00:07:57,310 --> 00:07:59,170 die from running out of physical memory 161 00:07:59,170 --> 00:08:01,840 well after you call m map. 162 00:08:01,840 --> 00:08:04,930 Because m map is going to allocate this physical memory 163 00:08:04,930 --> 00:08:06,140 whenever you first touch it. 164 00:08:06,140 --> 00:08:09,130 And this could be much later than when you actually 165 00:08:09,130 --> 00:08:12,250 made the call to m map. 166 00:08:12,250 --> 00:08:13,990 So any questions so far? 167 00:08:19,320 --> 00:08:19,820 OK. 168 00:08:22,510 --> 00:08:26,050 So what's the difference between malloc and m map? 169 00:08:26,050 --> 00:08:30,610 So as I said, malloc is a library call. 170 00:08:30,610 --> 00:08:33,610 And it's part of--malloc and free are part of the memory 171 00:08:33,610 --> 00:08:37,179 allocation interface of the heat-management code in the c 172 00:08:37,179 --> 00:08:38,950 library. 173 00:08:38,950 --> 00:08:41,950 And the heat-management code uses the available system 174 00:08:41,950 --> 00:08:44,860 facilities, including the m map function 175 00:08:44,860 --> 00:08:51,250 to get a virtual address space from the operating system. 176 00:08:51,250 --> 00:08:52,780 And then the heat-management code 177 00:08:52,780 --> 00:08:54,640 is going-- within malloc-- is going 178 00:08:54,640 --> 00:08:57,910 to attempt to satisfy user requests for heat storage 179 00:08:57,910 --> 00:09:01,090 by reusing the memory that it got from the OS 180 00:09:01,090 --> 00:09:04,210 as much as possible until it can't do that anymore. 181 00:09:04,210 --> 00:09:06,850 And then it will go and call m map 182 00:09:06,850 --> 00:09:10,990 to get more memory from the operating system. 183 00:09:10,990 --> 00:09:14,170 So the malloc implementation invokes m map and other system 184 00:09:14,170 --> 00:09:18,430 calls to expand the size of the users heap storage. 185 00:09:18,430 --> 00:09:21,340 And the responsibility of malloc is 186 00:09:21,340 --> 00:09:25,510 to reuse the memory, such that your fragmentation is reduced, 187 00:09:25,510 --> 00:09:28,900 and you have good temporal locality, 188 00:09:28,900 --> 00:09:31,000 whereas the responsibility of m map 189 00:09:31,000 --> 00:09:36,190 is actually getting this memory from the operating system. 190 00:09:36,190 --> 00:09:38,890 So any questions on the differences 191 00:09:38,890 --> 00:09:41,020 between malloc and m map? 192 00:09:44,560 --> 00:09:46,390 So one question is, why don't we just call 193 00:09:46,390 --> 00:09:50,800 m map up all the time, instead of just using malloc? 194 00:09:50,800 --> 00:09:52,510 Why don't we just directly call m map? 195 00:10:02,800 --> 00:10:03,370 Yes. 196 00:10:03,370 --> 00:10:07,560 STUDENT: [INAUDIBLE] 197 00:10:07,560 --> 00:10:09,850 JULIAN SHUN: Yes, so one answer is 198 00:10:09,850 --> 00:10:12,910 that you might have free storage from before 199 00:10:12,910 --> 00:10:18,010 that you would want to reuse. 200 00:10:18,010 --> 00:10:21,970 And it turns out that m map is relatively heavy weight. 201 00:10:21,970 --> 00:10:25,600 So it works on a page granularity. 202 00:10:25,600 --> 00:10:28,120 So if you want to do a small allocation, 203 00:10:28,120 --> 00:10:30,730 it's quite wasteful to allocate an entire page 204 00:10:30,730 --> 00:10:34,210 for that allocation and not reuse it. 205 00:10:34,210 --> 00:10:36,940 You'll get very bad external fragmentation. 206 00:10:36,940 --> 00:10:38,530 And when you call m map, it has to go 207 00:10:38,530 --> 00:10:41,200 through all of the overhead of the security of the OS 208 00:10:41,200 --> 00:10:44,900 and updating the page table and so on. 209 00:10:44,900 --> 00:10:47,110 Whereas, if you use malloc, it's actually 210 00:10:47,110 --> 00:10:50,350 pretty fast for most allocations, 211 00:10:50,350 --> 00:10:52,840 and especially if you have temporal locality where 212 00:10:52,840 --> 00:10:56,470 you allocate something that you just recently freed. 213 00:10:56,470 --> 00:10:58,570 So your program would be pretty slow 214 00:10:58,570 --> 00:11:02,230 if you used m map all the time, even for small allocations. 215 00:11:02,230 --> 00:11:04,300 For big allocations, it's fine. 216 00:11:04,300 --> 00:11:09,730 But for small allocations, you should use malloc. 217 00:11:09,730 --> 00:11:13,070 Any questions on m map versus malloc? 218 00:11:17,850 --> 00:11:21,610 OK, so I just want to do a little bit of review 219 00:11:21,610 --> 00:11:24,102 on how address translation works. 220 00:11:24,102 --> 00:11:26,560 So some of you might have seen this before in your computer 221 00:11:26,560 --> 00:11:28,810 architecture course. 222 00:11:28,810 --> 00:11:34,870 So how it works is, when you access memory location, 223 00:11:34,870 --> 00:11:37,600 you access it via the virtual address. 224 00:11:37,600 --> 00:11:40,570 And the virtual address can be divided into two parts, where 225 00:11:40,570 --> 00:11:42,880 the lower order bits store the offset, 226 00:11:42,880 --> 00:11:46,970 and the higher order bits store the virtual page number. 227 00:11:46,970 --> 00:11:49,720 And in order to get the physical address associated 228 00:11:49,720 --> 00:11:53,020 with this virtual address, the hardware 229 00:11:53,020 --> 00:11:56,590 is going to look up this virtual page number in what's 230 00:11:56,590 --> 00:11:59,870 called the page table. 231 00:11:59,870 --> 00:12:02,953 And then if it finds a corresponding entry 232 00:12:02,953 --> 00:12:04,870 for the virtual page number in the page table, 233 00:12:04,870 --> 00:12:08,320 that will tell us the physical frame number. 234 00:12:08,320 --> 00:12:10,300 And then the physical frame number 235 00:12:10,300 --> 00:12:16,060 corresponds to where this fiscal memory is in d ram. 236 00:12:16,060 --> 00:12:19,690 So you can just take the frame number, 237 00:12:19,690 --> 00:12:22,240 and then use the same offset as before 238 00:12:22,240 --> 00:12:25,870 to get the appropriate offset into the physical memory frame. 239 00:12:30,590 --> 00:12:32,770 So if the virtual page that you're looking for 240 00:12:32,770 --> 00:12:35,060 doesn't reside in physical memory, 241 00:12:35,060 --> 00:12:37,910 then a page fault is going to occur. 242 00:12:37,910 --> 00:12:41,770 And when a page fault occurs, either the operating system 243 00:12:41,770 --> 00:12:43,330 will see that the process actually 244 00:12:43,330 --> 00:12:46,600 has permissions to look at that memory region, 245 00:12:46,600 --> 00:12:49,420 and it will set the permissions and place the entry 246 00:12:49,420 --> 00:12:52,330 into the page table so that you can 247 00:12:52,330 --> 00:12:55,510 get the appropriate physical address. 248 00:12:55,510 --> 00:12:57,340 But otherwise, the operating system 249 00:12:57,340 --> 00:12:58,870 might see that this process actually 250 00:12:58,870 --> 00:13:00,870 can't access that region memory, and then you'll 251 00:13:00,870 --> 00:13:04,070 get a segmentation fault. 252 00:13:04,070 --> 00:13:07,300 It turns out that the page table search, also called a page 253 00:13:07,300 --> 00:13:11,110 walk, is pretty expensive. 254 00:13:11,110 --> 00:13:14,080 And that's why we have the translation look, a side 255 00:13:14,080 --> 00:13:17,110 buffer or TLB, which is essentially 256 00:13:17,110 --> 00:13:19,390 a cache for the page table. 257 00:13:19,390 --> 00:13:22,720 So the hardware uses a TLB to cache the recent page 258 00:13:22,720 --> 00:13:27,670 table look ups into this TLB so that later on when 259 00:13:27,670 --> 00:13:30,250 you access the same page, it doesn't 260 00:13:30,250 --> 00:13:32,050 have to go all the way to the page table 261 00:13:32,050 --> 00:13:33,330 to find the physical address. 262 00:13:33,330 --> 00:13:35,920 It can first look in the TLB to see 263 00:13:35,920 --> 00:13:38,810 if it's been recently accessed. 264 00:13:38,810 --> 00:13:41,530 So why would you expect to see something 265 00:13:41,530 --> 00:13:44,075 that it recently has accessed? 266 00:13:47,490 --> 00:13:49,110 So what's one property of a program 267 00:13:49,110 --> 00:13:53,100 that will make it so that you get a lot of TLB hits? 268 00:13:53,100 --> 00:13:53,946 Yes? 269 00:13:53,946 --> 00:13:59,083 STUDENT: Well, usually [INAUDIBLE] nearby one another, 270 00:13:59,083 --> 00:14:04,570 which means they're probably in the same page or [INAUDIBLE].. 271 00:14:04,570 --> 00:14:07,650 JULIAN SHUN: Yeah, so that's correct. 272 00:14:07,650 --> 00:14:09,810 So the page table stores pages, which 273 00:14:09,810 --> 00:14:12,060 are typically four kilobytes. 274 00:14:12,060 --> 00:14:13,860 Nowadays there are also huge pages, which 275 00:14:13,860 --> 00:14:15,640 can be a couple of megabytes. 276 00:14:15,640 --> 00:14:18,510 And most of the accesses in your program 277 00:14:18,510 --> 00:14:19,950 are going to be near each other. 278 00:14:19,950 --> 00:14:22,710 So they're likely going to reside 279 00:14:22,710 --> 00:14:26,130 on the same page for accesses that have been 280 00:14:26,130 --> 00:14:29,620 done close together in time. 281 00:14:29,620 --> 00:14:34,650 So therefore you'll expect that many of your recent accesses 282 00:14:34,650 --> 00:14:37,260 are going to be stored in the TLB 283 00:14:37,260 --> 00:14:40,500 if your program has locality, either spatial or temporal 284 00:14:40,500 --> 00:14:41,670 locality or both. 285 00:14:44,370 --> 00:14:47,608 So how this architecture works is that the processor is first 286 00:14:47,608 --> 00:14:49,650 going to check whether the virtual address you're 287 00:14:49,650 --> 00:14:51,780 looking for is in TLB. 288 00:14:51,780 --> 00:14:55,470 If it's not, it's going to go to the page table and look it up. 289 00:14:55,470 --> 00:14:57,540 And then if it finds that there, then it's 290 00:14:57,540 --> 00:14:59,790 going to store that entry into the TLB. 291 00:14:59,790 --> 00:15:02,720 And then next it's going to go get this physical address 292 00:15:02,720 --> 00:15:08,028 that it found from the TLB and look it up into the CPU cache. 293 00:15:08,028 --> 00:15:09,570 And if it finds it there, it gets it. 294 00:15:09,570 --> 00:15:13,410 If it doesn't, then it goes to d ram to satisfy the request. 295 00:15:13,410 --> 00:15:15,540 Most modern machines actually have an optimization 296 00:15:15,540 --> 00:15:20,250 that allow you to do TLB access in parallel with the L1 cache 297 00:15:20,250 --> 00:15:21,330 access. 298 00:15:21,330 --> 00:15:24,570 So the L1 cache actually uses virtual addresses instead 299 00:15:24,570 --> 00:15:26,850 of fiscal addresses, and this reduces 300 00:15:26,850 --> 00:15:32,070 the latency of a memory access. 301 00:15:32,070 --> 00:15:35,960 So that's a brief review of address translation. 302 00:15:35,960 --> 00:15:38,055 All right, so let's talk about stacks. 303 00:15:41,650 --> 00:15:48,430 So when you execute a serial c and c++ program, 304 00:15:48,430 --> 00:15:52,720 you're using a stack to keep track of the function calls 305 00:15:52,720 --> 00:15:56,740 and local variables that you have to save. 306 00:15:56,740 --> 00:15:59,500 So here, let's say we have this invocation tree, 307 00:15:59,500 --> 00:16:03,460 where function a calls Function b, which then returns. 308 00:16:03,460 --> 00:16:05,800 And then a calls function c, which 309 00:16:05,800 --> 00:16:09,490 calls d, returns, calls e, returns, and then returns 310 00:16:09,490 --> 00:16:11,430 again. 311 00:16:11,430 --> 00:16:14,530 Here are the different views of the stack at different points 312 00:16:14,530 --> 00:16:15,610 of the execution. 313 00:16:15,610 --> 00:16:21,030 So initially when we call a, we have a stack frame for a. 314 00:16:21,030 --> 00:16:23,590 And then when a calls b, we're going 315 00:16:23,590 --> 00:16:26,050 to place a stack frame for b right 316 00:16:26,050 --> 00:16:27,640 below the stack frame of a. 317 00:16:27,640 --> 00:16:30,940 So these are going to be linearly ordered. 318 00:16:30,940 --> 00:16:34,420 When we're done with b, then this part of the stack 319 00:16:34,420 --> 00:16:37,660 is no longer going to be used, the part for b. 320 00:16:37,660 --> 00:16:41,080 And then when it calls c, It's going to allocate a stack frame 321 00:16:41,080 --> 00:16:42,730 below a on the stack. 322 00:16:42,730 --> 00:16:47,230 And this space is actually going to be the same space as what 323 00:16:47,230 --> 00:16:48,610 b was using before. 324 00:16:48,610 --> 00:16:51,640 But this is fine, because we're already done with the call 325 00:16:51,640 --> 00:16:52,720 to b. 326 00:16:52,720 --> 00:16:56,080 Then when c calls d, we're going to create a stack frame for d 327 00:16:56,080 --> 00:16:57,520 right below c. 328 00:16:57,520 --> 00:17:00,340 When it returns, we're not going to use that space any more, 329 00:17:00,340 --> 00:17:04,980 so then we can reuse it for the stack frame when we call e. 330 00:17:04,980 --> 00:17:09,290 And then eventually all of these will pop back up. 331 00:17:09,290 --> 00:17:13,480 And all of these views here share the same view 332 00:17:13,480 --> 00:17:16,300 of the stack frame for a. 333 00:17:16,300 --> 00:17:21,130 And then for c, d, and e, they all stare share the same view 334 00:17:21,130 --> 00:17:24,369 of this stack for c. 335 00:17:24,369 --> 00:17:27,849 So this is how a traditional linear stack works when you 336 00:17:27,849 --> 00:17:30,520 call a serial c or c++ program. 337 00:17:30,520 --> 00:17:34,130 And you can view this as a serial walk over the invocation 338 00:17:34,130 --> 00:17:34,630 tree. 339 00:17:39,540 --> 00:17:41,830 There's one rule for pointers. 340 00:17:41,830 --> 00:17:43,860 With traditional linear stacks is 341 00:17:43,860 --> 00:17:47,610 that a parent can pass pointers to its stack variables 342 00:17:47,610 --> 00:17:49,710 down to its children. 343 00:17:49,710 --> 00:17:51,510 But not the other way around. 344 00:17:51,510 --> 00:17:54,750 A child can't pass a pointer to some local variable 345 00:17:54,750 --> 00:17:56,080 back to its parent. 346 00:17:56,080 --> 00:17:58,900 So if you do that, you'll get a bug in your program. 347 00:17:58,900 --> 00:18:01,820 How many of you have tried doing that before? 348 00:18:01,820 --> 00:18:05,140 Yeah, so a lot of you. 349 00:18:05,140 --> 00:18:09,240 So let's see why that causes a problem. 350 00:18:09,240 --> 00:18:12,780 So if I'm calling-- 351 00:18:12,780 --> 00:18:16,260 if I call b, and I pass a pointer to some local variable 352 00:18:16,260 --> 00:18:21,360 in b stack to a, and then now when a calls c, 353 00:18:21,360 --> 00:18:24,150 It's going to overwrite the space that b was using. 354 00:18:24,150 --> 00:18:26,760 And if b's local variable was stored in the space 355 00:18:26,760 --> 00:18:29,040 that c has now overwritten, then you're 356 00:18:29,040 --> 00:18:30,960 just going to see garbage. 357 00:18:30,960 --> 00:18:33,480 And when you try to access that, you're 358 00:18:33,480 --> 00:18:35,820 not going to get the correct value. 359 00:18:35,820 --> 00:18:39,090 So you can pass a pointer to a's local variable 360 00:18:39,090 --> 00:18:42,510 down to any of these descendant function 361 00:18:42,510 --> 00:18:45,480 calls, because they all see the same view of a stack. 362 00:18:45,480 --> 00:18:48,120 And that's not going to be overwritten 363 00:18:48,120 --> 00:18:51,360 while these descendant function calls are proceeding. 364 00:18:51,360 --> 00:18:54,690 But if you pass it the other way, then potentially 365 00:18:54,690 --> 00:18:56,370 the variable that you had a pointer to 366 00:18:56,370 --> 00:19:00,490 is going to be overwritten. 367 00:19:00,490 --> 00:19:02,640 So here's one question. 368 00:19:02,640 --> 00:19:06,360 If you want to pass memory from a child back to the parent, 369 00:19:06,360 --> 00:19:07,530 where would you allocate it? 370 00:19:11,380 --> 00:19:14,140 So you can allocate it on the parent. 371 00:19:14,140 --> 00:19:16,510 What's another option? 372 00:19:16,510 --> 00:19:17,490 Yes? 373 00:19:17,490 --> 00:19:21,643 Yes, so another way to do this is to allocate it on the heap. 374 00:19:21,643 --> 00:19:23,560 If you allocate it on the heap, even after you 375 00:19:23,560 --> 00:19:25,450 return from the function call, that memory 376 00:19:25,450 --> 00:19:27,550 is going to persist. 377 00:19:27,550 --> 00:19:32,140 You can also allocate it in the parent's stack, if you want. 378 00:19:32,140 --> 00:19:34,720 In fact, some programs are written that way. 379 00:19:34,720 --> 00:19:39,850 And one of the reasons why many c functions require 380 00:19:39,850 --> 00:19:44,050 you to pass in memory to the function where it's 381 00:19:44,050 --> 00:19:46,630 going to store the return value is 382 00:19:46,630 --> 00:19:50,920 to try to avoid an expensive heap allocation in the child. 383 00:19:50,920 --> 00:19:54,260 Because if the parent allocates this space to store the result, 384 00:19:54,260 --> 00:19:56,560 the child can just put whatever it 385 00:19:56,560 --> 00:19:58,310 wants to compute in that space. 386 00:19:58,310 --> 00:20:00,560 And the parent will see it. 387 00:20:00,560 --> 00:20:04,540 So then the responsibility is up to the parent 388 00:20:04,540 --> 00:20:07,270 to figure out whether it wants to allocate the memory 389 00:20:07,270 --> 00:20:09,110 on the stack or on the heap. 390 00:20:09,110 --> 00:20:11,410 So this is one of the reasons why 391 00:20:11,410 --> 00:20:14,200 you'll see many c functions, where one of the arguments 392 00:20:14,200 --> 00:20:18,190 is a memory location where the result should be stored. 393 00:20:23,710 --> 00:20:26,650 OK, so that was the serial case. 394 00:20:26,650 --> 00:20:29,310 What happens in parallel? 395 00:20:29,310 --> 00:20:31,030 So in parallel, we have what's called 396 00:20:31,030 --> 00:20:34,570 a cactus stack where we can support multiple views 397 00:20:34,570 --> 00:20:36,430 of the stack in parallel. 398 00:20:36,430 --> 00:20:40,720 So let's say we have a program where it calls function 399 00:20:40,720 --> 00:20:43,390 a, and then a spawns b and c. 400 00:20:43,390 --> 00:20:45,730 So b and c are going to be running potentially 401 00:20:45,730 --> 00:20:46,330 in parallel. 402 00:20:46,330 --> 00:20:48,700 And then c spawns d and e, which can potentially 403 00:20:48,700 --> 00:20:50,660 be running in parallel. 404 00:20:50,660 --> 00:20:54,820 So for this program, we could have functions b, d and e all 405 00:20:54,820 --> 00:20:56,860 executing in parallel. 406 00:20:56,860 --> 00:20:58,660 And a cactus stack is going to allow 407 00:20:58,660 --> 00:21:01,840 us to have all of these functions 408 00:21:01,840 --> 00:21:04,270 see the same view of this stack as they 409 00:21:04,270 --> 00:21:11,040 would have if this program were executed serially. 410 00:21:11,040 --> 00:21:14,110 And the silk runtime system supports 411 00:21:14,110 --> 00:21:18,940 the cactus stack to make it easy for writing parallel programs. 412 00:21:18,940 --> 00:21:21,470 Because now when you're writing programs, 413 00:21:21,470 --> 00:21:25,480 you just have to obey the same rules for programming in serial 414 00:21:25,480 --> 00:21:29,200 c and c++ with regards to the stack, 415 00:21:29,200 --> 00:21:31,830 and then you'll still get the intended behavior. 416 00:21:35,860 --> 00:21:40,060 And it turns out that there's no copying of the stacks here. 417 00:21:40,060 --> 00:21:42,190 So all of these different views are 418 00:21:42,190 --> 00:21:48,580 seeing the same virtual memory addresses for a. 419 00:21:48,580 --> 00:21:51,400 But now there is an issue of how do 420 00:21:51,400 --> 00:21:54,250 we implement this cactus stack? 421 00:21:54,250 --> 00:21:58,420 Because in the serial case, we could have these later stacks 422 00:21:58,420 --> 00:22:00,400 overwriting the earlier stacks. 423 00:22:00,400 --> 00:22:04,570 But in parallel, how can we do this? 424 00:22:04,570 --> 00:22:07,240 So does anyone have any simple ideas 425 00:22:07,240 --> 00:22:11,600 on how we can implement a cactus stack? 426 00:22:11,600 --> 00:22:12,100 Yes? 427 00:22:15,370 --> 00:22:22,367 STUDENT: You could just have each child's stack start 428 00:22:22,367 --> 00:22:26,059 in like a separate stack, or just have references 429 00:22:26,059 --> 00:22:28,470 to the [INAUDIBLE]. 430 00:22:28,470 --> 00:22:30,600 JULIAN SHUN: Yeah, so one way to do this 431 00:22:30,600 --> 00:22:35,940 is to have each thread use a different stack. 432 00:22:35,940 --> 00:22:40,710 And then store pointers to the different stack frames 433 00:22:40,710 --> 00:22:42,480 across the different stacks. 434 00:22:42,480 --> 00:22:45,150 There's actually another way to do this, which is easier. 435 00:22:49,830 --> 00:22:51,250 OK, yes? 436 00:22:51,250 --> 00:22:55,158 STUDENT: If the stack frames have a maximum-- 437 00:22:55,158 --> 00:22:58,086 fixed maximum size-- then you could put them 438 00:22:58,086 --> 00:23:04,430 all in the same stack separated by that fixed size. 439 00:23:04,430 --> 00:23:06,170 JULIAN SHUN: Yeah, so if the stacks all 440 00:23:06,170 --> 00:23:08,630 have a maximum depth, then you could just 441 00:23:08,630 --> 00:23:12,500 allocate a whole bunch of stacks, which are separated 442 00:23:12,500 --> 00:23:14,600 by this maximum depth. 443 00:23:17,180 --> 00:23:20,510 There's actually another way to do this, 444 00:23:20,510 --> 00:23:22,750 which is to not use the stack. 445 00:23:22,750 --> 00:23:24,468 So yes? 446 00:23:24,468 --> 00:23:26,510 STUDENT: Could you memory map it somewhere else-- 447 00:23:26,510 --> 00:23:27,760 each of the different threads? 448 00:23:27,760 --> 00:23:30,380 JULIAN SHUN: Yes, that's actually one way to do it. 449 00:23:30,380 --> 00:23:35,000 The easiest way to do it is just to allocate it off the heap. 450 00:23:35,000 --> 00:23:39,490 So instead of allocating the frames on the stack, 451 00:23:39,490 --> 00:23:42,690 you just do a heap allocation for each of these stack frames. 452 00:23:42,690 --> 00:23:44,350 And then each of these stack frames 453 00:23:44,350 --> 00:23:49,840 has a pointer to the parent stack frame. 454 00:23:49,840 --> 00:23:53,560 So whenever you do a function call, 455 00:23:53,560 --> 00:23:55,780 you're going to do a memory allocation from the heap 456 00:23:55,780 --> 00:23:56,920 to get a new stack frame. 457 00:23:56,920 --> 00:23:58,960 And then when you finish a function, 458 00:23:58,960 --> 00:24:01,480 you're going to pop something off of this stack, 459 00:24:01,480 --> 00:24:05,050 and free it back to the heap. 460 00:24:05,050 --> 00:24:09,040 In fact, a lot of early systems for parallel programming 461 00:24:09,040 --> 00:24:15,160 use this strategy of heap-based cactus stacks. 462 00:24:15,160 --> 00:24:17,680 Turns out that you can actually minimize the performance 463 00:24:17,680 --> 00:24:21,970 impact using this strategy if you optimize the code enough. 464 00:24:21,970 --> 00:24:23,770 But there is actually a bigger problem 465 00:24:23,770 --> 00:24:27,610 with using a heap-based cactus stack, which doesn't 466 00:24:27,610 --> 00:24:29,830 have to do with performance. 467 00:24:29,830 --> 00:24:34,390 Does anybody have any guesses of what this potential issue is? 468 00:24:38,620 --> 00:24:39,583 Yeah? 469 00:24:39,583 --> 00:24:42,000 STUDENT: It requires you to allocate the heap in parallel. 470 00:24:42,000 --> 00:24:44,160 JULIAN SHUN: Yeah, so let's assume that we can 471 00:24:44,160 --> 00:24:45,487 do parallel heap allocation. 472 00:24:45,487 --> 00:24:46,570 And we'll talk about that. 473 00:24:46,570 --> 00:24:50,050 So assuming that we can do that correctly, 474 00:24:50,050 --> 00:24:51,600 what's the issue with this approach? 475 00:24:54,883 --> 00:24:55,821 Yeah? 476 00:24:55,821 --> 00:24:58,071 STUDENT: It's that you don't know how big the stack is 477 00:24:58,071 --> 00:24:59,288 going to be? 478 00:24:59,288 --> 00:25:00,830 JULIAN SHUN: So let's assume that you 479 00:25:00,830 --> 00:25:03,840 can get whatever stack frames you need from the heap, 480 00:25:03,840 --> 00:25:06,920 so you don't actually need to put an upper bound on this. 481 00:25:12,550 --> 00:25:13,640 Yeah? 482 00:25:13,640 --> 00:25:15,348 STUDENT: We don't know the maximum depth. 483 00:25:17,390 --> 00:25:18,140 JULIAN SHUN: Yeah. 484 00:25:18,140 --> 00:25:19,990 So we don't know the maximum depth, 485 00:25:19,990 --> 00:25:23,283 but let's say we can make that work. 486 00:25:23,283 --> 00:25:25,450 So you don't actually need to know the maximum depth 487 00:25:25,450 --> 00:25:27,100 if you're allocating off the heap. 488 00:25:32,270 --> 00:25:33,875 Any other guesses? 489 00:25:45,590 --> 00:25:46,125 Yeah? 490 00:25:46,125 --> 00:25:47,750 STUDENT: Something to do with returning 491 00:25:47,750 --> 00:25:50,200 from the stack that is allocated on the heap 492 00:25:50,200 --> 00:25:52,570 to one of the original stacks. 493 00:25:52,570 --> 00:25:55,070 JULIAN SHUN: So let's say we could get that to work as well. 494 00:25:58,050 --> 00:26:00,560 So what happens if I try to run some program using 495 00:26:00,560 --> 00:26:03,380 this heap-based cactus stack with something 496 00:26:03,380 --> 00:26:06,320 that's using the regular stack? 497 00:26:06,320 --> 00:26:07,880 So let's say I have some old legacy 498 00:26:07,880 --> 00:26:10,490 code that was already compiled using 499 00:26:10,490 --> 00:26:14,170 the traditional linear stack. 500 00:26:14,170 --> 00:26:16,660 So there's a problem with interoperability here. 501 00:26:16,660 --> 00:26:18,670 Because the traditional code is assuming 502 00:26:18,670 --> 00:26:21,718 that, when you make a function call, 503 00:26:21,718 --> 00:26:23,260 the stack frame for the function call 504 00:26:23,260 --> 00:26:25,510 is going to appear right after the stack frame 505 00:26:25,510 --> 00:26:29,380 for the particular call e function. 506 00:26:29,380 --> 00:26:33,100 So if you try to mix code that uses the traditional stack as 507 00:26:33,100 --> 00:26:37,060 well as this heap-based cactus stack approach, 508 00:26:37,060 --> 00:26:39,760 then it's not going to work well together. 509 00:26:39,760 --> 00:26:42,010 One approach is that you can just 510 00:26:42,010 --> 00:26:48,310 recompile all your code to use this heap-based cactus stack. 511 00:26:48,310 --> 00:26:51,340 Even if you could do that, even if all of the source codes 512 00:26:51,340 --> 00:26:54,390 were available, there are some legacy programs 513 00:26:54,390 --> 00:26:56,090 that actually in the source code, 514 00:26:56,090 --> 00:26:59,470 they do some manipulations with the stack, 515 00:26:59,470 --> 00:27:02,170 because they assume that you're using the traditional stack, 516 00:27:02,170 --> 00:27:04,030 and those programs would no longer 517 00:27:04,030 --> 00:27:06,700 work if you're using a heap-based cactus stack. 518 00:27:06,700 --> 00:27:09,340 So the problem is interoperability 519 00:27:09,340 --> 00:27:12,070 with legacy code. 520 00:27:12,070 --> 00:27:14,380 Turns out that you can fix this using an approach 521 00:27:14,380 --> 00:27:16,680 called thread local memory mapping. 522 00:27:16,680 --> 00:27:20,200 So one of the students mentioned memory mapping. 523 00:27:20,200 --> 00:27:22,640 But that requires changes to the operating system. 524 00:27:22,640 --> 00:27:25,810 So it's not general purpose. 525 00:27:25,810 --> 00:27:30,460 But the heap-based cactus stack turns out to be very simple. 526 00:27:30,460 --> 00:27:33,110 And we can prove nice bounds about it. 527 00:27:33,110 --> 00:27:36,580 So besides the interoperability issue, 528 00:27:36,580 --> 00:27:40,120 heap-based cactus stacks are pretty good in practice, 529 00:27:40,120 --> 00:27:42,490 as well as in theory. 530 00:27:42,490 --> 00:27:44,770 So we can actually prove a space bound 531 00:27:44,770 --> 00:27:49,750 of a cilk program that uses the heap-based cactus stack. 532 00:27:49,750 --> 00:27:52,450 So let's say s 1 is the stack space required 533 00:27:52,450 --> 00:27:56,620 by a serial execution of a cilk program. 534 00:27:56,620 --> 00:27:59,200 Then the stack space of p worker execution 535 00:27:59,200 --> 00:28:01,810 using a heap-based cactus stack is going to be 536 00:28:01,810 --> 00:28:04,000 upper bounded by p times s 1. 537 00:28:04,000 --> 00:28:07,300 So s p is the space for a p worker execution, 538 00:28:07,300 --> 00:28:11,920 and that's less than or equal to p times s 1. 539 00:28:11,920 --> 00:28:14,560 To understand how this works, we need 540 00:28:14,560 --> 00:28:17,880 to understand a little bit about how the cilks works 541 00:28:17,880 --> 00:28:19,820 stealing algorithm works. 542 00:28:19,820 --> 00:28:22,690 So in the cilk work-stealing algorithm, 543 00:28:22,690 --> 00:28:24,630 whenever you spawn something of work, 544 00:28:24,630 --> 00:28:28,310 or that spawns a new task, is going to work on the task 545 00:28:28,310 --> 00:28:31,160 that it spawned. 546 00:28:31,160 --> 00:28:35,110 So therefore, for any leaf in the invocation tree that 547 00:28:35,110 --> 00:28:36,850 currently exists, there's always going 548 00:28:36,850 --> 00:28:38,170 to be a worker working on it. 549 00:28:38,170 --> 00:28:40,330 There's not going to be any leaves in the tree 550 00:28:40,330 --> 00:28:42,010 where there's no worker working on it. 551 00:28:42,010 --> 00:28:45,610 Because when a worker spawns a task, it creates a new leaf. 552 00:28:45,610 --> 00:28:49,480 But then it works immediately on that leaf. 553 00:28:49,480 --> 00:28:51,970 So here we have a-- 554 00:28:51,970 --> 00:28:54,280 we have a invocation tree. 555 00:28:54,280 --> 00:28:58,570 And for all of the leaves, we have a processor working on it. 556 00:28:58,570 --> 00:29:00,820 And with this busy leaves property, 557 00:29:00,820 --> 00:29:05,090 we can easily show this space bound. 558 00:29:05,090 --> 00:29:07,240 So for each one of these processors, 559 00:29:07,240 --> 00:29:09,850 the maximum stack space it's using 560 00:29:09,850 --> 00:29:11,830 is going to be upper bounded by s 1, 561 00:29:11,830 --> 00:29:15,430 because that's maximum stock space across a serial execution 562 00:29:15,430 --> 00:29:17,980 that executes the whole program. 563 00:29:17,980 --> 00:29:20,500 And then since we have p of these leaves, 564 00:29:20,500 --> 00:29:23,050 we just multiply s 1 by p, and that gives us 565 00:29:23,050 --> 00:29:26,800 an upper bound on the overall space used by a p worker 566 00:29:26,800 --> 00:29:28,690 execution. 567 00:29:28,690 --> 00:29:31,060 This can be a loose upper bound, because we're double 568 00:29:31,060 --> 00:29:31,760 counting here. 569 00:29:31,760 --> 00:29:34,780 There's some part of this memory that we're 570 00:29:34,780 --> 00:29:36,970 counting more than once, because they're shared 571 00:29:36,970 --> 00:29:39,850 among the different processors. 572 00:29:39,850 --> 00:29:42,850 But that's why we have the less than or equal to here. 573 00:29:42,850 --> 00:29:47,035 So it's upper bounded by p times s 1. 574 00:29:47,035 --> 00:29:48,410 So this is one of the nice things 575 00:29:48,410 --> 00:29:50,690 about using a heap-based cactus stack is that you 576 00:29:50,690 --> 00:29:54,170 get this good space bound. 577 00:29:54,170 --> 00:29:57,920 Any questions on the space bound here? 578 00:30:03,640 --> 00:30:08,070 So let's try to apply this theorem to a real example. 579 00:30:08,070 --> 00:30:10,680 So this is the divide and conquer matrix multiplication 580 00:30:10,680 --> 00:30:12,945 code that we saw in a previous lecture. 581 00:30:15,760 --> 00:30:21,780 So this is-- in this code, we're making eight recursive calls 582 00:30:21,780 --> 00:30:24,210 to a divide and conquer function. 583 00:30:24,210 --> 00:30:26,520 Each of size n over 2. 584 00:30:26,520 --> 00:30:28,950 And before we make any of these calls, 585 00:30:28,950 --> 00:30:32,160 we're doing a malloc to get some temporary space. 586 00:30:32,160 --> 00:30:35,140 And this is of size order and squared. 587 00:30:35,140 --> 00:30:37,920 And then we free this temporary space at the end. 588 00:30:37,920 --> 00:30:39,540 And notice here that the allocations 589 00:30:39,540 --> 00:30:43,290 of the temporary matrix obey a stack discipline. 590 00:30:43,290 --> 00:30:46,830 So we're allocating stuff before we make recursive calls. 591 00:30:46,830 --> 00:30:49,980 And we're freeing it after, or right before we 592 00:30:49,980 --> 00:30:51,210 return from the function. 593 00:30:51,210 --> 00:30:52,710 So all this stack-- 594 00:30:52,710 --> 00:30:54,900 all the allocations are nested, and they 595 00:30:54,900 --> 00:30:56,430 follow a stack discipline. 596 00:30:56,430 --> 00:30:59,010 And it turns out that even if you're allocating off 597 00:30:59,010 --> 00:31:00,840 the heap, if you follow a stack discipline, 598 00:31:00,840 --> 00:31:04,140 you can still use the space bound from the previous slide 599 00:31:04,140 --> 00:31:06,585 to upper bound the p worker space. 600 00:31:09,590 --> 00:31:14,390 OK, so let's try to analyze the space of this code here. 601 00:31:14,390 --> 00:31:17,300 So first let's look at what the work and span are. 602 00:31:17,300 --> 00:31:18,830 So this is just going to be review. 603 00:31:18,830 --> 00:31:21,770 What's the work of this divide and conquer matrix multiply? 604 00:31:21,770 --> 00:31:22,740 So it's n cubed. 605 00:31:22,740 --> 00:31:29,260 So it's n cubed because we have eight solve problems 606 00:31:29,260 --> 00:31:31,480 of size n over 2. 607 00:31:31,480 --> 00:31:33,100 And then we have to do linear work 608 00:31:33,100 --> 00:31:35,320 to add together the matrices. 609 00:31:38,490 --> 00:31:44,190 So our recurrence is going to be t 1 of n 610 00:31:44,190 --> 00:31:48,480 is equal to eight times t 1 of n over 2 plus order n squared. 611 00:31:48,480 --> 00:31:51,990 And that solves to order n cubed if you just pull out 612 00:31:51,990 --> 00:31:53,220 your master theorem card. 613 00:31:57,150 --> 00:31:58,260 What about the span? 614 00:32:00,810 --> 00:32:02,950 So what's the recurrence here? 615 00:32:02,950 --> 00:32:06,960 Yeah, so the span t infinity of n 616 00:32:06,960 --> 00:32:09,960 is equal to t infinitive of n over 2 617 00:32:09,960 --> 00:32:11,850 plus a span of the addition. 618 00:32:11,850 --> 00:32:15,570 And what's the span of the addition? 619 00:32:15,570 --> 00:32:17,758 STUDENT: [INAUDIBLE] 620 00:32:17,758 --> 00:32:19,300 JULIAN SHUN: No, let's assume that we 621 00:32:19,300 --> 00:32:20,870 have a parallel addition. 622 00:32:20,870 --> 00:32:25,390 We have nested silk four loops. 623 00:32:25,390 --> 00:32:29,260 Right, so then the span of that is just going of be log n. 624 00:32:29,260 --> 00:32:32,230 Since the span of 1 silk four loop is log n 625 00:32:32,230 --> 00:32:35,440 and when you nest them, you just add together the span. 626 00:32:35,440 --> 00:32:37,810 So it's going to be t infinity of n 627 00:32:37,810 --> 00:32:41,330 is equal to t infinity of n over 2 plus order log n. 628 00:32:41,330 --> 00:32:42,730 And what does that solve to? 629 00:32:45,710 --> 00:32:48,530 Yeah, so it's going to solve to order log squared n. 630 00:32:48,530 --> 00:32:51,020 Again you can pull out your master theorem card, 631 00:32:51,020 --> 00:32:54,910 and look at one of the three cases. 632 00:32:54,910 --> 00:32:57,560 OK, so now let's look at the space. 633 00:32:57,560 --> 00:33:00,140 What's going to be the recurrence for the space? 634 00:33:05,680 --> 00:33:06,180 Yes. 635 00:33:06,180 --> 00:33:14,980 STUDENT: [INAUDIBLE] 636 00:33:14,980 --> 00:33:17,795 JULIAN SHUN: The only place we're generating new space 637 00:33:17,795 --> 00:33:21,500 is when we call this malloc here. 638 00:33:21,500 --> 00:33:24,470 So they're all seeing the same original matrix. 639 00:33:27,990 --> 00:33:29,370 So what would the recurrence be? 640 00:33:36,270 --> 00:33:37,030 Yeah? 641 00:33:37,030 --> 00:33:38,230 STUDENT: [INAUDIBLE] 642 00:33:38,230 --> 00:33:38,980 JULIAN SHUN: Yeah. 643 00:33:45,292 --> 00:33:52,080 STUDENT: [INAUDIBLE] 644 00:33:52,080 --> 00:33:54,600 JULIAN SHUN: So the n square term is right. 645 00:33:54,600 --> 00:33:59,100 Do we actually need eight subproblems of size n over 2? 646 00:33:59,100 --> 00:34:02,960 What happens after we finish one of these sub problems? 647 00:34:02,960 --> 00:34:06,280 Are we still going to use the space for it? 648 00:34:06,280 --> 00:34:09,358 STUDENT: Yeah, you free the memory after the [INAUDIBLE].. 649 00:34:09,358 --> 00:34:10,150 JULIAN SHUN: Right. 650 00:34:10,150 --> 00:34:11,850 So you can actually reuse the memory. 651 00:34:11,850 --> 00:34:14,130 Because you free the memory you allocated 652 00:34:14,130 --> 00:34:17,639 after each one of these recursive calls. 653 00:34:17,639 --> 00:34:23,429 So therefore the recurrence is just going to be s of n over 2 654 00:34:23,429 --> 00:34:27,070 plus theta n squared. 655 00:34:27,070 --> 00:34:30,744 And what does that solve to? 656 00:34:30,744 --> 00:34:35,050 STUDENT: [INAUDIBLE] 657 00:34:35,050 --> 00:34:37,360 JULIAN SHUN: N squared. 658 00:34:37,360 --> 00:34:38,389 Right. 659 00:34:38,389 --> 00:34:42,590 So here the n squared term actually dominates. 660 00:34:42,590 --> 00:34:44,790 You have a decreasing geometric series. 661 00:34:44,790 --> 00:34:50,090 So it's dominated at the root, and you get theta of n squared. 662 00:34:50,090 --> 00:34:53,870 And therefore by using the busy leaves property and the theorem 663 00:34:53,870 --> 00:34:57,530 for the space bound, this tells us that on p processors, 664 00:34:57,530 --> 00:35:01,260 the space is going to be bounded by p times n squared. 665 00:35:01,260 --> 00:35:07,200 And this is actually pretty good since we have a bound on this. 666 00:35:07,200 --> 00:35:09,860 It turns out that we can actually prove a stronger bound 667 00:35:09,860 --> 00:35:12,770 for this particular example. 668 00:35:12,770 --> 00:35:16,400 And I'll walk you through how we can prove this stronger bound. 669 00:35:16,400 --> 00:35:18,980 Here's the order p times n squared is already pretty good. 670 00:35:18,980 --> 00:35:22,820 But we can actually do better if we look internally at how 671 00:35:22,820 --> 00:35:24,761 this algorithm is structured. 672 00:35:27,410 --> 00:35:32,150 So on each level of recursion, we're branching eight ways. 673 00:35:32,150 --> 00:35:35,150 And most of the space is going to be 674 00:35:35,150 --> 00:35:38,600 used near the top of this recursion tree. 675 00:35:38,600 --> 00:35:40,520 So if I branch as much as possible 676 00:35:40,520 --> 00:35:42,320 near the top of my recursion tree, 677 00:35:42,320 --> 00:35:45,560 then that's going to give me my worst case space bound. 678 00:35:45,560 --> 00:35:48,380 Because the space is decreasing geometrically as I 679 00:35:48,380 --> 00:35:50,690 go down the tree. 680 00:35:50,690 --> 00:35:52,520 So I'm going to branch eight ways 681 00:35:52,520 --> 00:35:55,880 until I get to some level k in the recursion tree 682 00:35:55,880 --> 00:35:57,450 where I have p nodes. 683 00:35:57,450 --> 00:36:00,410 And at that point, I'm not going to branch anymore because I've 684 00:36:00,410 --> 00:36:02,120 already used up all p nodes. 685 00:36:02,120 --> 00:36:07,350 And that's the number of workers I have. 686 00:36:07,350 --> 00:36:15,580 So let's say I have this level k here, where I have p nodes. 687 00:36:15,580 --> 00:36:19,490 So what would be the value of k here? 688 00:36:19,490 --> 00:36:21,380 If I branch eight ways how many levels do 689 00:36:21,380 --> 00:36:23,150 I have to go until I get to p nodes? 690 00:36:28,420 --> 00:36:29,408 Yes. 691 00:36:29,408 --> 00:36:32,550 STUDENT: It's log base 8 of p. 692 00:36:32,550 --> 00:36:33,400 JULIAN SHUN: Yes. 693 00:36:33,400 --> 00:36:36,380 It's log base 8 of p. 694 00:36:36,380 --> 00:36:38,620 So we have eight, the k, equal p, 695 00:36:38,620 --> 00:36:42,310 because we're branching k ways. 696 00:36:42,310 --> 00:36:45,330 And then using some algebra, you can get it 697 00:36:45,330 --> 00:36:48,850 so that k is equal to log base 8 of p, which is equal to log 698 00:36:48,850 --> 00:36:52,780 base 2 of p divided by 3. 699 00:36:52,780 --> 00:36:57,430 And then at this level k downwards, 700 00:36:57,430 --> 00:37:00,590 it's going to decrease geometrically. 701 00:37:00,590 --> 00:37:03,550 So the space is going to be dominant at this level k. 702 00:37:03,550 --> 00:37:06,310 So the space decreases geometrically 703 00:37:06,310 --> 00:37:11,410 as you go down from level k, and also as you go up from level k. 704 00:37:11,410 --> 00:37:16,300 So therefore we can just look at what the space is at this level 705 00:37:16,300 --> 00:37:17,950 k here. 706 00:37:17,950 --> 00:37:23,200 So the space is going to be p times the size of each one 707 00:37:23,200 --> 00:37:25,420 of these nodes squared. 708 00:37:25,420 --> 00:37:27,580 And the size of each one of these nodes 709 00:37:27,580 --> 00:37:31,840 is going to be n over 2 to the log base 2 of p over 3. 710 00:37:31,840 --> 00:37:33,940 And then we square that because we're using 711 00:37:33,940 --> 00:37:36,170 n squared temporary space. 712 00:37:36,170 --> 00:37:40,840 So if you solve that, that gives you p to the one-third times n 713 00:37:40,840 --> 00:37:44,320 squared, which is better than the upper bound 714 00:37:44,320 --> 00:37:48,430 we saw earlier of order p times n squared. 715 00:37:48,430 --> 00:37:51,710 So you can work out the details for this example. 716 00:37:51,710 --> 00:37:54,970 Not all the details are shown on this slide. 717 00:37:54,970 --> 00:38:01,290 You need to show that the level k here actually dominates 718 00:38:01,290 --> 00:38:03,890 all the other levels in the recursion tree. 719 00:38:03,890 --> 00:38:07,150 But in general, if you know what the structure of the algorithm, 720 00:38:07,150 --> 00:38:09,880 is you can potentially prove a stronger space bound than just 721 00:38:09,880 --> 00:38:13,420 applying the general theorem we showed on the previous slide. 722 00:38:16,420 --> 00:38:18,310 So any questions on this? 723 00:38:30,630 --> 00:38:34,035 OK, so as I said before, the problem with heap-based linkage 724 00:38:34,035 --> 00:38:37,440 is that parallel functions fail to interoperate 725 00:38:37,440 --> 00:38:41,127 with legacy and third-party serial binaries. 726 00:38:41,127 --> 00:38:42,210 Yes, was there a question? 727 00:38:42,210 --> 00:38:43,835 STUDENT: I actually do have a question. 728 00:38:43,835 --> 00:38:44,750 JULIAN SHUN: Yes. 729 00:38:44,750 --> 00:38:51,130 STUDENT: [INAUDIBLE] 730 00:38:51,130 --> 00:38:51,910 JULIAN SHUN: Yes. 731 00:38:51,910 --> 00:38:57,160 STUDENT: How do we know that the workers don't split 732 00:38:57,160 --> 00:39:03,400 along the path of the [INAUDIBLE] instead of across 733 00:39:03,400 --> 00:39:04,390 or horizontal. 734 00:39:04,390 --> 00:39:04,570 JULIAN SHUN: Yes. 735 00:39:04,570 --> 00:39:06,010 So you don't actually know that. 736 00:39:06,010 --> 00:39:08,030 But this turns out to be the worst case. 737 00:39:08,030 --> 00:39:10,450 So if it branches any other way, the space 738 00:39:10,450 --> 00:39:13,280 is just going to be lower. 739 00:39:13,280 --> 00:39:16,480 So you have to argue that this is going to be the worst case, 740 00:39:16,480 --> 00:39:17,582 and it's going to be-- 741 00:39:17,582 --> 00:39:19,540 intuitively it's the worst case, because you're 742 00:39:19,540 --> 00:39:23,920 using most of the memory near the root of the recursion tree. 743 00:39:23,920 --> 00:39:28,540 So if you can get all p nodes as close as possible to the root, 744 00:39:28,540 --> 00:39:31,820 that's going to make your space as high as possible. 745 00:39:31,820 --> 00:39:32,695 It's a good question. 746 00:39:36,490 --> 00:39:38,970 So parallel functions fail to interoperate 747 00:39:38,970 --> 00:39:42,150 with legacy and third-party serial binaries. 748 00:39:42,150 --> 00:39:44,520 Even if you can recompile all of this code, which 749 00:39:44,520 --> 00:39:46,870 isn't always necessarily the case, 750 00:39:46,870 --> 00:39:49,560 you can still have issues if the legacy code 751 00:39:49,560 --> 00:39:52,980 is taking advantage of the traditional linear stack 752 00:39:52,980 --> 00:39:55,230 inside the source code. 753 00:39:55,230 --> 00:39:59,340 So our implementation of cilk uses a less space 754 00:39:59,340 --> 00:40:04,370 efficient strategy that is interoperable with legacy code. 755 00:40:04,370 --> 00:40:07,500 And it uses a pool of linear stacks instead 756 00:40:07,500 --> 00:40:09,880 of a heap-based strategy. 757 00:40:09,880 --> 00:40:13,290 So we're going to maintain a pool of linear stacks lying 758 00:40:13,290 --> 00:40:14,310 around. 759 00:40:14,310 --> 00:40:17,340 There's going to be more than p stacks lying around. 760 00:40:17,340 --> 00:40:20,850 And whenever a worker tries to steal something, 761 00:40:20,850 --> 00:40:23,130 it's going to try to acquire one of these tasks 762 00:40:23,130 --> 00:40:25,750 from this pool of linear tasks. 763 00:40:25,750 --> 00:40:28,380 And when it's done, it will return it back. 764 00:40:28,380 --> 00:40:30,060 But when it finds that there's no more 765 00:40:30,060 --> 00:40:32,400 linear stacks in this pool, then it's 766 00:40:32,400 --> 00:40:33,600 not going to steal anymore. 767 00:40:33,600 --> 00:40:36,930 So this is still going to preserve the space bound, 768 00:40:36,930 --> 00:40:40,050 as long as the number of stocks is a constant times the number 769 00:40:40,050 --> 00:40:40,770 of processors. 770 00:40:40,770 --> 00:40:42,540 But it will affect the time bounds 771 00:40:42,540 --> 00:40:44,550 of the work-stealing algorithm. 772 00:40:44,550 --> 00:40:46,620 Because now when a worker is idle, 773 00:40:46,620 --> 00:40:48,450 it might not necessarily have the chance 774 00:40:48,450 --> 00:40:52,770 to steal if there are no more stacks lying around. 775 00:40:52,770 --> 00:40:54,780 This strategy doesn't require any changes 776 00:40:54,780 --> 00:40:56,850 to the operating system. 777 00:40:56,850 --> 00:40:59,370 There is a way where you can preserve the space 778 00:40:59,370 --> 00:41:03,420 and the time bounds using thread local memory mapping. 779 00:41:03,420 --> 00:41:07,470 But this does require changes to the operating system. 780 00:41:07,470 --> 00:41:12,090 So our implementation of cilk uses a pool of linear stacks, 781 00:41:12,090 --> 00:41:14,845 and it's based on the Intel implementation. 782 00:41:17,510 --> 00:41:18,010 OK. 783 00:41:21,520 --> 00:41:24,590 All right, so we talked about stacks, 784 00:41:24,590 --> 00:41:27,170 and that we just reduce the problem to heap allocation. 785 00:41:27,170 --> 00:41:29,540 So now we have to talk about heaps. 786 00:41:29,540 --> 00:41:31,820 So let's review some basic properties 787 00:41:31,820 --> 00:41:36,250 of heap-storage allocators. 788 00:41:36,250 --> 00:41:37,330 So here's a definition. 789 00:41:37,330 --> 00:41:39,460 The allocator speed is the number 790 00:41:39,460 --> 00:41:42,400 of allocations and d allocations per second 791 00:41:42,400 --> 00:41:43,945 that the allocator can sustain. 792 00:41:47,813 --> 00:41:48,730 And here's a question. 793 00:41:48,730 --> 00:41:51,400 Is it more important to maximize the allocator speed 794 00:41:51,400 --> 00:41:53,440 for large blocks or small blocks? 795 00:42:01,360 --> 00:42:02,120 Yeah? 796 00:42:02,120 --> 00:42:03,530 STUDENT: Small blocks? 797 00:42:03,530 --> 00:42:06,020 JULIAN SHUN: So small blocks. 798 00:42:06,020 --> 00:42:07,440 Here's another question. 799 00:42:07,440 --> 00:42:07,940 Why? 800 00:42:11,430 --> 00:42:12,526 Yes? 801 00:42:12,526 --> 00:42:16,650 STUDENT: So you're going to be doing a lot of [INAUDIBLE].. 802 00:42:16,650 --> 00:42:18,300 JULIAN SHUN: Yes, so one answer is 803 00:42:18,300 --> 00:42:22,110 that you're going to be doing a lot more allocations 804 00:42:22,110 --> 00:42:26,730 and deallocations of small blocks than large blocks. 805 00:42:26,730 --> 00:42:28,500 There's actually a more fundamental reason 806 00:42:28,500 --> 00:42:32,760 why it's more important to optimize for small blocks. 807 00:42:32,760 --> 00:42:33,730 So anybody? 808 00:42:33,730 --> 00:42:35,092 Yeah? 809 00:42:35,092 --> 00:42:40,970 STUDENT: [INAUDIBLE] basically not being 810 00:42:40,970 --> 00:42:43,318 able to make use of pages. 811 00:42:43,318 --> 00:42:45,110 JULIAN SHUN: Yeah, so that's another reason 812 00:42:45,110 --> 00:42:46,490 for small blocks. 813 00:42:46,490 --> 00:42:49,400 It's more likely that it will lead to fragmentation 814 00:42:49,400 --> 00:42:52,580 if you don't optimize for small blocks. 815 00:42:52,580 --> 00:42:53,540 What's another reason? 816 00:42:53,540 --> 00:42:54,628 Yes. 817 00:42:54,628 --> 00:42:56,170 STUDENT: Wouldn't it just take longer 818 00:42:56,170 --> 00:42:57,620 to allocate larger blocks anyway? 819 00:42:57,620 --> 00:43:02,480 So the overhead is going to be more noticeable if you have 820 00:43:02,480 --> 00:43:04,790 a big overhead when you allocate small blocks 821 00:43:04,790 --> 00:43:05,640 versus large blocks? 822 00:43:05,640 --> 00:43:06,390 JULIAN SHUN: Yeah. 823 00:43:06,390 --> 00:43:12,320 So the reason-- the main reason is that when you're allocating 824 00:43:12,320 --> 00:43:12,980 a large-- 825 00:43:12,980 --> 00:43:15,500 when you're allocating a block, a user program 826 00:43:15,500 --> 00:43:18,805 is typically going to write to all the bytes in the block. 827 00:43:18,805 --> 00:43:20,180 And therefore, for a large block, 828 00:43:20,180 --> 00:43:23,060 it takes so much time to write that the allocator 829 00:43:23,060 --> 00:43:26,600 time has little effect on the overall running time. 830 00:43:26,600 --> 00:43:30,310 Whereas if a program allocates many small blocks, 831 00:43:30,310 --> 00:43:31,970 the amount of work-- useful work-- 832 00:43:31,970 --> 00:43:36,590 it's actually doing on the block is going to be-- 833 00:43:36,590 --> 00:43:40,400 it can be comparable to the overhead for the allocation. 834 00:43:40,400 --> 00:43:42,680 And therefore, all of the allocation overhead 835 00:43:42,680 --> 00:43:47,630 can add up to a significant amount for small blocks. 836 00:43:47,630 --> 00:43:49,130 So essentially for large blocks, you 837 00:43:49,130 --> 00:43:52,557 can amortize away the overheads for storage allocation, 838 00:43:52,557 --> 00:43:54,890 whereas for small, small blocks, it's harder to do that. 839 00:43:54,890 --> 00:43:57,890 Therefore, it's important to optimize for small blocks. 840 00:44:01,540 --> 00:44:02,930 Here's another definition. 841 00:44:02,930 --> 00:44:05,980 So the user footprint is the maximum 842 00:44:05,980 --> 00:44:08,770 over time of the number u of bytes 843 00:44:08,770 --> 00:44:11,980 in use by the user program. 844 00:44:11,980 --> 00:44:14,710 And these are the bytes that are allocated and not freed. 845 00:44:14,710 --> 00:44:16,930 And this is measuring the peak memory usage. 846 00:44:16,930 --> 00:44:20,350 It's not necessarily equal to the sum of the sizes 847 00:44:20,350 --> 00:44:22,750 that you have allocated so far, because you 848 00:44:22,750 --> 00:44:25,150 might have reused some of that. 849 00:44:25,150 --> 00:44:28,480 So the user footprint is the peak memory usage and number 850 00:44:28,480 --> 00:44:29,770 of bytes. 851 00:44:29,770 --> 00:44:31,540 And the allocator footprint is the maximum 852 00:44:31,540 --> 00:44:33,610 over time of the number of a bytes 853 00:44:33,610 --> 00:44:35,680 that the memory provided to the locator 854 00:44:35,680 --> 00:44:37,850 by the operating system. 855 00:44:37,850 --> 00:44:40,738 And the reason why the allocator footprint could be larger 856 00:44:40,738 --> 00:44:42,280 than the user footprint, is that when 857 00:44:42,280 --> 00:44:44,680 you ask the OS for some memory, it could give you 858 00:44:44,680 --> 00:44:46,000 more than what you asked for. 859 00:44:48,670 --> 00:44:51,580 And similarly, if you ask malloc for some amount of memory, 860 00:44:51,580 --> 00:44:53,790 it can also give you more than what you asked for. 861 00:44:53,790 --> 00:44:59,200 And the fragmentation is defined to be a divided by u. 862 00:44:59,200 --> 00:45:01,720 And a program with low fragmentation 863 00:45:01,720 --> 00:45:04,090 will keep this ratio as low as possible, 864 00:45:04,090 --> 00:45:06,910 so keep the allocator footprint as close as 865 00:45:06,910 --> 00:45:08,780 possible to the user footprint. 866 00:45:08,780 --> 00:45:11,035 And in the best case, this ratio is going to be one. 867 00:45:11,035 --> 00:45:12,490 So you're using all of the memory 868 00:45:12,490 --> 00:45:14,200 that the operating system allocated. 869 00:45:18,050 --> 00:45:20,330 One remark is that the allocator footprint 870 00:45:20,330 --> 00:45:25,590 a usually gross monotonically for many allocators. 871 00:45:25,590 --> 00:45:28,190 So it turns out that many allocators 872 00:45:28,190 --> 00:45:30,950 do m maps to get more memory. 873 00:45:30,950 --> 00:45:34,130 But they don't always free this memory back to the OS. 874 00:45:34,130 --> 00:45:37,640 And you can actually free memory using something called 875 00:45:37,640 --> 00:45:40,280 m unmap, which is the opposite of m map, 876 00:45:40,280 --> 00:45:42,320 to give memory back to the OS. 877 00:45:42,320 --> 00:45:45,380 But this turns out to be pretty expensive. 878 00:45:45,380 --> 00:45:49,010 In modern operating systems, their implementation 879 00:45:49,010 --> 00:45:50,250 is not very efficient. 880 00:45:50,250 --> 00:45:54,020 So many allocators don't use m unmap. 881 00:45:54,020 --> 00:45:56,210 You can also use something called m advise. 882 00:45:56,210 --> 00:46:00,440 And what m advise does is it tells the operating system 883 00:46:00,440 --> 00:46:03,110 that you're not going to be using this page anymore 884 00:46:03,110 --> 00:46:05,940 but to keep it around in virtual memory. 885 00:46:05,940 --> 00:46:07,580 So this has less overhead, because it 886 00:46:07,580 --> 00:46:10,790 doesn't have to clear this entry from the page table. 887 00:46:10,790 --> 00:46:13,280 It just has to mark that the program isn't 888 00:46:13,280 --> 00:46:14,900 using this page anymore. 889 00:46:14,900 --> 00:46:18,290 So some allocators use m advise with the option, 890 00:46:18,290 --> 00:46:22,460 don't need, to free memory. 891 00:46:22,460 --> 00:46:26,900 But a is usually still growing monotonically over time, 892 00:46:26,900 --> 00:46:28,850 because allocators don't necessarily 893 00:46:28,850 --> 00:46:32,139 free all of the things back to the OS that they allocated. 894 00:46:37,130 --> 00:46:40,520 Here's a theorem that we proved in last week's lecture, which 895 00:46:40,520 --> 00:46:44,060 says that the fragmentation for binned free list 896 00:46:44,060 --> 00:46:49,340 is order log base 2 of u, or just order log u. 897 00:46:49,340 --> 00:46:52,380 And the reason for this is that you're 898 00:46:52,380 --> 00:46:55,040 can have log-based 2 of u bins. 899 00:46:55,040 --> 00:46:59,120 And for each bin it can basically 900 00:46:59,120 --> 00:47:02,420 contain u bytes of storage. 901 00:47:02,420 --> 00:47:05,260 So overall you can use-- 902 00:47:05,260 --> 00:47:06,980 overall, you could have allocated 903 00:47:06,980 --> 00:47:11,510 u times log u storage, and only be using u of those bytes. 904 00:47:11,510 --> 00:47:14,880 So therefore the fragmentation is order log u. 905 00:47:19,440 --> 00:47:24,480 Another thing to note is that modern 64-bit processors only 906 00:47:24,480 --> 00:47:28,960 provide about 2 to 48 bytes of virtual address space. 907 00:47:28,960 --> 00:47:32,070 So this is sort of news because you would probably 908 00:47:32,070 --> 00:47:34,890 expect that, for a 64-bit processor, 909 00:47:34,890 --> 00:47:39,160 you have to the 64 bytes of virtual address space. 910 00:47:39,160 --> 00:47:41,850 But that turns out not to be the case. 911 00:47:41,850 --> 00:47:43,860 So they only support to the 48 bytes. 912 00:47:43,860 --> 00:47:46,470 And that turns out to be enough for all of the programs 913 00:47:46,470 --> 00:47:48,780 that you would want to write. 914 00:47:48,780 --> 00:47:52,860 And that's also going to be much more than the physical memory 915 00:47:52,860 --> 00:47:54,150 you would have on a machine. 916 00:47:54,150 --> 00:47:56,580 So nowadays, you can get a big server 917 00:47:56,580 --> 00:47:59,910 with a terabyte of memory, or to the 40th bytes 918 00:47:59,910 --> 00:48:01,590 of physical memory, which is still 919 00:48:01,590 --> 00:48:05,440 much lower than the number of bytes in the virtual address 920 00:48:05,440 --> 00:48:05,940 space. 921 00:48:09,760 --> 00:48:11,004 Any questions? 922 00:48:18,920 --> 00:48:21,620 OK, so here's some more definitions. 923 00:48:21,620 --> 00:48:24,530 So the space overhead of an allocator 924 00:48:24,530 --> 00:48:27,470 is a space used for bookkeeping. 925 00:48:27,470 --> 00:48:29,750 So you could store-- 926 00:48:29,750 --> 00:48:31,940 perhaps you could store headers with the blocks 927 00:48:31,940 --> 00:48:33,770 that you allocate to keep track of the size 928 00:48:33,770 --> 00:48:35,630 and other information. 929 00:48:35,630 --> 00:48:40,870 And that would contribute to the space overhead 930 00:48:40,870 --> 00:48:42,880 Internal fragmentation is a waste 931 00:48:42,880 --> 00:48:47,720 due to allocating larger blocks in the user request. 932 00:48:47,720 --> 00:48:49,450 So you can get internal fragmentation 933 00:48:49,450 --> 00:48:51,750 if, when you call malloc, you get back 934 00:48:51,750 --> 00:48:55,180 a block that's actually larger than what the user requested. 935 00:48:55,180 --> 00:48:56,950 We saw on the bin free list algorithm, 936 00:48:56,950 --> 00:48:58,930 we're rounding up to the nearest power of 2's. 937 00:48:58,930 --> 00:49:01,360 If you allocate nine bytes, you'll 938 00:49:01,360 --> 00:49:05,110 actually get back 16 bytes in our binned-free list algorithm 939 00:49:05,110 --> 00:49:06,000 from last lecture. 940 00:49:06,000 --> 00:49:10,690 So that contributes to internal fragmentation. 941 00:49:10,690 --> 00:49:12,955 It turns out that not all binned-free list 942 00:49:12,955 --> 00:49:14,840 implementations use powers of 2. 943 00:49:14,840 --> 00:49:18,220 So some of them use other powers that are smaller than 2 944 00:49:18,220 --> 00:49:23,525 in order to reduce the internal fragmentation. 945 00:49:23,525 --> 00:49:25,150 Then there's an external fragmentation, 946 00:49:25,150 --> 00:49:28,150 which is the waste due to the inability to use storage 947 00:49:28,150 --> 00:49:30,950 because it's not contiguous. 948 00:49:30,950 --> 00:49:35,200 So for example, if I allocated a whole bunch of one byte things 949 00:49:35,200 --> 00:49:38,710 consecutively in memory, then I freed every other byte. 950 00:49:38,710 --> 00:49:41,800 And now I want to allocate a 2-byte thing, 951 00:49:41,800 --> 00:49:45,460 I don't actually have contiguous mammary to satisfy that 952 00:49:45,460 --> 00:49:48,500 request, because all of my free memory-- 953 00:49:48,500 --> 00:49:50,860 all of my free bytes are in one-bite chunks, 954 00:49:50,860 --> 00:49:52,610 and they're not next to each other. 955 00:49:52,610 --> 00:49:56,320 So this is one example of how external fragmentation can 956 00:49:56,320 --> 00:50:01,210 happen after you allocate stuff and free stuff. 957 00:50:01,210 --> 00:50:03,480 Then there's blow up. 958 00:50:03,480 --> 00:50:06,120 And this is for a parallel locator. 959 00:50:06,120 --> 00:50:11,470 The additional space beyond what a serial locator would require. 960 00:50:11,470 --> 00:50:16,120 So if a serial locator requires s space, 961 00:50:16,120 --> 00:50:19,690 and a parallel allocator requires t space, 962 00:50:19,690 --> 00:50:21,220 then it's just going to be t over s. 963 00:50:21,220 --> 00:50:22,012 That's the blow up. 964 00:50:26,200 --> 00:50:29,110 OK, so now let's look at some parallel heap allocation 965 00:50:29,110 --> 00:50:29,920 strategies. 966 00:50:32,860 --> 00:50:36,390 So the first strategy is to use a global heap. 967 00:50:36,390 --> 00:50:40,820 And this is how the default c allocator works. 968 00:50:40,820 --> 00:50:43,380 So if you just use a default c allocator out of the box, 969 00:50:43,380 --> 00:50:46,380 this is how it's implemented. 970 00:50:46,380 --> 00:50:50,070 It uses a global heap where all the accesses 971 00:50:50,070 --> 00:50:53,940 to this global heap are protected by mutex. 972 00:50:53,940 --> 00:50:56,760 You can also use lock-free synchronization primitives 973 00:50:56,760 --> 00:50:57,900 to implement this. 974 00:50:57,900 --> 00:51:00,600 We'll actually talk about some of these synchronization 975 00:51:00,600 --> 00:51:02,920 primitives later on in the semester. 976 00:51:02,920 --> 00:51:04,770 And this is done to preserve atomicity 977 00:51:04,770 --> 00:51:06,660 because you can have multiple threads trying 978 00:51:06,660 --> 00:51:08,670 to access the global heap at the same time. 979 00:51:08,670 --> 00:51:13,260 And you need to ensure that races are handled correctly. 980 00:51:16,450 --> 00:51:20,715 So what's the blow up for this strategy? 981 00:51:23,500 --> 00:51:30,250 How much more space am I using than just a serial allocator? 982 00:51:30,250 --> 00:51:31,372 Yeah. 983 00:51:31,372 --> 00:51:32,982 STUDENT: [INAUDIBLE] 984 00:51:32,982 --> 00:51:34,690 JULIAN SHUN: Yeah, so the blow up is one. 985 00:51:34,690 --> 00:51:37,627 Because I'm not actually using any more space 986 00:51:37,627 --> 00:51:38,710 than the serial allocator. 987 00:51:38,710 --> 00:51:41,530 Since I'm just maintaining one global heap, and everybody 988 00:51:41,530 --> 00:51:44,710 is going to that heap to do allocations and deallocations. 989 00:51:47,600 --> 00:51:49,900 But what's the potential issue with this approach? 990 00:51:56,870 --> 00:51:58,520 Yeah? 991 00:51:58,520 --> 00:52:01,480 STUDENT: Performance hit for that block coordination. 992 00:52:01,480 --> 00:52:03,110 JULIAN SHUN: Yeah, so you're going 993 00:52:03,110 --> 00:52:08,450 to take a performance hit for trying to acquire this lock. 994 00:52:08,450 --> 00:52:12,290 So basically every time you do a allocation or deallocation, 995 00:52:12,290 --> 00:52:13,970 you have to acquire this lock. 996 00:52:13,970 --> 00:52:16,370 And this is pretty slow, and it gets 997 00:52:16,370 --> 00:52:20,360 slower as you increase the number of processors. 998 00:52:20,360 --> 00:52:23,450 Roughly speaking, acquiring a lock to perform 999 00:52:23,450 --> 00:52:26,750 is similar to an L2 cache access. 1000 00:52:26,750 --> 00:52:29,840 And if you just run a serial allocator, 1001 00:52:29,840 --> 00:52:32,270 many of your requests are going to be satisfied just 1002 00:52:32,270 --> 00:52:33,920 by going into the L1 cache. 1003 00:52:33,920 --> 00:52:36,290 Because you're going to be allocating 1004 00:52:36,290 --> 00:52:38,300 things that you recently freed, and those things 1005 00:52:38,300 --> 00:52:40,580 are going to be residing in L1 cache. 1006 00:52:40,580 --> 00:52:42,350 But here, before you even get started, 1007 00:52:42,350 --> 00:52:44,300 you have to grab a lock. 1008 00:52:44,300 --> 00:52:46,970 And you have to pay a performance hit 1009 00:52:46,970 --> 00:52:48,870 similar to an L2 cache access. 1010 00:52:48,870 --> 00:52:50,600 So that's bad. 1011 00:52:50,600 --> 00:52:52,420 And it gets worse as you increase 1012 00:52:52,420 --> 00:52:55,790 the number of processors. 1013 00:52:55,790 --> 00:52:57,890 So the contention increases as you 1014 00:52:57,890 --> 00:53:00,080 increase the number of threads. 1015 00:53:00,080 --> 00:53:01,280 And then you can't-- 1016 00:53:01,280 --> 00:53:03,590 you're not going to be able to get good scalability. 1017 00:53:06,450 --> 00:53:10,950 So ideally, as the number of threads or processors grows, 1018 00:53:10,950 --> 00:53:13,320 the time to perform an allocation or deallocation 1019 00:53:13,320 --> 00:53:15,730 shouldn't increase. 1020 00:53:15,730 --> 00:53:17,130 But in fact, it does. 1021 00:53:17,130 --> 00:53:19,590 And the most common reason for loss of scalability 1022 00:53:19,590 --> 00:53:23,040 is lock contention. 1023 00:53:23,040 --> 00:53:25,020 So here all of the processes are trying 1024 00:53:25,020 --> 00:53:29,490 to acquire the same lock, which is the same memory address. 1025 00:53:29,490 --> 00:53:33,518 And if you recall from the caching lecture, 1026 00:53:33,518 --> 00:53:35,060 or the multicore programming lecture, 1027 00:53:35,060 --> 00:53:37,570 every time you acquire a memory location, 1028 00:53:37,570 --> 00:53:40,560 you have to bring that cache line into your own cache, 1029 00:53:40,560 --> 00:53:42,940 and then invalidate the same cache line 1030 00:53:42,940 --> 00:53:44,593 in other processors' caches. 1031 00:53:44,593 --> 00:53:46,260 So if all the processors are doing this, 1032 00:53:46,260 --> 00:53:49,080 then this cache line is going to be bouncing around 1033 00:53:49,080 --> 00:53:50,670 among all of the processors' caches, 1034 00:53:50,670 --> 00:53:54,475 and this could lead to very bad performance. 1035 00:53:54,475 --> 00:53:55,350 So here's a question. 1036 00:53:55,350 --> 00:53:57,870 Is lock contention more of a problem for large blocks 1037 00:53:57,870 --> 00:53:58,869 or small blocks? 1038 00:54:06,700 --> 00:54:08,390 Yes. 1039 00:54:08,390 --> 00:54:11,780 STUDENT: So small blocks. 1040 00:54:11,780 --> 00:54:13,790 JULIAN SHUN: Here's another question. 1041 00:54:13,790 --> 00:54:16,070 Why? 1042 00:54:16,070 --> 00:54:16,598 Yes. 1043 00:54:16,598 --> 00:54:18,140 STUDENT: Because by the time it takes 1044 00:54:18,140 --> 00:54:21,350 to finish using the small block, then 1045 00:54:21,350 --> 00:54:23,330 the allocator is usually small. 1046 00:54:23,330 --> 00:54:25,460 So you do many allocations and deallocations, 1047 00:54:25,460 --> 00:54:27,627 which means you have to go through the lock multiple 1048 00:54:27,627 --> 00:54:28,270 times. 1049 00:54:28,270 --> 00:54:29,020 JULIAN SHUN: Yeah. 1050 00:54:29,020 --> 00:54:33,730 So one of the reasons is that when 1051 00:54:33,730 --> 00:54:35,950 you're doing small allocations, that 1052 00:54:35,950 --> 00:54:38,740 means that your request rate is going to be pretty high. 1053 00:54:38,740 --> 00:54:41,950 And your processors are going to be spending a lot of time 1054 00:54:41,950 --> 00:54:43,830 acquiring this lock. 1055 00:54:43,830 --> 00:54:49,210 And this can exacerbate the lock contention. 1056 00:54:49,210 --> 00:54:52,750 And another reason is that when you allocate a large block, 1057 00:54:52,750 --> 00:54:55,540 you're doing a lot of work, because you have to write-- 1058 00:54:55,540 --> 00:54:57,610 most of the time you're going to write to all 1059 00:54:57,610 --> 00:55:00,010 the bytes in that large block. 1060 00:55:00,010 --> 00:55:02,290 And therefore you can amortize the overheads 1061 00:55:02,290 --> 00:55:06,370 of the storage allocator across all of the work 1062 00:55:06,370 --> 00:55:07,120 that you're doing. 1063 00:55:07,120 --> 00:55:08,950 Whereas for small blocks, in addition to 1064 00:55:08,950 --> 00:55:14,260 increasing this rate of memory requests, it's also-- 1065 00:55:14,260 --> 00:55:16,945 there's much less work to amortized to overheads across. 1066 00:55:20,010 --> 00:55:21,000 So any questions? 1067 00:55:26,960 --> 00:55:29,300 OK, good. 1068 00:55:29,300 --> 00:55:29,800 All right. 1069 00:55:29,800 --> 00:55:33,460 So here's another strategy, which is to use local heaps. 1070 00:55:33,460 --> 00:55:37,600 So each thread is going to maintain its own heap. 1071 00:55:37,600 --> 00:55:41,800 And it's going to allocate out of its own heap. 1072 00:55:41,800 --> 00:55:43,507 And there's no locking that's necessary. 1073 00:55:43,507 --> 00:55:46,090 So when you allocate something, you get it from your own heap. 1074 00:55:46,090 --> 00:55:48,770 And when you free something, you put it back into your own heap. 1075 00:55:48,770 --> 00:55:51,350 So there's no synchronization required. 1076 00:55:51,350 --> 00:55:52,880 So that's a good thing. 1077 00:55:52,880 --> 00:55:54,580 It's very fast. 1078 00:55:54,580 --> 00:55:56,695 What's a potential issue with this approach? 1079 00:56:04,900 --> 00:56:05,440 Yes. 1080 00:56:05,440 --> 00:56:07,510 STUDENT: It's using a lot of extra space. 1081 00:56:07,510 --> 00:56:09,770 JULIAN SHUN: Yes, so this approach, 1082 00:56:09,770 --> 00:56:13,380 you're going to be using a lot of extra space. 1083 00:56:13,380 --> 00:56:14,890 So first of all, because you have 1084 00:56:14,890 --> 00:56:16,630 to maintain multiple heaps. 1085 00:56:16,630 --> 00:56:18,610 And what's one phenomenon that you 1086 00:56:18,610 --> 00:56:21,640 might see if you're executing a program 1087 00:56:21,640 --> 00:56:25,250 with this local-heap approach? 1088 00:56:25,250 --> 00:56:26,860 So it's a space-- 1089 00:56:26,860 --> 00:56:30,276 could the space potentially keep growing over time? 1090 00:56:36,970 --> 00:56:37,645 Yes. 1091 00:56:37,645 --> 00:56:39,720 STUDENT: You could maybe like allocate 1092 00:56:39,720 --> 00:56:42,720 every one process [INAUDIBLE]. 1093 00:56:42,720 --> 00:56:43,470 JULIAN SHUN: Yeah. 1094 00:56:43,470 --> 00:56:46,520 Yeah, so you could actually have an unbounded blow up. 1095 00:56:46,520 --> 00:56:49,820 Because if you do all of the allocations in one heap, 1096 00:56:49,820 --> 00:56:53,160 and you free everything in another heap, 1097 00:56:53,160 --> 00:56:55,160 then whenever the first heap does an allocation, 1098 00:56:55,160 --> 00:56:57,620 there's actually free space sitting around in another heap. 1099 00:56:57,620 --> 00:56:59,810 But it's just going to grab more memory from the operating 1100 00:56:59,810 --> 00:57:00,310 system. 1101 00:57:00,310 --> 00:57:02,540 So you're blow up can be unbounded. 1102 00:57:02,540 --> 00:57:05,840 And this phenomenon, it's what's called memory drift. 1103 00:57:05,840 --> 00:57:08,120 So blocks allocated by one thread 1104 00:57:08,120 --> 00:57:10,620 are freed by another thread. 1105 00:57:10,620 --> 00:57:13,350 And if you run your program for long enough, 1106 00:57:13,350 --> 00:57:15,975 your memory consumption can keep increasing. 1107 00:57:15,975 --> 00:57:17,600 And this is sort of like a memory leak. 1108 00:57:17,600 --> 00:57:20,540 So you might see that if you have a memory drift problem, 1109 00:57:20,540 --> 00:57:22,850 your program running on multiple processors 1110 00:57:22,850 --> 00:57:24,590 could run out of memory eventually. 1111 00:57:24,590 --> 00:57:29,000 Whereas if you just run it on a single core, 1112 00:57:29,000 --> 00:57:31,030 it won't run out of memory. 1113 00:57:31,030 --> 00:57:33,320 And here it's because the allocator isn't smart enough 1114 00:57:33,320 --> 00:57:35,990 to reuse things in other heaps. 1115 00:57:38,600 --> 00:57:42,380 So what's another strategy you can use to try to fix this? 1116 00:57:45,210 --> 00:57:46,190 Yes? 1117 00:57:46,190 --> 00:57:49,868 STUDENT: [INAUDIBLE] 1118 00:57:49,868 --> 00:57:51,910 JULIAN SHUN: Sorry, can you repeat your question? 1119 00:57:51,910 --> 00:57:57,018 STUDENT: [INAUDIBLE] 1120 00:57:57,018 --> 00:57:58,810 JULIAN SHUN: Because if you keep allocating 1121 00:57:58,810 --> 00:58:02,230 from one thread, if you do all of your allocations 1122 00:58:02,230 --> 00:58:04,690 in one thread, and do all of your deallocations 1123 00:58:04,690 --> 00:58:06,430 on another thread, every time you 1124 00:58:06,430 --> 00:58:08,320 allocate from the first thread, there's 1125 00:58:08,320 --> 00:58:11,052 actually memory sitting around in the system. 1126 00:58:11,052 --> 00:58:13,510 But the first thread isn't going to see it, because it only 1127 00:58:13,510 --> 00:58:14,500 sees its own heap. 1128 00:58:14,500 --> 00:58:16,000 And it's just going to keep grabbing 1129 00:58:16,000 --> 00:58:17,920 more memory from the OS. 1130 00:58:17,920 --> 00:58:19,748 And then the second thread actually 1131 00:58:19,748 --> 00:58:21,290 has this extra memory sitting around. 1132 00:58:21,290 --> 00:58:22,207 But it's not using it. 1133 00:58:22,207 --> 00:58:23,710 Because it's only doing the freeze. 1134 00:58:23,710 --> 00:58:25,180 It's not doing allocate. 1135 00:58:25,180 --> 00:58:27,160 And if we recall the definition of blow up 1136 00:58:27,160 --> 00:58:29,560 is, how much more space you're using 1137 00:58:29,560 --> 00:58:31,810 compared to a serial execution of a program. 1138 00:58:31,810 --> 00:58:36,280 If you executed this program on a single core, 1139 00:58:36,280 --> 00:58:39,400 you would only have a single heap that does the allocations 1140 00:58:39,400 --> 00:58:40,570 and frees. 1141 00:58:40,570 --> 00:58:41,830 So you're not going to-- 1142 00:58:41,830 --> 00:58:43,540 your memory isn't going to blow up. 1143 00:58:43,540 --> 00:58:45,250 It's just going to be constant over time. 1144 00:58:45,250 --> 00:58:47,560 Whereas if you use two threads to execute this, 1145 00:58:47,560 --> 00:58:52,030 the memory could just keep growing over time. 1146 00:58:52,030 --> 00:58:54,314 Yes? 1147 00:58:54,314 --> 00:59:00,090 STUDENT: [INAUDIBLE] 1148 00:59:00,090 --> 00:59:02,680 JULIAN SHUN: So, it just-- 1149 00:59:02,680 --> 00:59:07,540 so if you remember the binned-free list approach, 1150 00:59:07,540 --> 00:59:09,340 let's say we're using that. 1151 00:59:09,340 --> 00:59:12,370 Then all you have to do is set some pointers 1152 00:59:12,370 --> 00:59:14,292 in your binned-free lists data structure, 1153 00:59:14,292 --> 00:59:16,000 as well as the block that you're freeing, 1154 00:59:16,000 --> 00:59:18,760 so that it appears in one of the linked lists. 1155 00:59:18,760 --> 00:59:21,740 So you can do that even if some other processor allocated 1156 00:59:21,740 --> 00:59:22,240 that block. 1157 00:59:26,580 --> 00:59:29,550 OK, so what what's another strategy that can avoid 1158 00:59:29,550 --> 00:59:32,100 this issue of memory drift? 1159 00:59:32,100 --> 00:59:32,633 Yes? 1160 00:59:32,633 --> 00:59:34,800 STUDENT: Periodically shuffle the free memory that's 1161 00:59:34,800 --> 00:59:36,690 being used on different heaps. 1162 00:59:36,690 --> 00:59:37,440 JULIAN SHUN: Yeah. 1163 00:59:37,440 --> 00:59:38,357 So that's a good idea. 1164 00:59:38,357 --> 00:59:41,580 You could periodically rebalance the memory. 1165 00:59:41,580 --> 00:59:44,458 What's a simpler approach to solve this problem? 1166 00:59:48,760 --> 00:59:50,680 Yes? 1167 00:59:50,680 --> 00:59:53,390 STUDENT: Make it all know all of the free memory? 1168 00:59:53,390 --> 00:59:55,140 JULIAN SHUN: Sorry, could you repeat that? 1169 00:59:55,140 --> 01:00:01,312 STUDENT: Make them all know all of the free memory? 1170 01:00:01,312 --> 01:00:02,020 JULIAN SHUN: Yes. 1171 01:00:02,020 --> 01:00:04,070 So you could have all of the processors 1172 01:00:04,070 --> 01:00:05,930 know all the free memory. 1173 01:00:05,930 --> 01:00:08,060 And then every time it grabs something, 1174 01:00:08,060 --> 01:00:09,740 it looks in all the other heaps. 1175 01:00:09,740 --> 01:00:12,440 That does require a lot of synchronization overhead. 1176 01:00:12,440 --> 01:00:14,780 Might not perform that well. 1177 01:00:14,780 --> 01:00:18,790 What's an easier way to solve this problem? 1178 01:00:18,790 --> 01:00:20,639 Yes. 1179 01:00:20,639 --> 01:00:24,837 STUDENT: [INAUDIBLE] 1180 01:00:24,837 --> 01:00:26,920 JULIAN SHUN: So you could restructure your program 1181 01:00:26,920 --> 01:00:30,032 so that the same thread does the allocation 1182 01:00:30,032 --> 01:00:32,060 and frees for the same memory block. 1183 01:00:32,060 --> 01:00:35,500 But what if you didn't want to restructure your program? 1184 01:00:35,500 --> 01:00:38,890 How can you change the allocator? 1185 01:00:38,890 --> 01:00:41,360 So we want the behavior that you said, 1186 01:00:41,360 --> 01:00:43,430 but we don't want to change our program. 1187 01:00:43,430 --> 01:00:43,930 Yes. 1188 01:00:43,930 --> 01:00:45,972 STUDENT: You could have a single free list that's 1189 01:00:45,972 --> 01:00:47,280 protected by synchronization. 1190 01:00:47,280 --> 01:00:49,790 JULIAN SHUN: Yeah, so you could have a single free list. 1191 01:00:49,790 --> 01:00:51,950 But that gets back to the first strategy 1192 01:00:51,950 --> 01:00:53,220 of having a global heap. 1193 01:00:53,220 --> 01:00:58,030 And then you have high synchronization overheads. 1194 01:00:58,030 --> 01:00:59,352 Yes. 1195 01:00:59,352 --> 01:01:03,320 STUDENT: You could have the free map to the thread 1196 01:01:03,320 --> 01:01:11,752 that it came from or for the pointer that corresponds to-- 1197 01:01:11,752 --> 01:01:13,510 that allocated it. 1198 01:01:13,510 --> 01:01:15,360 JULIAN SHUN: So you're saying free back 1199 01:01:15,360 --> 01:01:19,580 to the thread that allocated it? 1200 01:01:19,580 --> 01:01:22,830 Yes, so that that's exactly right. 1201 01:01:22,830 --> 01:01:25,080 So here each object, when you allocate it, 1202 01:01:25,080 --> 01:01:27,473 it's labeled with an owner. 1203 01:01:27,473 --> 01:01:28,890 And then whenever you free it, you 1204 01:01:28,890 --> 01:01:30,240 return it back to the owner. 1205 01:01:30,240 --> 01:01:33,660 So the objects that are allocated 1206 01:01:33,660 --> 01:01:35,880 will eventually go back to the owner's heap 1207 01:01:35,880 --> 01:01:37,050 if they're not in use. 1208 01:01:37,050 --> 01:01:39,420 And they're not going to be free lying around 1209 01:01:39,420 --> 01:01:42,810 in somebody else's heap. 1210 01:01:42,810 --> 01:01:44,340 The advantage of this approach is 1211 01:01:44,340 --> 01:01:47,940 that you get fast allocation and freeing of local objects. 1212 01:01:47,940 --> 01:01:52,530 Local objects are objects that you allocated. 1213 01:01:52,530 --> 01:01:56,400 However, free remote objects require some synchronization. 1214 01:01:56,400 --> 01:02:00,900 Because you have to coordinate with the other threads' heap 1215 01:02:00,900 --> 01:02:04,620 that you're sending the memory object back to. 1216 01:02:04,620 --> 01:02:09,090 But this synchronization isn't as bad as having a global heap, 1217 01:02:09,090 --> 01:02:13,860 since you only have to talk to one other thread in this case. 1218 01:02:13,860 --> 01:02:18,240 You can also bound the blow up by p. 1219 01:02:18,240 --> 01:02:22,470 So the reason why the blow up is upper bounded by p 1220 01:02:22,470 --> 01:02:25,850 is that, let's say the serial allocator uses 1221 01:02:25,850 --> 01:02:28,350 at most x memory. 1222 01:02:28,350 --> 01:02:32,170 In this case, each of the heaps can use at most x memory, 1223 01:02:32,170 --> 01:02:35,730 because that's how much the serial program would have used. 1224 01:02:35,730 --> 01:02:38,100 And you have p of these heaps, so overall you're 1225 01:02:38,100 --> 01:02:39,540 using p times x memory. 1226 01:02:39,540 --> 01:02:43,650 And therefore the ratio is upper bounded by p. 1227 01:02:43,650 --> 01:02:44,470 Yes? 1228 01:02:44,470 --> 01:02:51,830 STUDENT: [INAUDIBLE] 1229 01:02:51,830 --> 01:02:56,060 JULIAN SHUN: So when you free an object, it goes-- 1230 01:02:56,060 --> 01:02:59,120 if you allocated that object, it goes back to your own heap. 1231 01:02:59,120 --> 01:03:01,105 If your heap is empty, it's actually 1232 01:03:01,105 --> 01:03:03,230 going to get more memory from the operating system. 1233 01:03:03,230 --> 01:03:07,220 It's not going to take something from another thread's heap. 1234 01:03:07,220 --> 01:03:10,370 But the maximum amount of memory that you're going to allocate 1235 01:03:10,370 --> 01:03:12,260 is going to be upper bounded by x. 1236 01:03:12,260 --> 01:03:16,077 Because the sequential serial program took that much. 1237 01:03:16,077 --> 01:03:17,450 STUDENT: [INAUDIBLE] 1238 01:03:17,450 --> 01:03:19,880 JULIAN SHUN: Yeah. 1239 01:03:19,880 --> 01:03:23,960 So the upper bound for the blow up is p. 1240 01:03:23,960 --> 01:03:25,490 Another advantage of this approach 1241 01:03:25,490 --> 01:03:27,060 is that it's resilience-- 1242 01:03:27,060 --> 01:03:29,030 it has resilience to false sharing. 1243 01:03:31,730 --> 01:03:35,210 So let me just talk a little bit about false sharing. 1244 01:03:35,210 --> 01:03:37,640 So true sharing is when two processors 1245 01:03:37,640 --> 01:03:42,380 are trying to access the same memory location. 1246 01:03:42,380 --> 01:03:45,050 And false sharing is when multiple processors are 1247 01:03:45,050 --> 01:03:46,760 accessing different memory locations, 1248 01:03:46,760 --> 01:03:51,000 but those locations happen to be on the same cache line. 1249 01:03:51,000 --> 01:03:51,900 So here's an example. 1250 01:03:51,900 --> 01:03:55,460 Let's say we have two variables, x and y. 1251 01:03:55,460 --> 01:03:59,180 And the compiler happens to place x and y on the same cache 1252 01:03:59,180 --> 01:04:00,990 line. 1253 01:04:00,990 --> 01:04:03,680 Now, when the first processor writes to x, 1254 01:04:03,680 --> 01:04:08,870 it's going to bring this cache line into its cache. 1255 01:04:08,870 --> 01:04:10,980 When the other processor writes to y, 1256 01:04:10,980 --> 01:04:12,840 since it's on the same cache line, 1257 01:04:12,840 --> 01:04:17,120 it's going to bring this cache line to y's cache. 1258 01:04:17,120 --> 01:04:19,143 And then now, the first processor writes x, 1259 01:04:19,143 --> 01:04:20,810 it's going to bring this cache line back 1260 01:04:20,810 --> 01:04:24,080 to the first processor's cache. 1261 01:04:24,080 --> 01:04:25,850 And then you can keep-- 1262 01:04:25,850 --> 01:04:28,770 you can see this phenomenon keep happening. 1263 01:04:28,770 --> 01:04:30,320 So here, even though the processors 1264 01:04:30,320 --> 01:04:33,380 are writing to different memory locations, 1265 01:04:33,380 --> 01:04:36,470 because they happen to be on the same cache line, 1266 01:04:36,470 --> 01:04:40,040 the cache line is going to be bouncing back and forth 1267 01:04:40,040 --> 01:04:44,270 on the machine between the different processors' caches. 1268 01:04:44,270 --> 01:04:47,600 And this problem gets worse if more processors 1269 01:04:47,600 --> 01:04:49,070 are accessing this cache line. 1270 01:04:53,040 --> 01:04:56,120 So in this-- this can be quite hard to debug. 1271 01:04:56,120 --> 01:05:00,260 Because if you're using just variables on the stack, 1272 01:05:00,260 --> 01:05:02,030 you don't actually know necessarily 1273 01:05:02,030 --> 01:05:06,120 where the compiler is going to place these memory locations. 1274 01:05:06,120 --> 01:05:07,520 So the compiler could just happen 1275 01:05:07,520 --> 01:05:11,420 to place x and y in the same cache block. 1276 01:05:11,420 --> 01:05:13,942 And then you'll get this performance hit, 1277 01:05:13,942 --> 01:05:16,400 even though it seems like you're accessing different memory 1278 01:05:16,400 --> 01:05:18,920 locations. 1279 01:05:18,920 --> 01:05:21,620 If you're using the heap for memory allocation, 1280 01:05:21,620 --> 01:05:22,940 you have more knowledge. 1281 01:05:22,940 --> 01:05:25,340 Because if you allocate a huge block, 1282 01:05:25,340 --> 01:05:27,170 you know that all of the memory locations 1283 01:05:27,170 --> 01:05:29,310 are contiguous in physical memory. 1284 01:05:29,310 --> 01:05:31,700 So you can just space your-- 1285 01:05:31,700 --> 01:05:35,078 you can space the accesses far enough apart so 1286 01:05:35,078 --> 01:05:36,620 that different processes aren't going 1287 01:05:36,620 --> 01:05:37,910 to touch the same cache line. 1288 01:05:44,140 --> 01:05:46,510 A more general approach is that you can actually 1289 01:05:46,510 --> 01:05:48,230 pad the object. 1290 01:05:48,230 --> 01:05:50,110 So first, you can align the object 1291 01:05:50,110 --> 01:05:51,850 on a cache line boundary. 1292 01:05:51,850 --> 01:05:54,460 And then you pad out the remaining memory locations 1293 01:05:54,460 --> 01:05:58,450 of the objects so that it fills up the entire cache line. 1294 01:05:58,450 --> 01:06:03,220 And now there's only one thing on that cache line. 1295 01:06:03,220 --> 01:06:05,620 But this does lead to a waste of space 1296 01:06:05,620 --> 01:06:09,580 because you have this wasted padding here. 1297 01:06:09,580 --> 01:06:11,470 So program can induce false sharing 1298 01:06:11,470 --> 01:06:13,570 by having different threads process 1299 01:06:13,570 --> 01:06:18,100 nearby objects, both on the stack and on the heap. 1300 01:06:18,100 --> 01:06:22,090 And then an allocator can also induce false sharing 1301 01:06:22,090 --> 01:06:22,850 in two ways. 1302 01:06:22,850 --> 01:06:25,330 So it can actively induce false sharing. 1303 01:06:25,330 --> 01:06:28,000 And this is when the allocator satisfies memory requests 1304 01:06:28,000 --> 01:06:32,110 from different threads using the same cache block. 1305 01:06:32,110 --> 01:06:33,880 And it can also do this passively. 1306 01:06:33,880 --> 01:06:36,910 And this is when the program passes objects lying around 1307 01:06:36,910 --> 01:06:38,010 in the same cache line. 1308 01:06:38,010 --> 01:06:40,290 So different threads, and then the allocator 1309 01:06:40,290 --> 01:06:43,330 reuses the object storage after the objects 1310 01:06:43,330 --> 01:06:47,620 are free to satisfy requests from those different threads. 1311 01:06:47,620 --> 01:06:51,280 And the local ownership approach tends 1312 01:06:51,280 --> 01:06:54,850 to reduce false sharing because the thread that 1313 01:06:54,850 --> 01:06:57,130 allocates an object is eventually 1314 01:06:57,130 --> 01:06:58,030 going to get it back. 1315 01:06:58,030 --> 01:07:01,300 You're not going to have it so that an object is permanently 1316 01:07:01,300 --> 01:07:05,320 split among multiple processors' heaps. 1317 01:07:05,320 --> 01:07:09,040 So even if you see false sharing in local ownership, 1318 01:07:09,040 --> 01:07:10,990 it's usually temporary. 1319 01:07:10,990 --> 01:07:13,240 Eventually it's going-- the object is 1320 01:07:13,240 --> 01:07:16,510 going to go back to the heap that it was allocated from, 1321 01:07:16,510 --> 01:07:19,600 and the false sharing is going to go away. 1322 01:07:19,600 --> 01:07:20,996 Yes? 1323 01:07:20,996 --> 01:07:26,852 STUDENT: Are the local heaps just three to five regions in 1324 01:07:26,852 --> 01:07:28,330 [INAUDIBLE]? 1325 01:07:28,330 --> 01:07:31,220 JULIAN SHUN: I mean, you can implement it in various ways. 1326 01:07:31,220 --> 01:07:34,360 I mean can have each one of them have a binned-free list 1327 01:07:34,360 --> 01:07:36,730 allocator, so there's no restriction 1328 01:07:36,730 --> 01:07:39,860 on where they have to appear in physical memory. 1329 01:07:39,860 --> 01:07:41,770 There are many different ways where you can-- 1330 01:07:41,770 --> 01:07:44,890 you can basically plug-in any serial locator 1331 01:07:44,890 --> 01:07:46,288 for the local heap. 1332 01:07:50,280 --> 01:07:53,690 So let's go back to parallel heap allocation. 1333 01:07:53,690 --> 01:07:56,900 So I talked about three approaches already. 1334 01:07:56,900 --> 01:07:58,910 Here's a fourth approach. 1335 01:07:58,910 --> 01:08:02,600 This is called the hoard allocator. 1336 01:08:02,600 --> 01:08:04,790 And this was actually a pretty good allocator 1337 01:08:04,790 --> 01:08:08,900 when it was introduced almost two decades ago. 1338 01:08:08,900 --> 01:08:11,690 And it's inspired a lot of further research 1339 01:08:11,690 --> 01:08:13,970 on how to improve parallel-memory allocation. 1340 01:08:13,970 --> 01:08:16,120 So let me talk about how this works. 1341 01:08:16,120 --> 01:08:21,020 So in the hoard allocator, we're going to have p local heaps. 1342 01:08:21,020 --> 01:08:25,029 But we're also going to have a global heap. 1343 01:08:25,029 --> 01:08:26,960 The memory is going to be organized 1344 01:08:26,960 --> 01:08:30,140 into large super blocks of size s. 1345 01:08:30,140 --> 01:08:34,520 And s is usually a multiple of the page size. 1346 01:08:34,520 --> 01:08:36,170 So this is the granularity at which 1347 01:08:36,170 --> 01:08:40,250 objects are going to be moved around in the allocator. 1348 01:08:40,250 --> 01:08:44,600 And then you can move super blocks between the local heaps 1349 01:08:44,600 --> 01:08:46,130 and the global heaps. 1350 01:08:46,130 --> 01:08:48,950 So when a local heap becomes-- 1351 01:08:48,950 --> 01:08:52,770 has a lot of super blocks that are not being fully used 1352 01:08:52,770 --> 01:08:54,740 and you can move it to the global heap, 1353 01:08:54,740 --> 01:08:57,260 and then when a local heap doesn't have enough memory, 1354 01:08:57,260 --> 01:08:59,722 it can go to the global heap to get more memory. 1355 01:08:59,722 --> 01:09:02,180 And then when the global heap doesn't have any more memory, 1356 01:09:02,180 --> 01:09:06,779 then it gets more memory from the operating system. 1357 01:09:06,779 --> 01:09:10,010 So this is sort of a combination of the approaches 1358 01:09:10,010 --> 01:09:12,140 that we saw before. 1359 01:09:12,140 --> 01:09:15,979 The advantages are that this is a pretty fast allocator. 1360 01:09:15,979 --> 01:09:16,910 It's also scalable. 1361 01:09:16,910 --> 01:09:20,450 As you add more processors, the performance improves. 1362 01:09:20,450 --> 01:09:23,930 You can also bound the blow up. 1363 01:09:23,930 --> 01:09:26,390 And it also has resilience to false sharing, 1364 01:09:26,390 --> 01:09:29,540 because it's using local heaps. 1365 01:09:29,540 --> 01:09:33,080 So let's look at how an allocation using the hoard 1366 01:09:33,080 --> 01:09:34,500 allocator works. 1367 01:09:34,500 --> 01:09:36,800 So let's just assume without loss of generality 1368 01:09:36,800 --> 01:09:38,760 that all the blocks are the same size. 1369 01:09:38,760 --> 01:09:42,350 So we have fixed-size allocation. 1370 01:09:42,350 --> 01:09:46,160 So let's say we call malloc in our program. 1371 01:09:46,160 --> 01:09:49,130 And let's say thread i calls the malloc. 1372 01:09:49,130 --> 01:09:50,660 So what we're going to do is we're 1373 01:09:50,660 --> 01:09:56,030 going to check if there is a free object in heap i 1374 01:09:56,030 --> 01:09:58,910 that can satisfy this request. 1375 01:09:58,910 --> 01:10:01,010 And if so, we're going to get an object 1376 01:10:01,010 --> 01:10:05,360 from the fullest non-full super block in i's heap. 1377 01:10:05,360 --> 01:10:09,350 Does anyone know why we want to get the object from the fullest 1378 01:10:09,350 --> 01:10:10,580 non-full super block? 1379 01:10:13,430 --> 01:10:14,724 Yes. 1380 01:10:14,724 --> 01:10:17,478 STUDENT: [INAUDIBLE] 1381 01:10:17,478 --> 01:10:18,270 JULIAN SHUN: Right. 1382 01:10:18,270 --> 01:10:20,440 So when a super block needs to be moved, 1383 01:10:20,440 --> 01:10:21,570 it's as dense as possible. 1384 01:10:21,570 --> 01:10:25,500 And more importantly, this is to reduce external fragmentation. 1385 01:10:25,500 --> 01:10:28,620 Because as we saw in the last lecture, 1386 01:10:28,620 --> 01:10:32,430 if you skew the distribution of allocated memory objects 1387 01:10:32,430 --> 01:10:35,290 to as few pages, or in this case, 1388 01:10:35,290 --> 01:10:37,050 as few super blocks as possible, that 1389 01:10:37,050 --> 01:10:40,840 reduces your external fragmentation. 1390 01:10:40,840 --> 01:10:43,170 OK, so if it finds it in its own heap, 1391 01:10:43,170 --> 01:10:46,690 then it's going to allocate an object from there. 1392 01:10:46,690 --> 01:10:49,800 Otherwise, it's going to check the global heap. 1393 01:10:49,800 --> 01:10:53,320 And if there's something in the global heap-- 1394 01:10:53,320 --> 01:10:56,140 so here it says, if the global heap is empty, 1395 01:10:56,140 --> 01:10:59,130 then it's going to get a new super block from the OS. 1396 01:10:59,130 --> 01:11:03,240 Otherwise, we can get a super block from the global heap, 1397 01:11:03,240 --> 01:11:05,860 and then use that one. 1398 01:11:05,860 --> 01:11:09,000 And then finally we set the owner 1399 01:11:09,000 --> 01:11:12,000 of the block we got either from the OS or from the global heap 1400 01:11:12,000 --> 01:11:18,210 to i, and then we return that free object to the program. 1401 01:11:18,210 --> 01:11:22,920 So this is how a malloc works using the hoard allocator. 1402 01:11:22,920 --> 01:11:26,770 And now let's look at hoard deallocation. 1403 01:11:26,770 --> 01:11:31,590 Let use of i be the in use storage in heap i. 1404 01:11:31,590 --> 01:11:33,565 This is the heap for thread i. 1405 01:11:33,565 --> 01:11:39,162 And let a sub i be the storage owned by heap i. 1406 01:11:39,162 --> 01:11:41,370 The hoard allocator maintains the following invariant 1407 01:11:41,370 --> 01:11:43,110 for all heaps i. 1408 01:11:43,110 --> 01:11:44,650 And the invariant is as follows. 1409 01:11:44,650 --> 01:11:47,160 So u sub i is always going to be greater 1410 01:11:47,160 --> 01:11:50,940 than or equal to the min of a sub i minus 2 times s. 1411 01:11:50,940 --> 01:11:54,300 Recall s is the super block size. 1412 01:11:54,300 --> 01:11:58,110 And a sub i over 2. 1413 01:11:58,110 --> 01:12:01,750 So how it implements this is as follows. 1414 01:12:01,750 --> 01:12:06,500 When we call free of x, let's say x is owned by thread i, 1415 01:12:06,500 --> 01:12:09,240 then we're going to put x back into heap i, 1416 01:12:09,240 --> 01:12:13,230 and then we're going to check if the n u storage in heap i, 1417 01:12:13,230 --> 01:12:17,070 u sub i is less than the min of a sub i minus 2 s 1418 01:12:17,070 --> 01:12:20,510 and a sub i over 2. 1419 01:12:20,510 --> 01:12:23,610 And what this condition says, if it's true, 1420 01:12:23,610 --> 01:12:30,570 it means that your heap is, at most, half utilized. 1421 01:12:30,570 --> 01:12:32,970 Because if it's smaller than this, 1422 01:12:32,970 --> 01:12:35,300 it has to be smaller than a sub i over 2. 1423 01:12:35,300 --> 01:12:37,050 That means there's twice as much allocated 1424 01:12:37,050 --> 01:12:39,430 than used in the local heap i. 1425 01:12:39,430 --> 01:12:41,760 And therefore there must be some super block 1426 01:12:41,760 --> 01:12:43,140 that's at least half empty. 1427 01:12:43,140 --> 01:12:47,010 And you move that super block, or one of those super blocks, 1428 01:12:47,010 --> 01:12:48,090 to the global heap. 1429 01:12:51,060 --> 01:12:54,760 So any questions on how the allocation and deallocation 1430 01:12:54,760 --> 01:12:55,760 works? 1431 01:12:55,760 --> 01:12:58,960 So since we're maintaining this invariant, 1432 01:12:58,960 --> 01:13:01,722 it's going to allow us to approve a bound on the blow up. 1433 01:13:01,722 --> 01:13:03,430 And I'll show you that on the next slide. 1434 01:13:03,430 --> 01:13:05,718 But before I go on, are there any questions? 1435 01:13:08,530 --> 01:13:11,000 OK, so let's look at how we can bound the blow up 1436 01:13:11,000 --> 01:13:12,585 of the hoard allocator. 1437 01:13:12,585 --> 01:13:14,210 So there is actually a lemma that we're 1438 01:13:14,210 --> 01:13:15,440 going to use and not prove. 1439 01:13:15,440 --> 01:13:18,110 The lemma is that the maximum storage allocated 1440 01:13:18,110 --> 01:13:21,080 in the global heap is at most a maximum storage allocated 1441 01:13:21,080 --> 01:13:22,430 in the local heaps. 1442 01:13:22,430 --> 01:13:25,070 So we just need to analyze how much storage is 1443 01:13:25,070 --> 01:13:26,330 allocated in the local heaps. 1444 01:13:26,330 --> 01:13:29,030 Because the total amount of storage 1445 01:13:29,030 --> 01:13:30,890 is going to be, at most, twice as much, 1446 01:13:30,890 --> 01:13:35,240 since the global heap storage is dominated by the local heap 1447 01:13:35,240 --> 01:13:36,170 storage. 1448 01:13:36,170 --> 01:13:38,370 So you can prove this lemma by case analysis. 1449 01:13:38,370 --> 01:13:41,520 And there's the hoard paper that's 1450 01:13:41,520 --> 01:13:42,770 available on learning modules. 1451 01:13:42,770 --> 01:13:44,562 And you're free to look at that if you want 1452 01:13:44,562 --> 01:13:45,892 to look at how this is proved. 1453 01:13:45,892 --> 01:13:47,600 But here I'm just going to use this lemma 1454 01:13:47,600 --> 01:13:52,100 to prove this theorem, which says that, let u be the user 1455 01:13:52,100 --> 01:13:53,840 footprint for a program. 1456 01:13:53,840 --> 01:13:58,940 And let a be the hoard's allocator footprint. 1457 01:13:58,940 --> 01:14:04,340 We have that a as upper bounded by order u plus s p. 1458 01:14:04,340 --> 01:14:07,190 And therefore, a divided by u, which is a blowup, 1459 01:14:07,190 --> 01:14:11,510 is going to be 1 plus order s p divided by u. 1460 01:14:15,550 --> 01:14:18,810 OK, so let's see how this proof works. 1461 01:14:18,810 --> 01:14:22,530 So we're just going to analyze the storage in the local heaps. 1462 01:14:22,530 --> 01:14:26,940 Now recall that we're always satisfying this invariant here, 1463 01:14:26,940 --> 01:14:29,860 where u sub i is greater than the min of a sub i minus 2 s 1464 01:14:29,860 --> 01:14:32,420 and a sub i over 2. 1465 01:14:32,420 --> 01:14:34,580 So the first term says that we can 1466 01:14:34,580 --> 01:14:39,410 have 2 s on utilized storage per heap. 1467 01:14:39,410 --> 01:14:41,900 So it's basically giving two super blocks for free 1468 01:14:41,900 --> 01:14:42,980 to each heap. 1469 01:14:42,980 --> 01:14:45,980 And they don't have to use it. 1470 01:14:45,980 --> 01:14:49,670 They can basically use it as much as they want. 1471 01:14:49,670 --> 01:14:53,270 And therefore, the total amount of storage contributed 1472 01:14:53,270 --> 01:14:54,920 by the first term is going to be order 1473 01:14:54,920 --> 01:15:01,250 s p, because each processor has up to 2 s unutilized storage. 1474 01:15:01,250 --> 01:15:03,830 So that's where the second term comes from here. 1475 01:15:03,830 --> 01:15:11,180 And the second term, a sub i over 2-- 1476 01:15:11,180 --> 01:15:14,810 this will give us the first-term order u. 1477 01:15:14,810 --> 01:15:16,970 So this says that the allocated storage 1478 01:15:16,970 --> 01:15:19,910 is at most twice the use storage for-- 1479 01:15:19,910 --> 01:15:24,110 and then if you sum up across all the processors, 1480 01:15:24,110 --> 01:15:28,610 then there's a total of order use storage that's allocated. 1481 01:15:28,610 --> 01:15:30,530 Because the allocated storage can be at most 1482 01:15:30,530 --> 01:15:31,835 twice the used storage. 1483 01:15:34,990 --> 01:15:39,410 OK, so that's the proof of the blow up for hoard. 1484 01:15:39,410 --> 01:15:40,410 And this is pretty good. 1485 01:15:40,410 --> 01:15:43,650 It's 1 plus some lower order term. 1486 01:15:46,620 --> 01:15:51,590 OK, so-- now these are some other allocators 1487 01:15:51,590 --> 01:15:52,430 that people use. 1488 01:15:52,430 --> 01:15:54,860 So jemalloc is a pretty popular one. 1489 01:15:54,860 --> 01:15:57,410 Has a few differences with hoard. 1490 01:15:57,410 --> 01:15:59,870 It has a separate global lock for each different allocation 1491 01:15:59,870 --> 01:16:00,710 size. 1492 01:16:00,710 --> 01:16:03,350 It allocates the object with the smallest address 1493 01:16:03,350 --> 01:16:05,630 among all the objects of the requested size. 1494 01:16:05,630 --> 01:16:08,000 And it releases empty pages using m advise, 1495 01:16:08,000 --> 01:16:10,190 which we talked about-- 1496 01:16:10,190 --> 01:16:12,650 I talked about earlier. 1497 01:16:12,650 --> 01:16:16,130 And it's pretty popular because it has good performance, 1498 01:16:16,130 --> 01:16:20,827 and it's pretty robust to different allocation traces. 1499 01:16:20,827 --> 01:16:22,910 There's also another one called SuperMalloc, which 1500 01:16:22,910 --> 01:16:24,620 is an up and coming contender. 1501 01:16:24,620 --> 01:16:27,240 And it was developed by Bradley Kuszmaul. 1502 01:16:30,280 --> 01:16:33,130 Here are some allocator speeds for the allocators 1503 01:16:33,130 --> 01:16:36,430 that we looked at for our particular benchmark. 1504 01:16:36,430 --> 01:16:39,730 And for this particular benchmark, 1505 01:16:39,730 --> 01:16:41,980 we can see that SuperMalloc actually does really well. 1506 01:16:41,980 --> 01:16:44,500 It's more than three times faster than jemalloc, 1507 01:16:44,500 --> 01:16:48,160 and jemalloc is more than twice as fast as hoard. 1508 01:16:48,160 --> 01:16:51,460 And then the default allocator, which 1509 01:16:51,460 --> 01:16:53,620 uses a global heap is pretty slow, because it 1510 01:16:53,620 --> 01:16:55,450 can't get good speed up. 1511 01:16:55,450 --> 01:17:00,180 And all these experiments are in 32 threads. 1512 01:17:00,180 --> 01:17:01,780 I also have the lines of code. 1513 01:17:01,780 --> 01:17:04,630 So we see that SuperMalloc actually 1514 01:17:04,630 --> 01:17:06,325 has very few lines of code compared 1515 01:17:06,325 --> 01:17:07,325 to the other allocators. 1516 01:17:07,325 --> 01:17:10,710 So it's relatively simple. 1517 01:17:10,710 --> 01:17:13,970 OK so, I also have some slides in Garbage Collection. 1518 01:17:13,970 --> 01:17:16,210 But since we're out of time, I'll just 1519 01:17:16,210 --> 01:17:19,530 put these slides online and you can read them.