Lecture 12: Parallel Storage Allocation

Flash and JavaScript are required for this feature.

Download the video from Internet Archive.

Description: Prof. Shun discusses the differences between malloc() and mmap(); how cactus stacks work; parallel allocation strategies including global heaps, local heaps, and local ownership; and incremental, parallel, and concurrent garbage collection.

Instructor: Julian Shun

VOICEOVER: The following content is provided under a Creative Commons license. Your support will help MIT Open Courseware continue to offer high-quality educational resources for free. To make a donation or to view additional materials from hundreds of MIT courses, visit MIT OpenCourseWare at ocw.mit.edu.

JULIAN SHUN: Good afternoon, everyone. So today we're going to talk about storage allocation. This is a continuation from last lecture where we talked about serial storage allocation. Today we'll also talk a little bit more about serial allocation. But then I'll talk more about parallel allocation and also garbage collection.

So I want to just do a review of some memory allocation primitives. So recall that you can use malloc to allocate memory from the heap. And if you call malloc with the size of s, it's going to allocate and return a pointer to a block of memory containing at least s bytes. So you might actually get more than s bytes, even though you asked for s bytes. But it's guaranteed to give you at least s bytes.

The return values avoid star, but good programming practice is to typecast this pointer to whatever type you're using this memory for when you receive this from the malloc call. There's also aligned allocation.

So you can do aligned allocation with memalign, which takes two arguments, a size a as well as a size s. And a has to be an exact power of 2, and it's going to allocate and return a pointer to a block of memory again containing at least s bytes. But this time this memory is going to be aligned to a multiple of a, so the address is going to be a multiple of a, where this memory block starts.

So does anyone know why we might want to do an aligned memory allocation? Yeah?

STUDENT: [INAUDIBLE]

JULIAN SHUN: Yeah, so one reason is that you can align memories so that they're aligned to cache lines, so that when you access an object that fits within the cache line, it's not going to cross two cache lines. And you'll only get one cache axis instead of two. So one reason is that you want to align the memory to cache lines to reduce the number of cache misses.

You get another reason is that the vectorization operators also require you to have memory addresses that are aligned to some power of 2. So if you align your memory allocation with memalign, then that's also good for the vector units. We also talked about deallocations. You can free memory back to the heap with the free function.

So if you pass at a point of p to some block of memory, it's going to deallocate this block and return it to the storage allocator. And we also talked about some anomalies of freeing. So what is it called when you fail to free some memory that you allocated? Yes?

Yeah, so If you fail to freeze something that you allocated, that's called a memory leak. And this can cause your program to use more and more memory. And eventually your program is going to use up all the memory on your machine, and it's going to crash.

We also talked about freeing something more than once. Does anyone remember what that's called? Yeah? Yeah, so that's called double freeing. Double freeing is when you free something more than once. And the behavior is going to be undefined.

You might get a seg fault immediately, or you'll free something that was allocated for some other purpose. And then later down the road your program is going to have some unexpected behavior. OK.

I also want to talk about m map. So m map is a system call. And usually m map is used to treat some file on disk as part of memory, so that when you write to that memory region, it also backs it up on disk. In this context here, I'm actually using m map to allocate virtual memory without having any backing file. So

So our map has a whole bunch of parameters here. The second to the last parameter indicates the file I want to map, and if I pass a negative 1, that means there's no backing file. So I'm just using this to allocate some virtual memory.

The first argument is where I want to allocate it. And 0 means that I don't care. The size in terms of number of bytes has how much memory I want to allocate. Then there's also permissions. So here it says I can read and write this memory region. s private means that this memory region is private to the process that's allocating it. And then map anon means that there is no name associated with this memory region.

And then as I said, negative 1 means that there's no backing file. And the last parameter is just 0 if there's no backing file. Normally it would be an offset into the file that you're trying to map. But here there's no backing file.

And what m map does is it finds a contiguous unused region in the address space of the application that's large enough to hold size bytes. And then it updates the page table so that it now contains an entry for the pages that you allocated. And then it creates a necessary virtual memory management structures within the operating system to make it so that users accesses to this area are legal, and accesses won't result in a seg fault.

If you try to access some region of memory without using-- without having OS set these parameters, then you might get a set fault because the program might not have permission to access that area. But m map is going to make sure that the user can access this area of virtual memory. And m map is a system call, whereas malloc, which we talked about last time, is a library call. So these are two different things. And malloc actually uses m map under the hood to get more memory from the operating system.

So let's look at some properties of m map. So m map is lazy. So when you request a certain amount of memory, it doesn't immediately allocate physical memory for the requested allocation. Instead it just populates the page table with entries pointing to a special 0 page. And then it marks these pages as read only.

And then the first time you write to such a page, it will cause a page fault. And at that point, the OS is going to modify the page table, get the appropriate physical memory, and store the mapping from the virtual address space to physical address space for the particular page that you touch. And then it will restart the instructions so that it can continue to execute.

You can-- turns out that you can actually m map a terabyte of virtual memory, even on a machine with just a gigabyte of d ram. Because when you call m map, it doesn't actually allocate the physical memory. But then you should be careful, because a process might die from running out of physical memory well after you call m map. Because m map is going to allocate this physical memory whenever you first touch it. And this could be much later than when you actually made the call to m map.

So any questions so far? OK. So what's the difference between malloc and m map? So as I said, malloc is a library call. And it's part of--malloc and free are part of the memory allocation interface of the heat-management code in the c library.

And the heat-management code uses the available system facilities, including the m map function to get a virtual address space from the operating system. And then the heat-management code is going-- within malloc-- is going to attempt to satisfy user requests for heat storage by reusing the memory that it got from the OS as much as possible until it can't do that anymore. And then it will go and call m map to get more memory from the operating system.

So the malloc implementation invokes m map and other system calls to expand the size of the users heap storage. And the responsibility of malloc is to reuse the memory, such that your fragmentation is reduced, and you have good temporal locality, whereas the responsibility of m map is actually getting this memory from the operating system. So any questions on the differences between malloc and m map?

So one question is, why don't we just call m map up all the time, instead of just using malloc? Why don't we just directly call m map? Yes.

STUDENT: [INAUDIBLE]

JULIAN SHUN: Yes, so one answer is that you might have free storage from before that you would want to reuse. And it turns out that m map is relatively heavy weight. So it works on a page granularity. So if you want to do a small allocation, it's quite wasteful to allocate an entire page for that allocation and not reuse it.

You'll get very bad external fragmentation. And when you call m map, it has to go through all of the overhead of the security of the OS and updating the page table and so on. Whereas, if you use malloc, it's actually pretty fast for most allocations, and especially if you have temporal locality where you allocate something that you just recently freed.

So your program would be pretty slow if you used m map all the time, even for small allocations. For big allocations, it's fine. But for small allocations, you should use malloc. Any questions on m map versus malloc?

OK, so I just want to do a little bit of review on how address translation works. So some of you might have seen this before in your computer architecture course. So how it works is, when you access memory location, you access it via the virtual address. And the virtual address can be divided into two parts, where the lower order bits store the offset, and the higher order bits store the virtual page number.

And in order to get the physical address associated with this virtual address, the hardware is going to look up this virtual page number in what's called the page table. And then if it finds a corresponding entry for the virtual page number in the page table, that will tell us the physical frame number. And then the physical frame number corresponds to where this fiscal memory is in d ram. So you can just take the frame number, and then use the same offset as before to get the appropriate offset into the physical memory frame.

So if the virtual page that you're looking for doesn't reside in physical memory, then a page fault is going to occur. And when a page fault occurs, either the operating system will see that the process actually has permissions to look at that memory region, and it will set the permissions and place the entry into the page table so that you can get the appropriate physical address. But otherwise, the operating system might see that this process actually can't access that region memory, and then you'll get a segmentation fault.

It turns out that the page table search, also called a page walk, is pretty expensive. And that's why we have the translation look, a side buffer or TLB, which is essentially a cache for the page table. So the hardware uses a TLB to cache the recent page table look ups into this TLB so that later on when you access the same page, it doesn't have to go all the way to the page table to find the physical address. It can first look in the TLB to see if it's been recently accessed.

So why would you expect to see something that it recently has accessed? So what's one property of a program that will make it so that you get a lot of TLB hits? Yes?

STUDENT: Well, usually [INAUDIBLE] nearby one another, which means they're probably in the same page or [INAUDIBLE].

JULIAN SHUN: Yeah, so that's correct. So the page table stores pages, which are typically four kilobytes. Nowadays there are also huge pages, which can be a couple of megabytes. And most of the accesses in your program are going to be near each other. So they're likely going to reside on the same page for accesses that have been done close together in time.

So therefore you'll expect that many of your recent accesses are going to be stored in the TLB if your program has locality, either spatial or temporal locality or both. So how this architecture works is that the processor is first going to check whether the virtual address you're looking for is in TLB. If it's not, it's going to go to the page table and look it up.

And then if it finds that there, then it's going to store that entry into the TLB. And then next it's going to go get this physical address that it found from the TLB and look it up into the CPU cache. And if it finds it there, it gets it. If it doesn't, then it goes to d ram to satisfy the request. Most modern machines actually have an optimization that allow you to do TLB access in parallel with the L1 cache access. So the L1 cache actually uses virtual addresses instead of fiscal addresses, and this reduces the latency of a memory access.

So that's a brief review of address translation. All right, so let's talk about stacks. So when you execute a serial c and c++ program, you're using a stack to keep track of the function calls and local variables that you have to save. So here, let's say we have this invocation tree, where function a calls Function b, which then returns. And then a calls function c, which calls d, returns, calls e, returns, and then returns again.

Here are the different views of the stack at different points of the execution. So initially when we call a, we have a stack frame for a. And then when a calls b, we're going to place a stack frame for b right below the stack frame of a. So these are going to be linearly ordered.

When we're done with b, then this part of the stack is no longer going to be used, the part for b. And then when it calls c, It's going to allocate a stack frame below a on the stack. And this space is actually going to be the same space as what b was using before. But this is fine, because we're already done with the call to b.

Then when c calls d, we're going to create a stack frame for d right below c. When it returns, we're not going to use that space any more, so then we can reuse it for the stack frame when we call e. And then eventually all of these will pop back up.

And all of these views here share the same view of the stack frame for a. And then for c, d, and e, they all stare share the same view of this stack for c. So this is how a traditional linear stack works when you call a serial c or c++ program. And you can view this as a serial walk over the invocation tree.

There's one rule for pointers. With traditional linear stacks is that a parent can pass pointers to its stack variables down to its children. But not the other way around. A child can't pass a pointer to some local variable back to its parent. So if you do that, you'll get a bug in your program. How many of you have tried doing that before? Yeah, so a lot of you.

So let's see why that causes a problem. So if I'm calling-- if I call b, and I pass a pointer to some local variable in b stack to a, and then now when a calls c, It's going to overwrite the space that b was using. And if b's local variable was stored in the space that c has now overwritten, then you're just going to see garbage. And when you try to access that, you're not going to get the correct value.

So you can pass a pointer to a's local variable down to any of these descendant function calls, because they all see the same view of a stack. And that's not going to be overwritten while these descendant function calls are proceeding. But if you pass it the other way, then potentially the variable that you had a pointer to is going to be overwritten.

So here's one question. If you want to pass memory from a child back to the parent, where would you allocate it? So you can allocate it on the parent. What's another option? Yes? Yes, so another way to do this is to allocate it on the heap.

If you allocate it on the heap, even after you return from the function call, that memory is going to persist. You can also allocate it in the parent's stack, if you want. In fact, some programs are written that way. And one of the reasons why many c functions require you to pass in memory to the function where it's going to store the return value is to try to avoid an expensive heap allocation in the child.

Because if the parent allocates this space to store the result, the child can just put whatever it wants to compute in that space. And the parent will see it. So then the responsibility is up to the parent to figure out whether it wants to allocate the memory on the stack or on the heap. So this is one of the reasons why you'll see many c functions, where one of the arguments is a memory location where the result should be stored.

OK, so that was the serial case. What happens in parallel? So in parallel, we have what's called a cactus stack where we can support multiple views of the stack in parallel. So let's say we have a program where it calls function a, and then a spawns b and c. So b and c are going to be running potentially in parallel. And then c spawns d and e, which can potentially be running in parallel.

So for this program, we could have functions b, d and e all executing in parallel. And a cactus stack is going to allow us to have all of these functions see the same view of this stack as they would have if this program were executed serially. And the silk runtime system supports the cactus stack to make it easy for writing parallel programs. Because now when you're writing programs, you just have to obey the same rules for programming in serial c and c++ with regards to the stack, and then you'll still get the intended behavior.

And it turns out that there's no copying of the stacks here. So all of these different views are seeing the same virtual memory addresses for a. But now there is an issue of how do we implement this cactus stack? Because in the serial case, we could have these later stacks overwriting the earlier stacks. But in parallel, how can we do this?

So does anyone have any simple ideas on how we can implement a cactus stack? Yes?

STUDENT: You could just have each child's stack start in like a separate stack, or just have references to the [INAUDIBLE].

JULIAN SHUN: Yeah, so one way to do this is to have each thread use a different stack. And then store pointers to the different stack frames across the different stacks. There's actually another way to do this, which is easier. OK, yes?

STUDENT: If the stack frames have a maximum-- fixed maximum size-- then you could put them all in the same stack separated by that fixed size.

JULIAN SHUN: Yeah, so if the stacks all have a maximum depth, then you could just allocate a whole bunch of stacks, which are separated by this maximum depth. There's actually another way to do this, which is to not use the stack. So yes?

STUDENT: Could you memory map it somewhere else-- each of the different threads?

JULIAN SHUN: Yes, that's actually one way to do it. The easiest way to do it is just to allocate it off the heap. So instead of allocating the frames on the stack, you just do a heap allocation for each of these stack frames. And then each of these stack frames has a pointer to the parent stack frame.

So whenever you do a function call, you're going to do a memory allocation from the heap to get a new stack frame. And then when you finish a function, you're going to pop something off of this stack, and free it back to the heap. In fact, a lot of early systems for parallel programming use this strategy of heap-based cactus stacks.

Turns out that you can actually minimize the performance impact using this strategy if you optimize the code enough. But there is actually a bigger problem with using a heap-based cactus stack, which doesn't have to do with performance. Does anybody have any guesses of what this potential issue is? Yeah?

STUDENT: It requires you to allocate the heap in parallel.

JULIAN SHUN: Yeah, so let's assume that we can do parallel heap allocation. And we'll talk about that. So assuming that we can do that correctly, what's the issue with this approach? Yeah?

STUDENT: It's that you don't know how big the stack is going to be?

JULIAN SHUN: So let's assume that you can get whatever stack frames you need from the heap, so you don't actually need to put an upper bound on this. Yeah?

STUDENT: We don't know the maximum depth.

JULIAN SHUN: Yeah. So we don't know the maximum depth, but let's say we can make that work. So you don't actually need to know the maximum depth if you're allocating off the heap. Any other guesses? Yeah?

STUDENT: Something to do with returning from the stack that is allocated on the heap to one of the original stacks.

JULIAN SHUN: So let's say we could get that to work as well. So what happens if I try to run some program using this heap-based cactus stack with something that's using the regular stack? So let's say I have some old legacy code that was already compiled using the traditional linear stack. So there's a problem with interoperability here.

Because the traditional code is assuming that, when you make a function call, the stack frame for the function call is going to appear right after the stack frame for the particular call e function. So if you try to mix code that uses the traditional stack as well as this heap-based cactus stack approach, then it's not going to work well together.

One approach is that you can just recompile all your code to use this heap-based cactus stack. Even if you could do that, even if all of the source codes were available, there are some legacy programs that actually in the source code, they do some manipulations with the stack, because they assume that you're using the traditional stack, and those programs would no longer work if you're using a heap-based cactus stack.

So the problem is interoperability with legacy code. Turns out that you can fix this using an approach called thread local memory mapping. So one of the students mentioned memory mapping. But that requires changes to the operating system. So it's not general purpose.

But the heap-based cactus stack turns out to be very simple. And we can prove nice bounds about it. So besides the interoperability issue, heap-based cactus stacks are pretty good in practice, as well as in theory. So we can actually prove a space bound of a cilk program that uses the heap-based cactus stack.

So let's say s 1 is the stack space required by a serial execution of a cilk program. Then the stack space of p worker execution using a heap-based cactus stack is going to be upper bounded by p times s 1. So s p is the space for a p worker execution, and that's less than or equal to p times s 1.

To understand how this works, we need to understand a little bit about how the cilks works stealing algorithm works. So in the cilk work-stealing algorithm, whenever you spawn something of work, or that spawns a new task, is going to work on the task that it spawned.

So therefore, for any leaf in the invocation tree that currently exists, there's always going to be a worker working on it. There's not going to be any leaves in the tree where there's no worker working on it. Because when a worker spawns a task, it creates a new leaf. But then it works immediately on that leaf.

So here we have a-- we have a invocation tree. And for all of the leaves, we have a processor working on it. And with this busy leaves property, we can easily show this space bound. So for each one of these processors, the maximum stack space it's using is going to be upper bounded by s 1, because that's maximum stock space across a serial execution that executes the whole program.

And then since we have p of these leaves, we just multiply s 1 by p, and that gives us an upper bound on the overall space used by a p worker execution. This can be a loose upper bound, because we're double counting here. There's some part of this memory that we're counting more than once, because they're shared among the different processors.

But that's why we have the less than or equal to here. So it's upper bounded by p times s 1. So this is one of the nice things about using a heap-based cactus stack is that you get this good space bound. Any questions on the space bound here?

So let's try to apply this theorem to a real example. So this is the divide and conquer matrix multiplication code that we saw in a previous lecture. So this is-- in this code, we're making eight recursive calls to a divide and conquer function. Each of size n over 2.

And before we make any of these calls, we're doing a malloc to get some temporary space. And this is of size order and squared. And then we free this temporary space at the end. And notice here that the allocations of the temporary matrix obey a stack discipline.

So we're allocating stuff before we make recursive calls. And we're freeing it after, or right before we return from the function. So all this stack-- all the allocations are nested, and they follow a stack discipline. And it turns out that even if you're allocating off the heap, if you follow a stack discipline, you can still use the space bound from the previous slide to upper bound the p worker space.

OK, so let's try to analyze the space of this code here. So first let's look at what the work and span are. So this is just going to be review. What's the work of this divide and conquer matrix multiply? So it's n cubed. So it's n cubed because we have eight solve problems of size n over 2. And then we have to do linear work to add together the matrices.

So our recurrence is going to be t 1 of n is equal to eight times t 1 of n over 2 plus order n squared. And that solves to order n cubed if you just pull out your master theorem card. What about the span? So what's the recurrence here? Yeah, so the span t infinity of n is equal to t infinitive of n over 2 plus a span of the addition. And what's the span of the addition?

STUDENT: [INAUDIBLE]

JULIAN SHUN: No, let's assume that we have a parallel addition. We have nested silk four loops. Right, so then the span of that is just going of be log n. Since the span of 1 silk four loop is log n and when you nest them, you just add together the span. So it's going to be t infinity of n is equal to t infinity of n over 2 plus order log n. And what does that solve to?

Yeah, so it's going to solve to order log squared n. Again you can pull out your master theorem card, and look at one of the three cases. OK, so now let's look at the space. What's going to be the recurrence for the space? Yes.

STUDENT: [INAUDIBLE]

JULIAN SHUN: The only place we're generating new space is when we call this malloc here. So they're all seeing the same original matrix. So what would the recurrence be? Yeah?

STUDENT: [INAUDIBLE]

JULIAN SHUN: Yeah.

STUDENT: [INAUDIBLE]

JULIAN SHUN: So the n square term is right. Do we actually need eight subproblems of size n over 2? What happens after we finish one of these sub problems? Are we still going to use the space for it?

STUDENT: Yeah, you free the memory after the [INAUDIBLE].

JULIAN SHUN: Right. So you can actually reuse the memory. Because you free the memory you allocated after each one of these recursive calls. So therefore the recurrence is just going to be s of n over 2 plus theta n squared. And what does that solve to?

STUDENT: [INAUDIBLE]

JULIAN SHUN: N squared. Right. So here the n squared term actually dominates. You have a decreasing geometric series. So it's dominated at the root, and you get theta of n squared. And therefore by using the busy leaves property and the theorem for the space bound, this tells us that on p processors, the space is going to be bounded by p times n squared.

And this is actually pretty good since we have a bound on this. It turns out that we can actually prove a stronger bound for this particular example. And I'll walk you through how we can prove this stronger bound. Here's the order p times n squared is already pretty good. But we can actually do better if we look internally at how this algorithm is structured.

So on each level of recursion, we're branching eight ways. And most of the space is going to be used near the top of this recursion tree. So if I branch as much as possible near the top of my recursion tree, then that's going to give me my worst case space bound. Because the space is decreasing geometrically as I go down the tree.

So I'm going to branch eight ways until I get to some level k in the recursion tree where I have p nodes. And at that point, I'm not going to branch anymore because I've already used up all p nodes. And that's the number of workers I have.

So let's say I have this level k here, where I have p nodes. So what would be the value of k here? If I branch eight ways how many levels do I have to go until I get to p nodes? Yes.

STUDENT: It's log base 8 of p.

JULIAN SHUN: Yes. It's log base 8 of p. So we have eight, the k, equal p, because we're branching k ways. And then using some algebra, you can get it so that k is equal to log base 8 of p, which is equal to log base 2 of p divided by 3. And then at this level k downwards, it's going to decrease geometrically.

So the space is going to be dominant at this level k. So the space decreases geometrically as you go down from level k, and also as you go up from level k. So therefore we can just look at what the space is at this level k here.

So the space is going to be p times the size of each one of these nodes squared. And the size of each one of these nodes is going to be n over 2 to the log base 2 of p over 3. And then we square that because we're using n squared temporary space.

So if you solve that, that gives you p to the one-third times n squared, which is better than the upper bound we saw earlier of order p times n squared. So you can work out the details for this example. Not all the details are shown on this slide. You need to show that the level k here actually dominates all the other levels in the recursion tree.

But in general, if you know what the structure of the algorithm, is you can potentially prove a stronger space bound than just applying the general theorem we showed on the previous slide. So any questions on this?

OK, so as I said before, the problem with heap-based linkage is that parallel functions fail to interoperate with legacy and third-party serial binaries. Yes, was there a question?

STUDENT: I actually do have a question.

JULIAN SHUN: Yes.

STUDENT: [INAUDIBLE]

JULIAN SHUN: Yes.

STUDENT: How do we know that the workers don't split along the path of the [INAUDIBLE] instead of across or horizontal.

JULIAN SHUN: Yes. So you don't actually know that. But this turns out to be the worst case. So if it branches any other way, the space is just going to be lower. So you have to argue that this is going to be the worst case, and it's going to be-- intuitively it's the worst case, because you're using most of the memory near the root of the recursion tree. So if you can get all p nodes as close as possible to the root, that's going to make your space as high as possible. It's a good question.

So parallel functions fail to interoperate with legacy and third-party serial binaries. Even if you can recompile all of this code, which isn't always necessarily the case, you can still have issues if the legacy code is taking advantage of the traditional linear stack inside the source code. So our implementation of cilk uses a less space efficient strategy that is interoperable with legacy code. And it uses a pool of linear stacks instead of a heap-based strategy.

So we're going to maintain a pool of linear stacks lying around. There's going to be more than p stacks lying around. And whenever a worker tries to steal something, it's going to try to acquire one of these tasks from this pool of linear tasks. And when it's done, it will return it back.

But when it finds that there's no more linear stacks in this pool, then it's not going to steal anymore. So this is still going to preserve the space bound, as long as the number of stocks is a constant times the number of processors. But it will affect the time bounds of the work-stealing algorithm. Because now when a worker is idle, it might not necessarily have the chance to steal if there are no more stacks lying around.

This strategy doesn't require any changes to the operating system. There is a way where you can preserve the space and the time bounds using thread local memory mapping. But this does require changes to the operating system. So our implementation of cilk uses a pool of linear stacks, and it's based on the Intel implementation. OK.

All right, so we talked about stacks, and that we just reduce the problem to heap allocation. So now we have to talk about heaps. So let's review some basic properties of heap-storage allocators. So here's a definition.

The allocator speed is the number of allocations and d allocations per second that the allocator can sustain. And here's a question. Is it more important to maximize the allocator speed for large blocks or small blocks? Yeah?

STUDENT: Small blocks?

JULIAN SHUN: So small blocks. Here's another question. Why? Yes?

STUDENT: So you're going to be doing a lot of [INAUDIBLE].

JULIAN SHUN: Yes, so one answer is that you're going to be doing a lot more allocations and deallocations of small blocks than large blocks. There's actually a more fundamental reason why it's more important to optimize for small blocks. So anybody? Yeah?

STUDENT: [INAUDIBLE] basically not being able to make use of pages.

JULIAN SHUN: Yeah, so that's another reason for small blocks. It's more likely that it will lead to fragmentation if you don't optimize for small blocks. What's another reason? Yes.

STUDENT: Wouldn't it just take longer to allocate larger blocks anyway? So the overhead is going to be more noticeable if you have a big overhead when you allocate small blocks versus large blocks?

JULIAN SHUN: Yeah. So the reason-- the main reason is that when you're allocating a large-- when you're allocating a block, a user program is typically going to write to all the bytes in the block. And therefore, for a large block, it takes so much time to write that the allocator time has little effect on the overall running time.

Whereas if a program allocates many small blocks, the amount of work-- useful work-- it's actually doing on the block is going to be-- it can be comparable to the overhead for the allocation. And therefore, all of the allocation overhead can add up to a significant amount for small blocks.

So essentially for large blocks, you can amortize away the overheads for storage allocation, whereas for small, small blocks, it's harder to do that. Therefore, it's important to optimize for small blocks. Here's another definition. So the user footprint is the maximum over time of the number u of bytes in use by the user program.

And these are the bytes that are allocated and not freed. And this is measuring the peak memory usage. It's not necessarily equal to the sum of the sizes that you have allocated so far, because you might have reused some of that. So the user footprint is the peak memory usage and number of bytes.

And the allocator footprint is the maximum over time of the number of a bytes that the memory provided to the locator by the operating system. And the reason why the allocator footprint could be larger than the user footprint, is that when you ask the OS for some memory, it could give you more than what you asked for. And similarly, if you ask malloc for some amount of memory, it can also give you more than what you asked for.

And the fragmentation is defined to be a divided by u. And a program with low fragmentation will keep this ratio as low as possible, so keep the allocator footprint as close as possible to the user footprint. And in the best case, this ratio is going to be one. So you're using all of the memory that the operating system allocated.

One remark is that the allocator footprint a usually gross monotonically for many allocators. So it turns out that many allocators do m maps to get more memory. But they don't always free this memory back to the OS. And you can actually free memory using something called m unmap, which is the opposite of m map, to give memory back to the OS. But this turns out to be pretty expensive.

In modern operating systems, their implementation is not very efficient. So many allocators don't use m unmap. You can also use something called m advise. And what m advise does is it tells the operating system that you're not going to be using this page anymore but to keep it around in virtual memory. So this has less overhead, because it doesn't have to clear this entry from the page table. It just has to mark that the program isn't using this page anymore.

So some allocators use m advise with the option, don't need, to free memory. But a is usually still growing monotonically over time, because allocators don't necessarily free all of the things back to the OS that they allocated.

Here's a theorem that we proved in last week's lecture, which says that the fragmentation for binned free list is order log base 2 of u, or just order log u. And the reason for this is that you're can have log-based 2 of u bins. And for each bin it can basically contain u bytes of storage.

So overall you can use-- overall, you could have allocated u times log u storage, and only be using u of those bytes. So therefore the fragmentation is order log u.

Another thing to note is that modern 64-bit processors only provide about 2 to 48 bytes of virtual address space. So this is sort of news because you would probably expect that, for a 64-bit processor, you have to the 64 bytes of virtual address space. But that turns out not to be the case.

So they only support to the 48 bytes. And that turns out to be enough for all of the programs that you would want to write. And that's also going to be much more than the physical memory you would have on a machine. So nowadays, you can get a big server with a terabyte of memory, or to the 40th bytes of physical memory, which is still much lower than the number of bytes in the virtual address space.

Any questions? OK, so here's some more definitions. So the space overhead of an allocator is a space used for bookkeeping. So you could store-- perhaps you could store headers with the blocks that you allocate to keep track of the size and other information. And that would contribute to the space overhead

Internal fragmentation is a waste due to allocating larger blocks in the user request. So you can get internal fragmentation if, when you call malloc, you get back a block that's actually larger than what the user requested. We saw on the bin free list algorithm, we're rounding up to the nearest power of 2's.

If you allocate nine bytes, you'll actually get back 16 bytes in our binned-free list algorithm from last lecture. So that contributes to internal fragmentation. It turns out that not all binned-free list implementations use powers of 2. So some of them use other powers that are smaller than 2 in order to reduce the internal fragmentation.

Then there's an external fragmentation, which is the waste due to the inability to use storage because it's not contiguous. So for example, if I allocated a whole bunch of one byte things consecutively in memory, then I freed every other byte. And now I want to allocate a 2-byte thing, I don't actually have contiguous mammary to satisfy that request, because all of my free memory-- all of my free bytes are in one-bite chunks, and they're not next to each other.

So this is one example of how external fragmentation can happen after you allocate stuff and free stuff. Then there's blow up. And this is for a parallel locator. The additional space beyond what a serial locator would require. So if a serial locator requires s space, and a parallel allocator requires t space, then it's just going to be t over s. That's the blow up.

OK, so now let's look at some parallel heap allocation strategies. So the first strategy is to use a global heap. And this is how the default c allocator works. So if you just use a default c allocator out of the box, this is how it's implemented.

It uses a global heap where all the accesses to this global heap are protected by mutex. You can also use lock-free synchronization primitives to implement this. We'll actually talk about some of these synchronization primitives later on in the semester. And this is done to preserve atomicity because you can have multiple threads trying to access the global heap at the same time. And you need to ensure that races are handled correctly.

So what's the blow up for this strategy? How much more space am I using than just a serial allocator? Yeah.

STUDENT: [INAUDIBLE]

JULIAN SHUN: Yeah, so the blow up is one. Because I'm not actually using any more space than the serial allocator. Since I'm just maintaining one global heap, and everybody is going to that heap to do allocations and deallocations. But what's the potential issue with this approach? Yeah?

STUDENT: Performance hit for that block coordination.

JULIAN SHUN: Yeah, so you're going to take a performance hit for trying to acquire this lock. So basically every time you do a allocation or deallocation, you have to acquire this lock. And this is pretty slow, and it gets slower as you increase the number of processors.

Roughly speaking, acquiring a lock to perform is similar to an L2 cache access. And if you just run a serial allocator, many of your requests are going to be satisfied just by going into the L1 cache. Because you're going to be allocating things that you recently freed, and those things are going to be residing in L1 cache.

But here, before you even get started, you have to grab a lock. And you have to pay a performance hit similar to an L2 cache access. So that's bad. And it gets worse as you increase the number of processors. So the contention increases as you increase the number of threads. And then you can't-- you're not going to be able to get good scalability.

So ideally, as the number of threads or processors grows, the time to perform an allocation or deallocation shouldn't increase. But in fact, it does. And the most common reason for loss of scalability is lock contention.

So here all of the processes are trying to acquire the same lock, which is the same memory address. And if you recall from the caching lecture, or the multicore programming lecture, every time you acquire a memory location, you have to bring that cache line into your own cache, and then invalidate the same cache line in other processors' caches.

So if all the processors are doing this, then this cache line is going to be bouncing around among all of the processors' caches, and this could lead to very bad performance. So here's a question. Is lock contention more of a problem for large blocks or small blocks? Yes.

STUDENT: So small blocks.

JULIAN SHUN: Here's another question. Why? Yes.

STUDENT: Because by the time it takes to finish using the small block, then the allocator is usually small. So you do many allocations and deallocations, which means you have to go through the lock multiple times.

JULIAN SHUN: Yeah. So one of the reasons is that when you're doing small allocations, that means that your request rate is going to be pretty high. And your processors are going to be spending a lot of time acquiring this lock. And this can exacerbate the lock contention.

And another reason is that when you allocate a large block, you're doing a lot of work, because you have to write-- most of the time you're going to write to all the bytes in that large block. And therefore you can amortize the overheads of the storage allocator across all of the work that you're doing.

Whereas for small blocks, in addition to increasing this rate of memory requests, it's also-- there's much less work to amortized to overheads across. So any questions? OK, good.

All right. So here's another strategy, which is to use local heaps. So each thread is going to maintain its own heap. And it's going to allocate out of its own heap. And there's no locking that's necessary. So when you allocate something, you get it from your own heap. And when you free something, you put it back into your own heap. So there's no synchronization required. So that's a good thing. It's very fast. What's a potential issue with this approach? Yes.

STUDENT: It's using a lot of extra space.

JULIAN SHUN: Yes, so this approach, you're going to be using a lot of extra space. So first of all, because you have to maintain multiple heaps. And what's one phenomenon that you might see if you're executing a program with this local-heap approach? So it's a space-- could the space potentially keep growing over time? Yes.

STUDENT: You could maybe like allocate every one process [INAUDIBLE].

JULIAN SHUN: Yeah. Yeah, so you could actually have an unbounded blow up. Because if you do all of the allocations in one heap, and you free everything in another heap, then whenever the first heap does an allocation, there's actually free space sitting around in another heap. But it's just going to grab more memory from the operating system. So you're blow up can be unbounded.

And this phenomenon, it's what's called memory drift. So blocks allocated by one thread are freed by another thread. And if you run your program for long enough, your memory consumption can keep increasing. And this is sort of like a memory leak.

So you might see that if you have a memory drift problem, your program running on multiple processors could run out of memory eventually. Whereas if you just run it on a single core, it won't run out of memory. And here it's because the allocator isn't smart enough to reuse things in other heaps. So what's another strategy you can use to try to fix this? Yes?

STUDENT: [INAUDIBLE]

JULIAN SHUN: Sorry, can you repeat your question?

STUDENT: [INAUDIBLE]

JULIAN SHUN: Because if you keep allocating from one thread, if you do all of your allocations in one thread, and do all of your deallocations on another thread, every time you allocate from the first thread, there's actually memory sitting around in the system. But the first thread isn't going to see it, because it only sees its own heap. And it's just going to keep grabbing more memory from the OS.

And then the second thread actually has this extra memory sitting around. But it's not using it. Because it's only doing the freeze. It's not doing allocate. And if we recall the definition of blow up is, how much more space you're using compared to a serial execution of a program.

If you executed this program on a single core, you would only have a single heap that does the allocations and frees. So you're not going to-- your memory isn't going to blow up. It's just going to be constant over time. Whereas if you use two threads to execute this, the memory could just keep growing over time. Yes?

STUDENT: [INAUDIBLE]

JULIAN SHUN: So, it just-- so if you remember the binned-free list approach, let's say we're using that. Then all you have to do is set some pointers in your binned-free lists data structure, as well as the block that you're freeing, so that it appears in one of the linked lists. So you can do that even if some other processor allocated that block.

OK, so what what's another strategy that can avoid this issue of memory drift? Yes?

STUDENT: Periodically shuffle the free memory that's being used on different heaps.

JULIAN SHUN: Yeah. So that's a good idea. You could periodically rebalance the memory. What's a simpler approach to solve this problem? Yes?

STUDENT: Make it all know all of the free memory?

JULIAN SHUN: Sorry, could you repeat that?

STUDENT: Make them all know all of the free memory?

JULIAN SHUN: Yes. So you could have all of the processors know all the free memory. And then every time it grabs something, it looks in all the other heaps. That does require a lot of synchronization overhead. Might not perform that well. What's an easier way to solve this problem? Yes.

STUDENT: [INAUDIBLE]

JULIAN SHUN: So you could restructure your program so that the same thread does the allocation and frees for the same memory block. But what if you didn't want to restructure your program? How can you change the allocator? So we want the behavior that you said, but we don't want to change our program. Yes.

STUDENT: You could have a single free list that's protected by synchronization.

JULIAN SHUN: Yeah, so you could have a single free list. But that gets back to the first strategy of having a global heap. And then you have high synchronization overheads. Yes.

STUDENT: You could have the free map to the thread that it came from or for the pointer that corresponds to-- that allocated it.

JULIAN SHUN: So you're saying free back to the thread that allocated it? Yes, so that that's exactly right. So here each object, when you allocate it, it's labeled with an owner. And then whenever you free it, you return it back to the owner.

So the objects that are allocated will eventually go back to the owner's heap if they're not in use. And they're not going to be free lying around in somebody else's heap. The advantage of this approach is that you get fast allocation and freeing of local objects.

Local objects are objects that you allocated. However, free remote objects require some synchronization. Because you have to coordinate with the other threads' heap that you're sending the memory object back to. But this synchronization isn't as bad as having a global heap, since you only have to talk to one other thread in this case.

You can also bound the blow up by p. So the reason why the blow up is upper bounded by p is that, let's say the serial allocator uses at most x memory. In this case, each of the heaps can use at most x memory, because that's how much the serial program would have used. And you have p of these heaps, so overall you're using p times x memory. And therefore the ratio is upper bounded by p. Yes?

STUDENT: [INAUDIBLE]

JULIAN SHUN: So when you free an object, it goes-- if you allocated that object, it goes back to your own heap. If your heap is empty, it's actually going to get more memory from the operating system. It's not going to take something from another thread's heap.

But the maximum amount of memory that you're going to allocate is going to be upper bounded by x. Because the sequential serial program took that much.

STUDENT: [INAUDIBLE]

JULIAN SHUN: Yeah. So the upper bound for the blow up is p. Another advantage of this approach is that it's resilience-- it has resilience to false sharing. So let me just talk a little bit about false sharing. So true sharing is when two processors are trying to access the same memory location.

And false sharing is when multiple processors are accessing different memory locations, but those locations happen to be on the same cache line. So here's an example. Let's say we have two variables, x and y. And the compiler happens to place x and y on the same cache line.

Now, when the first processor writes to x, it's going to bring this cache line into its cache. When the other processor writes to y, since it's on the same cache line, it's going to bring this cache line to y's cache. And then now, the first processor writes x, it's going to bring this cache line back to the first processor's cache.

And then you can keep-- you can see this phenomenon keep happening. So here, even though the processors are writing to different memory locations, because they happen to be on the same cache line, the cache line is going to be bouncing back and forth on the machine between the different processors' caches. And this problem gets worse if more processors are accessing this cache line.

So in this-- this can be quite hard to debug. Because if you're using just variables on the stack, you don't actually know necessarily where the compiler is going to place these memory locations. So the compiler could just happen to place x and y in the same cache block. And then you'll get this performance hit, even though it seems like you're accessing different memory locations.

If you're using the heap for memory allocation, you have more knowledge. Because if you allocate a huge block, you know that all of the memory locations are contiguous in physical memory. So you can just space your-- you can space the accesses far enough apart so that different processes aren't going to touch the same cache line.

A more general approach is that you can actually pad the object. So first, you can align the object on a cache line boundary. And then you pad out the remaining memory locations of the objects so that it fills up the entire cache line. And now there's only one thing on that cache line. But this does lead to a waste of space because you have this wasted padding here.

So program can induce false sharing by having different threads process nearby objects, both on the stack and on the heap. And then an allocator can also induce false sharing in two ways. So it can actively induce false sharing. And this is when the allocator satisfies memory requests from different threads using the same cache block.

And it can also do this passively. And this is when the program passes objects lying around in the same cache line. So different threads, and then the allocator reuses the object storage after the objects are free to satisfy requests from those different threads. And the local ownership approach tends to reduce false sharing because the thread that allocates an object is eventually going to get it back. You're not going to have it so that an object is permanently split among multiple processors' heaps.

So even if you see false sharing in local ownership, it's usually temporary. Eventually it's going-- the object is going to go back to the heap that it was allocated from, and the false sharing is going to go away. Yes?

STUDENT: Are the local heaps just three to five regions in [INAUDIBLE]?

JULIAN SHUN: I mean, you can implement it in various ways. I mean can have each one of them have a binned-free list allocator, so there's no restriction on where they have to appear in physical memory. There are many different ways where you can-- you can basically plug-in any serial locator for the local heap.

So let's go back to parallel heap allocation. So I talked about three approaches already. Here's a fourth approach. This is called the hoard allocator. And this was actually a pretty good allocator when it was introduced almost two decades ago. And it's inspired a lot of further research on how to improve parallel-memory allocation.

So let me talk about how this works. So in the hoard allocator, we're going to have p local heaps. But we're also going to have a global heap. The memory is going to be organized into large super blocks of size s. And s is usually a multiple of the page size. So this is the granularity at which objects are going to be moved around in the allocator.

And then you can move super blocks between the local heaps and the global heaps. So when a local heap becomes-- has a lot of super blocks that are not being fully used and you can move it to the global heap, and then when a local heap doesn't have enough memory, it can go to the global heap to get more memory. And then when the global heap doesn't have any more memory, then it gets more memory from the operating system.

So this is sort of a combination of the approaches that we saw before. The advantages are that this is a pretty fast allocator. It's also scalable. As you add more processors, the performance improves. You can also bound the blow up. And it also has resilience to false sharing, because it's using local heaps.

So let's look at how an allocation using the hoard allocator works. So let's just assume without loss of generality that all the blocks are the same size. So we have fixed-size allocation. So let's say we call malloc in our program. And let's say thread i calls the malloc.

So what we're going to do is we're going to check if there is a free object in heap i that can satisfy this request. And if so, we're going to get an object from the fullest non-full super block in i's heap. Does anyone know why we want to get the object from the fullest non-full super block? Yes.

STUDENT: [INAUDIBLE]

JULIAN SHUN: Right. So when a super block needs to be moved, it's as dense as possible. And more importantly, this is to reduce external fragmentation. Because as we saw in the last lecture, if you skew the distribution of allocated memory objects to as few pages, or in this case, as few super blocks as possible, that reduces your external fragmentation.

OK, so if it finds it in its own heap, then it's going to allocate an object from there. Otherwise, it's going to check the global heap. And if there's something in the global heap-- so here it says, if the global heap is empty, then it's going to get a new super block from the OS. Otherwise, we can get a super block from the global heap, and then use that one.

And then finally we set the owner of the block we got either from the OS or from the global heap to i, and then we return that free object to the program. So this is how a malloc works using the hoard allocator. And now let's look at hoard deallocation.

Let use of i be the in use storage in heap i. This is the heap for thread i. And let a sub i be the storage owned by heap i. The hoard allocator maintains the following invariant for all heaps i. And the invariant is as follows. So u sub i is always going to be greater than or equal to the min of a sub i minus 2 times s. Recall s is the super block size. And a sub i over 2.

So how it implements this is as follows. When we call free of x, let's say x is owned by thread i, then we're going to put x back into heap i, and then we're going to check if the n u storage in heap i, u sub i is less than the min of a sub i minus 2 s and a sub i over 2.

And what this condition says, if it's true, it means that your heap is, at most, half utilized. Because if it's smaller than this, it has to be smaller than a sub i over 2. That means there's twice as much allocated than used in the local heap i. And therefore there must be some super block that's at least half empty. And you move that super block, or one of those super blocks, to the global heap.

So any questions on how the allocation and deallocation works? So since we're maintaining this invariant, it's going to allow us to approve a bound on the blow up. And I'll show you that on the next slide. But before I go on, are there any questions?

OK, so let's look at how we can bound the blow up of the hoard allocator. So there is actually a lemma that we're going to use and not prove. The lemma is that the maximum storage allocated in the global heap is at most a maximum storage allocated in the local heaps. So we just need to analyze how much storage is allocated in the local heaps. Because the total amount of storage is going to be, at most, twice as much, since the global heap storage is dominated by the local heap storage.

So you can prove this lemma by case analysis. And there's the hoard paper that's available on learning modules. And you're free to look at that if you want to look at how this is proved. But here I'm just going to use this lemma to prove this theorem, which says that, let u be the user footprint for a program. And let a be the hoard's allocator footprint.

We have that a as upper bounded by order u plus s p. And therefore, a divided by u, which is a blowup, is going to be 1 plus order s p divided by u. OK, so let's see how this proof works. So we're just going to analyze the storage in the local heaps.

Now recall that we're always satisfying this invariant here, where u sub i is greater than the min of a sub i minus 2 s and a sub i over 2. So the first term says that we can have 2 s on utilized storage per heap. So it's basically giving two super blocks for free to each heap. And they don't have to use it. They can basically use it as much as they want.

And therefore, the total amount of storage contributed by the first term is going to be order s p, because each processor has up to 2 s unutilized storage. So that's where the second term comes from here. And the second term, a sub i over 2-- this will give us the first-term order u.

So this says that the allocated storage is at most twice the use storage for-- and then if you sum up across all the processors, then there's a total of order use storage that's allocated. Because the allocated storage can be at most twice the used storage.

OK, so that's the proof of the blow up for hoard. And this is pretty good. It's 1 plus some lower order term. OK, so-- now these are some other allocators that people use. So jemalloc is a pretty popular one. Has a few differences with hoard. It has a separate global lock for each different allocation size.

It allocates the object with the smallest address among all the objects of the requested size. And it releases empty pages using m advise, which we talked about-- I talked about earlier. And it's pretty popular because it has good performance, and it's pretty robust to different allocation traces. There's also another one called SuperMalloc, which is an up and coming contender. And it was developed by Bradley Kuszmaul.

Here are some allocator speeds for the allocators that we looked at for our particular benchmark. And for this particular benchmark, we can see that SuperMalloc actually does really well. It's more than three times faster than jemalloc, and jemalloc is more than twice as fast as hoard. And then the default allocator, which uses a global heap is pretty slow, because it can't get good speed up.

And all these experiments are in 32 threads. I also have the lines of code. So we see that SuperMalloc actually has very few lines of code compared to the other allocators. So it's relatively simple. OK so, I also have some slides in Garbage Collection. But since we're out of time, I'll just put these slides online and you can read them.