Lecture 13: The Cilk Runtime System

Flash and JavaScript are required for this feature.

Download the video from Internet Archive.

Description: This class is focused on the Cilk runtime system. It stresses the work-first principle: optimize for ordinary serial execution at the expense of additional computation in steals. There is an emphasis on performance models and parallel runtime systems.

Instructor: Tao B. Schardl

The following content is provided under a Creative Commons license. Your support will help MIT OpenCourseWare continue to offer high quality educational resources for free. To make a donation or to view additional materials from hundreds of MIT courses, visit MIT OpenCourseWare at ocw.mit.edu.

PROFESSOR: Hey, everybody. It's my pleasure once again to welcome TB Schardl, who is the author of your taper compiler, to talk about the Cilk runtime system.

TAO SCHARDL: Thanks, Charles. Can anyone hear me in the back, seem good? OK. Thanks for the introduction. Today I'll be talking about the Cilk runtime system. This is pretty exciting for me. This is a lecture that's not about compilers. I get to talk about something a little different for once. It should be a fun lecture. Recently, as I understand it, you've been looking at storage allocation, both in the serial case as well as the parallel case.

And you've already done Cilk programming for a while, at this point. This lecture, honestly, is a bit of a non sequitur in terms of the overall flow of the course. And it's also an advanced topic. The Cilk runtime system is a pretty complicated piece of software. But nevertheless, I believe you should have enough background to at least start to understand and appreciate some of the aspects of the design of the Cilk runtime system.

So that's why we're talking about that today. Just to quickly recall something that you're all, I'm sure, intimately familiar with by this point, what's Cilk programming all about? Well, Cilk is a parallel programming language that allows you to make your software run faster using parallel processors. And to use Cilk, it's pretty straightforward. You may start with some serial code that runs in some running time-- we'll denote that as Ts for certain parts of the lecture.

If you wanted to run in parallel using Cilk, you just insert Cilk keywords in choice locations. For example, you can parallelize the outer loop in this matrix multiply kernel, and that will let your code run in time Tp on P processors. And ideally, Tp should be less than Ts. Now, just adding keywords is all you need to do to tell Cilk to execute the computation in parallel. What does Cilk do in light of those keywords?

At a very high level, Cilk and specifically its runtime system takes care of the task of scheduling and load balancing the computation on the parallel processors and on the multicore system in general. So after you've denoted logical parallel in the program using spawn, Cilk spawn, Cilk sync, and Cilk four, the Cilk scheduler maps that computation onto the processors.

And it does so dynamically at runtime, based on whatever processing resources happen to be available, and still uses a randomized work stealing scheduler which guarantees that that mapping is efficient and the execution runs efficiently. Now you've all been using the Cilk platform for a while. In its basic usage, you write some Cilk code, possibly by parallelizing ordinary serial code, you feed that to a compiler, you get a binary, you run the binary the binary with some particular input on a multicore system.

You get parallel performance. Today, we're going to look at how exactly does Cilk work? What's the magic that goes on, hidden by the boxes on this diagram? And the very first thing to note is that this picture is a little bit-- the first simplification that we're going to break is that it's not really just Cilk source and the Cilk compiler. There's also a runtime system library, libcilkrts.so, in case you've seen that file or messages about that file on your system.

And really it's the compiler and the runtime library, that work together to implement Cilk's runtime system, to do the work stealing and do the efficient scheduling and load balancing. Now we might suspect that if you just take a look at the code that you get when you compile a Cilk program, that might tell you something about how Cilk works. Here's C pseudocode for the results when you compile a simple piece of Cilk code. It's a bit complicated. I think that's fair to say.

There's a lot going on here. There is one function in the original program, now there are two. There's some new variables, there's some calls to functions that look a little bit strange, there's a lot going on in the compiled results. This isn't exactly easy to interpret or understand, and this doesn't even bring into the picture the runtime system library. The runtime system library, you can find the source code online. It's a little less than 20,000 lines of code. It's also kind of complicated.

So rather than dive into the code directly, what we're going to do today is an attempt at a top-down approach to understanding how the Cilk runtime system works, and some of the design considerations. So we're going to start by talking about some of the required functionality that we need out of the Cilk runtime system, as well as some performance considerations for how the runtime system should work.

And then we'll take a look at how the worker deques in Cilk get implemented, how spawning actually works, how stealing a computation works, and how synchronization works within Cilk. That all sound good? Any questions so far? This should all be review, more or less. OK, so let's talk a little bit about required functionality. You've seen this picture before, I hope. This picture illustrated the execution model of a Cilk program. Here we have everyone's favorite exponential time Fibonacci routine, parallelized using Cilk.

This is not an efficient way to compute Fibonacci numbers, but it's a nice didactic example for understanding parallel computation, especially the Cilk model. And as we saw many lectures ago, when you run this program on a given input, the execution of the program can be modeled as a computation dag. And this computation dag unfolds dynamically as the program executes.

But I want to stop and take a hard look at exactly what that dynamic execution looks like when we've got parallel processors and work stealing all coming into play. So we'll stick with this Fibonacci routine, and we'll imagine we've just got one processor on the system, to start. And we're just going to use this one processor to execute fib(4). And it's going to take some time to do it, just to make the story interesting.

So we start executing this computation, and that one processor is just going to execute the Fibonacci routine from beginning up to the Cilk spawn statement, as if it's ordinary serial code, because it is ordinary serial code. At this point the processor hits the Cilk spawn statement. What happens now? Anyone remember? What happens to the dag?

AUDIENCE: It branches down [INAUDIBLE]

TAO SCHARDL: It branches downward and spawns another process, more or less. The way we model that-- the Cilk spawn is of a routine fib of n minus 1. In this case, that'll be fib(3). And so, like an ordinary function call, we're going to get a brand new frame for fib(3). And that's going to have some strand that's available to execute. But the spawn is not your typical function call. It actually allows some other computation to run in parallel.

And so the way we model that in this picture is that we get a new frame for fib(3). There's a strand available to execute there. And the continuation, the green strand, is now available in the frame fib(4). But no one's necessarily executing it. It's just kind of faded in the picture. So once the spawn has occurred, what's the processor going to do? The processor is actually going to dive in and start executing fib(3), as if it were an ordinary function call.

Yes, there's a strand available within the frame of fib(4), but the processor isn't going to worry about that strand. It's just going to say, oh, fib(4) calls fib(3), going to start computing for fib(3). Sound good? And so the processor dives down from pink strand to pink strand. The instruction pointer for the processor returns to the beginning of the fib routine, because we're now calling fib once again. And this process repeats.

It executes the pink strand up until the Cilk spawn, just like ordinary serial code. The spawn occurs-- and we've already seen this picture before-- the spawn allows another strand to execute in parallel. But it also creates a frame for fib(2). And the processor dives into fib(2), resetting the instruction pointer to the beginning fib, P1 executes up to the spawn. Once again, we get another string to execute, as well as an invocation of fib(1). Processor dives even further.

So that's fine. This is just the processor doing more or less ordinary serial execution of this fib routine, but it's also allowing some strands to be executed in parallel. This is the one processor situation, looks pretty good so far. Right, and in the fib(1) case, it doesn't make it as far through the pink strand because, in fact, we hit the base case. But now let's bring in some more processors. Suppose that another processor finally shows up, says I'm bored, I want to do some work, and decides to steal some computation.

It's going to discover the green strand in the frame fib(4), and P2 is just going to jump in there and start executing that strand. And if we think really hard about what this means, P2 is another processor on the system. It has its own set of registers. It has its own instruction pointer. And so what Cilk somehow allows to happen is for P2 to just jump right into the middle of this fib(4) routine, which is already executing. It just sets the instruction pointer to point at that green instruction, at the call to fib of n minus 2.

And it's just going to pick up where processor 1 left off, when it executed up to this point in fib(4), somehow. In this case, it executes fib of n minus 2. That calls fib(2), creates a new strand, it's just an ordinary function call. It's going to descend into that new frame. It's going to return to the beginning of fib. All that's well and good. Another processor might come along and steal another piece of the computation. It steals another green strand, and so once again, this processor needs to jump into the middle of an executing function.

Its instruction pointer is just going to point at this call of the fib of n minus 2. Somehow, it's going to have the state of this executing function available, despite having independent registers. And it needs to just start from this location, with all the parameters set appropriately, and start executing this function as if it's an ordinary function. It calls fib(3) minus 2 is 1. And now these processors might start executing in parallel.

P1 might return from its base case routine up to the parent call of fib of n minus 2 and start executing its continuation, because that wasn't stolen. Meanwhile, P3 descends into the execution of fib(1). And then in another step, P3 and P2 make some progress executing their computation. P2 encounters a Cilk spawn statement, which creates a new frame and allows another strand to execute in parallel. P3 encounters the base case routine and says, OK, it's time to return.

And all of that can happen in parallel, and somehow the Cilk system has to coordinate all of this. But we already have one mystery. How does a processor start executing from the middle of a running function? The running function and it's state lived on P1 initially, and then P2 and P3 somehow find that state, hop into the middle of the function, and just start running. That's kind of strange. How does that happen? How does the Cilk runtime system make that happen? This is one thing to consider.

Another thing to consider is what happens when we hit a sync. We'll talk about how these issues get addressed later on, but let's lay out all of the considerations upfront, before we-- just see how bad the problem is before we try to solve it bit by bit. So now, let's take this picture again and progress it a little bit further. Let's suppose that processor three decides to execute the return. It's going to return to an invocation of fib(3). And the return statement is a Cilk sync statement.

But processor three can't execute the sync because the computation of fib(2) in this case-- that's being done by processor one-- that computation is not done yet. So the execution can proceed past the sync. So somehow P3 needs to say, OK, there is a sync statement, but we can't execute beyond this point because, specifically, it's waiting on processor one. It doesn't care what processor two is doing. Processor two is having a dandy time executing fib(2) on the other side of the tree. Processor three shouldn't care.

So processor three can't do something like, OK, all processors need to stop, get to this point in the code, and then the execution can proceed. No, no, it just needs to wait on processor one. Somehow the Cilk system has to allow that fine grain synchronization to happen in this nested pattern. So how does a Cilk sync wait on only the nested sub computations within the program? How does it figure out how to do that? How does the Cilk runtime system implement this?

So that's another consideration. OK, so at this point, we have three top level considerations. A single worker needs to be able to execute this program as if it's an ordinary serial program. Thieves have to be able to jump into the middle of executing functions and pick up from where they left off, from where other processors in the system left off. Syncs have to be able to stall functions appropriately, based only on those functions' nested child sub computations.

So we have three big considerations that we need to pick apart so far. That's not the whole story, though. Any ideas what other functionality we need to worry about, for implementing this Cilk system? It's kind of an open ended question, but any thoughts? We have serial execution, spawning, stealing, and syncing as top level concerns. Anyone remember some other features of Cilk that the runtime system magically makes happen, correctly? It's probably been a while since you've seen those. Yeah.

AUDIENCE: Cilk for loops divide and conquer?

TAO SCHARDL: The Cilk for loops divide and conquer. Somehow, the runtime system does have to implement Cilk fours. The Cilk fours end up getting implemented internally, with spawns and syncs. That's courtesy of the compiler. Yeah, courtesy of the compiler. So we wont look too hard at Cilk fors today, but that's definitely one concern. Good observation. Any other thoughts, sort of low level system details that Cilk needs to implement correctly?

Cache coherence-- it actually doesn't need to worry too much about cache coherence although, given the latest performance numbers I've seen from Cilk, maybe it should worry more about the cache. But it turns out the hardware does a pretty good job maintaining the cache coherence protocol itself. But good guess . It's not really a tough question, because it's really just calling back memories of old lectures.

I think you recently had a quiz on this material, so it's probably safe to say that all that material has been paged out of your brain at this point. So I'll just spoil the fun for you. Cilk has a notion of a cactus stack. So we talked a little bit about processors jumping into the middle of an executing function and somehow having the state of that function available. One consideration is registered state, but another consideration is the stack itself.

And Cilk supports the C's rule for pointers, namely that children can see pointers into parent frames, but parents can't see pointers into child frames. Now each processor, each worker in a Cilk system, needs to have its own view of the stack. But those views aren't necessarily independent. In this picture, all five processors share the same view of the frame for Function A instantiation A, then processors three through five all share the same view for the instantiation of C.

So somehow, Cilk has to make all of those views available and consistent but not quite the same, sort of consistent as we get with cache coherence. Cilk somehow has to implement this cactus stack. So that's another consideration that we have to worry about. And then there's one more kind of funny detail. If we take another look at work stealing itself-- you may remember we had this picture from several lectures ago where we have processors on the system, each maintains its own deck of frames, and workers are allowed to steal frames from each other.

But if we take a look at how this all unfolds, yes we may have a processor that performs a call, and that'll push another frame for a called function onto its deque on the bottom. It may spawn, and that'll push a spawn frame onto the bottom of its deck. But if we fast forward a little bit and we get in up with a worker with nothing to do, that worker is going to go ahead and steal, picking another worker in the system at random. And it's going to steal from the top of the deque.

But it's not just going to steal the topmost item on the deque. It's actually going to steal a chunk of items from the deque. In particular, if it selects the third processor in this picture, third from the left, this thief is going to steal everything through the parent of the next spawned frame. It needs to take this whole stack of frames, and it's not clear a priori how many frames the worker is going to have to steal in this case. But nevertheless, it needs to take all those frames and resume execution.

After all, that bottom was a call frame that it just stole. That's where there's a continuation with work available to be done in parallel. And so, if we think about it, there are a lot of questions that arise. What's involved in stealing frames? What synchronization does this system have to implement? What happens to the stack? It looks like we just shifted some frames from one processor to another, but the first processor, the victim, still needs access to the data in that stack. So how does that part work, and how does any of this actually become efficient?

So now we have a pretty decent list of functionality that we need out of the Cilk runtime system. We need serial execution to work. We need thieves to be able to jump into the middle of running functions. We need sinks to synchronize in this nested, fine grain way. We need to implement a cactus stack for all the workers to see. And these have to deal with mixtures of spawned frames and called frames that may be available when they steal a computation.

So that's a bunch of considerations. Is this the whole picture? Well, there's a little bit more to it than that. So before I give you an answers, I'm just going to keep raising questions. And now I want to raise some questions concerning the performance of the system. How do we want to design the system to get good parallel execution times?

Well if we take a look at the work stealing bounds for Cilk, the Cilk's work stealing scheduler achieves an expected running time of Tp, on P processors, which is proportional to the work of the computation divided by the number of processors, plus something on the order of the span of the computation. Now if we take a look at this running time bound, we can decompose it into two pieces. The T1 over P part, that's really the time that the parallel workers on the system spend doing actual work.

They're P of those workers, they're all making progress on the work of the computation. That comes out to T of one over P. The other part of the bound, order T infinity, that's a time that turns out to be the time that workers spend stealing computation from each other. And ideally, what we want when we paralyze a program using Cilk, is we want to see this program achieve linear speedup. That means that if we give the program more processors to run, if we increase P, we want to see the execution time decrease, linearly, with P.

And that means we want the of the workers in the Cilk system to spend most of the time doing useful work. We don't want the workers spending a lot of time stealing from each other. In fact, we want even more than this. We don't just want work divided by number of processors. We really care about how the performance compares to the running time of the original serial code that we were given, that we parallelized.

That original serial code ran in time Ts of S. And now we paralyze it using Cilk spawn, Cilk sync, or in this case, Cilk for. And ideally, with sufficient parallelism, we'll guarantee that the running time is going to be Ts of P proportional to the work of a processor, T1 divided by P. But we really want to speed up compared to Ts of S. So that's our goal. We want Tp to be proportional to Ts of S over P. That says that we want the serial running time to be pretty close to the work of the parallel computation.

So the one processor running time of our Cilk code, ideally, should look pretty close to the running time of the original serial code. So just to put these pieces together, if we were originally given a serial program that ran on time Ts of S, and we parallelize it using Cilk, we end up with a parallel program with work T1 and span T infinity. We want to achieve linear speed up on P processors, compared to the original serial running time.

In order to do that, we need two things. We need ample parallelism. T1 one over T infinity should be a lot bigger than P. And we've seen why that's the case in lectures past. We also want what's called high work efficiency. We want the ratio of the serial running time divided by the work of the still computation to be pretty close to one, as close as possible. Now, the Cilk runtime system is designed with these two observations in mind.

And in particular, the Cilk runtime system says, suppose that we have a Cilk program that has ample parallelism. It has efficient parallelism to make good use of the available parallel processors. Then in implementing the Cilk runtime, we have a goal to maintain high work efficiency. And to maintain high work efficiency, the Cilk runtime system abides by what's called the work first principle, which is to optimize the ordinary serial execution of the program, even at the expense of some additional cost to steals.

Now at 30,000 feet, the way that the Cilk runtime system implements the work first principle and makes all these components work is by dividing the job between both the compiler and the runtime system library. The compiler uses a handful of small data structures, including workers and stack frames, and implements optimized fast paths for execution of functions, which should be executed when no steals occur.

The runtime system library handles issues with the parallel execution. And uses larger data structures that maintain parallel running time state. And it handles slower paths of execution, in particular when seals actually occur. So those are all the considerations. We have a lot of functionality requirements and we have some performance considerations. We want to optimize the work, even at the expense of some steals. Let's finally take a look at how Cilk works. How do we deal with all these problems?

I imagine some you may have some ideas as to how you might tackle one issue or another, but let's see what really happens. Let's start from the beginning. How do we implement a worker deque? Now for this discussion, we're going to use a running example with just a really, really simple, Cilk routine. It's not even as complicated as fib. We're going to have a function foo that, at one point, spawns a function bar, in the continuation calls baz, performs a sync, and then returns.

And just to establish some terminology, foo will be what we call a spawning function, meaning that foo is capable of executing a Cilk spawn statement. The function bar is spawned by foo. We can see that from the Cilk spawn in front of bar. And the call to baz occurs in the continuation of that Cilk spawn, simple picture. Everyone good so far? Any questions about the functionality requirements, terminology, performance considerations? OK.

So now we're going to take a hard look at just one worker and we're going to say, conceptually, we have this deque-like structure which has spawned frames and called frames. Let's ignore the rest of the workers on the system. Let's not worry about-- well, we'll worry a little bit about how steals can work, but we're just going to focus on the actions that one worker performs. How do we implement this deque?

And we want the worker to operate on its own deck, a lot like a stack. It's going to push and pop frames from the bottom up the deque. Steals need to be able to transfer ownership of several consecutive frames from the top of the deque. And thieves need to be able to resume a continuation. So the way that the Cilk system does this, to bring this concept into an implementation, is that it's going to implement the deque externally from the actual call stack.

Those frames will still be in a stack somewhere and they'll be managed, roughly speaking, with a standard calling convention. But the worker is going to maintain a separate deque data structure, which will contain pointers into this stack. And the worker itself will maintain the deque using head and tail pointers. Now in addition to this picture, the frames that are available to be stolen-- the frames that have computation that a thief can come along and execute-- those frames will store an additional local structure that will contain information as necessary for stealing to occur.

Does this make sense? Questions so far? Ordinary call stack, deque lives outside of it, worker points at the deque, pretty simple design. So I mentioned that the compiler used relatively lightweight structures. This is essentially one of them. And if we take a look at the implementation of the Cilk runtime system, this is the essence of it. There are some additional implementation details, but these are the core-- this is, in a sense, the core piece of the design.

So the rest is just details. The Intel Cilk Plus runtime system takes this design and elaborates on it in a variety of ways. And we're going to take a look at those elaborations. First off, what we'll see is that every spawned subcomputation ends up being executed within its own helper function, which the compiler will generate. That's called a spawn helper function. And then the runtime system is going to maintain a few basic data structures as the workers execute their work.

There'll be a structure for the worker, which will look similar to what we just saw in the previous slide. There'll be a Cilk stack frame structure for each instantiation of a spawning function, some function that can perform and spawn. And there'll be a stack-frame structure for each spawn helper, each instantiation that is spawned. Now if we take another look at the compiled code we had before, some of it starts to make some sense.

Originally, we had our spawning function foo and a statement that spawned off, called a bar. And in the C pseudocode of the compiled results, we see that we have two functions. The first function foo-- that's our spawning function-- it's got a bunch of stuff in it, and we'll figure out what that's doing in a second. But there's a second function, and that second function is the spawn helper. And that spawn helper actually contains a statement which calls bar and ultimately saves the result.

Make sense? Now we're starting to understand some of the confusing C pseudocode we saw before. And if we take a look at each of these routines we see, indeed, there is a stack frame structure. And so in Intel Cilk Plus it's called a Cilk RTS stack frame, very creative name, I know. And it's just added as an extra local variable in each of these functions. You got one inside of foo, because that's a spawning function, and you get one inside of the spawn helper.

Now if we dive into the Cilk stack frame structure itself, by cracking open the source code for the Intel Cilk Plus runtime, we see that there are a lot of fields in the structure. The main fields are as follows-- there is a buffer, a context buffer, and that's going to contain enough information to resume a function at a continuation, particularly to mean after a Cilk spawn or, in fact, after a Cilk sync statement.

There's an additional integer in the stack frame called flags, which will summarize the state of the Cilk stack rate, and we'll see a little bit more about that later. And there's going to be a pointer to a parent Cilk stack frame that's somewhere above this Cilk RTS stack frame, somewhere in the call stack. So these Cilk RTS stack frames, these are the extra bit of state that the Cilk runtime system adds to the ordinary call stack. So if we take a look at the actual worker structure, it's a lot like what we saw before.

We have a deque that's external to the call stack. The Cilk worker maintains head and tail pointers to the deque. The Cilk workers are also going to maintain a pointer to the current Cilk RTS stack frame, which will tend to be somewhere near the bottom of the stack. OK, so those are the basic data structures that a single worker is going to maintain. That includes the deque. Let's see them all in action, shall we? Any questions about that so far, before we start watching pointers fly? Yeah.

AUDIENCE: I guess with the previous slide, there were arrows on the workers' call stack. What do you [INAUDIBLE]?

TAO SCHARDL: What do the arrows among the elements on the call stack mean? So in this picture of the call stack, function instantiations are actually in green, and local variables-- specifically the Cilk RTS stack frames-- those show up in beige. So foo SF is the Cilk RTS stack frame inside the instantiation of foo. It's just a local variable that's also stored in the stack, right? Now, the Cilk RTS stack frame maintains a parent pointer, and it maintains a pointer up to some Cilk RTS stack frame above it on the stack.

It's just another local variable, also stored in the stack. So when we step away and look at the whole call stack with all the function frames and the Cilk RTS stack frames, that's where we get the pointers climbing up the stack. We're good? Other questions? All right, let's make some pointers fly. OK, this is going to be kind of a letdown, because the first thing we're going to look at is some code. So we're not going to have pointers flying just yet.

We can take a look at the code for the spawning function foo, at this point. And there's a lot of extra code in here, clearly. I've highlighted a lot of stuff on this slide, and all the highlighted material is related to the execution of the Cilk runtime system. But basically, if we look at this code, we can understand each of these pieces. Each of them has some role to play in making the Cilk runtime system work.

So at the very beginning, we have our Cilk stack frame structure. And there's a call to this enter frame function, which all that really does is initialize the stack frame. That's all the function is doing. Later on, we find that there's this set jump routine-- we'll talk a lot more about set jump in a bit-- that, at this point, we can say the set jump prepares the function for a spawn. And inside the conditional, where the set jump occurs as a predicate, we have a call to spawn bar.

If we remember from a couple of slides ago, spawn bar was our spawn helper function. So we're here, we're just invoking the spawn helper. Later on in the code, we have another blob of conditionals with a Cilk RTS sync call, deep inside. All that code performs a sync. We'll talk about that a bit near the end of lecture. And finally, at the end of the spawning function, we have a call to pop frame, which just cleans up the Cilk stack frame structure within this function. And then there's a call to leave frame, which essentially cleans up the deque.

That's the spawning function. This is the spawn helper. It looks somewhat similar. I've added extra whitespace just to make the slide a little bit prettier. And in some ways, it's similar to the spawning function itself. We have a Cilk RTS stack frame [INAUDIBLE] spawn helper, another call to enter frame, which is just a little bit different. But essentially, it initializes the stack frame. Its reason to be is similar to the enter frame call we saw before.

There's a call to Cilk RTS detach, which performs a bunch of updates on the deque. Then there is the actual invocation of the spawn subroutine. This is where we're calling bar. And finally, at the end of the function, there is a call to pop frame, to clean up the stack structure, and a call to leave frame, which will clean up the deck and possibly return. It'll try to return. We'll see more about that. So let's watch all of this in action. Question? OK, cool.

Let's see all of this in action. We'll start off with a pretty boring picture. All we've got on our call stack is main, and our Cilk worker has nothing on its deque. But now we suppose that main calls our responding function foo, and the spawning function foo contains a Cilk RTS stack frame. What we're going to do in the Cilk worker, what that enter frame call is going to perform, all it's going to do is update the current stack frame. We now have a Cilk RTS stack frame, make sure the worker points at it, that's all.

Fast forward a little bit, and foo encounters this call to Cilk spawn a bar. And in the C pseudocode that's compiled for foo, we have a set jump routine. This set jump is kind of a magical function. This is the function that allows thieves to steal the continuation. And in particular, the set jump takes, as an argument, a buffer. In this case, it's the context buffer that we have in the Cilk RTS stack frame. And what the set jump will do is it will store information that's necessary to resume the function at the location of the set jump.

And it stores that information into the buffer. Can anyone guess what that information might be?

AUDIENCE: The instruction points at [INAUDIBLE].

TAO SCHARDL: Instruction pointer or stock pointer, I believe both of those are in the frame. Yeah, both of those are in the frame. Good, what else?

AUDIENCE: All the registers are in use.

TAO SCHARDL: All the registers are currently in use. Does it need all the registers? You're absolutely on the right track, but is there any way it could restrict the set of registers it needs to save?

AUDIENCE: The registers are used later in the execution.

TAO SCHARDL: That's part of it. Set jump isn't that clever though, so it just stores a predetermined set of registers. But there is another way to restrict the set.

AUDIENCE: [INAUDIBLE]

TAO SCHARDL: Only registers uses parameters in the called function, yeah, close enough. Callee-saved registers. So registers that the function might-- that it's the responsibility of foo to save, this goes all the way back to that discussion in lecture, I don't remember which small number, talking about the calling convention. These registers need to be saved, as well as the instruction pointer and various stack pointers. Those are what gets saved into the buffer.

The other registers, well, we're about to call a function, it's up to that other function to save the registers appropriately. So we don't need to worry about those. So all good? Any questions about that? All right, so this set jump routine, let's take it for granted that when we call a set jump on this given buffer, it returns zero. That's a good lie for now. We'll just run with it. So set jump returs zero.

The condition says, if not zero-- which turns out to be true-- and so the next thing that happens is this call to the spawn helper, spawn_bar, in this case. When we call spawn_bar, what happens to our stack? So this should look pretty routine. We're doing a function call, and so we push the frame for the called function onto the stack. And that called function, spawn bar, contains a local variable, which is this [INAUDIBLE] stack frame. So that also gets pushed onto the stack, pretty straightforward.

We've seen function calls many times before. This should look pretty familiar. Now we do this Cilk RTS enter frame fast routine. And I mentioned before that that's going to update the worker structure. So what's going to happen here? Well, we have a brand new Cilk RTS stack frame on the stack. Any guesses as to what change we make? What would enter frame do?

AUDIENCE: [INAUDIBLE]

TAO SCHARDL: Point current stack frame to spawn in bar stack frame, you're right. Anything else? Hope I got this animation right. What are the various fields within the stack frame? And what did-- sorry, I don't know your name. What's your name?

AUDIENCE: I'm Greg.

TAO SCHARDL: Greg, what did Greg ask about before, when we saw an earlier picture of the call stack?

AUDIENCE: Set a pointer to the parent.

TAO SCHARDL: Set a pointer to the parent, exactly. So what we're going to do is we're going to take this call stack, we'll do the enter frame fast routine. That establishes this parent pointer in our brand new stack frame. And we update the worker's current stack frame to point at the bottom. Yeah, question?

AUDIENCE: How does enter frame know what the parent is?

TAO SCHARDL: How does enter frame know what the parent is? Good question. Enter frame knows the worker. Or rather, enter frame can do a call, which will give it access to the Cilk worker structure. And because it can do a call, it can read the current stack frame pointer in the worker.

AUDIENCE: So we do [INAUDIBLE] before we change the current [INAUDIBLE]?

TAO SCHARDL: Yeah, in this case we do. So we add the parent pointer, then we delete and update. So, good catch. Any other questions? Cool. All right, now we encounter this thing, Cilk RTS detach. This one's kind of exciting. Finally we get to do something to the deque. Any guesses what we do? How do we update the deque? Here's a hint. Cilk RTS detach allows-- this is the function that allows some computation to be stolen.

Once Cilk RTS detach is done executing, a thief could come along and steal the continuation of the Cilk spawn. So what would Cilk RTS detach do to our worker and its structures? Yeah, in the back.

AUDIENCE: Push the stack frame to the worker deque?

TAO SCHARDL: Push the stack frame to the worker deque, specifically at the tail. Right, I gave it away by clicking the animation, oh well. Now the thing that's available to be stolen is inside of foo. So what ends up getting pushed onto the deque is not the current stack frame, but in fact its immediate parent, so the stack frame of foo. That gets pushed onto the tail of the deque. And we now push something onto the tail of a deque. And so we advance the tail pointer. Still good, everyone? I see some nods.

I see at least one nod. I'll take it. But feel free to ask questions, of course. And then of course there is this invocation of bar. This does what you might expect. It calls bar, no magic here. Well, no new magic here. OK, fast forward, let's suppose that bar finally returns.

And now we return to the statement after bar in the spawn helper. That statement is the pop frame. Actually, since we just returned from bar, we need to get rid of bar from the stack frame. Good, now we can execute the pop frame.

What would the pop frame do? It's going to clean up the stack frame structure. So what would that entail, any guesses?

AUDIENCE: I guess it would move the current stack frame back to the parent stack frame?

TAO SCHARDL: Move the current stack frame back to the parent, very good. I think that's largely it. I guess there's one other thing it can do. It's kind of optional, given that it's going to garbage the memory anyway. So it updates the current stack frame to point to the parent, and now it no longer needs that parent pointer. So it can clean that up, in principle. And then there's this call to Cilk RTS leave frame. This is magic-- well, not really, but it's not obvious. This is a function call that may or may not return.

Welcome to the Cilk runtime system. You end up with calls to functions that you may never return from. This happens all the time. And the Cilk RTS leave frame may or may not return, based entirely on what's on the status of the deque, what content is currently sitting on the workers' deque. Anyone have a guess as to why the leave frame routine might not return, in the conventional sense?

AUDIENCE: There's nothing else for the worker to do, so it'll sit there spinning.

TAO SCHARDL: If there's nothing left to do on the deck, then it's going to-- sorry, say again?

AUDIENCE: It'll just wait until there's work you can steal?

TAO SCHARDL: Right, if there's nothing on the deque, then it has nowhere to return to. And so naturally, as we've seen from Cilk workers in the past, it discovers there's nothing on the deque, there's no work to do, time to turn to a life of crime, and try to steal work from someone else. So there are two possible scenarios. The pop could succeed and execution continues as normal, or it fails and it becomes a thief. Now which of these two cases do you think is more important for the runtime system to optimize?

Success, case one, exactly, so why is that?

AUDIENCE: [INAUDIBLE]

TAO SCHARDL: At least, we hope so, yeah. We assume-- this hearkens all the way back to that work first principle-- we assume that in the common case, workers are doing useful work, they're not just spending their time stealing from each other. And therefore, ideally, we want to assume that the worker will do what's normal, just an ordinary serial execution. In a normal serial execution, there is something on the deque, the pop succeeds, that's case one.

So what we'll see is that the runtime system, in fact, does a little bit of optimization on case one. Let's talk about something a little more exciting. How about stealing computation. We like stealing stuff from each other. Yes?

AUDIENCE: [INAUDIBLE]

TAO SCHARDL: Where does it return the results? So where does it return the result in the spawn bar? The answer you can kind of see two lines above this. So in this case, in the original Cilk code, we had X equals Cilk spawn of bar. And here, what are the parameters to our spawn bar function? X and N. Now N is the input to bar, right? So what's X?

AUDIENCE: [INAUDIBLE]

TAO SCHARDL: You can rewind a little bit and see that you are correct. There we go. Yeah, so the original Cilk code, we had X equals Cilk spawn bar. That's the same X. All that Cilk does is pass a pointer to the memory allocated for that variable down to the spawn helper. And now the spawn helper, when it calls bar and that returns, it gets stored into that storage in the parent stack frame. Good catch. Good observation. Any questions about that? Does that make sense? Cool.

Probably used too many animations in these slides. All right, now let's talk about stealing. How does a worker steal computation? Now the conceptual diagram we had before saw this one worker, with nothing on its deque, take a couple of frames from another workers deque and just slide them on over. What does that actually look like in the implementation?

Well, we're still going to take from the top of the deque, but now we have a picture that's a little bit more accurate in terms of the structures that are really implemented in the system. So we have the call stack of the victim, and the victim also has a deque data structure and a Cilk worker data structure, with head and tail pointers and a current stack frame. So what happens when a thief comes along out of nowhere? It's bored, it has nothing on its deque. Head and tail pointers both point to the top.

Current stack frame has nothing. What's the thief going to do? Any guesses? How does this thief take the content from the worker's deque?

AUDIENCE: The worker sets their current stack frame to the one that [INAUDIBLE]

TAO SCHARDL: Exactly right, yeah. Sorry, was that-- I didn't mean to interrupt. All right, cool. So the red highlighting should give a little bit of a hint. The current stack frame in the thief is going to end up pointing to the stack frame at the top of the deque, pointed to by the top of the deque. And the head of the deque needs to be updated. So let's just see all those pointers shuffle. The thief is going to target the head of the deque. It's going to deque that item from the top of the deck.

It's going to set the current stack frame to point to that item, and it will delete the pointer on the deque. That make sense? Cool. Now the victim and the thief are on different processors, and this scenario involves shuffling a lot of pointers around. So if we think about this process, there needs to be some way to handle the concurrent accesses that are going to occur on the head of the deque.

You haven't talked about synchronization yet in this class, that's going to be a couple lectures down the road. I'll give you a couple of spoilers for those synchronization lectures. First off, synchronization is expensive. And second, reasoning about synchronization is a source of massive headaches. Congratulations, you now know those two lectures. No, I'm just kidding. Go to the lectures, you'll learn a lot, they're great.

In the Cilk runtime system, the way that those concurrent accesses are handled is by using a protocol known as the THE protocol. This is pseudo code for most of the logic in the THE protocol. There's a protocol that the worker, executing work normally, follows. And there is the protocol for the thief. I'm not going to walk through all the lines of code here and describe what they do. I'll just give you the very high level view of this protocol.

From the thief's perspective, the thief always grabs a lock on the deque before doing any operations on the deque. Always acquire the lock first. For the worker, it's a little bit more optimized. So what the worker will do is optimistically try to pop something from the bottom of the deque. And only if it looks like that pop operation fails does the worker do something more complicated. Only then does it try to acquire a lock on the deque, then try to pop something off, see if it really succeeds or fails, and possibly turn to a life of crime.

So the worker's protocol looks longer, but that's just because the worker implements a special case, which is optimized for the common case. This is essentially where the leave frame routine, that we saw before, is optimized for case one, optimized for the pop from the deque succeeding. Any questions about that? Seem clear from 30,000 feet? Cool. OK, so that's how a worker steals work from the top of the victim's deque.

Now, that thief needs to resume a continuation. And this is that whole process about jumping into the middle of an executing function. It already has a frame, it already has a [INAUDIBLE] state going on, and all that was established by a different processor. So somehow that thief has to magically come up with the right state and start executing that function. How does that happen? Well, this has to do with a routine that's the complement of the set jump routine we saw before.

The complement of set jump is what's called long jump. So Cilk uses, in particular Cilk thieves, use the long jump function in order to resume a stolen continuation. Previously, in our spawning function foo, we had this set jump call. And that set jump saved some state to a local buffer, in particular the buffer in the stack frame of foo. Now the thief has just created this Cilk worker structure, where the current stack frame is pointing at the stack frame of foo.

And so what the thief will do is it'll execute a call, it'll execute the statement, it will execute the long jump function, passing that particular stack frame's buffer and an additional argument, and that long jump will take the registered state stored in the buffer, put that registered state into the worker, and then let the worker proceed. That make sense? Any questions about that?

This is kind of a wacky routine because, if you remember, one of the registers stored in that buffer is an instruction pointer. And so it's going to read the instruction pointer out of the buffer. It's also going to read a bunch of callee-saved registers and stack pointers out of the buffer. And it is going to say, that's my register state now, that's what the thief says. It just stole that register state. And it's going to set its RAP to be the RAP it just read.

So what does that mean for where the long jump routine returns?

AUDIENCE: It returns into the stack frame above the [INAUDIBLE]

TAO SCHARDL: Returns the stack frame above the one it just stole. More or less, but more specifically, where in that function does it return?

AUDIENCE: Just after the call.

TAO SCHARDL: Which call?

AUDIENCE: [INAUDIBLE]

TAO SCHARDL: To the spawn bar, here? Almost, very, very close, very, very close. What ends up happening is that the long jump effectively returns from the set jump a second time. This is the weird protocol between set jump and long jump. Set jump, you pass it a buffer, it saves and registers state, and then it returns. And it returns immediately, and on its directed vocation, that set jump call returns the value zero, as we mentioned before.

Now if you invoke a long jump using the same buffer, that causes the processor to effectively return from the same set jump call. They use the same buffer. But now it's going to return with a different value, and it's going to return with the value specified in the second argument. So invoking long jump of buffer X returns from that set jump with the value X. So when the thief executes a long jump with the appropriate buffer, and the second argument is one, what happens?

Can anyone walk me through this? Oh, it's on the slide, OK. So now that set jump effectively returns a second time, but now it returns with a value one. And now the predicate gets evaluated. So if not one, which would be if false, well don't do the consequent, because the predicate was false. And that means it's going to skip the call to spawn bar, and it'll just fall through and execute the stuff right after that conditional, which happens to be the continuation of the spawn.

That's kind of neat. I think that's kind of neat, being unbiased. Anyone else think that's kind of neat? Excellent. Anyone desperately confused about this set jump, long jump nonsense? Any questions you want to ask, or just generally confused about why these things exist in modern computing? Yeah.

AUDIENCE: Is there any reason you couldn't just add, like, [INAUDIBLE] to the instruction point and jump over the call, instead?

TAO SCHARDL: Is there any reason you couldn't just add some fixed offset to the instruction pointer to jump over the call? In principle, I think, if you can statically compute the distance you need to jump, then you can just add that to RIP and let the long jump do its thing. Or rather, the thief will just adopt that RIP and end up in the right place. What's done here is-- basically, this was the protocol that the existing set jump and long jump routines implement.

And I imagine it's a bit more flexible of a protocol than what you strictly need for the Cilk runtime. And so, you know, it ends up working out. But if you can statically compute that offset, there's no reason in principle you couldn't adopt a different approach. So, good observation. Any questions? Any other questions? It's fine to be generally confused why their routines, set jump and long jump, with this wacky behavior. Compiler writers have that reaction all the time. These are a nightmare to compile.

Anyway, OK, so we've seen how a thief can take some computation off of a victim's deque, and we've seen how the thief can jump right into the middle of an executing function with the appropriate register state. Is this the end of the story? Is there anything else we need to talk about, with respect to stealing? Or, more pointedly, what else do we not need to talk about with respect to stealing? You're welcome to answer, if you like. OK.

Hey, remember that list of concerns we had at the beginning? List of requirements is what it was called. We will talk about syncs, but not just yet. What other thing was brought up? Remember this slide from a previous lecture? Here's another hint. So the register state is certainly part of the state of an executing function. What else defines a state of an executing function? Where doe the other state of the function live?

It lives on the stack, so what is there to talk about regarding the stack?

AUDIENCE: Cactus stack.

TAO SCHARDL: The cactus stack, exactly. So you mentioned before that thieves need to implement this cactus stack abstraction for the Cilk runtime system. Why exactly do we need this cactus stack? What's wrong with just having the thief use the victim's stack?

AUDIENCE: [INAUDIBLE]

TAO SCHARDL: The victim might just free up a bunch of stuff and then it's no longer accessible. So it can free some amount of stuff, in particular everything up to the function foo, but in fact it can't return from the function foo because some other-- well, assuming that the Cilk RTS leave frame thing is implemented-- the function foo is no longer in the stack, it won't ever reach it. So it won't return from the function foo while another worker is working on it. But good observation.

There is something else that can go wrong if the thief just directly uses the victim's stack. Well, let's take a hint from the slide we have so far. So the example that's going to be shown is that the thief steals the continuation of foo, and then the thief is going to call a function baz. So the thief is using the victim's stack, and then it calls a function baz. What goes wrong?

AUDIENCE: The victim has called something, but underneath, there is some other function stack [INAUDIBLE]

TAO SCHARDL: Exactly. The victim in this picture, for example, has some other functions on its stack below foo. So if the thief does any function calls and is using the same stack, it's going to scribble all over the state of, in this case spawn bar, and bar, which the victim is trying to use and maintain. So the thief will end up corrupting the victim stack. And if you think about it, it's also possible for the victim to call the thief stack. They can't share a stack, but they do want to share some amount of data on the stack.

They do both care about the state of foo, and that needs to be consistent across all the workers. But we at least need a separate call stack for the thief. We'd rather not do unnecessary work in order to initialize this call stack, however. We really need this call stack for things that the thief might invoke, local variables the thief might need, or functions that the thief might call or spawn. OK, so how do we implement the cactus stack?

We have a victim stack, we have a thief stack, and we have a pretty cute trick, in my opinion. So the thief steals its continuation. It's going to do a little bit of magic with its stack pointers. What it's going to do is it's going to use the RBP it was given, which points out the victim stack, and it's going to set the stack pointer to point at its own stack. So RBP is over there, and RSP, for the thief, is pointing to the beginning of the thief's call stack.

And that is basically fine. The thief can access all the state in the function foo, as offsets from RBP, but if the thief needs to do any function calls, we have a calling convention that involves saving RBP and updating RSP in order to execute the call. So in particular, the thief calls the function baz, it saves its current value of RBP onto its own stack, it advances RSP, it says RBP equals RSP, it pushes the stack frame for baz onto the stack, and it advances RSP a little bit further.

And just like that, the thief is churning away on its own stack. So just with this magic of RBP pointing there and RSP pointing here, we got our cactus stack. Everyone follow that? Anyone desperately confused by this stack pointer? Who thinks this is kind of a neat trick? All right, cool. Anyone think this is a really mundane trick? Hopefully no one thinks it's a mundane trick. OK, there's like half a hand there, that's fine. I think this is a neat trick, just messing around with the stack pointers.

Are there any worries about using RBP and RSP this way? Any concerns that you might think of from using these two stack pointers as described? In a past lecture, briefly mentioned was a compiler optimization for dealing with stacks. Yeah.

AUDIENCE: [INAUDIBLE] We were offsetting [INAUDIBLE]

TAO SCHARDL: Right, there was a compiler optimization that said, in certain cases you don't need both the base pointer and the stack pointer. You can do all offsets. I think it's actually off the stack pointer, and then the base pointer becomes an additional general purpose register. That optimization clearly does not work if you need the base pointer stack pointer to do this wacky trick. The answer is that the Cilk compiler specifically says, if this function has a continuation that could be stolen, don't do that optimization.

It's super illegal, it's very bad, don't do the optimization. So that ends up being the answer. And it costs us a general purpose register for Cilk functions, not the biggest loss in the world, all right. There's a little bit of time left, so we can talk about synchronizing computation. I'll give you a brief version of this. This part gets fairly complicated, and so I'll give you a high level summary of how all of this works.

So just to page back in some context, we have this scenario where different processors are executing different parts of our computation dag, and one processor might encounter a Cilk sync statement that it can't execute because some other processor is busy executing a spawn subcomputation. Now, in this case, P3 is waiting on P1 to finish its execution before the sync can proceed. And synchronization needs to happen, really, only on the subcomputation that P1 is executing.

P2 shouldn't play a role in this. So what exactly happens when a worker reaches a Cilk sync before all the spawned subcomputations return? Well, we'd like the worker to become a thief. We'd rather the worker not just sit there and wait until all the spawned subcomputations return. That's a waste of a perfectly good worker. But we also can't let the worker's current function frame disappear.

There is a spawned subcomputation that's using that frame. That frame is its parent. It may be accessing state in that frame, it may be trying to save a return value to some location in that frame. And so the frame has to persist, even if the worker that's working on the frame goes off and becomes a thief. Moreover, in the future, that subcomputation, we believe, should return. And that worker must resume the frame and actually execute past the Cilk sync.

Finally, the Cilk sync should only apply to the nested subcomputations underneath its function, not the program in general. And so we don't allow ourselves synchronization, just among all the workers, wholesale. We don't say, OK, we've hit a sync, every worker in the system must reach some point in the execution. We only care about this nested synchronization.

So if we think about this, and we're talking about nested synchronization for computations under a function, we have this notion of cactus stack, we have this notion of a tree of function invocations. We may immediately start to think about, well, what if we just maintain some state, in a tree, to keep track of who needs this to synchronize with whom, which computations are waiting on which other computations to finish? And, in fact, that's essentially what the Cilk runtime system does.

It maintains a tree of states called full frames, and those full frames store state for the parallel subcomputations. And those full frames keep track of which subcomputations are standing and how they relate to each other. This is a high level picture of a full frame. There are lots of details highlighted, to be honest.

But at 30,000 feet, a full frame keeps track of a bunch of information for the parallel execution-- I know, I'm giving you the quick version of this-- including pointers to parent frames and possibly pointers to child frames, or at least the number of outstanding child frames. The processors, when there's a system, work on what are called active full frames. In the diagram, those full frames are the rounded rectangles highlighted in dark blue. Other full frames in the system are, what we call, suspended. They're waiting on some subcomputation to return.

That's what a full frame tree can look like under, some execution. Let's see how a full frame tree can come into being, just by working through an animation. So suppose we have some worker with a bunch of spawned and called frames on its deque. No other workers have anything on their deques. And finally, some worker wants to steal. And I'll admit, this animation is crafted slightly, just to make the pictures a little bit nicer. It can look more complicated in practice, don't worry, if that was actually a worry of yours.

So what's going to happen, the thief is going to take some frames from the top of the victim's deque. And it's actually going to steal not just those frames, but the whole full frame structure along with it. The full frame structure is just represented with this rounded rectangle. In fact, it's a constant size thing. But the thief is going to take the whole full frame structure. And it's going to give the victim a brand new full frame and establish the child to parent pointer in the victim's new full frame.

That's kind of weird. It's not obvious why the thief would take the full frame as it's stealing computation, at least not from one step. But we can see why it helps, just given one more step. So let's fast forward this picture a little bit, and now we have another worker try to steal some computation, and we have a little bit more stuff going on. So this worker might randomly select the last worker on the right, steal computation from the top of its deque, and it's going to steal the full frame along with the deque frames.

And because it stole the full frame, all pointers to that full frame from any child subcomputations are still valid. The child's computation on the left still points to the correct full frame. The full frame that was stolen has the parent context of that child, and so we need to make sure that pointer is still good. If it created a new full frame for itself, then you would have to update the child pointers somehow, and that requires more synchronization and a more complicated protocol.

Synchronization is expensive, protocols are complicated. This ends up saving some complexity. And then it creates a frame for the child, and we can see this process unfold just a little bit further. And we'll hold off for a few steals, we end up with a tree. We have two children pointing to one parent, and one of those children has its own child. Great. Now suppose that some worker says, oh, I encountered a sync, can I synchronize?

In this case, the worker has an outstanding child computation so it can't synchronize. And so we can't recycle the full frame, we can't recycle any of the stack for this child. And so, instead, the worker will suspend this full frame, turning it from dark blue to light blue in our picture, and it goes and becomes a thief. The program has ample parallelism. What do we expect to typically happen when the program execution reaches a Cilk sync?

We're kind of out of time, so I think I'm just going to spoil the answer for this, unless anyone has a guess handy. So what's the common case for a Cilk sync? For the sake of time, the common case is that the executing function has no outstanding children. All the workers on the system were busy doing their own thing, there is no synchronization that's necessary. And so how does the runtime optimize this case?

It ends up having the full frame, uses some bits of an associated stack frame, in particular the flag field. And that's why, when we look at the compiled code for a Cilk sync, we see some conditions that evaluate the flags within the local stack frame. That's just an optimization to say, if you don't need a sync, don't do any computation, otherwise some steals really did occur, go ahead and execute the Cilk RTS sync routine.

There are a bunch of other runtime features. If you take a look at that picture for a long time, you may be dissatisfied with what that implies about some of the protocols. And there's a lot more code within the runtime system itself, to implement a variety of other features such as support for C++ exceptions, reducer hyperobjects, and a form of IDs called pedigrees. We won't talk about that today. I'm actually all out of time. Thanks for listening to all this about the Cilk runtime system. Feel free to ask any questions after class.