Lecture 16: Nondeterministic Parallel Programming

Flash and JavaScript are required for this feature.

Download the video from Internet Archive.

About this Video
Playlist
Transcript
Download this Video

Description: Prof. Leiserson discusses nondeterministic parallel programming, atomicity implement through mutexes, determinacy races, and data races. Deadlock and the dining philosopher’s problem are also discussed.

Instructor: Charles Leiserson

Lecture 1: Introduction and...

Lecture 2: Bentley Rules fo...

Lecture 3: Bit Hacks

Lecture 4: Assembly Languag...

Lecture 5: C to Assembly

Lecture 6: Multicore Progra...

Lecture 7: Races and Parall...

Lecture 8: Analysis of Mult...

Lecture 9: What Compilers C...

Lecture 10: Measurement and...

Lecture 11: Storage Allocation

Lecture 12: Parallel Storag...

Lecture 13: The Cilk Runtim...

Lecture 14: Caching and Cac...

Lecture 15: Cache-Oblivious...

Now Playing

Lecture 16: Nondeterministi...

Lecture 17: Synchronization...

Lecture 18: Domain Specific...

Lecture 19: Leiserchess Cod...

Lecture 20: Speculative Par...

Lecture 21: Tuning a TSP Al...

Lecture 22: Graph Optimization

Lecture 23: High Performanc...

Download English-US transcript (PDF)

The following content is provided under a Creative Commons license. Your support will help MIT OpenCourseWare continue to offer high-quality educational resources for free. To make a donation or to view additional materials from hundreds of MIT courses, visit MIT OpenCourseWare at ocw.mit.edu.

CHARLES LEISERSON: So today we're going to do some really cool stuff having to do with nondeterministic parallel programming. This is where the course starts to get hard. Because nondeterminism is really nasty. We'll talk about it a little bit. It's really nasty.

Parallel computing, as you know, is pretty easy, right? It's just work and span. Easy stuff, right? It makes sense. You can measure these things, can learn some skills around them, and so forth. But nondeterminism is nasty, really nasty.

So first let's talk about what we mean by determinism. So we say that a program is deterministic on a given input if every memory location is updated with a sequence-- the same sequence of values in every execution. So the program always behaves the same. And you may end up-- if it's a parallel program having different memory locations updated in different orders-- I may do A and then B, versus updating B and then A-- but if I look at a single memory location, A say, I'm always updating A with the same sequence of values.

There are lots of definitions of determinism. This is not the only one. There are some where people say, well, it only matters if the output is always the same. And there are others where you say not only does it have to be the same but every write to a location has to be in the same order globally. That turns out to be actually pretty hard, because if you have parallel computing you're not going to get them all updated the same unless you only have one processor executing instructions. And so we'll talk about this. We'll talk a little bit more about this kind of thing.

So why-- what's the big advantage of deterministic programs? Why should we care whether it's deterministic or nondeterministic? Sure.

AUDIENCE: It's repeatable.

CHARLES LEISERSON: It's repeatable. OK. So what?

AUDIENCE: [INAUDIBLE] a lot of programs [INAUDIBLE].

CHARLES LEISERSON: Why is that?

AUDIENCE: [INAUDIBLE] like a-- Why? Because sometimes that's what you want.

CHARLES LEISERSON: Because sometimes that's what you want. OK. That doesn't-- so if-- I mean, there's a lot of things I might sometimes want. Why is that important to want that? Yes.

AUDIENCE: Because consistency makes it easier to debug source code.

CHARLES LEISERSON: Yes. Makes it easier to debug. That's probably the number one reason, debugging. If it does the same thing every time, then if you have a bug, you can run it again. You expect to see the bug again. So every time you run through, hey, I get the same bug. But if it's nondeterministic, I get a bug, and now I go to look for it and the bug is nowhere to be found. Makes debugging a lot harder.

There are other reasons for wanting repeatability, so your answer is actually a broader correct answer. But the big advantage is in the specific application of repeatability to debugging. So here's the golden rule of parallel programming. Never write nondeterministic parallel programs. They can exhibit anomalous behaviors and it's hard to debug them. So never ever write nondeterministic programs.

Unfortunately, this is one of these things that is kind of hard in practice to do. So why might you want to write a nondeterministic program even though-- even when famous masters in the area of performance engineering, with highly credentialed-- numerous awards and so forth, tell you you shouldn't write nondeterministic programs? Why might you want to do it anyway? Yes.

AUDIENCE: You get better performance.

CHARLES LEISERSON: Yes. You might get better performance. That's one of the big ones. That's one of the big ones. And sometimes you can't. The nature of the problem is maybe that it's not deterministic. You may have asynchronous inputs coming in and so forth.

So this is the golden rule. We also have a silver rule. Silver rule says never write nondeterministic parallel programs, but if you must always devise a test strategy to manage the nondeterminism. So this gets into you better have some way of handling how you're going to tell what's going on if you have a bug.

So what are some of the typical test strategies that you could use that would manage the nondeterminism? So imagine you've got a parallel program and it's got races in it and so forth, and it's operating nondeterministically. What-- and that's OK if everything's going right. How would you-- you find a bug in the program. How are you-- what are you going to do? What kinds of ideas do you have? Yes.

AUDIENCE: You could temporarily remove the nondeterminism.

CHARLES LEISERSON: Yes. You could turn off the nondeterminism. You put a switch in there that says, well, I know the source of this nondeterministic behavior. Let me do that. Let me give you an example of that.

For security reasons these days, when you allocate memory, it's allocated to different locations on different runs of the program. It's allocated in random places. They want to randomize the addresses when you call malloc. That means that you can end up with different behaviors from run to run, and that can compromise your performance. But it turns out that there is a compiler switch, and if you run it in debug mode it will always deliver the results of malloc in deterministic locations, where the locations of the things you're mallocing are repeatable.

So that's good because they're supported. They said, yes, we have to randomize for security reasons so that people can't deterministically exploit buffer overflow errors, for example, but I don't want to have to do that every time. So I don't want to randomize every time I run. I want to have the option of making it so that that randomization is turned off. So that's a good one.

What's another one that can be done? You're full of good ideas. Let's try somebody else for now. But I like that, I like that. What are some other ideas? What else can you do to handle nondeterminism? You got a program and it's-- yes, yes, yes.

AUDIENCE: If you use random numbers, use the same seed.

CHARLES LEISERSON: Yes. If you have random numbers, you use the same seed. In some sense that's kind of the same thing if you're turning off nondeterminism. But that's a great one. There are other places. For example, if you read-- if you do get time of day for something in your program for something, you could have an option where it will put in a particular fixed value there so you can make sure that it doesn't-- even a serial program isn't nondeterministic. So that's good, but I also consider that to be-- it's another great example of turning off and on determinism. What other things can you do? Yes.

AUDIENCE: You could record the randomized outputs or inputs to determine correctness.

CHARLES LEISERSON: Yes. You can do record-replay for some things. Is that what you're saying? Is that what you mean? Or am I--

AUDIENCE: Maybe. [INAUDIBLE].

CHARLES LEISERSON: So record-replay says you run it through-- you can run it through with random numbers, but it's recording those things, so that when you run it again, instead of using the random numbers-- new random numbers, it uses the ones that you used to use. So that's the record-replay thing. Is that what you're saying, or are you saying something else? Yes, OK, good.

So that's using some tools. There are actually a lot of strategies. Let me just move on and answer. So another thing you can do is encapsulate the nondeterminism. So that's actually done in a Cilk runtime system already.

The runtime system is using a random scheduling strategy, but you don't see that it's random in the execution of your code if you don't-- if you have no race conditions in your code. It's encapsulated. So that the-- in the platform. So the platform is going to guarantee you deterministic results even though underneath the covers it's doing nondeterministic things.

You can also substitute a deterministic alternative. Sometimes there's a way of computing something that is nondeterministic, but in debug mode, ah, let me not use the nondeterministic one. And you can also use analysis tools, which can tell you things about your program and which you can control things. So there's a lot of things.

So whenever you have a nondeterministic program, you want to find some way of controlling it. Often, the nondeterminism is over in this corner but your bug is over in this corner. So if you can turn this thing off in some way, or encapsulate it, or otherwise control the nondeterminism over there, now you have a better chance of catching the stuff over here.

That's going to be particularly important in project 4 when we get to it, because that's going to be actually going to be doing nondeterministic programming for a game playing program. And one of the things is that the processors are, in this case, keeping the game positions together. And so if one processor stores something into what's called a transposition table, which is essentially a big hash table of positions it's seen, another one can see that value and change its behavior. And so one of the things you want to be do is turn off transposition table so that you don't take advantage of that performance advantage, but now you can debug the search code, or you can debug the evaluation code, and so forth.

You can also do things like unit testing so you know whether or not a particular piece is correct that might have-- so that you can test this thing separately from the rest of your system which may have nondeterminism. Anyway, this is a major thing. So never write them. But if you have to, you always want to have some test strategy. And so for people who are not watching this video and who are not in class today, they are going to be sorely hampered by not knowing this lesson when they go into the fourth project.

So what we're going to do is now we're going to talk about how to do nondeterministic programming. So this is-- there's always some part of your code which has a skull and crossbones. Like you have this abstraction. It's beautiful, and you can design, et cetera. And then somewhere there's this really ugly thing that nobody should know, and you put the skull and crossbones on that, and only experts go in. Well, anyway, that's the barrier we're crossing here.

And we're going to start out by talking about something that you've probably seen in some of the other classes, mutual exclusion and atomicity. So I'm going to use the example of a hash table. So here's a typical hash table. It's got collisions resolved by chaining. So you have a bunch of linked lists. You hash to a particular slot in the table, and then you chase down the linked list to find the value.

And so, for example, if I'm going to insert x which has a key value of 81, what I do is figure out which slot I go to by hashing the key. And then in this case I made it be the last one so that the animations could be easier than if it were in the middle. So now what do I do is I make the pointer of x go to the first element of that list, and then I make the slot value now point to x. And that effectively, with a constant number of operations, inserts x into the hash table, and in particular into the linked list in the slot that it's supposed to be. This is all familiar, right?

So now what happens when you have multiple parallel instructions that are accessing the same locations? So here we have two threads, one inserting x and one inserting y. And x goes, it does its thing. It hashes to there, and it then sets the next pointer to be the-- to add itself into the list.

And then there's this other thing going on in parallel which effectively says, oh, I'm going to hash. Oh, we're going to the same slot. It doesn't know that somebody is already there. And so then it decides it's going to put itself in as the first element of the list. And then it sets the value of y-- it sets the value of the slot to point to y.

And then along comes x, finishing off what it's doing, and it points the value to x. And you can see that we have a race bug here, a really nasty one because we've just destroyed the integrity of our system. We now have-- in particular, y is sort of floating, not in the list when it's supposed to be in the list.

So the standard solution to this is to make some of these instructions be atomic. And what that means is the rest of the system can never view them as being partially executed. So they either all have been executed or none of them have been executed at any point in time as far as the rest of the system is concerned. And the part of code that is within the atomic region is called the critical section. And, typically, a critical section of code is some place that should not be being executed by two things at the same time.

So the standard solution to atomicity is to use what's called a mutex lock, or a mutual exclusion lock. And it's basically an object with a lock and unlock member functions. And an attempt by a thread to lock an already locked mutex causes the thread to block-- that is, wait-- until the mutex is unlocked. So if somebody grabs the lock, somebody else grabs the lock and it's already taken, then they have to wait. And they sit there waiting until this guy says, yes, I'm going to release it.

So what we'll do now is we'll make each slot be a struct with a mutex L, and a pointer, head, to the slot context. So it's going to be the same data structure we had before but now I'm going to have not just the pointer from the slot but I'll also have a-- also have a lock in that position. And so the idea of-- in the code now is that before I access the lock-- before I access the list, I'm going to lock that list in the table by locking slot. Then I'll do the things that I need to do, and then I'll unlock it, and now anything else can go on.

Because what's happening is-- the reason we're getting into trouble is because we've got some sort of interleaving of operations. And our goal is to make sure that it's either doing this or doing this, and never this, to make sure that-- so that each thing, each piece of code, is restoring the invariant of correctness after it executes the pointer swaps. The invariance in this case is that the elements are in a list. And so you want to restore that with each one. So mutexes-- this is one way you can use mutexes to implement atomicity.

So now let's just go back. Who has seen mutexes before? Is that pretty much everybody? Yes. OK, good. I hope that this is not brand new for too many of you. If it is brand new, that's great. But what I'm trying to do is make it-- so let's go back a little bit and recall in this class our discussion of determinacy races.

So, remember, a determinacy race occurs when you have two logically parallel instructions that access the same memory location and at least one of them performs a write. So mutex locks can guarantee that critical sections behave atomically, but the resulting code is inherently nondeterministic because you've got a-- we had a race bug there. We had two things trying to access the same slot. But that may be what I want. I want to have a shared hash table maybe for these things. So I want something where there is a race, but I just don't want to have the anomalies that arise. In this case, the race bug caused things, and I can solve that with atomicity.

If you have no determinacy races, it means that the program is deterministic on that input and that it always behaves the same. And remember also that if a deterministic race exists in an ostensibly deterministic program, then it guarantees to find a race. Now, if you put in mutexes, you still have a nondeterministic program. You still have a race. Because you have two things that are logically parallel that are both accessing the lock. That's a race. That's a determinacy race.

If you have two things, they're in parallel, they're both accessing the lock, that's a determinacy race. It may be a safe, correct one, but it is a determinacy race. And so any codes that use locks are nondeterministic by intention, and they're going to invalidate the Cilksan guarantee of finding those race bugs. So you will end up with races in your code if you're not careful.

And so this is one reason it's important to have some way of turning off nondeterminism to detect stuff. Because what you don't want is a whole rash of false positives saying, oh, you raced on gathering this lock. Nor do you want to ignore that and then discover that a race has popped up somewhere else.

Now, some people feel that-- so this is basically talking about having a data race. And a data race is similar to the definition of determinacy race, but it says that you have two logically parallel instructions and they don't hold locks in common. And then it's the same definition. If they access the same memory location and one of them performs a write, then you have a-- then you have a data race bug.

But if they have the locks in common, if they both have acquired at least one lock that's the same, then you don't have a data race, because that means that you've now successfully protected the atomicity. But it is still nondeterministic and there is a determinacy race, just no data race. And that's the big distinction between data races and determinacy races. And on quiz 2, you better know the difference between data races and determinacy races, because they are different. So a program may have no determine-- may have no data races. That doesn't mean that it doesn't have a determinacy race. In fact, if it's got any locks, it probably has a determinacy race.

So one of the things is, if I have no data races, does that mean I have no bugs? Suppose I have no data races in my code. Does that mean I have no bugs? This is like an obvious answer just by quizmanship, right?

So what might happen? Think about it a little bit. What might happen? How could I have no data races and yet there still be a bug, even though-- I'm assuming it's a correct piece of code otherwise. In other words, when it runs serially or whatever, it's correct. How could I end up having a code-- no data races but still have a bug?

AUDIENCE: It's still nondeterministic [INAUDIBLE].

CHARLES LEISERSON: Yes, but that doesn't mean it's bad, right?

AUDIENCE: Well, you said that it runs correctly serially.

CHARLES LEISERSON: Yes.

AUDIENCE: So the order that things are put in or generated might still be--

CHARLES LEISERSON: Might still be different, yes.

AUDIENCE: [INAUDIBLE].

CHARLES LEISERSON: OK. Yes. Let me give you an example which is more to the point. Here is a way of making sure that I have no data race, which is I lock before I follow the table slot value. Then I unlock, and I lock again and then I set the value. So I haven't prevented the atomicity. Right now I've got an atomicity violation, but I have no data races, because I never have two things-- any two things that are going to access things at the same time is protected by the lock. But it didn't solve my atomicity, so there's a-- you can definitely have no data races, but that doesn't mean you have no bugs.

But, usually, what happens is, if you have no data races, then usually the programmer actually got this code right. It's one of these things where demonstrating no data races is in fact a very positive thing in your code. It doesn't mean the programmer did right. But most of the time, the reason they're putting in the locks is to provide atomicity for something, and they usually get it right.

They don't always get it right. In fact, Java, for example, had a very famous bug early on in the way that it specified locking such that the-- you could look at the length of a string and then modify it, and then you would end up with a race bug because somebody else could swoop in in between. So they thought they were providing atomicity and they didn't.

So there's another set of issues here having to do with benign races. Now, there's some people who argue that no races are-- no determinacy races are benign. And they make academic statements that I find quite compelling, actually, what they say, about races and whether races are benign. But, nevertheless, the literature also continues to use the term benign race for this kind of example.

So suppose we want to identify what is the set of digits that occurred in some array. So here's an array with a bunch of values in it, each one being a digit from 0 to 9. So I could write a little piece of code that runs through a digits array of length 10 and sets the number of digits I've seen so far of each value to be 0. And now I go through-- and I'm going to do this in parallel-- and I'm going to set, every time I see a value A of i-- suppose A of i is 3-- I set the location of A3 to be 1. And, otherwise, and now-- otherwise, it's 0 because that's what I had it before.

So here's the kind of thing I have. So, for example, I can have both of those 6's-- or in parallel, we're going to access the location 6 to set it to 1. But they're both setting it to 1. It doesn't really matter what order they do it in. You're going to get the same value there, 1. And so there's a race. Maybe we don't too much care about that race, because they're both setting the same value. We're not going to get an incorrect value.

Well, not exactly. We might get it on some architecture. On the Intel architectures, you won't get an incorrect value, on x86. But there are codes where the elements-- the array values are not set atomically. So, for example, on the MIPS architecture, in order to set a bite to be a particular value, you have to fetch the word, mask out, set the word, and then store it back in. Set the byte and then store it back into the word.

And so if there are two guys who are basically operating on that same word location, they will have a race, even though in the code it looks like they're just setting bytes. Does that make sense? So nasty. Nasty bugs. That's why you should never do nondeterministic programming unless you have to.

So Cilksan allows you to turn off race detection for intentional races. So if you really meant there to be a race, as in this case, you can turn it off. This is dangerous but practical, it turns out. Usually you're not turning it off for-- because here's what can happen. You can turn it off and yet-- then there's something else which is using that same stuff, and now you're running Cilksan without having turned it off for exactly what your race might be.

There are better solutions. So in Intel's Cilk Screen, there's the notion of fake locks. We just have not yet implemented it in the open Cilk compiler and in Cilksan. We'll eventually get to doing that. And then people who take this class in the future will have an easier time with that, because we'll be able to check for that as well. So any questions about these notions?

So you can see the notions of races can get quite hairy and make it quite difficult to do your debugging, and in fact even can confound your tools that are supposed to be helping you to get correct code. All in the name of performance. But we like performance. Any questions? Yes.

AUDIENCE: So I don't really understand how some architectures can cause some error in race conditions.

CHARLES LEISERSON: Yes. So how can some architectures cause some error? So here's the thing, is that if I have a, let's say, a byte array, it may be that this is stored as a set of let's say four-byte words. And so although you may write that A of 0 gets 1, what it does is it says, let me fetch these four values, because there is no byte set instruction on some architectures. It can only set, in this case, 32-bit words.

So it fetches the values. It then-- into a register. It then sets the value in the register by masking. So it doesn't set the other things here. And then it stores it back so that it has a 1 here. But what if somebody, at the same time, is storing into this location? They will fetch it into their own register, set their byte, mask it, et cetera. And now my writeback is going to-- we're going to have a lost update in the writebacks. Does that make sense?

AUDIENCE: [INAUDIBLE].

CHARLES LEISERSON: OK. Good. Very good question. Yes, I know. I went through that orally a little bit quicker than maybe I should have. Great.

So let's talk a little bit about implementation. I always like to take things down one level below what you necessarily need to know in order to do things. But it's helpful to sort of see how these things are implemented, because then that gives you a better sense at a higher level what your capabilities are and how things are actually working underneath.

So let's talk about mutexes. So here, first of all, understand there are lots of different mutexes. If you look at an operating system, they may have a half a dozen or more different mutexes, different locks that can provide mutual exclusion, or parameters that can be set for what kind of mutexes.

So the first basic difference in most things is whether the mutex is yielding or spinning. So a yielding mutex returns control to the operating system when it blocks. When a program tries to get-- when it tries to get access, when a thread tries to get access to a given lock, if it is blocked, it doesn't just sit there and keep-- and spinning, where you're basically-- spinning means I just sit there checking it and checking it and checking it and checking it.

Instead what it does is it says, oh, I'm doing useless work here. Let me go and return control to the operating system. Maybe there's another thread that can run at the same time, and therefore I'll give-- by switching myself out, by yielding my scheduling quantum, I will get better efficiency overall, because somebody-- some other thread that is capable of running can run at that point. So is that a clear distinction between spinning and yielding?

Another one is whether the mutex is reentrant or nonreentrant. A reentrant mutex allows a thread that is already holding a lock to acquire it again. A nonreentrant one deadlocks if the thread attempts to require a mutex it already holds. So I grab a lock, and now I go to a piece of code that says grab that lock.

So very simple. I can check to see whether I have-- if I want to be reentrant, I can check, do I have that lock already? And if I do, then I don't actually have to acquire it. I just keep going. But that's extra overhead. It's faster for me to have a nonreentrant lock, where I just simply grab the lock, and if somebody has got it, including me, then it's a deadlock. But now if there's the possibility that I could reacquire a lock, then that might not be safe. You have to worry about-- the program has to worry about that now. Is that clear, that one?

And then a final basic property of mutexes is whether they're fair or unfair. So here's the thing. It's the easiest to think about it in the context of spinning. I have several threads that basically came to the same lock, and we decided they're going to spin. They're just going to sit there continually checking, waiting for that lock to be free.

So when finally the guy who has it unlocks it, maybe I've got a half a dozen threads sitting there. One of them wins. And which one wins? Well, they're spinning. It could be any one of them. Then it has one.

And so the issue that can go on is you could have what's called a starvation problem, where some guy is sitting there for a really long time waiting while everybody else is continually grabbing locks out from under his or her nose. So with a fair mutex, basically what you do is you go for the one that's been waiting the longest, essentially. And so, therefore, you never have to wait more than for however many things were there when you got there before you're able to go. Question.

AUDIENCE: Why is that better?

CHARLES LEISERSON: It can be better because you may freeze out our service if there's something that's-- you may never get to do the thing that you want to do because there's something else always interfering with the ability for that part of the program to make progress. This tends to be more of an issue in concurrent programming, where you have different programs that are trying to accomplish different tasks and you want to accomplish both tasks.

It does not come across-- in parallel programming, mostly we deal with unfair-- often unfair spinning locks because they're the cheapest. And we just trust that, a, we're not going to have any critical regions-- we write our code so we don't have critical regions that are really long, so nobody ever has to wait a very long time. But, indeed, dealing with a contention issue, as we talked about last week, can make a difference. good.

So here's an implementation of a simple spinning mutex an assembly language. So the first thing it does is it checks to see if the-- the mutex is free if its value is 0. So it compares the value of the mutex to 0. And if it is 0, it says, oh, it's free. Let me go get it. It then-- to get the mutex, what it does is it moves a 1 into the-- it basically moves 1 into a register, and then it exchanges the mutex with that register eax.

And then it compares to see whether or not it actually got the mutex. And if it didn't, then it goes back up to the top and starts again. And then the other branch is at the top there. It does this pause, and this apparently is due to a bug in x86 that they end up having to put this pause instruction in there. And then, otherwise, you jump to where the Spin_Mutex is and go again. And then, once you've done the Critical_Section, when you're done you free it by just setting it to 0.

So the question here is-- so the exchange instruction is an atomic exchange. So it takes the register and the memory value and it swaps them, and you can't have anything come in. So one of the things that might have you confused a little bit here is, wait a second. I checked to see if the mutex is free, and then I tried to get it to test if I was successful. Why? Why can't I just start out by essentially going to get mutex?

I mean, why do I need any of the code between Spin_Mutex and Get_Mutex? So if I just started with Get_Mutex, I would move a 1 in. I would exchange, check to see if I could get it. If I had it, fine. Then I execute the end. If not, I would go back and try again.

So why-- because if somebody has it, by the way, the value that I'm going to get is going to be 1. And that's what I swapped in, so I haven't changed anything. I go back and I check again. So why do I need that first part? Yes.

AUDIENCE: Maybe it's faster to just get [INAUDIBLE].

CHARLES LEISERSON: Yes. Maybe it's faster. So, indeed, it's because it's faster. Even though you're executing extra code, it's faster. Tell me why it's faster. And this will take you-- you have to think a little bit about the cache protocols and the invalidation issue. So why is it going to be faster? Yes.

AUDIENCE: Because I do the atomic exchange.

CHARLES LEISERSON: OK, good. Say more.

AUDIENCE: Basically, just to exchange atomically, you have to have [INAUDIBLE]. And you bring it in only just to do a swap.

CHARLES LEISERSON: Yes. So it turns out the exchange operation is like a write. And so in order to do a write, what do I need to do for the cache line that it's on?

AUDIENCE: To bring it in.

CHARLES LEISERSON: To bring it in. But how does it have to be brought in? Remember, the cache lines have-- let's ima--

AUDIENCE: [INAUDIBLE].

CHARLES LEISERSON: You have to invalidate on the other ones, and you have to hold it in what state? Remember, the cache lines have-- if we take a look at just a simplified protocol where-- the MSI's protocol.

AUDIENCE: [INAUDIBLE].

CHARLES LEISERSON: Yes. You have to have it-- in MSI or MESI, you have to bring it in in modified or at least exclusive state. So exclusive is for the MESI protocol. We mentioned that but we didn't really do it. Mostly we just went-- but I have to bring it in and modify it, where I guarantee there are no other copies.

So if I've got two guys that are polling on this location, they're both continually invalidating each other, and you create a whole bunch of traffic on the memory network. That's going to slow everything down. Whereas if I do the first one, what state do I get it in?

AUDIENCE: [INAUDIBLE].

CHARLES LEISERSON: Then you get it in shared state. What does the other guy get it in?

AUDIENCE: Shared.

CHARLES LEISERSON: Shared state. And now I keep going, just having it spinning in my own local cache, not generating any local traffic until the-- until somebody releases the lock, in which case it invalidates all those. And now you can actually get a little bit of a storm after the fact. There are in fact locks where you don't even get a storm after the fact called MCS locks. But this kind of lock is, for most practical purposes, just fine.

So everybody follow that description of what's going on there? So that first code, for correctness purpose, is not important. For performance, it is important. Isn't it great that you guys can read assembly language?

Now suppose that-- this is a spinning mutex. Suppose that I want to do a yielding mutex. How does this code have to change? So this is a spinning one. It just keeps checking. Instead, I want to return control to the operating system. So how does this code change if I do that? Yes.

AUDIENCE: Instead of the pause, [INAUDIBLE].

CHARLES LEISERSON: Like that. Yes, exactly. So instead of doing that pause instruction, which-- the documentation on this is not very clear. I'd love to have the inside scoop on why they really had to do the pause there.

But in any case, you take that no op that they want to have in there and you replace it with just a call to the yield, which allows the operating system to schedule something else. And then when it's your turn again, it resumes from that point. So that's the yield. So that's the difference in implementation, essentially, between a spinning mutex and a yielding mutex.

Now, there's another kind of mutex that is kind of cool which is called a competitive mutex. So think about it this way. I have competing goals. One is I want to get the mutex as quickly as possible after it's released. I don't want-- if it's unlocked, I don't want to sit there for a really long time before I actually acquire it. And, two, yes, but I don't want to sit there spinning for a really long time. And then-- because as long as I'm doing that, I'm taking up cycles and not accomplishing anything. Let me turn it over to some other thread that can use the cycles more effectively.

So there are those two goals. How can I get the best of both worlds here? Something that's close to the best of both worlds. It's not absolutely the best of both worlds, but it's close to the best of both worlds. What strategy could I do? So I want to claim it very soon.

So the point is that the spinning mutex achieves goal 1, and the yielding mutex achieved goal 2. So how can I-- what can I do to get both goals? Yes.

AUDIENCE: [INAUDIBLE] you could use some sort of message passing to [INAUDIBLE].

CHARLES LEISERSON: So you're saying use message passing to inform--

AUDIENCE: The waiting threads.

CHARLES LEISERSON: --the waiting threads. I'm think of something a lot simpler in this context. Because the message passing, you're going to have to go through-- to do message passing properly, you actually need to use mutexes that are to implement it. So you want to be a little bit careful about that. But interesting idea. Yes.

AUDIENCE: Could you try using an interrupt?

CHARLES LEISERSON: Using an interrupt. How would you do that?

AUDIENCE: Like once the [INAUDIBLE].

CHARLES LEISERSON: Yes. So, typically, if you implement interrupt you also need to have some mutual exclusions to do it properly, but-- I mean, hardware will support that. That's pretty heavy-handed as well. There's actually a very simple solution.

I'm seeing familiar hands. I want to see some unfamiliar hands. Who's got an unfamiliar hand? I see. You raised your left hand that time instead of your right hand. Yes.

AUDIENCE: You try to have whichever one is closest to being back to the beginning of the cycle take the lock.

CHARLES LEISERSON: Hard to measure that, right? How would you write code to measure that? Yes. Hmm. Hmm. Yes. Go ahead.

AUDIENCE: I have a question, actually.

CHARLES LEISERSON: OK, good.

AUDIENCE: Why does it [INAUDIBLE]?

CHARLES LEISERSON: Why doesn't it have a?

AUDIENCE: [INAUDIBLE]. Why does yielding mutex [INAUDIBLE]?

CHARLES LEISERSON: Because if I yield-- so what's the-- how often does-- if I context switch, how often is it going to be that I-- how long am I going to have to wait, typically, before I am scheduled again? When a code yields to the operating system, how often does the operating system normally do context switching? What's the rate at which it context switches for the different multiplexing of threads that it does onto the available processors? What's the rate at which it shifts? Oh, this is-- OK, that's going to be on the quiz. This is a numeracy thing. Yes. Do you know how frequently?

AUDIENCE: [INAUDIBLE] sub-millisecond [INAUDIBLE].

CHARLES LEISERSON: Not quite, but you're not off by more than an order of magnitude. So what are the typical rates that the system does context switching? So in human time, it's the blink of an eye. So it's actually around 10 milliseconds. So it does a hundred times a second. Some of them do. Some do 60 times a second. That's how often it switches.

Now, let's say it's a hundred times a second, 10 milliseconds. So you're pretty close. 10 milliseconds. How many orders of magnitude is that from the execution of a simple instruction? So we're going at more than a gigahertz. And so a gigahertz is 10 to the ninth, and we're talking 10 to the minus 9, and we're talking 10 to the minus 2. So that's 10 million instruction opportunities that we miss if we switch out.

And, of course, we'd probably only switch out for half our-- where are you along the thing. So you're only switching out maybe for half, assuming nothing else is going on there. But that means you're not grabbing the lock quickly after it's released, because you've got 10 million instructions that are going to execute before you're going to have a chance to come back in and grab it. So that's why a yielding one does not grab it quickly.

Whereas spinning is like we're executing this stuff at the rate of gigahertz, checking again, checking again, checking again. So why-- so what's the strategy here? What can I do? Yes.

AUDIENCE: Maybe we could spin for a little bit and then yield.

CHARLES LEISERSON: Hey, what a good idea. Spin for a while and then yield. So the idea being, hey, if the lock is released soon, then I will be able to grab it immediately because I'm spinning. If it takes a long time for the lock to yield, well, I will yield eventually. So yes, but how long to spin? How long shall I spin? Sure.

AUDIENCE: Somewhere close to the amount of time it takes to yield and come back.

CHARLES LEISERSON: Yes. Basically as long as a context switch takes, as long as it takes to go out and come back. And if you do that, then you never wait more than twice the optimal time. This is competitive analysis, which the theoreticians have gone off-- there's brilliant work in competitive analysis.

So the idea here is that if the mutex is released while you're spinning, then this strategy is optimal. Because you just sat there spinning, and as soon as it was there you got it on the next cycle. If the mutex is released after the yield, you've already spun for the equal to that. So you'll come back and get it within at most a factor of 2.

This is-- by the way, this shows up in the theory literature, if you're interested, is it's called the ski rental problem. And here's the idea. You're going to go-- your friends have persuaded you to go try skiing. Snow skiing, right? Pu-chu, pu-chu, pu-chu. Right?

And so you say, gee, should I buy the equipment or should I rent? After all, you may discover that you rent and then-- you buy it, and then you break your leg and never want to go back. Well, then, if you've bought it's been very expensive. And if you've rented, well, then you're probably better off. On the other hand, if it turns out you like it, you're now accumulating the costs going forward.

And so the question is, well, what's your strategy? And the idea is, well, let's look at what renting costs and what buying costs. Let me rent until it's equal to the cost of buying and then buy. And then I'm within a factor of 2 of having spent the optimal amount of money for-- because then if I break my leg after that, well, at least I-- I got-- I didn't spend more than a factor of 2. And if I get it before, then I've spent optimally. Yes.

AUDIENCE: So when you say how long a context switch takes, is that in milliseconds or--

CHARLES LEISERSON: Yes. 10 milliseconds. Yes. So spin for 10 milliseconds, and then switch. So now the point is that when you come back in, the other job's going to run for 10 milliseconds or whatever. So if you get switched out, then if the lock is released, you're going to be done in 20 milliseconds. And so you'll be within a factor of 2. And if it happened if the lockout released before then, you're right there to grab it.

Now, it turns out that there's a really clever randomized algorithm-- I love this algorithm-- from 1994 that achieves a competitive ratio of e over e minus 1 using a randomized strategy. And I'll encourage you, those of you have a theoretical bent, to go take a look at that. It's very clever. So, basically, you have some probability of, at every step, of whether you, at that point, decide to yield or continue spinning. And by using a randomized strategy, you can actually get this to e over e minus 1.

Questions about this? So this is sort of some of the basics. I'm glad we went over some of that, because everybody should know these basic numbers about what things cost. Because, otherwise, you don't know where to spend it. So context switching time is on the order of 10 milliseconds. How long is a disk access compared to-- yes. What's a disk access?

AUDIENCE: 150 cycles?

CHARLES LEISERSON: 150 cycles? Hmm, that's a--

AUDIENCE: Or is that the cache?

CHARLES LEISERSON: That would be accessing DRAM. Accessing DRAM, if it wasn't in cache, might be 150 cycles. So two orders of magnitude or so. So what about a disk access? How long does that take? Yes.

AUDIENCE: Milliseconds?

CHARLES LEISERSON: Yes. Several milliseconds. So 10 milliseconds or 5 milliseconds depending upon how fast your disk is. But, once again, it's on the order of milliseconds. So it's helpful to know some of these numbers, because, otherwise, where are you spending your time?

Especially, we're sort of doing performance engineering in the small, basically looking within the pro-- within a multicore processor. Most performance engineering is on all the stuff on the outside, dealing with networking, and file systems, and stuff where things are really costly, and where, if you actually have a lot of time, you can write a fast piece of code that can figure out how you should best deal with these slow parts of your system. So those are all sort of good numbers to know. You'll probably see some of them on quiz 2.

Deadlock. I mentioned deadlock earlier. Let's talk about what deadlock is and understand this. Once again, I expect some of you have seen this, but I still want to go through it because it's hugely important material. And this is the issue, that holding more than one lock at a time can be dangerous.

So imagine that thread 1 says, I'm going to lock A, lock B, execute the critical section, unlock B, unlock A, were A and B are mutexes. And thread 2 does something very similar. It locks B and locks A. Then it does the critical section, then it unlocks A and then unlocks B. So what can happen here?

So thread 1 locks A, thread 2 locks B. Thread 1 can't go and lock B because thread 2 has it. Thread 2 can't go and lock A because thread 1 has it. So they sit there, blocked. I don't care if they're spinning or yielding. They're not going anywhere. So this is the ultimate loss of performance. It's like-- it's incorrect. It's like you're stuck, you've deadlocked.

Now, there's three basic conditions for deadlock. Everybody understands this, right? Is there anybody who has a question, because just-- OK. There's three conditions you need for deadlock. The first one is mutual exclusion, that you're going to have exclusive control over the resources.

The second is nonpreemption. You don't release your resources. You hold until you finish using them. And three is circular waiting. You have a cycle of threads, in which each thread is blocked waiting for resources held by the next one. In this case, the resource is the lock.

And so if you remove any one of these constraints, you can come up with solutions that won't deadlock. So, for example, it could be that when I try to acquire a lock, if somebody else has them, I take it away. That could be one thing. Now, they may get into other issues, which is like, well, but what if he's actually doing real work or whatever? So all of these things have things. Or I don't insist that it be mutual exclusion, except that's the kind of problem that we're trying to solve. So these are generally the three things that are necessary in order to have a deadlock situation.

Now, in any discussion of deadlock, you have to talk about dining philosophers. When I was an undergraduate-- and I graduated in 1975 from Yale, a humanities school-- I was taught the dining philosophers, because, after all, philosophy is what you find at humanities schools. I mean, we have a philosophy department too. Don't get me wrong. But at Yale the humanities is huge. And so philosophy, I guess they thought this would appeal to the people who were not real techies in the background. I sort of like-- I was a techie in the midst of all these non-technical people.

Dining philosophers is a story of deadlock told by Tony Hoare based on an examination question by Edsger Dijkstra. And it's been embellished over the years by many, many, many retellers. And I like the Chinese version of this. There's versions where they use forks, but I'm going to-- this is going to be-- they're dining-- I'm going to say that they are eating noodles with chopsticks.

And there are n philosophers seated around the table, and between every plate of noodles there's a chopstick. And so in order to eat the noodles they need two chopsticks, which to me sounds very natural. And so here's the code for philosopher i. So he's a philosopher, so he starts by thinking for a while. And then he gets hungry, he or she gets hungry.

So the philosopher grabs the chopstick on the right-- on the left, sorry. And then he grabs the one on the right, which is i plus 1. But he has to do that mod n, because if it's the last one, you've got to go around and grab the first one. Then eats, and then it unlocks the two chopsticks. And now they can be used by the other dining philosophers because they don't think much about sanitation and so forth. Because they're too busy thinking, right?

But what happens? What's wrong with this solution? What happens? What can happen for this? It's very simple. I need two chopsticks. I grab one, I grab the other, I eat. One day, what happens? Yes.

AUDIENCE: Everyone grabs the chopstick to the left and they're all stuck with one chopstick.

CHARLES LEISERSON: Yes. They grab one to the left, and now they go to the right. It's not there, and they starve. One day they grab all the things, so we have the starving philosophers problem. So motivated by this problem-- yes, question.

AUDIENCE: Is there any way to temporarily unlock it? Like the philosopher could just hand the chopstick [INAUDIBLE].

CHARLES LEISERSON: Yes. So if you're willing to preempt, then that would be preemption. As I say, it's got to be nonpreemptive in order for deadlock to occur. In this case, yes. But you also have to worry in those cases. Could be, oh, well if I couldn't get both, let me put them both down.

But then you can have a thing that's called livelock. So they all pick up their left. They see the right one's busy, so they put it down so somebody else can have it. They look around. Oh, OK. Let me pick up one. Oh, no. OK. And so they still starve even though they've done that.

So in that kind of situation, you could put in a time delay. You could say-- let everybody pick a random number to have a randomized scheme so that we're not-- so there are other solutions if you don't insist on nonpreemption. I'm going to give you one where we have nonpreemption but we still avoid deadlock, and it's to go for that cyclic problem. So here's the idea.

Suppose that we can linearly order the mutexes. So I pick some order of the mutexes, so that whenever a thread holds a mutex L sub i and attempts to lock another mutex L sub j, we have that in this linear order-- L sub i comes before L sub j. Then you can't have a deadlock.

So in this case, for the dining philosophers, it would, for example, number the chopsticks from 1 to n, or 0 to n minus 1, whatever. And then grab the smaller one and then grab the larger one. And then it says then you would never have a deadlock. And so here's the proof. You know I like proofs. Proofs are really important. So I'm going to show you that if you do that, you couldn't have a cycle of waiting.

So suppose you had a cycle of waiting. We're in a situation where everybody is holding chopsticks, and one of them is waiting for another one, which is waiting for-- all the way around to the first one. That's what we need for deadlock to occur. So let me just look at what's the largest mutex on the cycle. Let's call that L max.

And suppose that it's waiting on mutex L held by the next thread in the cycle. Well, then, we have something that's bigger than the maximum one. And so that contradicts the fact that I grab them-- whenever I grab them, I do it in order. So very simple-- very simple proof that you can't have deadlock if you grab them according to a linear order.

And so for this particular problem, what I do is, instead of grabbing the one on the left and one the right, as I say, you grab the smaller of the two and then grab the larger of the two. And then you're guaranteed to have no deadlock. Does that make sense?

Now, if you're going to use locks in Cilk, you have to realize that in the operating-- in the runtime system of Cilk, they're doing-- they're using locks. You can't see them. They're encapsulated, as we talked about. The nondeterminism in Cilk is encapsulated. It's still going on underneath the covers. And if you start introducing your own nondeterminism through the use of locks you can run into trouble if you're not careful.

And let me give you an example. This is a situation-- you can deadlock your program in Cilk with just one lock. So here's an example of a code that does that. So main spawns off foo. And foo basically locks the lock L and then unlocks it. And, meanwhile, after it spawns off foo, the continuation goes and it locks L itself, and then does a sync, and then it unlocks it.

So what happens here? We sort of have a situation like this, where the locking I've done with an open bracket, and an unlock, a release, I'm doing with a closed bracket. So I'm spawning off foo, which is the lower part there, and locking and unlocking. And up above unlocking then unlocking.

So what can happen here? I can go and I basically spawn off the child, but then I lock. And now the child goes and it says, whoops, can't-- foo is going to wait here because it can't grab the lock because it's owned by main. And now we get to the point where main has to wait for the sync, and the child is never going to complete because I hold the resource that the child needs to complete.

So don't hold mutexes across Cilk syncs. That's the lesson there. There are actually places you can, but if you don't hold them across that, then you won't run into this particular problem. A good strategy is only holding mutexes within strands. So there's no parallelism. So you have it bounded.

And also, that's a good idea generally because you want to hold mutexes as short amount of time as you possibly can. So, for example, if you have a big calculation and then you want to assign something atomically, don't put the big calculation inside the critical region. Move the calculation outside the critical region, do the calculation you need to do, and then acquire the locks just to do the interaction you need to set a value. And then you'll have a lot faster code because you're not holding up other threads for a long time.

And always try to avoid nondeterministic programming. But that's not always possible. So any questions about that? Then I want to go on a really interesting topic because it's a really recent research level topic, and that's to talk about transactional memory. Who's heard this term before? Anybody?

So the idea is to have database transactions, that you have things like database transactions where the atomicity is happening implicitly. You don't specify locks. You just say this is a critical region. Don't interrupt me while I do this critical region. The system works everything out.

Here's a good example of where it might be useful. Suppose we want to do a concurrent graph computation. And so you take people involved in parallel and distributed computing at MIT and you say, OK, I want to do Gaussian elimination on this graph. Now, you guys, I'm sure most of you know Gaussian elimination from the matrix context. Do you know what it means in a graph context?

So if you have a sparse matrix, you actually have a graph. And Gaussian elimination is a way of manipulating the graph, and you get exactly the same behavior as you get in the dense one. So I'll show you what it is. You basically pick somebody to eliminate.

[STUDENTS LAUGH]

And now what you do is look at all this vertex's neighbors. Those guys. And what you do is you eliminate that vertex-- bye bye-- and you interconnect all the neighbors with all the edges that don't already exist. And that's Gaussian elimination. And if you think of it in terms of matrix fashion, the question is, if you have a sparse matrix, where are you going to get fill in? What are the places that you need to update when you do a pivot in Gaussian elimination in a matrix?

So that's the basic notion of graph-- of doing Gaussian elimination. But now we want to deal with the concurrency. And the problem occurs if I want to eliminate two nodes at the same time. Because now they're adjacent to each other, and if I just do what I expressed, there's going to be all kinds of atomicity violations, et cetera. By the way, the reason I'm picking these two folks is because they're going to a better place.

So how do you deal with this? And so in transactional memory, what I want to be able to do is just simply say, OK, here's the thing that I need to be atomic. And so if I look at this code, it's basically saying who are my neighbors, and then let me identify all of the edges that need to be removed, the ones that I just showed you that we removed. Now let me get rid of the element v. And now, for all of the neighbors of u, let us add in the edge between the neighbor and-- between the pairs of neighbors. So that's basically what it's doing.

And I'd like to just say that's atomic. And so the idea is that if I express that as a transaction, then the idea is that, on the transaction commit, all the memory updates in the critical region appear to take it happen at once. However, in transaction, remember the idea is, rather than forcing it to go forward, I can have the transactions abort. So if I get a conflict, I'll abort one and restart it. And then the restarted transaction may take a different code path, because, after all, I may have restructured the graph underneath. And so it may do something different the second time through than the first. It may also abort again and so forth.

So when you study transaction, transactional memory-- let me just do a couple of definitions. One is a conflict. That's when you have two transactions that are-- they can't both complete. One of them has to be aborted. And aborting, by the way, is once again violating the nonpreemptive nature. Here we're going to preempt one of them by keeping all the states so I can roll a state back and restart it from scratch.

So contention resolution is deciding which of the two conflicting transactions to wait or to abort and restart, and under what conditions you do that. So the resolution manager has to figure out what happens in the case of contention. And then forward progress is avoiding deadlock of course, but also livelock and starvation. You want to make sure that you're going to make-- because what you don't want to have happen, for example, is that two transactions keep aborting each other and you never make forward progress. And throughput, well, you'd like to run as many transactions as concurrently as possible.

So I'm going to show you an algorithm for doing this. It's a really simple algorithm. It happens to be one that I discovered just a couple of years ago. And I was surprised that it did not appear in the literature, and so I wrote a very short paper on it. Because what happens for a lot of people is they-- if they discover there's a lot of aborting, they say, oh, well let's grab a global lock. And then if everybody grabs a global lock, you can do this sort of thing. You can't deadlock with a single lock if you're not also doing things like Cilk sync or whatever.

But, in any case, if you have just a single lock, everybody falls back to the single lock, and then you have no concurrency in your program, no performance, until everybody gets through the difficult time. So this is an algorithm that doesn't require a global lock. So it assumes the transactional memory system will log the reads and writes. That's typically true of any transaction, where you log what reads and writes you're doing so that you can either abort and roll back, or you can-- when you abort-- or else you sandbox things and then atomically commit them.

And so we have all the mechanisms for aborting and rolling back. These are all very interesting in their own right, and restarting. And this is going to basically use a lock-based approach that uses two ideas. One is the notion of what's called a finite ownership array, and another is a thing called release-sort-reacquire. And let me explain those two things, and I'll show you really quickly how this beautiful algorithm works.

So you have an array of anti-starvation mutual exclusion locks. So these are ones that are going to be fair, so that you're always going to the oldest one. And you can do an acquire, but we're also going to add in a try acquire. Tell me whether, if I tried to acquire, I would get it. That is, if I get it, give it to me. If I don't get it, don't wait. Just tell me that I didn't get it, and then release.

And there's an owner function that maps all of the-- function h that maps my universe of memory locations to the indexes in this finite ownership array, this lock array. So the lock has length-- array has length n, has n slots in it. To lock a location x in the set of all possible memory locations, you actually acquire lock of h of x. So you can think of h as a hash function, but it doesn't have to be a fair hash function or whatever. Any function will do. And then, yes, there will be some advantages to picking some functions or another one.

So rather than actually locking the location or locking the object, I lock a location that essentially I hash to from that object. So if two guys are trying to grab the same location, they will both grab the same lock because they've got the same hash function. But I may have inadvertent locks where if I were locking the objects themselves, I wouldn't have them both trying to acquire the same lock. That might happen in this algorithm.

So here's the idea. The first idea is called release, sort, and reacquire. So that's the ownership array part that I just explained. Now here's the release, sort, reacquire. Before you access a memory location x, simply try to grab lock of x greedily. And if you have a conflict-- so if you don't have a conflict, you get it. You just simply try to get it. And if you can, that's great.

If not, then what I'm going to do is roll back the transaction but don't release the locks I hold, and then release all the locks with indexes greater than h of x. And then I'm going to acquire the lock that I want. And now, at that point, I've released all the bigger locks, so I'm acquiring the next lock. And then I reacquire the released locks in sorted order. So I go through all the locks I released and I reacquire them in sorted order.

And then I start my transaction over again. I try again. So what happens each time through this process, I'm always-- whenever I'm trying to acquire a lock, I'm only holding locks that are smaller. But each time that I restart, I have one more lock that I didn't used to have before I restart my transaction, which I've acquired in the order, in the linear order, in that ownership array from 0 to n minus 1.

And so here's the algorithm. I'll let you guys look at it in more detail, because I see our time is up. And it's actually fun to take a look at, and we'll put the paper online. There's one other topic that I wanted to go through here which you should know about, is this locking anomaly called convoying. And this was actually a bug that we had-- a performance bug that we had in our original and MIT-Cilk. So it's kind of a neat one to see and how we resolved it. And that's it.

Free Downloads

Video

Caption