Subtitles section Play video Print subtitles The following content is provided under a Creative Commons license. Your support will help MIT OpenCourseWare continue to offer high quality educational resources for free. To make a donation or to view additional materials from hundreds of MIT courses, visit MIT OpenCourseWare at ocw.mit.edu. PROFESSOR: Hey, everybody. It's my pleasure once again to welcome TB Schardl, who is the author of your taper compiler, to talk about the Cilk runtime system. TAO SCHARDL: Thanks, Charles. Can anyone hear me in the back, seem good? OK. Thanks for the introduction. Today I'll be talking about the Cilk runtime system. This is pretty exciting for me. This is a lecture that's not about compilers. I get to talk about something a little different for once. It should be a fun lecture. Recently, as I understand it, you've been looking at storage allocation, both in the serial case as well as the parallel case. And you've already done Cilk programming for a while, at this point. This lecture, honestly, is a bit of a non sequitur in terms of the overall flow of the course. And it's also an advanced topic. The Cilk runtime system is a pretty complicated piece of software. But nevertheless, I believe you should have enough background to at least start to understand and appreciate some of the aspects of the design of the Cilk runtime system. So that's why we're talking about that today. Just to quickly recall something that you're all, I'm sure, intimately familiar with by this point, what's Cilk programming all about? Well, Cilk is a parallel programming language that allows you to make your software run faster using parallel processors. And to use Cilk, it's pretty straightforward. You may start with some serial code that runs in some running time-- we'll denote that as Ts for certain parts of the lecture. If you wanted to run in parallel using Cilk, you just insert Cilk keywords in choice locations. For example, you can parallelize the outer loop in this matrix multiply kernel, and that will let your code run in time Tp on P processors. And ideally, Tp should be less than Ts. Now, just adding keywords is all you need to do to tell Cilk to execute the computation in parallel. What does Cilk do in light of those keywords? At a very high level, Cilk and specifically its runtime system takes care of the task of scheduling and load balancing the computation on the parallel processors and on the multicore system in general. So after you've denoted logical parallel in the program using spawn, Cilk spawn, Cilk sync, and Cilk four, the Cilk scheduler maps that computation onto the processors. And it does so dynamically at runtime, based on whatever processing resources happen to be available, and still uses a randomized work stealing scheduler which guarantees that that mapping is efficient and the execution runs efficiently. Now you've all been using the Cilk platform for a while. In its basic usage, you write some Cilk code, possibly by parallelizing ordinary serial code, you feed that to a compiler, you get a binary, you run the binary the binary with some particular input on a multicore system. You get parallel performance. Today, we're going to look at how exactly does Cilk work? What's the magic that goes on, hidden by the boxes on this diagram? And the very first thing to note is that this picture is a little bit-- the first simplification that we're going to break is that it's not really just Cilk source and the Cilk compiler. There's also a runtime system library, libcilkrts.so, in case you've seen that file or messages about that file on your system. And really it's the compiler and the runtime library, that work together to implement Cilk's runtime system, to do the work stealing and do the efficient scheduling and load balancing. Now we might suspect that if you just take a look at the code that you get when you compile a Cilk program, that might tell you something about how Cilk works. Here's C pseudocode for the results when you compile a simple piece of Cilk code. It's a bit complicated. I think that's fair to say. There's a lot going on here. There is one function in the original program, now there are two. There's some new variables, there's some calls to functions that look a little bit strange, there's a lot going on in the compiled results. This isn't exactly easy to interpret or understand, and this doesn't even bring into the picture the runtime system library. The runtime system library, you can find the source code online. It's a little less than 20,000 lines of code. It's also kind of complicated. So rather than dive into the code directly, what we're going to do today is an attempt at a top-down approach to understanding how the Cilk runtime system works, and some of the design considerations. So we're going to start by talking about some of the required functionality that we need out of the Cilk runtime system, as well as some performance considerations for how the runtime system should work. And then we'll take a look at how the worker deques in Cilk get implemented, how spawning actually works, how stealing a computation works, and how synchronization works within Cilk. That all sound good? Any questions so far? This should all be review, more or less. OK, so let's talk a little bit about required functionality. You've seen this picture before, I hope. This picture illustrated the execution model of a Cilk program. Here we have everyone's favorite exponential time Fibonacci routine, parallelized using Cilk. This is not an efficient way to compute Fibonacci numbers, but it's a nice didactic example for understanding parallel computation, especially the Cilk model. And as we saw many lectures ago, when you run this program on a given input, the execution of the program can be modeled as a computation dag. And this computation dag unfolds dynamically as the program executes. But I want to stop and take a hard look at exactly what that dynamic execution looks like when we've got parallel processors and work stealing all coming into play. So we'll stick with this Fibonacci routine, and we'll imagine we've just got one processor on the system, to start. And we're just going to use this one processor to execute fib(4). And it's going to take some time to do it, just to make the story interesting. So we start executing this computation, and that one processor is just going to execute the Fibonacci routine from beginning up to the Cilk spawn statement, as if it's ordinary serial code, because it is ordinary serial code. At this point the processor hits the Cilk spawn statement. What happens now? Anyone remember? What happens to the dag? AUDIENCE: It branches down [INAUDIBLE] TAO SCHARDL: It branches downward and spawns another process, more or less. The way we model that-- the Cilk spawn is of a routine fib of n minus 1. In this case, that'll be fib(3). And so, like an ordinary function call, we're going to get a brand new frame for fib(3). And that's going to have some strand that's available to execute. But the spawn is not your typical function call. It actually allows some other computation to run in parallel. And so the way we model that in this picture is that we get a new frame for fib(3).