Lecture 23: High Performance in Dynamic Languages

Flash and JavaScript are required for this feature.

Download the video from Internet Archive.

About this Video
Playlist
Transcript
Download this Video

Description: Professor Steven Johnson talks about a new dynamic language called Julia as an alternative to the two-language approach for interactive math. Julia is then compared to Python and C.

Instructor: Steven Johnson

Lecture 1: Introduction and...

Lecture 2: Bentley Rules fo...

Lecture 3: Bit Hacks

Lecture 4: Assembly Languag...

Lecture 5: C to Assembly

Lecture 6: Multicore Progra...

Lecture 7: Races and Parall...

Lecture 8: Analysis of Mult...

Lecture 9: What Compilers C...

Lecture 10: Measurement and...

Lecture 11: Storage Allocation

Lecture 12: Parallel Storag...

Lecture 13: The Cilk Runtim...

Lecture 14: Caching and Cac...

Lecture 15: Cache-Oblivious...

Lecture 16: Nondeterministi...

Lecture 17: Synchronization...

Lecture 18: Domain Specific...

Lecture 19: Leiserchess Cod...

Lecture 20: Speculative Par...

Lecture 21: Tuning a TSP Al...

Lecture 22: Graph Optimization

Now Playing

Lecture 23: High Performanc...

Download English-US transcript (PDF)

The following content is provided under a Creative Commons license. Your support will help MIT Open CourseWare continue to offer high quality educational resources for free. To make a donation or to view additional materials from hundreds of MIT courses, visit MIT Open CourseWare at ocw.mit.edu.

CHARLES E. LEISERSON: Hey, everybody. Let's get going. Who here has heard of the FFT? That's most of you. So I first met Steve Johnson when he worked with one of my graduate students, now former graduate student, Matteo Frigo. And they came up with a really spectacular piece of performance engineering for the FFT, a system they call FFTW for the Fastest Fourier Transform in the West. And it has, for over years and years been a staple of anybody doing signal processing will know FFTW.

So anyway, it's a great pleasure to welcome Steve Johnson, who is going to talk about some of the work that he's been doing on dynamic languages, such as Julia and Python.

STEVEN JOHNSON: Yeah. Thanks.

CHARLES E. LEISERSON: Is that pretty actuate?

STEVEN JOHNSON: Yeah. Yeah, so I'm going to talk, as I said, about high level dynamic languages and how you get performance in these. And so most of you have probably used Python, or R, and Matlab. And so these are really popular for people doing in technical computing, statistics, and anything where you want kind of interactive exploration. You'd like to have a dynamically typed language where you can just type x equals 3. And then three lines later, you said, oh, x is an array. Because you're doing things interactively. You don't have to be stuck with a particular set of types. And there's a lot of choices for these.

But they usually hit a wall when it comes to writing performance critical code in these languages. And so traditionally, people doing some serious computing in these languages have a two-language solution. So they do high level exploration, and productivity and so forth in Python or whatever. But when they need to write performance critical code, then you drop down to a lower level language, Fortran, or C, or Cython, or one of these things. And you use Python as the glue for these low level kernels.

And the problem-- and this is workable. I've done this myself. Many of you have probably done this. But when you drop down from Python to C, or even to Cython, there there's a huge discontinuous jump in the complexity of the coding. And there's usually a lot of generality. When you write code in C or something like that, it's specific to a very small set of types, whereas the nice thing about high level languages is you can write generic code that works for a lot of different types.

So at this point, there's often someone who pops up and says, oh, well, I did performance programming in Python. And everyone knows you just need to vectorize your code, right? So basically, what they mean is you rely on mature external libraries that you pass on a big block of data. It does a huge amount of computation and comes back. And so you never write your own loops.

And this is great. If there's someone who's already written the code that you need, you should try and leverage that as much as possible. But somebody has to write those. And eventually, that person will be you. And because eventually if you do scientific computing, you run into a problem inevitably that you just can't express in terms of existing libraries very easily or at all.

So this was the state of affairs for a long time. And a few years ago, starting in Alan Edelman's group at MIT, there was a proposal for a new language called Julia, which tries to be as high level and interactive as-- it's a dynamically typed language, you know, as Matlab, or Python, and so forth. But general purpose language like Python, very productive for technical work, so really oriented towards scientific numerical computing.

But you can write a loop, and you write low level code in that that's as fast as C. So that was the goal. The first release was in 2013. So it's a pretty young language. The 1.0 release was in August of this year. So before that point every year there was a new release, 0.1, 0.2, Point3. And every year, it would break all your old code, and you'd have to update everything to keep it working.

So now they said, OK, it's stable. We'll add new features. We'll make it faster. But from this point onwards, for least until 2.0, many years in the future it will be backwards compatible. So there's lots of-- in my experience, this pretty much holds up. I haven't found any problem where there was a nice highly optimized C or Fortran code where I couldn't write equivalent code or equivalent performance, equivalently performing code in Julia given enough time, right?

Obviously, if something is-- there's a library with 100,000 lines of code. It takes quite a long time to rewrite that in any other language. So there are lots of benchmarks that illustrate this. The goal of Julia is usually to stay within a factor of 2 of C. In my experience, it's usually within a factor of a few percent if you know what you're doing.

So there's a very simple example that I like to use, which is generating a Vandermonde matrix. So giving a vector a value as alpha 1 alpha 2 to alpha n. And you want to make an n by m matrix whose columns are just those entries to 0 with power, first power squared, cubed, and so forth element-wise. All right, so this kind of matrix shows up in a lot of problems. So most matrix and vector libraries have a built in function to do this and Python.

In NumPy, there is a function called numpy.vander to do this. And if you look at-- it's generating a big matrix. It could be performance critical. So they can implement it in Python. So if you look at the NumPy implementation, it's a little Python shim that calls immediately to C. And then if you look at the C code-- I won't scroll through it-- but it's several hundred lines of code. It's quite long and complicated.

And all that several hundred lines of code is doing is just figuring out what types to work with, like what kernels to dispatch to. And at the end of that, it dispatches to a kernel that does the actual work. And that kernel is also C code, but that C code was generated by a special purpose code generation. So it's quite involved to get good performance for this while still being somewhat type generic.

So their goal is to have something that works for basically any NumPy array and any NumPy type, which there's a handful, like maybe a dozen scalar types that it should work with, all right? So if you're implementing this in C, it's really trivial to write 20 lines of code that implements this but only for double precision, a point or two double position array, all right? So the difficulty is getting type generic in C.

So in Julia. Here is the implementation in Julia. It looks at first glance much like what roughly what a C or Fourier implementation would look like. It's just implemented the most simple way. It's just two nested loops. So just basically, you loop across. And as you go across, you accumulate powers by multiplying repeatedly by x. That's all it is. And it just fills in the array.

The performance of that graph here is the time for the NumPy implementation divided by the time for the Julie implementation as a function of n for an n by n matrix. The first data point, I think there's something funny going on that's not 10,000 times slower. But for a 10 by 10, 20 by 20 array, the NumPy version is actually 10 times slower because it's basically the overhead that's imposed by all going through all those layers.

Once you get to 100 by 100 matrix, the overhead doesn't matter. And then it's all this optimized C code, generation and so forth is pretty much the same speed as the Julia code. Except the Julia code there, as I said, it looks much like C code would, except there's no types. It's Vander x. There's no type declaration. x can be anything.

And, in fact, this works with any container type as long as it has an indexing operation. And any numeric type-- it could be real numbers. It could be complex numbers. It could be quarternians, anything that supports the times operation. And there's also a call to 1. So 1 returns the multiplicative identity for whatever, so whatever group you're in you need to have a 1, right? That's the first column.

That might be a different type of 1 for a different object, right? It might be an array of matrices, for example. And then the 1 is the identity matrix. So, in fact. There are even cases where you can do. Get significantly faster than optimize C and Fortran codes. So I found this when I was implementing special functions, so things like the error function, or polygamma function, or the inverse of the error function.

I've consistently found that I can get often two to three times faster than optimized C and Fortran libraries out there, partly because I'm smarter than people who wrote those libraries, but-- no-- mainly because in Julia, I'm using basically the same expansions, the same series, rational functions that everyone else is using. The difference is then in Julia, it has built-in techniques for what's called metaprogramming or co-generation.

So usually, the special functions involved lots of polynomial evaluations. That's what they boil down to. And you can basically write co-generation that generates very optimized inline evaluation of the specific polynomials for these functions that would be really awkward to write in Fortran. You'd either have to write it all by hand or write a separate routine, a separate program that wrote Fortran code for you. So you can do this. It's a high level languages allow you to do tricks for performance that it would be really hard to do in low level languages.

So mainly what I wanted to talk about is give some idea of why Julia can be fast. And to understand this, you also need to understand why is Python slow. And in general, what's going on in determining the performance in a language like this? What do you need in the language to enable you to compile it to fast code while still, still being completely generic like this Vander function, which works on any type. Even user-defined, numeric type, user-defined container type will be just as fast. There's no privileged-- in fact, if you look at Julia, almost all of Julia is implemented in Julia.

Integer operations and things like that, the really basic types, most of that is implemented in Julia, right? Obviously, if you're multiplying two 32-bit integers. At some point, it's calling an assembly language instruction. But even that, calling out to the assembly is actually in Julia. So at this point, I want to switch over to sort of a live calculation.

So this is from a notebook that I developed as part of a short course with Alan Edelman, who's sitting over there, [INAUDIBLE] on performance optimization high level languages. And so I want to go through just a very simple calculation. Of course, you would never-- in any language, usually you would have a built-- often have a built-in function for this. But it's just a sum function just written up there.

So we need to have a list, an array, of n numbers. We're just going to add them up. And if we can't make this fast, then we have real problems. And we're not going to be able to do anything in this list. So this is the simple sort of thing where if someone doesn't provide this for you, you're going to have to write a loop to do this.

So I'm going to look at it not just in Julia but also in Python, in C, and Python with NumPy and so forth. So this document that I'm showing you here is a Jupyter Notebook. Some of you may have seen this kind of thing. So Jupyter is this really nice-- they provide this really nice browser-based front end when I can put in equations, and text, and code, and results, and graphs all in one Mathematical notebook document.

And you can plug in different languages. So initially, it was for Python. But we plugged in Julia. And now there's R, and there's 30 different-- like 100 different languages that you can plug in to the same front end. OK, so I'll start with the C implementation of this.

So this is a Julia notebook, but I can easily compile and call out to C. So I just made a string that has just-- you know there's 10 lines C implementation. It's just the most obvious function that just takes in a pointer to an array of doubles and it's length. And it just loops over them and sums them up, just what you would do.

And then I'll compile it with GCC dash 03 and link it to a shared library, and load that shared library in Julia and just call it. So there's a function called C call in Julia where I can just call out to a C library with a 0 overhead, basically. So it's nice because you have lots of existing C libraries out there. You don't want to lose them. So I just say C call, and we call this c_sum function in my library. It returns a flow 64. It takes two parameters, a size t and a flow 64.

And I'm going to pass the length of my array. And the array-- and it'll automatically-- a Julia array, of course, is just a bunch of numbers. And it'll pass a pointer to that under the hood. So do that. And I wrote a little function to call relerr that computes the relative error between the fractional difference between x and y.

And I'll just check it. I'll just generate 10 to the 7 random numbers in 01 and compare that to the Julia because Julia has a built-in function called sum, that sums and array. And it's giving the same answer to 13 decimal places, so not quite machine precision, but there's 10 to the 7 numbers. So the error is kind of accumulative when you add it across.

OK so, as I'm calling it, it's giving the right answer. And now I want to just benchmark the C implementation, use that as kind of a baseline for this. This should be pretty fast for an array of floating point values. So there's a Julia package called benchmark tools. As you probably know from this class, benchmarking is a little bit tricky. So this will take something, run it lots of times, collect some statistics, return the minimum time, or you can also get the variance and other things.

So I'm going to get that number. B time is something called a macro in Julia. So it takes an expression, rewrites it into something that basically has a loop, and times it, and does all that stuff. OK, so it takes 11 milliseconds to sum 10 to the 7 numbers with a straight C, C loop compiled with jcc-03, no special tricks, OK? And so that's 1 gigaflop, basically, billion operations per second. So it's not hitting the peak rate of the CPU.

But, of course, there's additional calculations. This array doesn't fit in cache, and so forth. OK. So now let's-- before I do anything in Julia, let's do some Python. But I'll do a trick. I can call Python from Julia, so that way I can just do everything from one notebook using a package I wrote called PyCall. And PyCall just calls directly out to lib Python.

So with no virtually no overhead, so it's just like calling Python from within Python. I'm calling directly out to lib Python functions to call. And I can pass any type I want, and call any function, and do conversions back and forth. OK, so I'm going to take that array. I'll convert it to a Python list object. So I don't want to time the overhead of converting my array to a Python array. So I'll just convert ahead of time.

And just start with a built-- Python has a built-in function called sum. So I'll use the built-in sum function. And I'll get this Py object for it. I'll call it PySum on this list and make sure it's giving the right answer OK is the difference is 10 to the minus 13 again. And now let's benchmark it. Oops. There we go.

So it takes a few seconds because it has to run it a few times and catch up with statistics. OK, so it takes 40 milliseconds. That's not bad. It's actually it's four or five times slower than C, but it's pretty good, OK? So and why is it five times slower than C? Is it is it because-- the glib answer is, oh, well, Python is slow because it's interpreted.

But the sum function is actually written in C. Here's the C implementation of the sum function that I'm calling. And I'm just linking to the GitHub code. There's a whole bunch of boilerplate that just checks with the type of the object, and then has some loops and so forth. And so if you look carefully, it turns out it's actually doing really well. And the reason it does really well is it has a fast path. If you have a list where everything is a number type, so then it has an optimized implementation for that case. But it's still five times slower than C.

And they've spent a lot of work on it. It used to be 10 times slower than C a couple of years ago. So they do a lot of work on optimizing this. And so why aren't they able to get C speed? Since they have a C implementation of a sum, are they just dumb, you know? No. It's because the semantics of the data type prevent them from getting anything faster than that.

And this is one of the things you learn when you do high level performance in high level languages. You have to think about data types, and you have to think about what the semantics are. And that that greatly constrains what any conceivable compiler can do. And if the language doesn't provide you with the ability to express the semantics you want, then you're dead. And that's one of the basic things that Julia does. So what does a Python list?

Right, so you have-- you can have three, four, right? A Python list is a bunch of objects, Python objects. But the Python numbers can be anything. They can be any type. So it's completely heterogeneous types. So, of course, a particular list like in this case can be homogeneous. But the data structure has to be heterogeneous because, in fact, I can take that homogeneous thing. At any point, I can assign the third element to a string, right? And it has to support that.

So think about what that means for how it has to be implemented in memory. So what this has to do-- so this is a list of-- in this case, three items. But what are those items? So if they can be an item of any type, they could be things that-- they could be another array, right? It could be of different sizes and so forth. You don't want to have an array where everything is a different size, first of all. So it has to be this an array of pointers where the first pointer is 3, turned out to be 3. The next one is four. The next one is 2, all right?

But it can't just be that. It can't just be pointer to-- if this is a you 64-bit number, it's can't just be pointer to one 64-bit value in memory, because it has to know. It has to somehow store what type the subject is. So there has to be a type tag this says this is an integer. And this one has to have a type tag that says it's a string. So this is sometimes called the box.

So you have a value, you have a type tag plus a value. And so imagine what even the most optimized C imitation has to do given this kind of data structure. OK, here's the first element. It has to chase the pointer and then ask what type of object is it, OK? Then depending on what type of object is it, so I initialize my sum to that. Then I read the next object. I have to chase the second pointer, read the type tag, figure out the type. This is all done at run time, right?

And then, oh, this is another integer that tells me I want to use the plus function for two integers, OK? And then I read the next value, which maybe-- which plus function it's using depends upon the type of the object. It's an to object-oriented language. I can define my own type. If it has its own plus function, it should work with sum.

So it's looking up the types of the objects at runtime. It's looking at the plus function at runtime. And not only that, but each time it does a loop iteration it has to add two things and allocate a result. That result in general might be another-- it has to be a box because the type might change as you're summing through. If you start with integers, and then you get a floating point value, and then you get an array, the type will change. So each time you do a loop iteration, it allocates another box.

So what happens is the C implementation is a fast path. If they're all integer types, I think it doesn't reallocate that box for the sum it's accumulating all the time. And it caches the value of the plus function it's using. So it's a little bit faster. But still, it has to inspect every type tag and chase all these pointers for every element of the array, whereas the C implementation of sum, if you imagine what this is this compiles down to, for each loop iteration, what does it do?

It increments a pointer to the next element. At compile time, the types are all flow 64. So it flushes flow 64 value into a register. And then it has a running sum in another register, calls one machine instruction to add that running sum, and then it checks to see if we're done, an if statement there, and then goes on, all right? So just a few instructions and in a very tight loop here, whereas each loop iteration here has to be lots and lots of instructions to chase all these pointers to get the type tag-- and that's in the fast case where they're all the same type and it's optimized for that.

So where was I? So wrong thing. So that's the Python sum function. Now most many of you, you've used Python know that there is another type of array and there's a whole library called NumPy for working with numerics. So what problem is that addressing? So the basic problem is this data structure. This data structure, as soon as you have a list of items-- it can be any type-- you're dead, right? There's no way to make this as fast as a C loop over a double pointer.

So to make it fast, what you need to have is a way to say, oh, every element of this array is the same type. So I don't need to store type tags for every element. I can store a type tag once for the whole array. So there, there is a tag. There is a type, which is, say, float 64, OK? There is maybe a length of the array. And then there's just a bunch of values one after the other.

So this is just 1.0, 3.7, 8.9. And each of these are just 8 bytes, an 8-byte double in C notation, right? So it's just one after the other. So it reads this once, reads the length, and then it dispatches to code that says, OK, now-- basically dispatches to the equivalent of my C code, all right? So now once it knows the type and the length, then OK, it says it runs this, OK?

And that can be quite fast. And the only problem is you cannot write, implement this in Python. So Python doesn't provide you a way to have that semantics to have a list of objects where you say they all have to be the same type. There is no way to enforce that, or to inform the language of that in Python. And then to tell it-- oh, for this, since these are all the same type, you can throw away the boxes.

Every Python object looks like this. So there's no way to tell Python. Oh, well, these are all the same types. You don't need to store the type tags. You don't need to have pointers. You don't need to have reference counting. You can just slam the values into memory one after the other. It doesn't provide you with that facility, and so there's no way to-- you can make a fast Python compiler that will do this.

So NumPy is implemented in C. Even with-- some of you are familiar with PyPy, which is an attempt to make a fast-- like a tracing jit for Python. So when they poured into NumPy to PyPy, or they attempted to, even then they could implement more of it in Python. But they had to implement the core in C.

OK, but given that, I can do this. I can import the NumPy module into Julia, get its sum function, and benchmark the NumPy sum function, and-- OK, again, it takes a few seconds to run. OK, and it takes 3.8 milliseconds. So the C was 10 milliseconds. So it's actually doing faster than the C code, almost a little over twice as fast, actually.

And what's going on is their C code is better than my C code. Their C code is using SIMD instructions. So and at this point, I'm sure that you guys all know about these things where you can read in two numbers or four numbers into one giant register and one instruction add all four numbers at once.

OK, so what about if we go in the other direction? We write our own Python sum function. So we don't use the Python sum implemented in C. I write our own Python. So here is a little my sum function in Python. Only works for floating point. Oh, I initialize S to 0.0. So really, it only accumulates floating point values. But that's OK.

And then I just loop for x and a, s equals s plus x, return s is the most obvious thing you would write in Python. OK, and checked that it works. Yeah, errors 10 the minus 13th. It's giving the right answer. And now let's time it.

So remember that C was 10 milliseconds. NumPy was 5 milliseconds, and then built-in Python was like, the sum was 50 milliseconds operating on this list. So now we have C code operating on this list with 50 milliseconds. And now we have Python code operating on this list. And that is 230 milliseconds, all right? So it's quite a bit slower.

And it's because, basically, in Python, there's-- in the pure Python code, there's no way to implement this fast path that checks-- oh they're all the same type, so I can cash the plus function and so forth. I don't think it's feasible to implement that. And so basically then on every loop iteration has to look up the plus function dynamically and allocate a new box for the result and do that 10 to 7 times.

Now, so there's a built-in sum function in Julia. So [INAUDIBLE] benchmark that as a-- it's actually implemented in Julia. It's not implemented in C. I won't show you the code for the built-in one because it's a little messy because it's actually computing the sum more accurately than the loop that have done. So that's 3.9 milliseconds. So it's comparable to the NumPy code, OK? So it's also using SIMD, but so this is also fast.

So now so why can Julie do that? So it has to be that the array type, first of all, has the type attached to it. So you can see the type of the array is an array of low 64. So there's a type tag attached to the array itself. So somehow, that's involved. So it looks more like an NumPy ray in memory than a Python list.

You can make the equivalent of a Python list that's called an array of any. So if I convert this to an array of any, so an array of any is something where the elements types can be any Julia type. And so then it has to be stored as something like this as an array of pointers to boxes. And when I do that-- let's see-- [INAUDIBLE] there it is-- then it's 355 milliseconds. So it's actually even worse than Python.

So the Julia-- the Python, they spent a lot of time optimizing their code past for things that had to allocate lots of boxes all the time. So in Julia, it's usually understood that if you're writing optimized code you're going to do it not on arrays of pointers to boxes. You're going to write on homogeneous arrays or things where the types are known at compile time. OK, so let's write our own Julia sum function. So this is a a Julia sum function. There's no type declaration. It works on any container type.

I initialize s for this function called 0 for the element type of a. So it initializes into the additive identity. So it will work on any container of anything that supports a plus function that has an additive identity. So it's completely generic. It looks a lot like the Python code except for there's no 0 function in Python. And let's make sure it gives the right answer. It does.

And let's benchmark it. So this is the code you'd like to be able to right. You'd like to be able to write high level code that's a straight level straight loop. Unlike the C code, it's completely generic, right? It works on any container, anything you can loop over, and anything that has a plus function, so an array of quarternians or whatever. And a benchmark, it's 11 milliseconds. It's the same as the C code I wrote in the beginning.

It's not using SIMD. So the instructions are-- that's where the additional factors of 2 come. But it's the same as the non SIMD C code. And, in fact, if I want to use SIMD, there is a little tag you can put on a loop to tell-- it says compile this with llvm. Tell llvm to try and vectorize the loop. Sometimes it can. Sometimes it can't. But something like this, it should be able to vectorize it simple enough. You don't need to hand code SIMD instructions for a loop this simple. Yeah?

AUDIENCE: Why don't you always put the [INAUDIBLE]?

STEVEN JOHNSON: So a yeah, why isn't it the default? Because most code, the compiler cannot autovectorize. So it increases the completion time and often blows to the code size for no benefit. So it's only really-- it's really only relatively simple loops on doing simple operations and arrays that benefit from SIMD. So you don't want it to be there.

Yeah, so now it's 4.3 milliseconds. So it's about the same as the NumPy and so forth. It's a little slower than the NumPy. It's interesting. A year ago when I tried this, it was almost exactly the same speed as the NumPy. And then since then, both the NumPy and the Julia have gotten better. But the NumPy got better more. So there's something going on with basically how well the compiler can use AVX instructions by-- it seems like we're still investigating what that is. But it's an llvm limitation looks like.

So as it's still completely type generic. So it I make a random array of complex numbers, and then I sum them-- which one am I calling now? Am I calling the vector? My sum is the vectorized one, right? So complex numbers-- each complex number is two floating point numbers. So it should take about twice the time to naively think, right? So twice the number of operations to add, the same number of complex numbers, the same number of real numbers.

Yeah, and it does. It takes about 11 milliseconds. So 10 to the 7, which is about twice the 5 milliseconds it took for the same number of real numbers. And the code works for everything. So why? OK, so what's going on here? So-- OK, so we saw this my sum function. I'll just take out the SIMD for now. And we did all that. OK. OK, and it works for any type. It doesn't even have to be an array.

So, for example, there is another container type called a set in Julia, which is just an unordered collection of unique elements. But you can also loop over it. If it's a set of integers, you can also summit. And I'm waiting for the benchmarks to complete. So let me allocate a set. It says there's no type declarations here. Mysum a-- there was no a-- has to be an array of particular type or it doesn't even have to be an array, right? So set is a different data structure. And so it's a set of integers.

The sets are unique. If I add something to already the set-- it's in there-- it won't add it twice. And it supports the fast checking. Is 2 in the set? Is 3 in the set? It doesn't have to look through all the elements. It's a hash internally. But I can call my mysum function on it, and it sums up 2 plus 17 plus 6.24, which is hopefully 49, right? So OK, so what's going on here?

So suppose you define-- the key, one of the keys, there's several things that are going on in to make Julia fast. One key thing is that when you have a function like this mysum, or even here's a simpler function-- f of x equals x plus 1-- when I call it with a particular type of argument, like an integer, or an array of integers, or whatever, then it compiles a specialized version of that for that type.

So here's f of x equals x plus 1. It works on any type supporting plus. So if I call f of 3-- so here I'm passing a 64-bit integer. When it did that, it says, OK, x is a 64-bit integer. I'm going to compile a specialized version of f with that knowledge. And then when I call with a different type, 3.1, now x is a floating point number. It will compile a specialized version with that value.

If I call it with another integer, it says, oh, that version was already compiled. [INAUDIBLE] I'll re-use it. So it only compiles it the first time you call it with a particular type. If I call it with a string, it'll give an error because it doesn't know how to add plus to a string. So it's a particular-- OK, so what is going on?

So we can actually look at the compiled code. So the function is called code, these macros called code llvm and code native. They say, OK, when I call f of 1, what's the-- do people know what llvm is? You guys-- OK, so llvm compiles to byte code first and then it goes to machine codes. So you can see the llvm bit code or byte code, or whatever it's called. And you can see the native machine code. So here's the llvm byte code that it compiles to [INAUDIBLE].

So it's a bit called add i640 basically one llvm instruction, which turns into one machine instruction, load effective address is actually a 64-bit edition function instruction. So let's think about what had to happen there. So you have f of x equals x plus 1.

Now you want to compile for x is a int 64, so 64-bit integer, or in Julia, we'd say x colon colon int 64. So colon colon means that this is of this type, OK? So this is 64-bit integer type. So what does it have to do? It first has to figure out what plus function to call. So plus, it has lots of-- there is a plus for two matrices, a plus for lots of different things. So depending on the types of the arguments, it decides on which plus function to call.

So it first realizes, oh, this is an integer. Oh, this is an integer. This is also a 64-bit integer. So that means I'm going to call the plus function for two integers. So I'm going to look into that function. And then, oh, that one returns an int 64. So that's a return value for my function. And oh, by the way, this function is so simple that I'm going to inline it. So it's type specializing.

And this process of going from x is an integer to that, to figure out the type of the output, is called type inference, OK? So. in general, for type inference, it is given the types of the inputs, it tries to infer the types of the outputs and, in fact, all intermediate values as well.

Now what makes it a dynamic language is this can fail. So in some languages like ml or some other languages, you don't really declare types. But they're designed so they could give the types the inputs. So you can figure out everything. And if it can't figure out everything, it gives an error, basically. It has to infer everything. So Julia is a dynamic language. This can fail and have a fallback. If it doesn't know the type, it can stick things in a box.

But obviously, the fast path is when it succeeds. And one of the key things is you have to try and make this kind of thing work in a language. You have to design the language so that at least for all the built-in constructs, the standard library, in general, in the culture for people designing packages and so forth, to design things so that this type inference can succeed.

And I'll give a counter example to that in a minute, right? So and this works recursively. So it's not suppose I define a function g of x equals f of x times 2, OK? And then I called g of 1. So it's going to say, OK, x here is an integer, OK? I'm going to call f with an integer argument. Oh, I should compile f for an integer argument, figure out its return type, use its returned type to figure out what time's function to call and do all of this at compile time, right? Not at runtime, ideally.

So we can look at the llvm code for this. And, in fact, so remember, f of x adds 1 to x. And then we're multiplying by 2. So the result computes 2x plus 2. And llvm is smart enough. So f is so simple that it inlines it. And then llvm is smart enough that I don't have to-- well, I know it does it by one shift instruction to multiply x by 2. And then it adds 2. So it actually combines the times 2 and the plus 1. So it does constant folding, OK?

And it can continue on. If you look at h of x equals g of x times 2, then that compiles to 1 shift instruction to multiply x by 4 and then adding 4. So you want the-- so this process cascades. So you can even do it for recursive function. So here's a stupid implementation of the Fibonacci number and calculation of recursive limitation, right? So this is given n. It's an integer.

OK, if n is less than 3, returns 1. Otherwise, it adds the previous two numbers. I can compute the first call, listen to the first 10 integers. Here's the first 10 Fibonacci numbers. There's also an acute notation in Julia. You can say fib dot. So if you do f dot arguments, it calls the function element Y's on a collection and returns a collection. So F dot 1 to 10 returns the first 10 Fibonacci numbers.

And I can call-- there's a function called code 1 type that'll tell me what-- it'll tell me the output of type inference. N is a 1. And it goes through-- this is kind of a hard to read format. But this is like the-- after one of the compiler passes called lowering, but yeah. It's figure out the types of every intermediate calls.

So here it's invoking main dot fib. It's recursively. And it's figured out that the return type is also int 64. So it knows everything, OK? So you'll notice that here I declared a type. I've said that this is an integer, OK? I don't have to do that for type inference. This doesn't help the compiler at all because it does type inference depending on what I pass. So what this is is more like a filter.

It says that if I pass-- this function only accepts integers. If you pass something else, you should throw an error. Because if I don't want this function, because if I pass 3.7, if I fill out any number, if you look at the 3.7, I can check whether it's less than three. You can call it recursively. I mean, the function would run. It would just give nonsense, right? So I want to prevent someone from passing nonsense for this.

So that's one reason to do a type declaration. But another reason is to do something called dispatch. So what we can do is we can define different versions of the function for different arguments. So. for example. another nicer version of that is a factorial function. So here is a stupid recursive implementation of a factorial function that takes an integer argument and just recursively calls itself an n minus 1. You can call it a 10 factorial, OK?

If I want 100 factorial, I need to use a different type, not 64-bit integers. I need some arbitrary precision integer. And since I said it was an integer, if I call in 3.7, it'll given an error. So that's good. But now I can find a different version of this. So actually, there is a generalization of factorial to arbitrary real, in fact, even complex numbers called a gamma function. And so I can define a fallback that works for any type of number that calls a gamma function from someplace else.

And then if I can pass it to floating point value, I can-- if you take the factorial minus 1/2, it turns out that's actually a square root of pi. So if I square it, it gives pi, all right? So now I have one function and I have two methods, all right? So these types here, so there's a hierarchy of types. So this is what's called an abstract type, which most of you have probably seen.

So there's a type called number. And underneath, there's a class of subtypes called integer. And underneath, there is, for example, int 64 or int 8 for 64-bit integers. And underneath number there's actually another subtype called real. And underneath that there's a couple of subtypes. And then there's say flow 64 or float 32 for a single precision 32-bit floating point number and so forth. So there's a hierarchy of these things.

When I specify something can take integer I'm just saying so this type is not help the compiler. It's to provide a filter. So this method only works for these types. And this other method only works-- my second method works for any number type. So I have one thing that works for any number. Whoops. Here it is-- One that works for any number type, and one method that only works for integers. So when I call it for 3, which ones does it call, because it actually called both methods.

And what it does is it calls the most specific one. It calls the one that sort of farthest down the tree. So if I have a method defined for number and one defined for integer, if I pass an integer, it'll do this. If I have one that's defined for number, one that's defined by integer, and one that's defined specifically for int 8, and I call it a --pass an 8 bit integer, it'll call that version, all right? So it gives you a filter.

But, in general, you can do this on multiple arguments. So this is like the key abstraction in Julia, something called multiple dispatch. So this was not invented by Julia. I guess it was present in Small Talk, and Dylan. It's been in a bunch of languages. It's been floating around for a while. But it's not been in a lot of mainstream languages, not in a high performance way. And you can think of it as a generalization of advertising and programming.

So I'm sure all of you have done object oriented programming in Python or C++ or something like this. So in object-oriented programming typically the way you think of it is this is you save an object. It's usually spelled object dot method xy for example, right? And what it does, is this type, the object type determines the method, right?

So you can have a method called plus. But it would actually call a different class function for a complex number or a real number, something like that, or a method called length, which for a Python list would call a different function than for an NumPy array, OK? In Julia, the way you would spell the same thing would you'd say method. And you wouldn't say object dot method.

So you don't think of the-- here, you think of the object as sort of owning the method. all right? And Julia-- the object would just be maybe the first argument. In fact, under the hood, if you looking in Python, for example, the object is passed as an implicit first argument called self, all right? So it actually is doing this. It's just different spelling of the same thing. But as soon as you read it this way, you realized what Python and what op languages are doing is they're looking at the type of the first argument to determine the method. But why just the first argument?

In a multiple dispatch language, you look at all the types. So this is sometimes-- in Julia, this will sometimes be called single dispatch because determining the method is called dispatch, figuring out which function is spelled length, which function actually are you calling this dispatching to the right function. So this is called multiple dispatch.

And it's clearest if you look at something like a plus function. So a plus function, if you do a plus b, which plus you do really should depend on both a and b, right? It shouldn't depend on just a or just b. And so it's actually quite awkward in languages, in o op languages like Python or especially C++ to overload a plus operation that operates on sort of mixed types. As a consequence, for example, in C++ there's a built-in complex number type.

So you can have a complex float, or complex double, complex and complex with different real types. But you can't add a complex float to a complex double. You can't add a single-precision complex number to a double-precision complex number or any mixed operation because-- any mixed complex operation because it can't figure out who owns the method. It doesn't have a way of doing that kind of promotion, all right?

So in Julia, so now you can have a method for adding a float 32 to a float 32, but also a method for adding a-- I don't know-- let's see, adding a complex number to a real number, for example. You want to specialize-- or a real number to a complex number. You want to specialize things. In fact, we can click on the link here and see the code.

So the complex number to a real number in Julia looks like this. It's the most obvious thing. It's implemented in Julia. Plus complex real creates new complex number. But you only have to add the real part. You can leave the imaginary part alone. And this works on any complex type. OK, so there's too many methods for-- OK, I can shrink that. Let's see.

So but there's another type inference thing called-- I'll just mention it briefly. So one of the things you've to do to make this type inference process work is given the types of the arguments you have to figure out the type of the return value, OK? So that means when you assign a function, it has to be what's called type stable. The type of the result should depend on the types of the arguments and not on the values of the arguments because the types are known at compile time. The values are only known at runtime.

And it turns out if you don't have this in mind, in C, you have no choice but to obey this. But in something like Python and dynamic language, like Python and Matlab, if you're not thinking about this, it's really easy to design things so that it doesn't work, so that it's not true. So a classic example is a square root function. All right, so suppose I pass an integer to it, OK? So the square root of-- let's do square root of 5, all right? The result has to be floating point number, right? It's 2.23-- whatever.

So if I do square root of 4, of course, that square root is an integer. But if I return an integer for that type, then it wouldn't be type stable anymore. Than the return value type would depend on the value of the input, whether it's a perfect square or not, all right? So it was returned to floating point value, even if the input is an integer. Yes?

AUDIENCE: If you have enough methods to find for a bunch of different types of lookup function? Can that become really slow?

STEVEN JOHNSON: Well, so the lookup comes-- it comes at compile time. So it's really kind of irrelevant. At least if type inference succeeds, if type inference fails, then it's runtime, it's slower. But it's not like-- it's like a tree surge. So it's not-- it's not as slow as you might think. But most of the time you don't worry about that because if you care about performance you want to arrange your code so that type inference succeeds.

So you prototype maybe to-- this is something that you do in performance optimization. Like when you're prototyping, you don't care about types, you say x equals 3. And the next line you say x equals an array-- whatever. But when you're optimizing your code, then, OK, you tweak it a little bit to make sure that things don't change types willy nilly and that the types of function depend on the types of the arguments not on the values.

So, as I mentioned, square root is what really confuses people at first, is if you take square root of minus 1, you might think you should get a complex value. And instead, it gives you an error, right? And basically, what are the choices here? It could give you an error. It could give you a complex value. But if it gave you a complex value, then the return type of square root would depend upon the value of the input, not just the type.

So Matlab, for example, if you take square root of minus 1, it will happily give you a complex number. But as a result, if you have Matlab, Matlab has a compiler, right? But it has many, many challenges/ but one simple thing to understand is if the Matlab compiler suite sees a square root function, anywhere in your function, even if it knows the inputs that are real, it doesn't know if the outputs are complex or real unless it can prove that the inputs were positive or non-negative, right?

And that means it could then compile two code pass for the output, all right? But then suppose it calls square root again or some other function, right? You quickly get a combinatorial explosion of possible code paths because of possible types. And so at some point, you just give up and put things in a box. But as soon as you put things in a box, and you're looking up types at runtime, you're dead from a performance perspective.

So Python, actually-- s if you want a complex result from square root, you have to give it a complex input. So in Julia, a complex number, the I is actually m. They decide I is too useful for loop variables. So I and J. So m is the complex unit. And if you take square root of a complex input, it gives you a complex output.

So Python actually does the same thing. So if in Python if it takes square root of a negative value, it gives an error unless you give it a complex input. But Python made other mistakes. So, for example, in Python, an integer is guaranteed never to overflow. If you add 11 plus 1 plus 1 over and over again in Python, eventually overflow the size of a 64-bit integer, and Python will just switch under the hood to an arbitrary position in an integer, which seem like a nice idea probably at the time.

And the rest in Python is so slow that the performance cost of this test makes no difference in a typical Python code. But it makes it very difficult to compile. Because that means if you have integer inputs and you see x plus 1 in Python, the compiler cannot just compile that to one instruction, because unless you can somehow prove that x is sufficiently small.

So in Julia, integer arithmetic will overflow. But the default integer arithmetic it's 64 bits. So in practice that never overflows unless you're doing number theory. And you usually know if you're doing number theory, and then use arbitrary precision integers. It was much worse in the days-- this is something people worried a lot before you were born when there were 16-bit machines, right? And integers, it's really, really easy to overflow 16 bits because the biggest sine value is then 32,767, right?

So it's really easy to overflow it. So you're constantly worrying about overflow. And even 32 bits, the biggest sign value is 2 billion. It's really easy to overflow that, even just counting bytes, right? You can have files that are bigger than 2 gigabytes easily nowadays. So some people worried about this all the time. There were 64-bit integers.

Basically, 64-bit integer will never overflow if it's counting objects that exist in the real universe, like bytes, or loop iterations, or something like that. So you just-- then you just say, OK, it's either 64 bits, or you're doing number theory. You should big ints. So OK. So let me talk about-- the final thing I want to talk about-- let's see, how much [INAUDIBLE] good-- is defining our own types.

So this is the real test of the language, right? So it's easy to make a language where there is a certain built-in set of functions and built-in types, and those things are fast. So, for example, for Python, there actually is a compiler called numba that does exactly what Julia does. It looks at the arguments, type specializes things, and then calls llvm and compiles it to fast code.

But it only works if you're only container type as an NumPy array and you're only scalar type is one of the 12 scalar types that NumPy supports. If you have your own user defined number type or your own user defined container type, then it doesn't work. And user-defined container types, it's probably easy to understand why that's useful. User defined number types are extremely useful as well.

So, for example, there's a package in Julia that provides the number type called dual numbers. And those have the property that if you pass them into the function they compute the function in its derivative. And just a slightly different. It basically carries around function an derivative values and has a slightly different plus and times and so forth that just do the product rule and so forth. And it just propagates derivatives. And then if you have Julia code, like that Vandermonde function, it will just compute its derivative as well.

OK, so I want to be able to find my own type. So a very simple type that I might want to add would be points 2D vectors in two space, right? So, of course, I could have an array of two values. But an array is a really heavyweight object for just two values, right? If I know at compile time there's two values that I don't need to have a pointer to-- I can actually store them in registers. I I can unroll the loop over these and everything should be faster.

You can get an order of magnitude and speed by specializing on the number of elements for small arrays compared to just a general ray data structure. So let's make a point, OK? So this is-- and I'm going to go through several iterations, starting with a slow iteration. I'm going to define a mutable struct. OK, so this will be a mutable object where I can add a [INAUDIBLE] point that has two values x and y. It can be of any type.

I'll define a plus function that can add them. And it does the most obvious thing. It adds the x components, adds the y components. I'll define a 0 function that's the additive identity that just returns the point 0, 0. And then I can construct an object Point34. I can say 034 plus 0.56. It works. It can hold actually-- right now it's very generic, and probably too generic.

So they act like the real part can be a floating point number and the imaginary. The x can be a floating point number here and the y is a complex number of two integers, or even I can make a string and an array. It doesn't even make sense. So I probably should have restricted the types of x and y a little bit just to prevent the user from putting in something that makes no sense at all, OK?

So these things, they can be anything. So this type is not ideal in several ways. So let's think about how this has to be stored in memory. So this is a 0.1. 0.11, 3.7, right? So in memory it's-- there is an x and there is a y. But x and y can be of any type. So that means they have to be pointers to boxes. There's pointer to int 1 and there's a pointer to a float 64, in this case, 3.7.

So oh, this already, we know this is not going to be good news for performance. And it's mutable. So that mutable struct means if I take p equals a point, I can then say p dot x equals 7. I can change the value, which seems like a harmless thing to do, but actually is a big problem because, for example, if I make an array of a three piece, and then I say p dot y equals 8, and I look at that array, it has to change the y component, OK?

So if I have a p, or in general, if I have a p that's looking at that object, if this is an object you can mutate, it means that if I have another element, a q, it's also pointing the same object. And I say p, p dot x equals 4, then q dot x had better also before at that point. So to have mutable semantics, to have the semantics of something you can change, and other references can see that change, that means that this object has to be stored in memory on the heap as a pointer to two objects so that you have other pointer to the same object, and I mutate it, and the other references see it.

It can't just be stuck in a register or something like that. It has to be something that other references can see. So this is bad. So if I have, so I can call 0.1 dot aa calls the constructor element-wise. A is this array of 10 to the 7 random numbers. I was benchmarking them before. That was taking 10 milliseconds, OK?

And I can sum it. I can call the built-in some function on this. I can even call my sum function on this because it supports a 0 function. And so it supports a plus. So here, I have an array. If i just go back up, I have an array here of 10 to the 7 values of type 0.1. So the type of 0.1 is attached to the array.

So the array and memory, so I have an array of 0.1, the one here means it's a one-dimensional array. There's also 2D, 3D, and so forth. That looks like a 0.1 value, a 0.1 value, a 0.1 value. But each one of those now-- sorry-- has to be a pointer to an x and a y, which themselves are pointers to boxes. All right, so summing is going to be really slow because there's a lot of pointer chasing. It has to run time, check what's the type of x, what's the type of y.

And, in fact, it was. It took instead of 10 milliseconds, it took 500 or 600 milliseconds. So to do better, we need to do two things. So, first of all, x and y, we have to be able to see what type they are, OK? It can't be just any arbitrary old thing that has to be a pointer to a box, OK? And the point object has to be mutable. It has to be something where if I have p equals something, q equals something, I can't change p and expect q to see it. Otherwise, if it's mutable, those semantics have to be implemented as some pointer to an object someplace. Then you're dead.

So I can just say struct. So struct now is not mutable. It doesn't have the mutable keyword. And I can give the arguments types. I can say they're both flow 64. And x and y are both the same type. They're both 64. But floating point numbers, I'll define plus the same way, 0 the same way, and now I can add them and so forth.

But now if I make an array of these things, and if I say p dot x equals 6, it will give an error. It says you can't actually mutate. Don't even try to mutate it because we can't support those semantics on this type. But that means so that type is actually-- if you look at look at that in memory, what the compiler is allowed to do and what it does do for this is if you have an array of points 0.21, then it looks like just the x value, the y value, The value, the y value, and so forth. But each of these are exactly one 8 byte flow 64.

And all the types are known at compile time. And so if I sum them, it should take about 0.20 milliseconds, right? Because summing real numbers was 10 milliseconds. And this is twice as many because you have to sum the x's, sum the y's. And let's benchmark it. And let's see.

Oh, actually, some of the real numbers took 5 milliseconds. So something should take about 10. Let's see if that's still true. Yeah, it took about 10. So actually, the compiler is smart enough. So, first of all, it stores this in line as one big block, consecutive block of memory.

And then when you sum them, remember our sum function. Well, this is the built-in sum. But our sum function will work in the same way. The compiler-- llvm will be smart enough to say, oh, I can load this into a register, load y into a register, call, have a tight loop where I basically call one instruction to sum the x's, one instruction to sum the y's, and then repeat. And so it's about as good as you could get.

But you paid a big price. We've lost all generality, right? These can only be two 64-bit floating point numbers. I can't have two single-precision numbers or-- this is like a struct of two doubles in C. So if I have to do this to get performance in Julia, then it would suck. Basically, I'm basically writing C code in a slightly higher level syntax. I'm losing that benefit of using a high level language.

So the way you get around this is you define-- what you want is to define something like this 0.2 type. But you don't want to define just one type. You want to define a whole family of types. You want to define this for two float 64s, two float 32s. In fact, you want to define an infinite family of types, at two things of any type you want as long as they're two real numbers, two real types.

And so the way you do that in Julia is a parametrized type. This is called parametric polymorphism. It's similar to what you see in C++ templates. So now I have a struct. It's not mutable-- Point3. But the curly braces t. It says it's parametrized by type t. So x and y-- I've restricted it slightly here. I've said x and y had to be the same type. I didn't have to do that, but I could have had two parameters, one for the type of x, one to the type of y.

But most the time you'd be doing something like this, you'd want them both to be the same type. But they could be both 64s, both float 32s, both integers, whatever. So t is any type that less than colon means is a subtype of. So t is any subtype of real. It could be float 64. It could be int 64. It could be int 8. It could be big float. It could be a user defined type. It doesn't care.

So this is really not-- it's a Point3 here. It is a whole hierarchy. So I'm not defining one type. I'm defining a whole set of types. So Point3 is a set of types. There's a point Point3 int 64. There is a Point3 float 32, a float 64 and so on. Infinitely, many types, as many as you want, and basically, it'll create more types on the fly just by instantiating.

So, for example, otherwise it looks the same. The plus function is basically the same. I add the x components, the y components. The 0 function is the same. Except now I make sure there's zeros of type t, whatever that type is. And now if I say Point34, now I'm instantiating a particular instance of this. And now that particular instance of Point3 we'll have-- this is an abstract type. We'll have one of these concrete types. And the concrete type it has in this case is a Point3 of two int 64s, two 64-bit integers, OK?

And I can add them. And actually, adding mixed types will already work because the plus, the addition function here, it works for any 2.3s. I didn't say there had to Point3s of the same type. Any two of these, they don't have to be two of these. They could be one of these and one of these. And then it determines the type of the result by the type of the [INAUDIBLE] it does type inference.

So if you have a point Point3 of two int 64s and Point3 of two float 32s, it says, oh, p dot x is an int 64. Q dot x is a full 64. Oh, which plus function do I call? There is a plus function from that mixing. And it promotes the result of flow 64. So that means that that sum is flow 64. The other sum is flow 64. Oh, then I'm creating a Point3 of flow 64s. So this kind of mixed promotion is done automatically. You can actually define your own promotion rules in Julia as well. And I can make an array.

And so now if I have an array of Point3 float 64, so this type is attached to the whole array. And this is not an arbitrary Point3. It's a Point3 of two float 64s. So it gets stored again as just 10 to the 7 elements of xy, xy where each one is 8 bytes 8 byes, 8 bytes, one after the other. The compiler knows all the types. And when you submit, it knows everything at compile time. And it will sum to these things. But I loaded this into a register, load this into a register called one instruction-- add them.

And so the sum function should be fast. So we can call the built-in sum function. We can call our own sum function. Our own some function, I didn't put SIMD here, so it's going to be twice as slow. But Yeah. Yeah?

AUDIENCE: Will this work with SIMD?

STEVEN JOHNSON: Yeah. Yeah. In fact, if you look, the built-in sum function, the built-in sum function is implemented in Julia. It just hasn't [INAUDIBLE] SIMD on the sum. So yeah. llvm is smart enough that if you give it a struct of two values and load them-- and if you tell it that you're adding these two values to these two values these two values to these two values, it will actually use SIMD instructions, I think. Oh, maybe not.

No, wait. Did my sum use SIMD? I'm confused. I thought it did.

AUDIENCE: [INAUDIBLE] removed it.

STEVEN JOHNSON: I thought I removed it. Yeah, so maybe it's not smart enough to use SIMD. I've seen in some cases where it's smart enough to use-- huh, yeah. OK, so they're the same speed. OK, no. I take it back. So maybe llvm is not smart enough in this case to use SIMD automatically. We could try putting the SIMD annotation there and try it again. But I thought it was, but maybe not.

Let's see. Let's put SIMD. So redefine that. And then just rerun this. So it'll notice that I've changed the definition. It'll recompile it. But the B time, since it times at multiple times, the first time it calls it, it's slow because it's compiling it. But it takes the minimum over several times. So let's see.

Yeah, this is the problem in general with vectorizing compilers if they're not that smart if you're using anything other than just an array of an elementary data type. Yeah, no. It didn't make any difference. So I took it back. So for more complicated data structures, you often have to use SIMD structure explicitly. And there is a way to do that in Julia. And there is a higher level library on top of that. You can basically credit a tuple and then add things and it will do SIMD acceleration. But yeah.

So anyway, so that's the upside here. There's a whole bunch of-- like the story of why Julia can be compiled with fast code, it's a combination of lots of little things. But there are a few big things. One is that its specialized thing is compile times. But, of course, you could do that in any language. So that relies on designing the language so that you can do type inference.

It relies on having these kind of parametrized types and giving you a way to talk about types and attach types to other types. So the array you notice probably-- let's see-- and now that you understand what these little braces mean, you can see that the array is defined in Julia as another parametrized type. It's parametrized by the type of the element and also by the dimensionality. So it uses the same mechanism to attach types to an array. And you can have your own-- the array type in Julia is implemented mostly in Julia. And there are other packages that implement their own types of arrays that have the same performance.

One of the goals of Julia is to build in as little as possible so that there's not some set of privileged types that the compiler knows about and everything else is second class. It's like user code is just as good as the built-in code. And, in fact, the built-in code is mostly just implemented in Julia. There's a small core that's implemented in C for bootstrapping, basically. Yeah. So having parametrized types, having another technicalities, having all concrete types are final in Julia.

A concrete type is something you can actually store in memory. So Point3864 is something you can actually have, right? An object of two integers is that type. So it is concrete, as opposed to this thing. This is an abstract type. You can't actually have one of these. You can only have one of the instances of the concrete types. So but there are no-- this is final.

It's not possible to have a subtype of this because if you could, then you'd be dead because this is an array of these things. If the compiler has to know it's actually these things and not some subtype of this, all right, whereas in other languages, like Python, you can have subtypes of concrete types. And so then even if you said this is an array of a particular Python type, it wouldn't really know it's that type, or it might be some subtype of that. That's one of the reasons why you can't implement NumPy in Python. This is-- there's no way to say this is really that type and nothing else in the language level. Yeah?

AUDIENCE: Will this compilation in Julia work?

STEVEN JOHNSON: Oh, yeah. So and it's calling llvm. So basically, the stage is you call-- so there's a few passes. OK, so and one of the fun things is you can actually inspect all the passes and almost intercept all of them practically. So, of course, typing code like this, first, it gets parsed, OK? And you can macro those things [INAUDIBLE] actually are functions that are called right after parsing. They can just take the parse code and rewrite it arbitrarily.

So they can extend the language that way. But then it parsed, maybe rewritten by a macro. And then you get an abstract syntax tree. And then when you call it, let's say f of 3, then says, oh, x is an integer. Int 64, it runs a type inference pass. It tries to figure out what's the type of everything, which version of plus to call and so forth.

Then it decides whether to inline some things. And then once it's done all that, it spits out llvm byte code, then calls llvm, and compiles it to machine code. And then it caches that some place for the next time you call, you call f of 4, f with another integer. It doesn't repeat the same processes. Notice it's cached. So that's-- so yeah. At the lowest level, it's just the llvm.

So then there's tons of things I haven't showed you. So I haven't showed you-- I mentioned metaprogramming. So it has this macro facility. So you can basically write syntax that rewrites other syntax, which is really cool for code generation. You can also intercept it after the type inference phase. You can write something called the generated function that basically takes-- because at parse time, it knows how things are spelled. And you can rewrite how they're spelled. But it doesn't know what anything actually means. It does knows x is a symbol. It doesn't know x as an integer-- whatever. It just knows the spelling.

So when you actually compile f of x, at that point, it knows x is an integer. And so you can write something called a generator or a stage function that basically runs at that time and says, oh, you told me x is an integer. Now I'll rewrite the code based on that. And so this is really useful for-- there's some cool facilities for multidimensional arrays. Because the dimensionality of the array is actually part of the type.

So you can say, oh, this is a three-dimensional array. I'll write three loops. Oh, you have a four-dimensional array. I'll write four loops. And it can rewrite the code depending on the dimensionality with code generation. So you can have code that basically generates any number of nested loops depending on the types of the arguments. And all the generation is done in compiled time after type inference. So it knows the dimensionality of the array.

And yeah, so there's lots of fun things like that. Of course, it has parallel facilities. They're not quite as advanced as Cilk at this point, but that's the direction there they're heading. There's no global interpreter lock like in Python. There's no interpreter. So there's a threading facility. And there's a pool of workers. And you can thread a loop. And the garbage collection is threading aware. So that's safe.

And they're gradually having more and more powerful run times, hopefully, eventually hooking into some of Professor Leiserson's advanced threading compiler, taper compiler, or whatever it is. And there's also-- most of what I do in my research is more coarse grained distributed memory parallelism, so running on supercomputers and stuff like that.

And there's MPI. There is a remote procedure call library. There's different flavors of that. But yeah. So any other questions? Yeah?

AUDIENCE: How do you implement the big number type?

STEVEN JOHNSON: The big num type in Julia is actually calling GIMP. So that's one of those things. Let me just-- let me make a new notebook. So if I say I know big int 3, 3,000, and then I'd say that to the, say, factorial. I think there's a built-in factorial of that. All right, so this is called the big num type, right? It's something where the number of digits changes at run time.

So, of course, these are orders of magnitude slower than hardware things. Basically, it has to implement it as a loop of digits in some base. And when you add or multiply, you have to loop over those at runtime. These big num libraries, they are quite large and heavily optimized. And so nobody has reimplemented one in Julia. They're just calling out to a C library called the GNU multi-precision library.

And for floating point values, there is something called big float. So big float of pi is that I can actually-- let's set precision. Big float to 1000. That's 1,000 binary digits. And then say big float of pi. And [INAUDIBLE] more. By the way, you might have-- so I can have a variable alpha-- oops-- alpha hat sub 2 equals 17. That's allowed.

All that's happening here is that Julia allows almost arbitrary unicode things for identifiers. So I can have-- make it bigger so we can have an identifier Koala, right? So there's two issues here. So one is just you have a language that allows those things as identifiers. So Python 3 also allows Unicode identifiers, although I think Julia out of all the existing-- the common languages-- it's probably the widest unicode support.

Most languages only allow a very narrow range of unicode characters for identifiers. So Python would allow the koala, but Python 3 would not allow with alpha hat sub 2 because the numeric subscript unicode characters it doesn't allow. The other thing is how do you type these things. And that's more of an editor thing.

And so in Julia, we implemented initially in the repl and in Jupiter. And now all the editors support, you can just to tab completion of latex. So I can type in gamma, tab, and the tab completes to the unicode character. I can say dot. And it puts a dot over it and backslash superscript 4. And it puts a 4. And that's allowed. So it's quite nice.

So when I'm typing emails, and I put equations in emails, I go to the Julia rappel and tab complete all my LaTeX characters so that I can put equations in emails because It's the easiest way to type these Unicode math characters. But yeah. So IPython borrowed this. So now do the same thing in the IPython notebooks as well.

So it's really fun. Because if you read old math codes, especially old Fortran codes or things, you see lots of variables that are named alpha hat or something like that, alpha hat underscore 3. It's so much nicer to have a variable that's actually the alpha hat sub 3. So that's cute.

CHARLES E. LEISERSON: Steve, thanks very much. Thanks. This was great.

[APPLAUSE]

We are, as Steve mentioned, looking actually at a project to merge the Julia technology with the Cilk technology. And so we're right now in the process of putting together the grant proposal. And if that gets funded, there may be some UROPS.

Free Downloads

Video

Caption