Friday, November 7, 2008

Porting the JIT to CLI (part 2)

In my previous post, we saw that PyPy JIT generator can produce huge speedups when applied to the tlc toy language.

In this post we will dive a bit into the internals of PyPy JIT, to see how it manages to do so. Note that this is a very high level overview of how the JIT works, and applies to all backends. Then, in the third post of this series, we will look closer at the CLI JIT backend, seeing how it works around some .NET limitations and how the generated code looks like.

PyPy JIT for dummies

As you surely know, the key idea of PyPy is that we are too lazy to write a JIT of our own: so, instead of passing nights writing a JIT, we pass years coding a JIT generator that writes the JIT for us :-).

I'm not going to explain how the JIT generator does its job, (perhaps this will be the subject of another blog post), but how the generated JIT works.

There are values that, if known at compile-time (i.e., when the JIT compiler runs), let the JIT to produce very efficient code. In a dynamic language, types are the primary example: for instance, suppose you are a compiler and you have to compile to following Python function:

def mysum(a):
  return a + 1

At compile time, you don't have any knowledge about the type of the parameter: it could be integer, float, an user defined object, etc. In this situation, the only safe choice is to emit code which does the usual, slow, full lookup to know how to perform the operations.

On the other hand, suppose that you knew in advance that the parameter is an integer: this time, you could emit code that exploits this extra knowledge, by performing directly a fast integer addition.

The idea behind PyPy JIT is that if you don't have enough knowledge to generate efficient code, you stop compiling and wait until you know exactly what you need. Concretely, you emit code that runs until the point where you stopped the compilation, then it triggers a special procedure that restarts the compiler. This time the JIT compiler knows everything you need, because you can inspect the state of the running program.

Let's see an example: the first time the JIT compiles mysum, it produces something like this pseudo-code:

PyObject mysum_compiled(PyObject a)
{
  Type a_type = a.GetType();
  switch(a_type) {
      default: continue_compilation(a_type, <position>);
  }
}

If you call mysum(41), the execution goes in the default branch of the switch, thus calling continue_compilation: its job is to restart the JIT compiler, which now can emit fast code because it knows the exact type of a; then, it modifies the original mysum_compiled function, in order to make it executing the newly generated code the next time it encounters an integer at that point:

PyObject mysum_compiled(PyObject a)
{
  Type a_type = a.GetType();
  switch(a_type) {
      PyInteger: return new PyInteger(a.value+1); // fast path!
      default: continue_compilation(a_type, <position>);
  }
}

From now on, every time we call mysum with an integer argument, the JIT compiler is not called anymore and the fast path is directly executed; if we happen to call mysum with a float arguments, the switch goes again in the default branch, and the JIT compiler is started once more to produce fast code also for this case. What happens in practice is that compile-time and runtime are continuously intermixed, until the switches are stable enough and the compiler is not needed anymore.

In PyPy jargon, this kind of "growable switch" is called flexswitch, and it's one of the most important concept of our JIT generator.

Promotion

How can the JIT generator know which values are useful to know to generate efficient code and which aren't? Unfortunately it can't, or at least our JIT generator is not smart enough at the moment.

To get the best from it, the developers of the VM need to instruct it by annotating the variables on which we want the JIT to stop until it knows the actual values; this is done by using particular hints, called promote and promote_class; variables annotated with such hints are said to be promoted. If something is promoted, a flexswitch is used to gain information about it, as seen in the last section.

For an example, let's look at an excerpt from main dispatch loop of the tlc virtual machine:

elif opcode == ADD:
  a, b = stack.pop(), stack.pop()
  hint(a, promote_class=True)
  hint(b, promote_class=True)
  stack.append(b.add(a))

This the implementation of the ADD opcode: first, it pops two values from the stack; then, it computes the result; finally, it push the result to the stack again. In between, both the classes of a and b have been promoted: this means that when the JIT emits the code for b.add(a), it knows exactly what is happening: if it sees that both are instances of the IntObj class, it inlines the method call and emits a fast integer addition instead.

Virtuals

The other important concept of the JIT is the presence of virtual structures, virtual lists, and virtual dictionaries. Again, I'm not going to explain in depth how they work, but only why they are so important for generating highly efficient code.

The essence of virtuals is that you don't allocate objects until you really need to do it, e.g. because they are being passed as an argument to some external function. Instead, we store all the informations we need as local variables; e.g., in the case of a virtual structure, we create as many local variables as the number of its fields: if the structure escapes the local scope, we force it to a real object, by allocating memory on the heap and initializing it after the current value of the local variables.

This technique allows the JIT to avoid the allocation of many temporary objects that hold intermediate results; consider for example the following Python loop:

result = 0
for i in range(N):
  result += i
return result

Without the JIT, at each iteration, a new int object is created and bound to the result variable, while the previous one is discarded and not needed anymore. By combining virtuals and promotion, the JIT can emit code that does the whole computation locally, and allocates a real object only at the end, when it escapes from the local scope because it is returned from the function.

Putting it all together

This is, essentially, how PyPy's generated JITs work. To summarize, our JITs emit multiple versions of each chunk of code: each version is specialized and optimized for one particular case.

The cost of selecting the right specialization to use (through flexswitches) is almost always negligible compared to how much time you save by running the fast version instead of the more-general-but-slow one. Moreover, each specialized version knows the exact shape of the objects it's dealing with, so they can be virtualized to make the generated code even more efficient.

At the end, the actual code generation is done by one of the JIT backends: the backends exploit all the knowledge gathered by the previous steps to produce highly efficient code, but this will be the subject of the next blog post.

Tuesday, November 4, 2008

Porting the JIT to CLI (part 1)

As the readers of this blog already know, I have been working on the CLI JIT backend for some months: last Friday, it reached an important milestone, as it is now able to produce huge speedups for a little dynamic language. To know how huge the speedup is, read on :-).

The goal of PyPy JIT generator is to take an interpreter and, with the help of few annotations, automatically generate a JIT compiler for it. In this post, we will talk about the tlc virtual machine: while tlc it is just a toy language, it contains some features that make it an interesting target for our JIT generator.

The tlc virtual machine

tlc is executed by a stack based, dynamically typed virtual machine (for those who knows a bit about the Python VM: does it sound familiar? :-)).

There are three types of objects: integers, nil, and cons cells (i.e. lisp-like pairs of objects).

As the VM is very simple, it provides only few opcodes:

  • opcodes to manipulate the stack, like PUSH, POP, etc.
  • integer operations, like ADD, MUL, all the comparisons, etc.: these operations can only be applied to integers;
  • list operations, like CONS, CAR, CDR: these operations can only be applied to lists;
  • other operations, including jumps and conditional jumps.

The VM is interesting for our purposes because it has a lot of similarities with Python (though on a smaller scale, of course):

  1. it has to do type-checks at runtime before doing most of the operations;
  2. every time you do an arithmetic operation, it has to unbox the operand, do the computation, and the box the result again.

This means that even if you have a program which only uses integers, you are paying a lot of overhead.

To know more about this toy VM, look at its source code: the interesting bits are the classes used to represent objects, and the interp_eval function, which contains the main loop of the virtual machine. As you can see, the implementation is quite straightforward; all the hint calls you see are the special annotations needed by the JIT generator to produce better code.

Let's JIT it!

So, the whole point is to generate a JIT compiler from it, isn't it?

First, checkout a fresh copy of the oo-jit branch:

$ svn co http://codespeak.net/svn/pypy/branch/oo-jit

Then, go to the oo-jit/pypy/jit/tl directory, and compile the tlc VM with the CLI backend and JIT enabled:

$ cd oo-jit/pypy/jit/tl/
$ ../../translator/goal/translate.py -b cli --jit --batch targettlc
...
lot of texts
...

If everything went OK, you now have a targettlc-cli executable, which accepts two arguments: the name of the file containing the tlc program we want to run, and an integer to be passed to it.

Luckily, in the same directory we have a factorial.tlc file that contains the bytecode for a function that -- guess? -- computes the factorial of a given integer; let's try it:

$ ./targettlc-cli factorial.tlc 5
Non jitted:    120 (0.009371 seconds)
Warmup jitted: 120 (0.208954 seconds)
Warmed jitted: 120 (0.000323999999999991 seconds)

Cool, it seems that the result was computed correcly :-). As you can see from the output, we ran the program three times:

  1. by plain interpretation, without any jitting;
  2. with the jit enabled: this run includes the time spent by doing the compilation itself, plus the time spent by running the produced code;
  3. again with the jit enabled, but this time the compilation has already been done, so we are actually measuring how good is the code we produced.

So, it's time to run a benchmark: let's try to compute the factorial of a very big number; the result will be 0, because obviously after a while we overflow, but after all we are interested in the time spent, not in the result:

$ ./targettlc-cli factorial.tlc 5000000
Non jitted:    0 (19.93247 seconds)
Warmup jitted: 0 (0.293229999999998 seconds)
Warmed jitted: 0 (0.0494239999999984 seconds)

$ python -c 'print 19.93247/0.0494239999999984'
403.295362577

And no, I didn't make any mistake in copying&pasting: the jitted version is really 400 times faster that the non jitted one!

Warning: my laptop seems to be not very well suited for benchmarks, as the results vary a lot from run to run; I've run the benchmarks a lot of times, and I got speedup factors up to 500 times, so your results may be different.

More benchmarks

It's also interesting to compare the result with a manual written C# version of the factorial, to see how good is code we produced; to get reasonable results, we need to compute a larger factorial, to let to code to run a bit more:

$ ./targettlc-cli --onlyjit factorial.tlc 100000000
Warmup jitted: 0 (0.980856 seconds)
Warmed jitted: 0 (0.769716 seconds)

$ mono factorial.exe 100000000
C#:            0 (0.153777 seconds)

$ python -c 'print 0.769716/0.153777'
5.00540392907

We know that the generated code is far from being optimal, but probably the factor of five is at least partially due to the fact that Mono's own JIT is optimized for C#-like code, and our code has a completely different shape.

All the benchmarks above were run under Linux, with Mono 1.9.1. Here are the results for the same benchmarks, but run with Microsoft CLR (on a different machine, so the absolute values are not comparable):

$ ./targettlc-cli factorial.tlc 5000000
Non jitted:    0 (15,640625 seconds)
Warmup jitted: 0 (0,4375 seconds)
Warmed jitted: 0 (0,03125 seconds)

$ python -c 'print 15.640625/0.03125'
500.5

$ ./targettlc-cli --onlyjit factorial.tlc 100000000
Warmup jitted: 0 (0,90625 seconds)
Warmed jitted: 0 (0,515625 seconds)

$ ./factorial.exe 100000000
C#:            0 (0,34375 seconds)

$ python -c 'print 0.515625/0.34375'
1.5

The results are even better than before; this is probably thanks to CLR's JIT, that does a better job than Mono when faced to something which is different than the usual C#-like code.

Conclusions (for now)

This is a very important result, because it proves that PyPy's approach to JIT compilers can be applied effectively also to OO virtual machines; the result is even better than what I expected, because when generating code for .NET we have much less freedom than when generating assembly code, and I had to play some tricks to work around some .NET limitations.

Moreover, it worked at the first try :-). I tried to compile the tlc virtual machine as soon as all the related JIT tests were passing, and surprisingly everything worked just fine, even if it was the very first time I was trying to apply some features of the JIT to something bigger than a test: I think this is yet another prove that Test Driven Development just works!

Even if this is a major milestone, the CLI JIT backend is not yet completed: as a consequence it can't still be used for the full PyPy, but all the hardest problems should have been solved now.

Since a lot of readers asked for more technical details, especially about the JIT, I will try to soon write a second blog post explaining how the CLI backend works internally, with a brief look to the generated code to see how it looks like.

Sunday, November 2, 2008

One year PyPy Blog

Last Friday the PyPy Status Blog had its first anniversary. Yay! After not really buying into any of this new-fangled "blog" stuff for a long time we just bit the bullet and got started. Totally surprisingly it even worked. We posted 76 post in the last year, more than one per week. By now we have more than 800 subscribers (according to feedburner), which is quite cool for a rather niche blog.

To make our blog even more interesting, I would like to ask for some feedback via the comments:

  • Which posts did you like in particular?
  • What sort of posts would you be interested in getting more of?
  • Any other improvements we could make?