Tuesday, July 27, 2010

A Play on Regular Expression

The paper where the algorithms we described in the recent blog posts come from is now available. It is written as a play in three Acts with a cast of three and is very readable and funny. The Haskell code is at Sebastian Fischer's github pages.

Monday, July 26, 2010

EuroPython 2010 report

So, EuroPython 2010 is over, I am flying home and it's time to write a report about the conference from the PyPy point of view.

As usual, the conference was very interesting and went very well. The quality of the talks I attended to was high on average and most importantly I could meet a lot of interesting people to discuss various things.

On the first day, Armin, Amaury and I presented the usual PyPy status talk (here are the slides): the talk is an extended version of the one that I and Armin presented at Pycon Italia in May and is divided in three parts: first I talked about the current status of the project, what is the content of the recent 1.2 and 1.3 releases and showed a demo of a simple Django application that renders a Mandelbrot fractal and is measurably faster on PyPy than on CPython. In the second part of the talk, Armin gave an introduction about the ideas that stand behind the JIT. Finally, in the third part Amaury explained how the new cpyext module lets PyPy to compile and load existing CPython extensions written in C.

I think that the talk was well received: the only drawback is that there was no time to answer questions at the end of the presentation. However, we received a lot of "offline" questions after the talk finished and thorough the whole conference: it is always great to see that people are interested in our work, and I'd like to thank everybody for the feedback that they gave to us.

PyPy was also mentioned in the interesting Mark Shannon's talk, where he compared the optimization techniques used by PyPy, Unladen Swallow and HotPy, which is Mark's own PhD project. Moreover, Henrik Vendelbo gave a talk about how to tweak PyPy to produce a standalone executable which embeds a whole python application to make deployment easier, while Andrew Francis explained his implementation of the Go select statement based on the stackless.py module implemented in PyPy. Personally, I am glad to see that people start to think of PyPy as a useful starting point to experiment with new features and use cases that we did not think about: after all, one of PyPy explicit goals is to be "flexible and easy to experiment with".

After the conference there were the usual post EuroPython sprints: this year we had not planned a PyPy sprint, but some people showed interest in it and since Armin and I happened to be still around the day after the conference, we decided to do a mini 1-day sprint, with 6 or 7 people present. Since there were only two core developers it was impossible to use our usual pairing scheme, in which every newcomer pairs with someone who is experienced with the source code to gain knowledge of it. However, I think it was still a successful day of work, and we managed to fix a couple of bugs that was standing in our issue tracker. Again, I'd like to thank all the people that came and worked with us during the sprint.

In conclusion I really enjoyed the EuroPython 2010 experience: the fact that I managed to find a place in Birmingham where to eat a good Italian-style "gelato" helped a lot :-).

Friday, July 9, 2010

CERN Sprint Report – Wrapping C++ Libraries

The last five days we have been sprinting in a meeting room in the Computing Center at CERN in Genève, Switzerland. Present are Armin Rigo, Antonio Cuni, Carl Friedrich Bolz and Wim Lavrijsen (LBL). The goal of the sprint was to use some of the C++ technology developed at CERN to make it possible to use C++ libraries from PyPy's Python interpreter. For this we used the Reflex library, which provides reflection information for C++ classes. We discussed using Reflex in PyPy during the Düsseldorf sprint of 2008, please read that blog post if you want some more details on how Reflex works. There is support for this sort of C++/Python integration also for CPython, using the PyROOT module.

The sprint was very successful. On Monday we had a few discussion about how Reflex could best be integrated with PyPy. One of the goals of the sprint was to make the approach JIT-friendly from the start, so that calls to C++ libraries can be reasonably fast. After the discussion we started coding on the reflex-support branch. This branch adds a new cppyy builtin module to PyPy's Python interpreter (why we chose that name is left as an exercise to the reader). This module can be used to load C++ classes, construct instances and call static and instance methods on them.

The work has just started, as of now, the argument and return types of the methods are restricted to some simple C types, such as int, double and char* and pointers to class instances. Most of the work necessary to properly resolve overloaded methods is done, but default arguments are not.

As an example, suppose there is a C++ class like this:

class example01 {
private:
    static int count;
    int somedata;
public:

    example01(int a) : somedata(a) {
        count++;
    }
    ~example01() {
        count--;
    }
    static int getCount() {
        return count;
    }

    int addDataToInt(int a) {
        return somedata + a;
    }
};
int example01::count = 0;

You can now use it from PyPy's Python interpreter in the following way, after you have used Reflex to generate reflection information for the class:

import cppyy
cppyy.load_lib("example01Dict.so") # contains the Reflex information
example01_class = cppyy.gbl.example01
instance = example01_class(7)
assert example01_class.getCount() == 1
res = instance.addDataToInt(4)
assert res == 11
res = instance.addDataToInt(-4)
assert res == 3
instance.destruct() # so far explicit destruction needed
assert example01_class.getCount() == 0

We also did some very early JIT work and some early performance measurements. The rough figures are that cppyy is two times faster at calling a simple C++ method from Python than PyROOT. To get a feeling for how fast things could go in the end, we also implemented a proof-of-concept for some more advanced JIT technology (which requires a patch for Reflex and uses a GCC extension). With this, the speedup over PyROOT is a factor of 20. Of course, this is still a lot slower than a C++ to C++ method call (probably by at least an order of magnitude).

The sprint was very productive because we managed to get the right people into the same room working together. Wim has a lot of experience with C++ and Reflex, and is the author of PyROOT, and of course the others know a lot about PyPy (at the end of the sprint, Anto was very glad that he stopped using C++ a long time ago). Also, working at CERN was very cool. The atmosphere is amazing, and we got to visit the ATLAS control room. Extremely advanced technology, and also research on a completely different scale than what we are used to.

Saturday, July 3, 2010

Comparing SPUR to PyPy

Recently, I've become aware of the SPUR project of Microsoft Research and read some of their papers (the tech report "SPUR: A Trace-Based JIT Compiler for CIL" is very cool). I found the project to be very interesting and since their approach is in many ways related to what PyPy is doing, I now want to compare and contrast the two projects.

A Tracing JIT for .NET

SPUR consist of two parts: On the one hand it is a VM for CIL, the bytecode of the .NET VM. This VM uses a tracing JIT compiler to compile the programs it is running to machine code. As opposed to most existing VMs that have a tracing JIT it does not use an interpreter at all. Instead it contains various variants of a JIT compiler that produce different versions of each method. Those are:

  • a profiling JIT which produces code that does lightweight profiling when running the compiled method
  • a tracing JIT which produces code that produces a trace when running the compiled method
  • a transfer-tail JIT which is used to produce code which is run to get from a failing guard back to the normal profiling version of a method
  • an optimizing JIT that actually optimizes traces and turns them into machine code

Optimizations Done by the Optimizing JIT

SPUR's optimizing JIT does a number of powerful optimizations on the traces before it turns them into machine code. Among them are usual compiler optimizations such as register allocation, common subexpression elimination, loop invariant code motion, etc.

It also performs some optimizations that are specific to the tracing context and are thus not commonly found in "normal" compilers:

  • guard implication: if a guard is implied by an earlier guard, it is removed
  • guard strengthening: if there is a sequence of guards that become stronger and stronger (i.e. each guard implies the previous one), the first guard in the sequence is replaced by the last one, and all others are removed. This can greatly reduce the number of guards and is generally safe. It can shift a guard failure to an earlier point in the trace, but the failure would have occurred at some point in the trace anyway.
  • load/store optimizations: this is an optimization for memory reads/writes. If several loads from the same memory location occur without writes in between, all but the first one are removed. Similarly, if a write to a memory location is performed, this write is delayed as much as possible. If there is a write to the same location soon afterwards, the first write can be removed.
  • escape analysis: for allocations that occur in a loop, the optimizer checks whether the resulting object escapes the loop. If not, the allocation is moved before the loop, so that only one object needs to be allocated, instead of one every loop iteration.
  • user-controlled loop unrolling: not exactly an optimization, but an interesting feature anyway. It is possible to annotate a CIL method with a special decorator [TraceUnfold] and then the tracing JIT will fully unroll the loops it contains. This can be useful for loops than are known to run a small and fixed number of iterations for each call-site.
  • user controlled tracing: The user can also control tracing up to a point. Methods can be annotated with [NativeCall] to tell the tracer to never trace their execution. Instead they appear as a direct call in the trace.

A JavaScript Implementation

In addition to the tracing JIT I just described, SPUR also contains a JavaScript implementation for .NET. The approach of this implementation is to translate JavaScript to CIL bytecode, doing some amount of type inference to detect variables that have fixed types. All operations where no precise type could be determined are implemented with calls to a JavaScript runtime system, which does the necessary type dispatching. The JavaScript runtime is implemented in C#.

The JavaScript implementation and the CLI VM with a tracing JIT sound quite unrelated at first, but together they amplify each other. The tracing JIT traces the JavaScript functions that have been translated to CLI bytecode. Since the JavaScript runtime is in C#, it exists as CLI bytecode too. Thus it can be inlined into the JavaScript functions by the tracer. This is highly beneficial, since it exposes the runtime type dispatching of the JavaScript operations to the optimizations of the tracing JIT. Particularly the common expression elimination helps the JavaScript code. If a series of operations is performed on the same object, the operations will all do the same type checks. All but the type checks of the first operation can be removed by the optimizer.

Performance Results

The speed results of the combined JavaScript implementation and tracing JIT are quite impressive. It beats TraceMonkey for most benchmarks in SunSpider (apart from some string-heavy benchmarks that are quite slow) and can compete with V8 in many of them. However, all this is steady-state performance and it seems SPUR's compile time is rather bad currently.

Further Possibilities

A further (so far still hypothetical) advantage of SPUR is that the approach can optimize cases where execution crosses the border of two different systems. If somebody wrote an HTML layout engine and a DOM in C# to get a web browser and integrated it with the JavaScript implementation described above, the tracing JIT could optimize DOM manipulations performed by JavaScript code as well as callbacks from the browser into JavaScript code.

Of course the approach SPUR takes to implement JavaScript is completely generalizable. It should be possible to implement other dynamic languages in the same way as JavaScript using SPUR. One would have to write a runtime system for the language in C#, as well as a compiler from the language into CIL bytecode. Given these two elements, SPUR's tracing JIT compiler would probably do a reasonable job at optimizing this other language (of course in practise, the language implementation would need some tweaking and annotations to make it really fast).

Comparison With PyPy

The goals of PyPy and SPUR are very similar. Both projects want to implement dynamic languages in an efficient way by using a tracing JIT. Both apply the tracing JIT "one level down", i.e. the runtime system of the dynamic language is visible to the tracing JIT. This is the crucial point of the approach of both projects. Since the runtime system of the dynamic language is visible to the tracing JIT, the JIT can optimize programs in that dynamic language. It does not itself need to know about the semantics of the dynamic language. This makes the tracing JIT usable for a variety of dynamic languages. It also means that the two halves can be implemented and debugged independently.

In SPUR, C# (or another language that is compilable to CIL) plays the role of RPython, and CIL is equivalent to the intermediate format that PyPy's translation toolchain uses. Both formats operate on a similar abstraction level, they are quite close to C, but still have support for the object system of their respective language and are garbage-collected.

SPUR supports only a JavaScript implementation so far, which could maybe change in the future. Thus JavaScript in SPUR corresponds to Python in PyPy, which was the first dynamic language implemented in PyPy (and is also the reason for PyPy's existence).

There are obviously also differences between the two projects, although many of them are only skin-deep. The largest difference is the reliance of SPUR on compilers on all levels. PyPy takes the opposite approach of using interpreters almost everywhere. The parts of PyPy that correspond to SPUR's compilers are (I will use the Python implementation of PyPy as an example):

  • the JavaScript-to-CIL compiler corresponds to the Python interpreter of PyPy
  • the profiling JIT corresponds to a part of PyPy's translation toolchain which adds some profiling support in the process of turning RPython code into C code,
  • the tracing JIT corresponds to a special interpreter in the PyPy JIT which executes an RPython program and produces a trace of the execution
  • the transfer-tail JIT corresponds to PyPy's blackhole interpreter, also called fallback interpreter
  • the optimizing JIT corresponds to the optimizers and backends of PyPy's JIT

PyPy's Optimizations

Comparing the optimizations that the two projects perform, the biggest difference is that PyPy does "trace stitching" instead of fully supporting trace trees. The difference between the two concerns what happens when a new trace gets added to an existing loop. The new trace starts from a guard in the existing loop that was observed to fail often. Trace stitching means that the loop is just patched with a jump to the new trace. SPUR instead recompiles the whole trace tree, which gives the optimizers more opportunities, but also makes compilation a lot slower. Another difference is that PyPy does not perform loop-invariant code motion yet.

Many of the remaining optimizations are very similar. PyPy supports guard implication as well as guard strengthening. It has some load/store optimizations, but PyPy's alias analysis is quite rudimentary. On the other hand, PyPy's escape analysis is very powerful. PyPy also has support for the annotations that SPUR supports, using some decorators in the pypy.rlib.jit module. User-controlled loop unrolling is performed using the unroll_safe decorator, tracing of a function can be disabled with the dont_look_inside decorator.

PyPy has a few more annotations that were not mentioned in the SPUR tech report. Most importantly, it is possible to declare a function as pure, using the purefunction decorator. PyPy's optimizers will remove calls to a function decorated that way if the arguments to the call are all constant. In addition it is possible to declare instances of classes to be immutable, which means that field accesses on constant instances can be folded away. Furthermore there is the promote hint, which is spelled x = hint(x, promote=True). This will produce a guard in the trace, to turn x into a constant after the guard.

Summary

Given the similarity between the projects' goals, it is perhaps not so surprising to see that PyPy and SPUR have co-evolved and reached many similar design decisions. It is still very good to see another project that does many things in the same way as PyPy.