Tuesday, October 14, 2008

Sprint Discussions: JIT Generator Planning

Background

Finally, the JIT post :-). First some background: Despite our plans at the end of the EU period, PyPy's Python interpreter didn't get a good and widely applicable JIT in the last year. The reason for that was that we discovered that although the basic idea to generate JIT compilers is good, the concrete prototype made during the EU period is basically flawed. It could have been pushed a bit farther, but would have run into deep troubles eventually. One of the problems would have been performance instability: change a seemingly unrelated bit in your source program, and the performance changes in unexpected ways, which is clearly not desirable. Another problem with that old approach is that too much assembler code is generated, leading to memory problems, and also that the generated assembler is bad in various ways, e.g. it is hard in that approach to do proper register allocation.

Therefore we decided that it would be worthless to pursue this direction much further. Instead we tried to research approaches to fixing the inherent problems. This research was largely done in Prolog and I eventually wrote my Master's thesis about it. From the Prolog work we got some good insights into what needs to be done and what sorts of techniques are needed. Also, it inspired Armin to do some more exploration on a small Python prototype which used the lessons learned from Prolog and also some additional ideas from tracing JITs. So far, however, the prototype is neither in RPython, nor much integrated with PyPy.

This research is not the only thing happening in the JIT-area. During the last year, Antonio Cuni was working on bringing the JIT to pypy-cli. This consisted mostly of writing a .NET backend for the old JIT-generator. Some further work is being done since August by John Witulski, who is writing an AMD64 backend for the JIT-generator for his Bachelor's thesis.

Where to go from there

During the sprint we discussed in which directions we should continue now. We plan to work quite a bit on the JIT in the coming months. Both Armin and Anto are in Düsseldorf for four months, and them and me plan to mostly work on the JIT (as well as giving a lecture on "Dynamic Programming Languages", trying to ensnare some more students).

The first step will be to experiment a bit more with Armin's prototype. So far it looks rather promising, but there are some unsolved issues that we need to look into first. The first issue is to think a bit about how to efficiently do profiling to compile only important code paths. The other large issue are so-called "virtualizables". Roughly speaking, they are the frame objects of the interpreter from which the JIT is generated. They need special treatment, because on the one hand it is important that they get optimized away to make the code fast, since the frames are accessed all the time for the local variables; on the other hand they should still be usable for introspection if code is around that is trying to look into them.

When this is done, the prototype needs to be ported to RPython, which is a non-trivial task, since it is rather dynamic so far (it is rather important that the unresolved issues are done before the porting, because once the prototype is in RPython, experimentation will be harder). The porting has the potential to be tedious, but in a sense it is "just work", as opposed to unclear research.

At this point it will become important to think about the backend interface. The interface that the old frontend used to produce assembler code won't be usable for the new approach, so things need to be rearranged slightly. Afterwards the backends will have more information and be invoked at a slightly higher level, which should allow them to produce better code.

When all this is done, the JIT generator will be in a rather good state and it should become possible (modulo a lot of details, of course), to use it on the Python interpreter.

Conclusion

I am intentionally not attaching any time estimates to this blog post. So far our time estimates have not been very accurate when it comes to the JIT, which only lead to disappointment when the JIT failed to materialize. We hope that we will progress in interesting ways in the next four months, but who knows. Note that we are really quite disappointed ourselves that it took so much longer than we planned and hoped. The reason for this is mostly that this work really is research and sometimes it is just hard to predict what sort of problems turn up. Partial evaluation (the basis for our JIT generator) is a 30 years old technique that was always just promising and never really successful, so the fact that we think we can solve its problems in a few years is very much hubris anyway :-). On the positive side, we think that we now know these problems much better than ever before and that we have a plan that has a chance to succeed.

Also we are still convinced that our approach has huge potential, despite the difficulties. If we manage to pull it off, it should be significantly simpler to support new language features in the JIT and also to get speedups on some rather interesting bits of the language. Some ideas we are having include generating a JIT for the regex engine or speed up ctypes-bindings to be nearly as fast as an extension module (or faster?). Also the JIT will be such that by construction the JIT-generated code behaves identical to the original code, which isn't always true for Psyco, for example.

Sprint Discussions: C++ Library Bindings

At the beginning of this year, PyPy grew ctypes support, thanks to generous support by Google. This made it possible to interface with C libraries from our Python interpreter, something that was possible but rather tedious before. What we are lacking so far is a way to interface to large C++ libraries (like GUI libraries). During the sprint we had a brainstorming session about possible approaches for fixing this shortcoming.

For CPython there are a number of approaches in common use:

Those all have the property that they produce some code that is then compiled with a compiler to produce a CPython extension. The produced code also uses functions from CPython's C-API. This model is not simple to use for PyPy in its current state. Since PyPy generates C code automatically, a fixed C-level API does not exist (it is not unlikely that at one point in the future we might have to provide one, but not yet). At the moment, PyPy very much has a "Don't call us, we call you"-approach.

A very different approach is followed by the Reflex package, which is developed at CERN (which has an incredible amount of C++ libraries). It is not mainly intended for writing Python bindings for C++ libraries but instead provides reflection capabilities for C++. The idea is that for every C++ shared library, an additional shared library is produced, which allows together with Reflex to introspect properties of C++ classes, methods, etc. at runtime. These facilities are then used for writing a small generic CPython extension module, that allows CPython to use any C++ library for which this reflection information was generated.

This approach is a bit similar to the ctypes module, apart from the fact that ctypes does not use any reflection information, but the user has to specify the data structures that occur in the C code herself. This makes it sometimes rather burdensome to write cross-platform library bindings.

For PyPy the approach seems rather fitting: We would need to implement only the generic extension module and could then use any number of C++ libraries. Of course some more evaluation is needed (e.g. to find out whether there are any restrictions for the C++ code that the library can use and how bothersome it is to get this reflection information for a large library) but so far it seems promising.

Sunday, October 12, 2008

Sprint Discussions: Release Planning

One of the discussions that happened during the sprint was about how to approach the next PyPy release. There hasn't been a release since the end of the EU period, which is not an optimal situation. Therefore we plan to make a 1.1 release at the beginning of next year, ideally before Pycon US. We'd also like to move towards time-based releases. This will be greatly helped by the new buildbot infrastructure, which allows us to decide when the state of the codebase is stable enough to release.

Another goal of the release is to involve more people from the wider PyPy community by having bugdays and generally asking for more support. This will be particularly useful for bugs on platforms that no one of the core developers group is using.

Feature-wise the release will mostly contain CPython 2.5 language support, including some new extension modules (like ctypes, expat, sqlite). In addition we plan to make it easier to actually install and use the PyPy Python interpreter, which means some sort of proper installation procedure and supporting distutils on top of PyPy. Another part of the release will be support for fully sand-boxing an interpreter.

Additionally there were also a large number of improvements on several levels since the last release, like optimizations, faster oldstyle-classes, better GCs, correct finalization behaviour, lots and lots of bugfixes, better threading support (still with the GIL), some work on improving memory behaviour, ...

In contrast to our last release, we will focus mainly on PyPy's Python Intepreter and more particularly its C-version. There are also various experimental interpreters that PyPy contains, like for Prolog, Smalltalk, JavaScript and Scheme. We also don't intend to put the LLVM and Javascipt backends in the release, since they are essentially unmaintained and at least partially broken. If anybody is particularly interested in one of these components, please feel free to step up and take responsibility for them. Another thing that the release won't contain is a JIT. I plan to make another blog-post about this soon, stay tuned.

Saturday, October 11, 2008

Düsseldorf Sprint Report Days 1-3

The Düsseldorf sprint is currently in full progress and this post will try to summarize what progress has been made in the last days. We are (again) sprinting at the STUPS group of the Düsseldorf University. You can find the sprint announcement and the daily planning file.

Holger and Samuele put quite some effort over several days into setting up and improving PyPy's testing infrastructure. PyPy has a variety of tests. On the one hand, there are of course our own tests. But then we also have the CPython tests that should be run on top of pypy-c. Up to now we used a custom-made pile of hacks, held together by lots of duct-tape. It consisted of a variety of different machines running different things with different reporting solutions. Some of the old test-results can still be found on wyvern. Now we are moving to a buildbot based solution together with a custom reporter to have a view similar to the old one. Some details are not quite finished yet, but most of the things are already working rather well (currently all the results displayed are from the 2.5-merge branch).

Another large (and ongoing) topic of work is the 2.5 branch. It contains the work done by our Summer-of-Code student, Bruno Gola, of adding CPython 2.5 features to PyPy's Python interpreter. While Bruno implemented most language features and imported the 2.5 stdlib into PyPy, a lot of details were still missing. In the last days nearly everybody worked on fixing small issues and failing stdlib tests. While doing that we tried to categorize some CPython tests as implementation dependant so that we can skip them when running on PyPy.

Memory Improvements

One goal of the sprint is to measure and to reduce the memory behaviour of our Python interpreter. The idea is to make pypy-c a realistic option for use on embedded devices. By memory behaviour we mean both the dynamic memory usage (how much bytes does a dict or an instance take) as well as the size of the executable and details of the GC strategy.

Alexander, Carl Friedrich and Antonio did some work on analyzing the static data that a pypy-c executable contains. Our executables have the tendency to be rather large, both due to a lot of code and due to a large amount of static data. The analysis didn't give any really surprising results, the problem is mostly that we have a lot of static data originating from a bit everywhere in our program. Two big offenders are the unicodedata-module with about 750 KB of static data and the multimethod-tables with about 150 KB of data.

Armin, Iko, Anto and Maciek worked on a new approach to malloc-removal. This is (for PyPy) a crucial optimization of the translation toolchain that performs escape analysis to find out which objects don't outlive the frame they were allocated in. Since RPython is garbage-collected we usually have a lot of allocations, so it is important to statically get rid of many of them. To successfully do that, some inlining is needed to give the analysis more context. This leads to the fact that we have rather aggressive inlining-settings to allow as much malloc-removal as possible. The new approach tries to inline functions only if this actually leads to the successful removal of a malloc operation. The code is not finished quite yet, so it remains to be seen how successful it will be.

Before the sprint Maciek had started to work on a mark-compact GC for PyPy. The idea is that it is better for memory-constrained-environments because it does not double the memory-requirements during collections. During the sprint Armin and Maciek worked on cleaning up the code a bit and then merging the branch. An interesting property of the mark-compact GC is that after a collection all the memory that is not currently used by the program is returned to the operating system. Right now the GC is not as fast as our more mature ones, but it probably will be the basis for future tweaking.

A small thing that was done by Alexander and Carl Friedrich to make objects smaller is to enable shared instance dictionaries also for instances of old-style classes. Before it worked only for instances of new-style classes. Shared instance dictionaries are a way to reduce the memory-usage of instances. In the optimal case, it gives the same memory-savings that __slots__ are giving, but without any behavioural changes. Conceptually it is very similar e.g. to the notion of "map" in the Self project, or the hidden classes that Google Chrome's V8 is using (click on the link, there are nice graphics). The difference is that for them it is mostly a way to get faster attribute access, and PyPy is so far only using it form memory savings (but that might change in the future).

In parallel to all the other work, John Witulski worked tirelessly on advancing the AMD64-JIT-backend. John has the implementation of this backend as the topic of his Bachelor's thesis. He is progressing quite well (especially also considering that this is his first sizeable Python project ever), just sometimes being impaired by such annoyances as errors in the official Intel documentation. By now the backend is supporting many integer operations and control flow.

Friday, October 10, 2008

Prolog-JIT Master's-Thesis Finished

As we already blogged, in the last half-year or so, Michael Leuschel, Armin and me did a lot of JIT generator work on a Prolog prototype. The idea was to experiment more quickly with some techniques than what would have been possible with RPython. These experiments were quite successful in themselves. With very little code we managed to get a JIT that is not doing too badly when compared to existing projects for Prolog.

This Prolog work was also the subject of my Master's thesis. I finished the thesis about two weeks ago (and since then have been mostly sleeping and then sprinting). The thesis should be self-contained when it comes to explaining the JIT concepts but needs knowledge of Prolog to be understandable.