Thursday, February 16, 2012

Py3k status update

Thank to all the people who donated to the py3k proposal, we managed to collect enough money to start to work on the first step. This is a quick summary of what I did since I began working on this.
First of all, many thanks to Amaury Forgeot d'Arc, who started the py3k branch months ago, and already implemented lots of features including e.g. switching to "unicode everywhere" and the int/long unification, making my job considerably easier :-)
I started to work on the branch at the last Leysin sprint together with Romain Guillebert, where we worked on various syntactical changes such as extended tuple unpacking and keyword-only arguments. Working on such features is a good way to learn about a lot of the layers which the PyPy Python interpreter is composed of, because often you have to touch the tokenizer, the parser, the ast builder, the compiler and finally the interpreter.
Then I worked on improving our test machinery in various way, e.g. by optimizing the initialization phase of the object space created by tests, which considerably speeds up small test runs, and adding the possibility to automatically run our tests against CPython 3, to ensure that what we are not trying to fix a test which is meant to fail :-). I also setup our buildbot to run the py3k tests nightly, so that we can have an up to date overview of what is left to do.
Finally I started to look at all the tests in the interpreter/ directory, trying to unmangle the mess of failing tests. Lots of tests were failing because of simple syntax errors (e.g., by using the no longer valid except Exception, e syntax or the old print statement), others for slightly more complex reasons like unicode vs bytes or the now gone int/long distinction. Others were failing simply because they relied on new features, such as the new lexical exception handlers.
To give some numbers, at some point in january we had 1621 failing tests in the branch, while today we are under 1000 (to be exact: 999, and this is why I've waited until today to post the status update :-)).
Before ending this blog post, I would like to thank once again all the people who donated to PyPy, who let me to do this wonderful job. That's all for now, I'll post more updates soon.
cheers, Antonio

Monday, February 13, 2012

A Larger Example for the Flow Graph Language

Part 4 of Comparing Partial Evaluation to Tracing

This is the fourth and final blog post in a series about comparing partial evaluation and tracing. We've come a long way: In the first post of the series I showed an interpreter for a small flow-graph language together with a partial evaluator it. In the second post I showed how a tracer for the same language works and how it relates to both execution and to partial evaluation. The third post described an optimizer for traces.

In this final post we can compare and contrast the two different approaches of tracing and partial evaluation by means of an example. The programs in the flow chart language seen so far have been rather small, so I want to give an example of a larger program: an interpreter for an extremely simple bytecode instruction set. I will look at how the partial evaluator deals with that interpreter, and what the tracer does with it. The code for that, as well as all the code of the series can be found here: http://paste.pocoo.org/show/550282/ (some small additions have been made, such as a nicer way to print traces).

A Bytecode Interpreter

Writing programs in the flow graph language is painful, but I still want to give an example that is a bit more interesting than the tiny ones that we've seen so far. The example is an interpreter for the bytecode of a very trivial register-based language. The language has four registers, one of which is an accumulator on which all the actual operations are performed.

The opcodes of the language are:

  • jump_if_a, jumps to a target address when the accumulator is non-zero
  • mov_a_r0, mov_a_r1, mov_a_r2 move the value of the accumulator to the respective register
  • mov_r0_a, mov_r1_a, mov_r2_a move the value of a register to the accumulator
  • add_r0_to_a, add_r1_to_a, add_r2_to_a add the value of the register to the accumulator
  • decr_a decrement the accumulator
  • return_a stop the program and print the accumulator

The interpreter has a main loop that reads the opcode at the current program counter, does a (lengthy) dispatch to the right bytecode via a series of if statements and then executes the right opcode. Afterwards the next opcode is treated equivalently.

Here is a part of the source code in the flow graph language. As pseudocode:

bytecode_loop:
    opcode = bytecode[pc]
    pc = pc + 1
    c = opcode == 'jump_if_a'
    if c goto op_jump_if_a else goto not_jump_if_a

# select the right bytecode via a long series of if statements
not_jump_if_a:
    c = opcode == 'mov_a_r0'
    if y goto op_mov_a_r0 else goto not_mov_a_r0
not_mov_a_r0:
    c = opcode == 'mov_a_r0'
    if y goto op_mov_a_r1 else goto not_mov_a_r1
...

# bytecode implementations
op_mov_a_r0:
    r0 = a
    goto bytecode_loop

op_jump_if_a:
    c = a == 0
    target = bytecode[pc]
    pc += 1
    if c goto bytecode_loop else goto op_jump_if_a_jump

op_jump_if_a_jump:
    pc = target
    goto bytecode_loop
...

And actually working, as Prolog facts (the full implementation can be found at the link above):

% bytecode dispatch loop
block(bytecode_loop,
      op2(opcode, readlist, var(bytecode), var(pc),
      op2(pc, add, var(pc), const(1),
      op2(c, eq, var(opcode), const(jump_if_a),
      if(c, op_jump_if_a, not_jump_if_a))))).

% select the right bytecode via a long series of if statements
block(not_jump_if_a,
      op2(c, eq, var(opcode), const(mov_a_r0),
      if(c, op_mov_a_r0, not_mov_a_r0))).
block(not_mov_a_r0,
      op2(c, eq, var(opcode), const(mov_a_r1),
      if(c, op_mov_a_r1, not_mov_a_r1))).
...

% bytecode implementations
block(op_jump_if_a,
      op2(c, eq, var(a), const(0),
      op2(target, readlist, var(bytecode), var(pc),
      op2(pc, add, var(pc), const(1),
      if(c, bytecode_loop, op_jump_if_a_jump))))).
block(op_jump_if_a_jump,
      op1(pc, same, var(target),
      promote(bytecode, bytecode_loop))).
block(op_mov_a_r0,
      op1(r0, same, var(a), jump(bytecode_loop))).
...

The bytecode_loop block is the main dispatch loop. It reads an opcode out of the bytecode list at the program counter position, then has a long series of if statements that compares the current opcode to the various existing opcodes. The full code of the interpreter can be found under the link above.

The bytecodes of the interpreter don't really permit hugely complex programs, but it can be used to write a program that computes the square of a number with the following program:

mov_a_r0     # r0 = a
mov_a_r1     # r1 = a
# 2:
mov_r0_a     # r0--
decr_a
mov_a_r0
mov_r2_a     # r2 += a
add_r1_to_a
mov_a_r2
mov_r0_a     # if r0!=0: goto 2
jump_if_a 2
mov_r2_a     # return r2
return_a

Partially Evaluating the Bytecode Interpreter

The partial evaluator from the first blog post can be easily used to partially evaluate the bytecode interpreter. The static input is the bytecode for computing the square and the initial program counter value, as given above. The dynamic input are the content of the accumulator (the number to be squared). This can be done as follows:

?- bytecode_square(B),
Env = [bytecode/B, pc/0],
do_pe(bytecode_loop, Env, Label),
REnv = [a/16, r0/0, r1/0, r2/0],
interp(jump(Label), REnv), listing(block).
256
:- dynamic block/2.

<lots of generated code>

The code that is generated by the partial evaluation process is somewhat hard to read. It contains a lot of passages like this:

...
block(op_return_a1, print_and_stop(var(a))).
block(not_decr_a1, jump(op_return_a1)).
block(not_add_r2_to_a2, jump(not_decr_a1)).
block(not_add_r1_to_a2, jump(not_add_r2_to_a2)).
block(not_add_r0_to_a3, jump(not_add_r1_to_a2)).
block(not_mov_r2_a3, jump(not_add_r0_to_a3)).
block(not_mov_r1_a5, jump(not_mov_r2_a3)).
block(not_mov_r0_a5, jump(not_mov_r1_a5)).
block(not_mov_a_r27, jump(not_mov_r0_a5)).
block(not_mov_a_r18, jump(not_mov_a_r27)).
block(not_mov_a_r09, jump(not_mov_a_r18)).
block(not_jump_if_a11, jump(not_mov_a_r09)).
block(bytecode_loop12, jump(not_jump_if_a11)).
block(op_mov_r2_a2, op1(a, same, var(r2), jump(bytecode_loop12))).
...

I.e. lots of blocks that do nothing but jump to another block, interspersed with some blocks that contain an actual operation. I cleaned the output up manually and got something like the following (this sort of cleanup is something a good partial evaluation system would do itself, after partial evaluation has occurred):

block(bytecode_loop1,
    op1(r0, same, var(a),
    op1(r1, same, var(a),
    op1(a, same, var(r0),
    op2(a, sub, var(a), const(1),
    op1(r0, same, var(a),
    op1(a, same, var(r2),
    op2(a, add, var(a), var(r1),
    op1(r2, same, var(a),
    op1(a, same, var(r0),
    op2(c, eq, var(a), const(0),
    if(c, bytecode_loop11, op_jump_if_a_jump1)))))))))))).

block(bytecode_loop11,
    op1(a, same, var(r2),
    print_and_stop(var(a))).

block(op_jump_if_a_jump1,
    op1(a, same, var(r0),
    op2(a, sub, var(a), const(1),
    op1(r0, same, var(a),
    op1(a, same, var(r2),
    op2(a, add, var(a), var(r1),
    op1(r2, same, var(a),
    op1(a, same, var(r0),
    op2(c, eq, var(a), const(0),
    if(c, bytecode_loop11, op_jump_if_a_jump1)))))))))).

What do we see here? The partial evaluator has generated a block bytecode_loop1, which corresponds to the initialization opcodes mov_a_r0 and mov_a_r1 together with one iteration of the loop. Then it either jumps to a copy of the main loop (label op_jump_if_a_jump1) or to block bytecode_loop11, which prints the result and then stops. The residual code does exactly what the bytecode did: It squares the accumulator then prints that. All the uses of the bytecode and pc variable are gone.

Why did the partial evaluator produce two copies of the main loop that look the same? The reason for that is that in the second copy, the additional static information target = 2 is known, where target is a variable in the interpreter source that stores the jump target, for very brief periods of time. This additional static information does not have any effect on the residual code, so the same code is uselessly generated twice. This is an example of overspecialization.

Tracing the Interpreter

In this section we will look at what happens if we try to trace the interpreter. The naive way of doing that yields traces that are not very useful, because they abort after one iteration. We will look at a way of avoiding this problem. The problems described in this section are at the core of the paper Tracing the meta-level: PyPy's tracing JIT compiler (that paper uses a slightly more advanced version of the bytecode interpreter as an example).

To trace the interpreter, it is useful to change the bytecode_loop block from above to always promote the bytecode and the pc variables, because without knowing them the trace produced is not really interesting. This is similar to making these variables static in the partial evaluation example above:

block(bytecode_loop,
      promote(bytecode, bytecode_loop_promote_bytecode)).
block(bytecode_loop_promote_bytecode,
      promote(pc, bytecode_loop_promote_pc)).
block(bytecode_loop_promote_pc,
      op2(opcode, readlist, var(bytecode), var(pc),
      op2(pc, add, var(pc), const(1),
      op2(c, eq, var(opcode), const(0),
      if(c, op_jump_if_a, not_jump_if_a))))).
...

The rest of the interpreter stays unchanged.

To trace the interpreter we would start naively at the bytecode_loop label, because that's the label in the interpreter that is jumped to most often (which a profiler could establish easily). The following command can be used for that (this output prints traces in a slightly more readable way than in previous blog posts):

?- bytecode_square(B),
        A = 16, Env = [bytecode/B, pc/2, a/A, r0/A, r1/A, r2/0],
        do_trace(bytecode_loop, Env).
trace
  guard_value(bytecode,[mov_a_r0,mov_a_r1,mov_r0_a,decr_a,mov_a_r0,mov_r2_a,add_r1_to_a,mov_a_r2,mov_r0_a,jump_if_a,2,mov_r2_a,return_a],[],bytecode_loop_promote_bytecode)
  guard_value(pc,2,[],bytecode_loop_promote_pc)
  op2(opcode,readlist,var(bytecode),var(pc))
  op2(pc,add,var(pc),const(1))
  op2(c,eq,var(opcode),const(jump_if_a))
  guard_false(c,[],op_jump_if_a)
  op2(c,eq,var(opcode),const(mov_a_r0))
  guard_false(c,[],op_mov_a_r0)
  op2(c,eq,var(opcode),const(mov_a_r1))
  guard_false(c,[],op_mov_a_r1)
  op2(c,eq,var(opcode),const(mov_a_r2))
  guard_false(c,[],op_mov_a_r2)
  op2(c,eq,var(opcode),const(mov_r0_a))
  guard_true(c,[],not_mov_r0_a)
  op1(a,same,var(r0))
  loop

opttrace
  guard_value(bytecode,[mov_a_r0,mov_a_r1,mov_r0_a,decr_a,mov_a_r0,mov_r2_a,add_r1_to_a,mov_a_r2,mov_r0_a,jump_if_a,2,mov_r2_a,return_a],[],bytecode_loop_promote_bytecode)
  guard_value(pc,2,[bytecode/[mov_a_r0,mov_a_r1,mov_r0_a,decr_a,mov_a_r0,mov_r2_a,add_r1_to_a,mov_a_r2,mov_r0_a,jump_if_a,2,mov_r2_a,return_a]],bytecode_loop_promote_pc)
  op1(a,same,var(r0))
  op1(bytecode,same,const([mov_a_r0,mov_a_r1,mov_r0_a,decr_a,mov_a_r0,mov_r2_a,add_r1_to_a,mov_a_r2,mov_r0_a,jump_if_a,2,mov_r2_a,return_a]))
  op1(pc,same,const(3))
  op1(opcode,same,const(mov_r0_a))
  op1(c,same,const(1))
  loop

256
B = [mov_a_r0, mov_a_r1, mov_r0_a, decr_a, mov_a_r0, mov_r2_a, add_r1_to_a, mov_a_r2, mov_r0_a|...],
A = 16,
Env = [bytecode/[mov_a_r0, mov_a_r1, mov_r0_a, decr_a, mov_a_r0, mov_r2_a, add_r1_to_a|...], pc/2, a/16, r0/16, r1/16, r2/0]

These traces are very short. They start with promoting the bytecode and the pc, followed by the execution of the opcode mov_r0_a, which is the one at position 2 in the given bytecode. Then they increment the pc and loop back to the beginning. Looking at the optimized trace, it is clear that the trace is essentially useless. It will run only for one iteration, because in the second iteration the pc is 3, thus the guard_value at the beginning will fail.

This problem can be solved by tracing more than just one iteration of the bytecode dispatch loop, which is called meta-tracing. To get this behaviour, in this simple example it is enough to start (and thus end) tracing at a different label, op_jump_if_a_jump. This label is hit when the interpreter executes a jump_if_a bytecode and the jump is taken. In a loop on the level of the executed bytecode program there is one such jump. Thus tracing from this label, a full loop in the bytecode program is traced, containing potentially many iterations of the bytecode dispatch loop in the control flow graph language.

Doing that yields the following:

?- bytecode_square(B),
        A = 16, Env = [bytecode/B, pc/11, a/A, r0/A, r1/A, r2/0, target/2],
        do_trace(op_jump_if_a_jump, Env).
trace
  op1(pc,same,var(target))
  guard_value(bytecode,[mov_a_r0,mov_a_r1,mov_r0_a,decr_a,mov_a_r0,mov_r2_a,add_r1_to_a,mov_a_r2,mov_r0_a,jump_if_a,2,mov_r2_a,return_a],[],bytecode_loop)
  guard_value(bytecode,[mov_a_r0,mov_a_r1,mov_r0_a,decr_a,mov_a_r0,mov_r2_a,add_r1_to_a,mov_a_r2,mov_r0_a,jump_if_a,2,mov_r2_a,return_a],[],bytecode_loop_promote_bytecode)
  guard_value(pc,2,[],bytecode_loop_promote_pc)
  op2(opcode,readlist,var(bytecode),var(pc))
  op2(pc,add,var(pc),const(1))
  op2(c,eq,var(opcode),const(jump_if_a))
  guard_false(c,[],op_jump_if_a)
  op2(c,eq,var(opcode),const(mov_a_r0))
  guard_false(c,[],op_mov_a_r0)
  op2(c,eq,var(opcode),const(mov_a_r1))
  guard_false(c,[],op_mov_a_r1)
  op2(c,eq,var(opcode),const(mov_a_r2))
  guard_false(c,[],op_mov_a_r2)
  op2(c,eq,var(opcode),const(mov_r0_a))
  guard_true(c,[],not_mov_r0_a)
  op1(a,same,var(r0))
  guard_value(bytecode,[mov_a_r0,mov_a_r1,mov_r0_a,decr_a,mov_a_r0,mov_r2_a,add_r1_to_a,mov_a_r2,mov_r0_a,jump_if_a,2,mov_r2_a,return_a],[],bytecode_loop_promote_bytecode)
  guard_value(pc,3,[],bytecode_loop_promote_pc)
  op2(opcode,readlist,var(bytecode),var(pc))
  ...
  lots of operations ommitted
  ...
  guard_value(bytecode,[mov_a_r0,mov_a_r1,mov_r0_a,decr_a,mov_a_r0,mov_r2_a,add_r1_to_a,mov_a_r2,mov_r0_a,jump_if_a,2,mov_r2_a,return_a],[],bytecode_loop_promote_bytecode)
  guard_value(pc,9,[],bytecode_loop_promote_pc)
  op2(opcode,readlist,var(bytecode),var(pc))
  op2(pc,add,var(pc),const(1))
  op2(c,eq,var(opcode),const(jump_if_a))
  guard_true(c,[],not_jump_if_a)
  op2(c,eq,var(a),const(0))
  op2(target,readlist,var(bytecode),var(pc))
  op2(pc,add,var(pc),const(1))
  guard_false(c,[],bytecode_loop)
  loop

opttrace
  op1(pc,same,var(target))
  guard_value(bytecode,[mov_a_r0,mov_a_r1,mov_r0_a,decr_a,mov_a_r0,mov_r2_a,add_r1_to_a,mov_a_r2,mov_r0_a,jump_if_a,2,mov_r2_a,return_a],[],bytecode_loop)
  guard_value(pc,2,[bytecode/[mov_a_r0,mov_a_r1,mov_r0_a,decr_a,mov_a_r0,mov_r2_a,add_r1_to_a,mov_a_r2,mov_r0_a,jump_if_a,2,mov_r2_a,return_a]],bytecode_loop_promote_pc)
  op1(a,same,var(r0))
  op2(a,sub,var(a),const(1))
  op1(r0,same,var(a))
  op1(a,same,var(r2))
  op2(a,add,var(a),var(r1))
  op1(r2,same,var(a))
  op1(a,same,var(r0))
  op2(c,eq,var(a),const(0))
  guard_false(c,[bytecode/[mov_a_r0,mov_a_r1,mov_r0_a,decr_a,mov_a_r0,mov_r2_a,add_r1_to_a,mov_a_r2,mov_r0_a,jump_if_a,2,mov_r2_a,return_a],pc/11,opcode/jump_if_a,target/2],bytecode_loop)
  op1(bytecode,same,const([mov_a_r0,mov_a_r1,mov_r0_a,decr_a,mov_a_r0,mov_r2_a,add_r1_to_a,mov_a_r2,mov_r0_a,jump_if_a,2,mov_r2_a,return_a]))
  op1(pc,same,const(11))
  op1(opcode,same,const(jump_if_a))
  op1(target,same,const(2))
  op1(c,same,const(0))
  loop

256
B = [mov_a_r0, mov_a_r1, mov_r0_a, decr_a, mov_a_r0, mov_r2_a, add_r1_to_a, mov_a_r2, mov_r0_a|...],
A = 16,
Env = [bytecode/[mov_a_r0, mov_a_r1, mov_r0_a, decr_a, mov_a_r0, mov_r2_a, add_r1_to_a|...], pc/11, a/16, r0/16, r1/16, r2/0, target/2] .

That looks better. The trace corresponds to the interpreter running all the bytecodes in the loop of the squaring function in the example bytecode above. The optimized code starts with two guards (checking that the bytecode is still the one for the squaring function, checking that the pc is 2) and then only does the operations that actually do the computation. No bytecode dispatching is performed, thus the interpretation overhead is fully removed, apart from the two guard_value operations at the beginning.

Many of the assignments in the trace are superfluous, e.g. all the copying back and forth between registers r1, r1, r2 and accumulator a. This could be easily solved by an even more intelligent optimization utilizing SSA form.

Conclusion About the Interpreter

Both partial evaluation and meta-tracing can be used to transform the example bytecode computing a square into a form that shows the essential computation that is going on, without the interpretation overhead. The naive partial evaluator produces lots of extra blocks that just jump around, which could be solved with a post-processing step. The tracer by itself produces uselessly short traces, but with a simple trick of starting the trace at a different point the results become a lot better.

In a real meta-tracing system, the meta-tracer would need a way for the author of the interpreter to mark which bytecode corresponds to a backward jump. It would also need better integration with the interpreter to start tracing automatically, as well as cache the traces. Additionally, it would have to deal better with guards that fail a lot, attaching new traces to the failing guards. However, all that is "just" engineering on top of the ideas presented in this series of blog posts.

High-Level Conclusion

Some concluding high-level thoughts about the similarities of tracing and partial evaluation: Tracing and partial evaluation try to tackle a similar problem, that of automatically reducing the interpreter overhead, their approaches are slightly different though.

Tracing is very close to normal evaluation, only keeping some extra information in the process. But then, the optimizer that is used in a tracer is again very similar in structure to a partial evaluator. The task of the optimizer is much simpler though, because it does not need to deal with control flow at all, just a linear list of operations.

So in a sense tracing is taking those parts of partial evaluation that work (the "just evaluate those things that you can, and leave the others") and replacing the parts that don't (controlling unfolding) by a much more pragmatic mechanism. That mechanism observes actual execution runs of the program to choose control flow paths that are typical. At the same time, the tracer's focus is on loops, because they are where most programs spend significant amounts of time.

Another point of view of tracing is that it is a form of partial evaluation that replaces the control components of a partial evaluator with an oracle (the actual execution runs) that provide the information which paths to look at.

Already in the quite trivial interpreter here the effects of this are visible. The simple partial evaluator over-specializes the loop and produces two identical versions of it, that aren't different. The tracer doesn't, and it also generates only code for the loop itself, not for the initialization opcodes.

That's it for this series. To those that made it, thanks for following along. Also thanks to Samuele and Sven, who consistently gave me good feedback on the posts before I put them here.

Friday, February 10, 2012

PyPy 1.8 - business as usual

We're pleased to announce the 1.8 release of PyPy. As habitual this release brings a lot of bugfixes, together with performance and memory improvements over the 1.7 release. The main highlight of the release is the introduction of list strategies which makes homogenous lists more efficient both in terms of performance and memory. This release also upgrades us from Python 2.7.1 compatibility to 2.7.2. Otherwise it's "business as usual" in the sense that performance improved roughly 10% on average since the previous release.

you can download the PyPy 1.8 release here:

http://pypy.org/download.html

What is PyPy?

PyPy is a very compliant Python interpreter, almost a drop-in replacement for CPython 2.7. It's fast (pypy 1.8 and cpython 2.7.1 performance comparison) due to its integrated tracing JIT compiler.

This release supports x86 machines running Linux 32/64, Mac OS X 32/64 or Windows 32. Windows 64 work has been stalled, we would welcome a volunteer to handle that.

Highlights

  • List strategies. Now lists that contain only ints or only floats should be as efficient as storing them in a binary-packed array. It also improves the JIT performance in places that use such lists. There are also special strategies for unicode and string lists.

  • As usual, numerous performance improvements. There are many examples of python constructs that now should be faster; too many to list them.

  • Bugfixes and compatibility fixes with CPython.

  • Windows fixes.

  • NumPy effort progress; for the exact list of things that have been done, consult the numpy status page. A tentative list of things that has been done:

    • multi dimensional arrays
    • various sizes of dtypes
    • a lot of ufuncs
    • a lot of other minor changes

    Right now the numpy module is available under both numpy and numpypy names. However, because it's incomplete, you have to import numpypy first before doing any imports from numpy.

  • New JIT hooks that allow you to hook into the JIT process from your python program. There is a brief overview of what they offer.

  • Standard library upgrade from 2.7.1 to 2.7.2.

Ongoing work

As usual, there is quite a bit of ongoing work that either didn't make it to the release or is not ready yet. Highlights include:

  • Non-x86 backends for the JIT: ARMv7 (almost ready) and PPC64 (in progress)
  • Specialized type instances - allocate instances as efficient as C structs, including type specialization
  • More numpy work
  • Since the last release there was a significant breakthrough in PyPy's fundraising. We now have enough funds to work on first stages of numpypy and py3k. We would like to thank again to everyone who donated.
  • It's also probably worth noting, we're considering donations for the Software Transactional Memory project. You can read more about our plans

Cheers,
The PyPy Team

Wednesday, February 8, 2012

Introductory Article About RPython

Laurence Tratt from King's College London has written a long and detailed introduction to the goals and significance of RPython over on his blog. Laurie has been implementing his Converge Language in RPython in the last months. He is one of the first people external to the PyPy team who have pushed a sizeable RPython-based VM quite far, adding and tuning JIT hints. The post describes some of that work and his impressions of RPython and PyPy.

"RPython, to my mind, is an astonishing project. It has, almost single-handedly, opened up an entirely new approach to VM implementation. As my experience shows, creating a decent RPython VM is not a huge amount of work (despite some frustrations). In short: never again do new languages need come with unusably slow VMs. That the the PyPy / RPython team have shown that these ideas scale up to a fast implementation of a large, real-world language (Python) is another feather in their cap." 

Tuesday, February 7, 2012

Optimizing Traces of the Flow Graph Language

Part 3 of Comparing Partial Evaluation to Tracing

This is the third blog post in a series about comparing partial evaluation and tracing. In the first post of the series I introduced a small flow-graph language together with an interpreter for it. Then I showed a partial evaluator for the language. In the second post of the series I showed how a tracer for the same language works and how it relates to both execution and to partial evaluation. Then I added support for promotion to that tracer.

In this post I will show how to optimize the traces that are produced by the tracer and compare the structure of the optimizer to that of partial evaluation.

The code from this post can be found here: http://paste.pocoo.org/show/547304/

Optimizing Traces

In the last post we saw how to produce a linear trace with guards by interpreting a control flow graph program in a special mode. A trace always end with a loop statement, which jumps to the beginning. The tracer is just logging the operations that are done while interpreting, so the trace can contain superfluous operations. On the other hand, the trace also contains some of the runtime values through promotions and some decisions made on them which can be exploited by optimization. An example for this is the trace produced by the promotion example from the last post:

op2(c,ge,var(i),const(0),
guard_true(c,[],l_done,
guard_value(x,5,[],b2,
op2(x2,mul,var(x),const(2),
op2(x3,add,var(x2),const(1),
op2(i,sub,var(i),var(x3),
loop))))))

After the guard_value(x, 5, ...) operation, x is know to be 5: If it isn't 5, execution falls back to the interpreter. Therefore, operations on x after the guard can be constant-folded. To do that sort of constant-folding, an extra optimization step is needed. That optimization step walks along the trace, remembers which variables are constants and what their values are using a partial environment. The opimizer removes operations that have only constant arguments and leaves the others in the trace. This process is actually remarkably similar to partial evaluation: Some variables are known to be constants, operations on only constant arguments are optimized away, the rest remains.

The code for optimizing operations looks as follows:

optimize(op1(ResultVar, Op, Arg, Rest), PEnv, NewOp) :-
    presolve(Arg, PEnv, RArg),
    (RArg = const(C) ->
        do_op(Op, C, Res),
        write_env(PEnv, ResultVar, Res, NEnv),
        NewOp = RestResidual
    ;
        remove_env(PEnv, ResultVar, NEnv),
        NewOp = op1(ResultVar, Op, RArg, RestResidual)
    ),
    optimize(Rest, NEnv, RestResidual).

optimize(op2(ResultVar, Op, Arg1, Arg2, Rest), PEnv, NewOp) :-
    presolve(Arg1, PEnv, RArg1),
    presolve(Arg2, PEnv, RArg2),
    (RArg1 = const(C1), RArg2 = const(C2) ->
        do_op(Op, C1, C2, Res),
        write_env(PEnv, ResultVar, Res, NEnv),
        NewOp = RestResidual
    ;
        remove_env(PEnv, ResultVar, NEnv),
        NewOp = op2(ResultVar, Op, RArg1, RArg2, RestResidual)
    ),
    optimize(Rest, NEnv, RestResidual).

Just like partial evaluation! It even reuses the helper functions presolve from the partial evaluator and a partial environment PEnv. When the arguments of the operation are known constants in the partial environment, the operation can be executed at optimization time and removed from the trace. Otherwise, the operation has to stay in the output trace. The result variable (as in the partial evaluator) needs to be removed from the partial environment, because it was just overwritten by an unknown result.

Now we need to deal with guards in the trace.

optimize(guard_true(V, [], L, Rest), PEnv, NewOp) :-
    plookup(V, PEnv, Val),
    (Val = const(C) ->
        NewOp = RestResidual
    ;
        NewOp = guard_true(V, PEnv, L, RestResidual)
    ),
    optimize(Rest, PEnv, RestResidual).

optimize(guard_false(V, [], L, Rest), PEnv, NewOp) :-
    plookup(V, PEnv, Val),
    (Val = const(C) ->
        NewOp = RestResidual,
        NEnv = PEnv
    ;
        write_env(PEnv, V, 0, NEnv),
        NewOp = guard_false(V, PEnv, L, RestResidual)
    ),
    optimize(Rest, NEnv, RestResidual).

When the variable that is being guarded is actually known to be a constant, we can remove the guard. Note that it is not possible that the guard of that constant fails: The tracer recorded the operation while running with real values, therefore the guards have to succeed for values the optimizer discovers to be constant.

guard_false is slightly different from guard_true: after the former we know that the argument is actually 0. After guard_true we only know that it is not equal to zero, but not which precise value it has.

Another point to note in the optimization of guards is that the second argument of the guard operation, which was so far always just an empty list, is now replaced by the partial environment PEnv. I will discuss further down why this is needed.

Optimizing guard_value is very similar, except that it really gives precise information about the variable involved:

optimize(guard_value(V, C, [], L, Rest), PEnv, NewOp) :-
    plookup(V, PEnv, Val),
    (Val = const(C1) ->
        NewOp = RestResidual,
        NEnv = PEnv
    ;
        write_env(PEnv, V, C, NEnv),
        NewOp = guard_value(V, C, PEnv, L, RestResidual)
    ),
    optimize(Rest, NEnv, RestResidual).

This operation is the main way how the optimizer gains constant variables that it then exploits to do constant-folding on later operations. This is a chief difference from partial evaluation: There the optimizer knows the value of some variables from the start. When optimizing traces, at the beginning the value of no variable is known. Knowledge about some variables is only later gained through guards.

Now we are missing what happens with the loop statement. In principle, it is turned into a loop statement again. However, at the loop statement a few additional operations need to be emitted. The reason is that we optimized away operations and thus assignments when the result value of the variable was a constant. That means the involved variable still potentially has some older value. The next iteration of the loop would continue with this older value, which is obviously wrong. Therefore we need to emit some assignments before the loop statement, one per entry in the partial environment:

optimize(loop, PEnv, T) :-
    generate_assignments(PEnv, T).

generate_assignments([], loop).
generate_assignments([Var/Val | Tail], op1(Var, same, const(Val), T)) :-
    generate_assignments(Tail, T).

As an example of how generate_assignments assignments works, let's look at the following example. When the partial environment is, [x/5, y/10] the following assignments are generated:

?- generate_assignments([x/5, y/10], Out).
Out = op1(x, same, const(5), op1(y, same, const(10), loop)).

That's all the code of the optimizer. While the basic structure is quite similar to partial evaluation, it's a lot less complex as well. What made the partial evaluator hard was that it needs to deal with control flow statements and with making sure that code is reused if the same block is partially evaluated with the same constants. Here, all these complexities go away. The tracer has already removed all control flow and replaced it with guards and one loop operation at the end. Thus, the optimizer can simply do one pass over the operations, removing some (with some extra care around the loop statement).

With this machinery in place, we can optimize the trace from the promotion example of the last post:

?- optimize(
    guard_value(x,3,[],b2,
    op2(x2,mul,var(x),const(2),
    op2(x3,add,var(x2),const(1),
    op2(i,sub,var(i),var(x3),
    op2(c,ge,var(i),const(0),
    guard_true(c,[],l_done, loop)))))),
    [],
    LoopOut).
LoopOut = guard_value(x, 3, [], b2, op2(i, sub, var(i), const(7), op2(c, ge, var(i), const(0), guard_true(c, [x/3, x2/6, x3/7], l_done, op1(x, same, const(3), op1(x2, same, const(6), op1(x3, same, const(7), loop)))))))

More readably, the optimized version is:

guard_value(x, 3, [], b2,
op2(i, sub, var(i), const(7),
op2(c, ge, var(i), const(0),
guard_true(c, [x/3, x2/6, x3/7], l_done,
op1(x, same, const(3),
op1(x2, same, const(6),
op1(x3, same, const(7),
loop)))))))

As intended, the operations on x after the guard_value have all been removed. However, some additional assignments (to x, x2, x3) at the end have been generated as well. The assignments look superfluous, but the optimizer does not have enough information to easily recognize this. That can be fixed, but only at the cost of additional complexity. (A real system would transform the trace into static single assignment form to answer such questions.)

Resuming to the Interpreter

Why does the code above need to add the partial environment to the guards that cannot be optimized away? The reason is related to why we needed to generate assignments before the loop statement. The problem is that the optimizer removes assignments to variables when it knows the values of these variables. That means that when switching back from running the optimized trace to the interpreter, a number of variables are not updated in the environment, making the execution in the interpreter incorrect.

In the example above, this applies to the variables x2 and x3. When the second guard fails, they have not been assigned in the optimized case. Therefore, the guard lists them and their (always constant) values.

When switching back these assignments need to be made. Thus we need to adapt the resume_interp function from the last blog post as follows:

write_resumevars([], Env, Env).
write_resumevars([Key / Value | Rest], Env, NEnv) :-
    write_env(Env, Key, Value, Env1),
    write_resumevars(Rest, Env1, NEnv).

resume_interp(Env, ResumeVars, L) :-
    write_resumevars(ResumeVars, Env, NEnv),
    block(L, Block),
    interp(Block, NEnv).

On resuming, the ResumeVars (a former partial environment) are simply added back to the normal environment before going back to the interpreter.

The data attached to guards about what needs to be done to resume to the interpreter when the guard fails is often a very complex part of a tracing system. The data can become big, yet most guards never fail. Therefore, most real systems try hard to compress the attached data or try to share it between subsequent guards.

Summary

In this post we have shown how to optimize traces by applying a variant of the partial evaluation principle: Perform all the operations that have only constant arguments, leave the others alone. However, optimizing traces is much simpler, because no control flow is involved. All the questions about control flow have already been solved by the tracing component.

In the next and final post of the series I will show a larger example of how tracing and partial evaluation can be used to optimize a small bytecode interpreter.

Wednesday, February 1, 2012

Almost There - PyPy's ARM Backend

In this post I want to give an update on the status of the ARM backend for PyPy's JIT and describe some of the issues and details of the backend.

Current Status

It has been a more than a year that I have been working on the ARM backend. Now it is in a shape, that we can measure meaningful numbers and also ask for some feedback. Since the last post about the backend we have added support floating point operations as well as for PyPy's framework GC's. Another area of work was to keep up with the constant improvements done in the main development branch, such as out-of-line guards, labels, etc. It has been possible for about a year to cross-translate the PyPy Python interpreter and other interpreters such as Pyrolog, with a JIT, to run benchmarks on ARM. Up until now there remained some hard to track bugs that would cause the interpreter to crash with a segmentation fault in certain cases when running with the JIT on ARM. Lately it was possible to run all benchmarks without problems, but when running the translation toolchain itself it would crash. During the last PyPy sprint in Leysin Armin and I managed to fix several of these hard to track bugs in the ARM backend with the result that, it is now possible to run the PyPy translator on ARM itself (at least unless until it runs out of memory), which is a kind of litmus test for the backend itself and used to crash before. Just to point it out, we are not able to complete a PyPy translation on ARM, because on the hardware we have currently available there is not enough memory. But up to the point we run out of memory the JIT does not hit any issues.

Implementation Details

The hardware requirements to run the JIT on ARM follow those for Ubuntu on ARM which targets ARMv7 with a VFP unit running in little endian mode. The JIT can be translated without floating point support, but there might be a few places that need to be fixed to fully work in this setting. We are targeting the ARM instruction set, because at least at the time we decided to use it seemed to be the best choice in terms of speed while having some size overhead compared to the Thumb2 instruction set. It appears that the Thumb2 instruction set should give comparable speed with better code density but has a few restriction on the number of registers available and the use of conditional execution. Also the implementation is a bit easier using a fixed width instruction set and we can use the full set of registers in the generated code when using the ARM instruction set.

The calling convention on ARM

The calling convention on ARM uses 4 of the general purpose registers to pass arguments to functions, further arguments are passed on the stack. The presence of a floating point unit is not required for ARM cores, for this reason there are different ways of handling floats with relation to the calling convention. There is a so called soft-float calling convention that is independent of the presence of a floating point unit. For this calling convention floating point arguments to functions are stored in the general purpose registers and on the stack. Passing floats around this way works with software and hardware floating point implementations. But in presence of a floating point unit it produces some overhead, because floating point numbers need to be moved from the floating point unit to the core registers to do a call and moved back to the floating point registers by the callee. The alternative calling convention is the so-called hard-float calling convention which requires the presence of a floating point unit but has the advantage of getting rid of the overhead of moving floating point values around when performing a call. Although it would be better in the long term to support the hard-float calling convention, we need to be able to interoperate with external code compiled for the operating system we are running on. For this reason at the moment we only support the soft-float to interoperate with external code. We implemented and tested the backend on a BeagleBoard-xM with a Cortex-A8 processor running Ubuntu 11.04 for ARM.

Translating for ARM

The toolchain used to translate PyPy currently is based on a Scratchbox2. Scratchbox2 is a cross-compiling environment. Development had stopped for a while, but it seems to have revived again. We run a 32-bit Python interpreter on the host system and perform all calls to the compiler using a Scratchbox2 based environment. A description on how to setup the cross translation toolchain can be found here.

Results

The current results on ARM, as shown in the graph below, show that the JIT currently gives a speedup of about 3.5 times compared to CPython on ARM. The benchmarks were run on the before mentioned BeagleBoard-xM with a 1GHz ARM Cortex-A8 processor and 512MB of memory. The operating system on the board is Ubuntu 11.04 for ARM. We measured the PyPy interpreter with the JIT enabled and disabled comparing each to CPython Python 2.7.1+ (r271:86832) for ARM. The graph shows the speedup or slowdown of both PyPy versions for the different benchmarks from our benchmark suite normalized to the runtime of CPython. The data used for the graph can be seen below.

The speedup is less than the speedup of 5.2 times we currently get on x86 on our own benchmark suite (see http://speed.pypy.org for details). There are several possible reasons for this. Comparing the results for the interpreter without the JIT on ARM and x86 suggests that the interpreter generated by PyPy, without the JIT, has a worse performance when compared to CPython that it does on x86. Also it is quite possible that the code we are generating with the JIT is not yet optimal. Also there are some architectural constraints produce some overhead. One of these differences is the handling of constants, most ARM instructions only support 8 bit (that can be shifted) immediate values, larger constants need to be loaded into a register, something that is not necessary on x86.

BenchmarkPyPy JITPyPy no JIT
ai0.4844397800473.72756749625
chaos0.08072916919342.2908692212
crypto_pyaes0.07111148322453.30112318509
django0.09777432455192.56779947601
fannkuch0.2104237356982.49163632938
float0.1542753346752.12053281495
go0.3304830342025.84628320479
html5lib0.6292643898623.60333138526
meteor-contest0.9847474269122.93838610037
nbody_modified0.2369695930821.40027234936
pyflate-fast0.3674471918072.72472422146
raytrace-simple0.02905274614371.97270054339
richards0.0345755735533.29767342015
slowspitfire0.7866425519083.7397367403
spambayes0.6603243794563.29059863111
spectral-norm0.0636107837314.01788986233
spitfire0.436171311652.72050579076
spitfire_cstringio0.2555387021341.7418593111
telco0.1029189304133.86388866047
twisted_iteration0.1227239868054.33632475491
twisted_names2.423677971352.99878698076
twisted_pb1.309918374314.48877805486
twisted_tcp0.9270333540552.8161624665
waf1.020598119321.03793427321


The next steps and call for help

Although there probably still are some remaining issues which have not surfaced yet, the JIT backend for ARM is working. Before we can merge the backend into the main development line there are some things that we would like to do first, in particular it we are looking for a way to run the all PyPy tests to verify that things work on ARM before we can merge. Additionally there are some other longterm ideas. To do this we are looking for people willing to help, either by contributing to implement the open features or that can help us with hardware to test.

The incomplete list of open topics:
  • We are looking for a better way to translate PyPy for ARM, than the one describe above. I am not sure if there currently is hardware with enough memory to directly translate PyPy on an ARM based system, this would require between 1.5 or 2 Gig of memory. A fully QEMU based approach could also work, instead of Scratchbox2 that uses QEMU under the hood.
  • Test the JIT on different hardware.
  • Experiment with the JIT settings to find the optimal thresholds for ARM.
  • Continuous integration: We are looking for a way to run the PyPy test suite to make sure everything works as expected on ARM, here QEMU also might provide an alternative.
  • A long term plan would be to port the backend to ARMv5 ISA and improve the support for systems without a floating point unit. This would require to implement the ISA and create different code paths and improve the instruction selection depending on the target architecture.
  • Review of the generated machine code the JIT generates on ARM to see if the instruction selection makes sense for ARM.
  • Build a version that runs on Android.
  • Improve the tools, i.e. integrate with jitviewer.
So if you are interested or willing to help in any way contact us.