Wednesday, April 20, 2016

PyPy 5.1 released

We have released PyPy 5.1, about a month after PyPy 5.0.

This release includes more improvement to warmup time and memory requirements, extending the work done on PyPy 5.0. We have seen an additional reduction of about 20% in memory requirements, and up to 30% warmup time improvement, more detail in the blog post.

We also now have full support for the IBM s390x. Since this support is in RPython, any dynamic language written using RPython, like PyPy, will automagically be supported on that architecture.

We updated cffi to 1.6 (cffi 1.6 itself will be released shortly), and continue to improve support for the wider python ecosystem using the PyPy interpreter.

You can download the PyPy 5.1 release here:
We would like to thank our donors for the continued support of the PyPy project.
We would also like to thank our contributors and encourage new people to join the project. PyPy has many layers and we need help with all of them: PyPy and RPython documentation improvements, tweaking popular modules to run on pypy, or general help with making RPython’s JIT even better.

What is PyPy?

PyPy is a very compliant Python interpreter, almost a drop-in replacement for CPython 2.7. It’s fast (PyPy and CPython 2.7.x performance comparison) due to its integrated tracing JIT compiler.

We also welcome developers of other dynamic languages to see what RPython can do for them.

This release supports:
  • x86 machines on most common operating systems (Linux 32/64, Mac OS X 64, Windows 32, OpenBSD, FreeBSD),
  • newer ARM hardware (ARMv6 or ARMv7, with VFPv3) running Linux,
  • big- and little-endian variants of PPC64 running Linux,
  • s390x running Linux

Other Highlights

(since the release of PyPy 5.0 in March, 2016


  • New features:

    • A new jit backend for the IBM s390x, which was a large effort over the past few months.
    • Add better support for PyUnicodeObject in the C-API compatibility layer
    • Support GNU/kFreeBSD Debian ports in vmprof
    • Add __pypy__._promote
    • Make attrgetter a single type for CPython compatibility

  • Bug Fixes

    • Catch exceptions raised in an exit function
    • Fix a corner case in the JIT
    • Fix edge cases in the cpyext refcounting-compatible semantics (more work on cpyext compatibility is coming in the cpyext-ext branch, but isn’t ready yet)
    • Try harder to not emit NEON instructions on ARM processors without NEON support
    • Improve the rpython posix module system interaction function calls
    • Detect a missing class function implementation instead of calling a random function
    • Check that PyTupleObjects do not contain any NULLs at the point of conversion to W_TupleObjects
    • In ctypes, fix _anonymous_ fields of instances
    • Fix JIT issue with unpack() on a Trace which contains half-written operations
    • Fix sandbox startup (a regression in 5.0)
    • Fix possible segfault for classes with mangled mro or __metaclass__
    • Fix isinstance(deque(), Hashable) on the pure python deque
    • Fix an issue with forkpty()
    • Issues reported with our previous release were resolved after reports from users on our issue tracker at https://bitbucket.org/pypy/pypy/issues or on IRC at #pypy

  • Numpy:

    • Implemented numpy.where for a single argument
    • Indexing by a numpy scalar now returns a scalar
    • Fix transpose(arg) when arg is a sequence
    • Refactor include file handling, now all numpy ndarray, ufunc, and umath functions exported from libpypy.so are declared in pypy_numpy.h, which is included only when building our fork of numpy
    • Add broadcast

  • Performance improvements:

    • Improve str.endswith([tuple]) and str.startswith([tuple]) to allow JITting
    • Merge another round of improvements to the warmup performance
    • Cleanup history rewriting in pyjitpl
    • Remove the forced minor collection that occurs when rewriting the assembler at the start of the JIT backend
    • Port the resource module to cffi
     
    • Internal refactorings:

      • Use a simpler logger to speed up translation
      • Drop vestiges of Python 2.5 support in testing
      • Update rpython functions with ones needed for py3k
    Please update, and continue to help us make PyPy better.
    Cheers
    The PyPy Team







    Monday, April 18, 2016

    PyPy Enterprise Edition

    With the latest additions, PyPy's JIT now supports the Z architecture on Linux. The newest architecture revision (also known as s390x, or colloquially referred to as "big iron") is the 64-bit extension for IBM mainframes. Currently only Linux 64 bit is supported (not z/OS nor TPF).
    This is the fourth assembler backend supported by PyPy in addition to x86 (32 and 64), ARM (32-bit only) and PPC64 (both little- and big-endian). It might seem that we kind of get a hang of new architectures. Thanks to IBM for funding this work!

    History

    When I went to university one lecture covered the prediction of Thomas Watson in 1943. His famous quote "I think there is a world market for maybe five computers ...", turned out not to be true.

    However, even 70 years later, mainframes are used more often than you think. They back critical tasks requiring a high level of stability/security and offer high hardware and computational utilization rates by virtualization.

    With the new PyPy JIT backend we are happy to present a fast Python virtual machine for mainframes and contribute more free software running on s390x.

    Meta tracing

    Even though the JIT backend has been tested on PyPy, it is not restricted to  the Python programming language. Do you have a great idea for a DSL, or another language that should run on mainframes? Go ahead and just implement your interpreter using RPython.

    How do I get a copy?

    PyPy can be built using the usual instructions found here. As soon as the next PyPy version has been released we will provide binaries. Until then you can just grab a nightly here.We are currently busy to get the next version of PyPy ready, so an official release will be rolled out soon.

    Comparing s390x to x86

    The goal of this comparison is not to scientifically evaluate the benefits/disadvantages on s390x, but rather to see that PyPy's architecture delivers the same benefits as it does on other platforms. Similar to the comparison done for PPC I ran the benchmarks using the same setup. The first column is the speedup of the PyPy JIT VM compared to the speedup of a pure PyPy interpreter 1). Note that the s390x's OS was virtualized.

      Label               x86     s390x      s390x (run 2)

      ai                 13.7      12.4       11.9
      bm_chameleon        8.5       6.3        6.8
      bm_dulwich_log      5.1       5.0        5.1
      bm_krakatau         5.5       2.0        2.0
      bm_mako             8.4       5.8        5.9
      bm_mdp              2.0       3.8        3.8
      chaos              56.9      52.6       53.4
      crypto_pyaes       62.5      64.2       64.2
      deltablue           3.3       3.9        3.6
      django             28.8      22.6       21.7
      eparse              2.3       2.5        2.6
      fannkuch            9.1       9.9       10.1
      float              13.8      12.8       13.8
      genshi_text        16.4      10.5       10.9
      genshi_xml          8.2       7.9        8.2
      go                  6.7       6.2       11.2
      hexiom2            24.3      23.8       23.5
      html5lib            5.4       5.8        5.7
      json_bench         28.8      27.8       28.1
      meteor-contest      5.1       4.2        4.4
      nbody_modified     20.6      19.3       19.4
      pidigits            1.0      -1.1       -1.0
      pyflate-fast        9.0       8.7        8.5
      pypy_interp         3.3       4.2        4.4
      raytrace-simple    69.0     100.9       93.4
      richards           94.1      96.6       84.3
      rietveld            3.2       2.5        2.7
      slowspitfire        2.8       3.3        4.2
      spambayes           5.0       4.8        4.8
      spectral-norm      41.9      39.8       42.6
      spitfire            3.8       3.9        4.3
      spitfire_cstringio  7.6       7.9        8.2
      sympy_expand        2.9       1.8        1.8
      sympy_integrate     4.3       3.9        4.0
      sympy_str           1.5       1.3        1.3
      sympy_sum           6.2       5.8        5.9
      telco              61.2      48.5       54.8
      twisted_iteration  55.5      41.9       43.8
      twisted_names       8.2       9.3        9.7
      twisted_pb         12.1      10.4       10.2
      twisted_tcp         4.9       4.8        5.2


      Geometric mean:    9.31      9.10       9.43


    As you can see the benefits are comparable on both platforms.
    Of course this is scientifically not good enough, but it shows a tendency. s390x can achieve the same results as you can get on x86.

    Are you running your business application on a mainframe? We would love to get some feedback. Join us in IRC tell us if PyPy made your application faster!

    plan_rich & the PyPy Team

    1) PyPy revision for the benchmarks: 4b386bcfee54

    Thursday, April 7, 2016

    Warmup improvements: more efficient trace representation

    Hello everyone.

    I'm pleased to inform that we've finished another round of improvements to the warmup performance of PyPy. Before I go into details, I'll recap the achievements that we've done since we've started working on the warmup performance. I picked a random PyPy from November 2014 (which is definitely before we started the warmup work) and compared it with a recent one, after 5.0. The exact revisions are respectively ffce4c795283 and cfbb442ae368. First let's compare pure warmup benchmarks that can be found in our benchmarking suite. Out of those, pypy-graph-alloc-removal numbers should be taken with a grain of salt, since other work could have influenced the results. The rest of the benchmarks mentioned is bottlenecked purely by warmup times.

    You can see how much your program spends in warmup running PYPYLOG=jit-summary:- pypy your-program.py under "tracing" and "backend" fields (in the first three lines). An example looks like that:

    [e00c145a41] {jit-summary
    Tracing:        71      0.053645 <- time spent tracing & optimizing
    Backend:        71      0.028659 <- time spent compiling to assembler
    TOTAL:                  0.252217 <- total run time of the program
    

    The results of the benchmarks

    benchmark time - old time - new speedup JIT time - old JIT time - new
    function_call 1.86 1.42 1.3x 1.12s 0.57s
    function_call2 5.17s 2.73s 1.9x 4.2s 1.6s
    bridges 2.77s 2.07s 1.3x 1.5s 0.8s
    pypy-graph-alloc-removal 2.06s 1.65s 1.25x 1.25s 0.79s

    As we can see, the overall warmup benchmarks got up to 90% faster with JIT time dropping by up to 2.5x. We have more optimizations in the pipeline, with an idea how to transfer some of the JIT gains into more of a total program runtime by jitting earlier and more eagerly.

    Details of the last round of optimizations

    Now the nitty gritty details - what did we actually do? I covered a lot of warmup improvements in the past blog posts so I'm going to focus on the last change, the jit-leaner-frontend branch. This last change is simple, instead of using pointers to store the "operations" objects created during tracing, we use a compact list of 16-bit integers (with 16bit pointers in between). On 64bit machine the memory wins are tremendous - the new representation is 4x more efficient to use 16bit pointers than full 64bit pointers. Additionally, the smaller representation has much better cache behavior and much less pointer chasing in memory. It also has a better defined lifespan, so we don't need to bother tracking them by the GC, which also saves quite a bit of time.

    The change sounds simple, but the details in the underlaying data mean that everything in the JIT had to be changed which took quite a bit of effort :-)

    Going into the future on the JIT front, we have an exciting set of optimizations, ranging from faster loops through faster warmup to using better code generation techniques and broadening the kind of program that PyPy speeds up. Stay tuned for the updates.

    We would like to thank our commercial partners for making all of this possible. The work has been performed by baroquesoftware and would not be possible without support from people using PyPy in production. If your company uses PyPy and want it to do more or does not use PyPy but has performance problems with the Python installation, feel free to get in touch with me, trust me using PyPy ends up being a lot cheaper than rewriting everything in go :-)

    Best regards,
    Maciej Fijalkowski