Python: The Optimization Ladder

Posted by Twirrim 4 days ago

Python: The Optimization Ladder(cemrehancavdar.com)

201 points | 67 comments

Ralfp 6 hours ago|

    CPython 3.13 went further with an experimental copy-and-patch JIT compiler -- a lightweight JIT that stitches together pre-compiled machine code templates instead of generating code from scratch. It's not a full optimizing JIT like V8's TurboFan or a tracing JIT like PyPy's;

Good news. Python 3.15 adapts Pypy tracing approach to JIT and there are real performance gains now:

https://github.com/python/cpython/issues/139109

https://doesjitgobrrr.com/?goals=5,10

pjmlp 27 minutes ago||

Now this is great to know.

josalhor 6 hours ago||

While this is great, I expected faster CPython to eventually culminate into what YJIT for Ruby is. I'm not sure the current approaches they are trying will get the ecosystem there.

kenjin4096 4 hours ago||

I implemented most of the tracing JIT frontend in Python 3.15, with help from Mark to clean up and fix my code. I also coordinated some of the community JIT optimizer effort in Python 3.15 (note: NOT the code generator/DSL/infra, that's Mark, Diego, Brandt and Savannah). So I think I'm able to answer this.

I can't speak for everyone on the team, but I did try the lazy basic block versioning in YJIT in a fork of CPython. The main problem is that the copy-and-patch backend we currently have in CPython is not too amenable to self-modifying machine code. This makes inter-block jumps/fallthroughs very inefficient. It can be done, it's just a little strange. Also for security reasons, we tried not to have self-modifying code in the original JIT and we're hoping to stick to that. Everything has their tradeoffs---design is hard! It's not too difficult to go from tracing to lazy basic blocks. Conceptually they're somewhat similar, as the original paper points out. The main thing we lack is the compact per-block type information that something like YJIT/Higgs has.

I guess while I'm here I might as well make the distinction:

- Tracing is the JIT frontend (region selection).

- Copy and Patch is the JIT backend (code generation).

We currently use both. PyPy uses meta-tracing. It traces the runtime itself rather than the user's code in CPython's tracing case. I did take a look at PyPy's code, and a lot of ideas in the improved JIT are actually imported from PyPy directly. So I have to thank them for their great ideas. I also talk to some of the PyPy devs.

Ending off: the team is extremely lean right now. Only 2 people were generously employed by ARM to work on this full time (thanks a lot to ARM too!). The rest of us are mostly volunteers, or have some bosses that like open source contributions and allow some free time. As for me, I'm unemployed at the moment and this is basically my passion project. I'm just happy the JIT is finally working now after spending 2-3 years of my life on it :). If you go to Savannah's website [1], the JIT is around 100% faster for toy programs like Richards, and even for big programs like tomli parsing, it's 28% faster on macOS AArch64. The JIT is very much a community effort right now.

[1]: https://doesjitgobrrr.com/?goals=5,10

PS: If you want to see how the work has progressed, click "all time" in that website, it's pretty cool to see (lower is faster). I have a blog explaining how we made the JIT faster here https://fidget-spinner.github.io/posts/faster-jit-plan.html.

__mharrison__ 5 hours ago||

Great writeup.

I've been in the pandas (and now polars world) for the past 15 years. Staying in the sandbox gets most folks good enough performance. (That's why Python is the language of data science and ML).

I generally teach my clients to reach for numba first. Potentially lots of bang for little buck.

One overlooked area in the article is running on GPUs. Some numpy and pandas (and polars) code can get a big speedup by using GPUs (same code with import change).

bloaf 4 hours ago|

Taichi, benchmarked in the article, claims to be able to outperform CUDA at some GPU tasks, although their benchmarks look to be a few years old:

https://github.com/taichi-dev/taichi_benchmark

pjmlp 4 hours ago||

And doesn't account for cuTitle, NVidia's new API infrastructure that supports writing CUDA directly in Python via a JIT that is based on MLIR.

repple 4 hours ago||

Significant AI smell in this write up. As a result, my current reflex is to immediately stop reading. Not judgement on the actual analysis and human effort which went in. It’s just that the other context is missing.

canjobear 1 hour ago||

Here's what gave it away for me

> The remaining difference is noise, not a fundamental language gap. The real Rust advantage isn't raw speed -- it's pipeline ownership.

repple 37 minutes ago||

There’s an unmistakable rhythm beginning with first paragraph. The trigger was “Same problems, same Apple M4 Pro, real numbers.” in third for me.

I’m scarred to detect these things by my own AI usage.

https://en.wikipedia.org/wiki/Wikipedia:Signs_of_AI_writing

huseyinkeles 3 hours ago|||

The author is from Turkey (where I’m also originally from).

Believe it or not, when you write a blog post in a different language, it really helps to use an LLM, even just to fix your grammar mistakes etc.

I assume that’s most likely what happened here too.

shepherdjerred 1 hour ago|||

IMO it would make sense to add a disclaimer then, e.g. “I wrote this myself but had AI edit”

I have no problem with people using AI, especially to close a language gap.

If you disclose your usage I have a _lot_ more trust that effort has been put into the writing despite the usage

butterNaN 40 minutes ago||||

Honestly I'd rather read imperfect english

repple 3 hours ago|||

I believe it

Terretta 7 minutes ago|||

“The numbers are real.” But the voice is not.

pjmlp 20 minutes ago|||

If we only applied the same reflex to software, even when 100% human programmed.

jb_hn 3 hours ago|||

I didn't notice any signs of AI writing until seeing this comment and re-reading (though I did notice it on the second pass).

That said, I think this article demonstrates that focusing on whether or not an article used AI might be focusing on the wrong “problem.” I appreciate being sensitive to the "smell" (the number of low-effort, AI posts flying around these days has made me sensitive too), but personally, I found this article both (1) easy to read and (2) insightful. I think the number of AI-written content lacking (2) is the problem.

repple 3 hours ago||

Your initial focus is to prioritize which content to consume.

markisus 2 hours ago|||

I also seem to be developing an immune response to several slopisms. But the actual content is useful for outlining tradeoffs if you’re needing to make your Python code go faster.

MonkeyClub 3 hours ago|||

I got the same sense, but nowadays I can't be sure whether a text is AI or the writer's style has absorbed LLM tropes.

FusionX 3 hours ago||

I don't think it should be conflated with auto generated AI slop. I see a lot of snippets which were clearly manually written. I'm assuming the author used AI in a supervised manner, to smooth out the writing process and improve coherency.

seanwilson 6 hours ago||

> The real story is that Python is designed to be maximally dynamic -- you can monkey-patch methods at runtime, replace builtins, change a class's inheritance chain while instances exist -- and that design makes it fundamentally hard to optimize. ...

> 4 bytes of number, 24 bytes of machinery to support dynamism. a + b means: dereference two heap pointers, look up type slots, dispatch to int.__add__, allocate a new PyObject for the result (unless it hits the small-integer cache), update reference counts.

Would Python be a lot less useful without being maximally dynamic everywhere? Are there domains/frameworks/packages that benefit from this where this is a good trade-off?

I can't think of cases in strong statically typed languages where I've wanted something like monkey patching, and when I see monkey patching elsewhere there's often some reasonable alternative or it only needs to be used very rarely.

bloaf 4 hours ago||

I've always thought the flexibility should allow python to consume things like gRPC proto files or OpenAPI docs and auto-generate the classes/methods at runtime as opposed to using codegen tools. But as far as I know, there aren't any libraries out there actually doing that.

haimez 29 minutes ago|||

Generating code at runtime is often an anti-goal because you can’t easily introspect it. “Build-time” generation gives you that, but print often choose to go further and check the generated code to source control to be able to see the change history.

skeledrew 3 hours ago|||

But it's an fairly easy build if you want any of that.

NeutralForest 4 hours ago|||

There are some use cases for very dynamic code, like ORMs; with descriptors you can add attributes + behavior at runtime and it's quite useful. Anyways, breaking metaprogramming and more dynamic features would mean python 4 and we know how 2 -> 3 went. I also don't think it's where the core developers are going. Also also, there are other things I'd change before going after monkey patching like some scoping rules, mutable defaults in function attributes, better async ergonomics, etc.

LtWorf 5 hours ago||

I've used a library that patches the zipfile module to add support for zstd compression in zipfiles.

In python3.14 the support is there, but 2 years ago you could just import this library and it would just work normally.

elophanto_agent 2 hours ago||

the optimization ladder is just the five stages of grief but for python developers. denial ("it's fast enough"), anger ("why is this so slow"), bargaining ("maybe if I use numpy"), depression ("I should rewrite this in rust"), acceptance ("actually cython is fine")

rusakov-field 5 hours ago||

Python is perfect as a "glue" language. "Inner Loops" that have to run efficiently is not where it shines, and I would write them in C or C++ and patch them with Python for access to the huge library base.

This is the "two language problem" ( I would like to hear from people who extensively used Julia by the way, which claims to solve this problem, does it really ?)

pjmlp 4 hours ago|

This problem has been solved already by Lisp, Scheme, Java, .NET, Eiffel, among others, with their pick and choose mix of JIT and AOT compiler toolchains and runtimes.

pjmlp 4 hours ago||

Kudos for going through all the existing JIT approaches, instead of reaching for rewrite into X straight away.

However if Rust with PyO3 is part of the alternatives, then Boost.Python, cppyy, and pybind11 should also be accounted for, given their use in HPC and HFT integrations.

blt 4 hours ago||

Surprised Python is only 21x slower than C for tree traversal stuff. In my experience that's one of the most painful places to use Python. But maybe that's because I use numpy automatically when simple arrays are involved, and there's no easy path for trees.

tweakimp 4 hours ago||

Be careful with that, numpy arrays can be slower than Python tuples for some operations. The creation is always slower and the overhead has to be worth it.

AlotOfReading 3 hours ago||

You can turn trees into numpy-style matrix operations because graphs and matrices are two sides of the same coin. I don't see the code for the binary-tree benchmark in the repo to see how it's written, but there are libraries like graphblas that use the equivalence for optimization.

markisus 2 hours ago||

I wish there were more details on this part.

> Missing @cython.cdivision(True) inserts a zero-division check before every floating-point divide in the inner loop. Millions of branches that are never taken.

I thought never taken branches were essentially free. Does this mean something in the loop is messing with the branch predictor?

pavpanchekha 2 hours ago|

They're cheap but not free, especially at the front end of the CPU where it's just a lot more instructions to churn through. What the branch predictor gets you is it turns branches, which would normally cause a pipeline bubble, to be executed like straightline code if they're predicted right. It's a bit like a tracing jit. But you will still have a bunch of extra instructions to, like, compute the branch predicate.

beng-nl 2 hours ago||

Worse, IMO, is the never taken branch taking up space in branch prediction buffers. Which will cause worse predictions elsewhere (when this branch ip collides with a legitimate ip). Unless I missed a subtlety and never taken branches don’t get assigned any resources until they are taken (which would be pretty smart actually).

adsharma 3 hours ago|

Missing: write static python and transpile to rust pyO3 which is at the top of the ladder.

Some nuance: try transpiling to a garbage collected rust like language with fast compilation until you have millions of users.

Also use a combination of neural and deterministic methods to transpile depending on the complexity.

zahlman 2 hours ago||

> a garbage collected rust like language with fast compilation

I don't know what languages you might have in mind. "Rust-like" in what sense?

pjmlp 16 minutes ago||

Probably OCaml, Standard ML, Haskell, MLton, F#, Scala,....

If going to complain about some of those being slow, remeber that they have various options between interpreter, bytecode, REPL, JIT and AOT.

tda 2 hours ago||

One thing with python is that usually I will use one of the many c based libraries to get reasonable speed and well thought out abstractions from the start. I architect around numpy, scipy, shapely, pandas/polars or whatever. So my code runs at reasonable speed from the start. But transpiling to rust then effectively means a complete redesign of the code, data structures, algorithms etc. And I have seen the AI tools really struggle to get it right, as my intent gets lost somewhere.

So what I do now (since Claude Code) is write really bare bones (and slow) pure python implementation (like I used to do for numba, pypy or cython ready code), with minimal dependencies. Then I use the REPL, notebooks and nice plotting tools to get a real understanding of the problem space and the intricacies of my algorithm/problem at hand. When done, I let Claude add tests and I ask it to transpile to equivalent Rust and boom! a flawless 1000x speed upgrade in a minutes.

The great thing is I don't need to do the mental gymnastics to vectorize code in a write only mode like I've had to do since my Matlab days. Instead I can write simple to read for loops that follow my intent much better, and result in much more legible code. So refreshing!

And with pyO3 i can still expose the Rust lib to python, and continue to use Python for glue and plotting

More comments...