Optimizers need a rethink

Posted by ingve 10/23/2024

Optimizers need a rethink(typesanitizer.com)

149 points | 167 comments

pizlonator 10/27/2024|

Compiler optimizers are designed from a “win in the average” mindset - so whether a particular optimization succeeds or not is really not something the user ought to rely on. If you’re relying on it then you’re playing with fire. (I say that as someone who has played with this particular kind of fire. Sometimes I’m ok, other times I get burned.)

A great example of when winning in the average works is register allocation. It’s fine there because the cost of any particular variable getting spilled is so low. So, all that matters is that most variables are in registers most of the time. If spill heuristics change for the better, it usually means some of your variables that previously got spilled now are in registers while others that were in registers are now spilled - and the compiler writer declares victory if this is a speedup in some overall average of large benchmarks. Similar thinking plays out in stuff like common subexpression elimination or basically any strength reduction. (In fact, most of those optimizations have the peculiar property that you’ll always be able to craft a program that shows the optimization to be a bad idea; we do them anyway because on average they are a speedup.)

In my view, if a compiler optimization is so critical that users rely on it reliably “hitting” then what you really want is for that optimization to be something guaranteed by the language using syntax or types. The way tail calls work in functional languages comes to mind. Also, the way value types work in C#, Rust, C++, etc - you’re guaranteed that passing them around won’t call into the allocator. Basically, relying on the compiler to deliver an optimization whose speedup from hitting is enormous (like order of magnitude, as in the escape analysis to remove GC allocations case) and whose probability of hitting is not 100% is sort of a language design bug.

This is sort of what the article is saying, I guess. But for example on the issue of the optimizer definitely removing a GC allocation: the best design there is for the GC’d language to have a notion of value types that don’t involve allocation at all. C# has that, Java doesn’t.

WalterBright 10/27/2024||

One thing I learned about optimizers is some optimizations undo earlier optimizations. The state of the optimized program will sometimes flip-flop back and forth and never converge on a solution. Then you have to rely on a heuristic to guess at the better form, and live with it. The optimizer has a counter in it so it will "pull the plug" if it fails to converge, and will just go with the last state.

My optimizer first appeared in Datalight C around 1984 or so. It was the first DFA optimizer for any C compiler on the PC. C compiler benchmark roundups were popular articles in programming magazines at the time. We breathlessly waited for the next roundup article.

When it came, Datalight C was omitted from it! The journalist said they excluded Datalight C because it was buggy, as it deleted the benchmark code and just printed the success message. The benchmarks at the time consisted of things like:

    for (i = 0; i < 1000; ++i) a = 3;

so of course it deleted the useless code.

I was really angry about that, as the journalist never bothered to call us and ask about DLC's behavior. Our sales tanked after that article.

But it wasn't long until the industry realized that optimizers were the future, the benchmarks were revised, and the survivors in the compiler business all did optimizers. DLC recovered, but as you can tell, I am still annoyed at the whole incident.

WalterBright 10/28/2024|||

Recently, a D user remarked that a non-optimized build took a few seconds, but turning on the optimizer caused it to take 35 minutes. I was wondering how this was possible. It turns out his code had a function with 30,000 lines of code in it!

Some things one just doesn't anticipate.

Working on compilers is never dull.

subharmonicon 10/28/2024||

I’m honestly a bit surprised if that’s the first time you’ve seen something like that.

I’ve been working on compilers 30 years, primarily on optimizers (although some frontend work as well). Learned C from Borland C and C++ from…Zortech C++ 1.0, so thank you for that!

Within a few years of working on compilers I came across my first examples of 50k+ line functions. These were often (but not always) the result of source code generators that were translating some kind of problem description to code. It taught me very early on that you really need to focus on scalability and compile time in compilers, whether it’s the amount of code within a function, or across functions (for IPO / LTO).

And yes, working on compilers is never dull. 25 years ago I thought we’d end up in a monolithic world with x86, C++, and Java being the focus of all work. Instead, there’s been an absolute explosion of programming models, languages, and architectures, as well as entirely new problem spaces like graph compilers for ML.

badmintonbaseba 10/28/2024|||

> Within a few years of working on compilers I came across my first examples of 50k+ line functions. These were often (but not always) the result of source code generators that were translating some kind of problem description to code.

Like the issue in the Linux kernel recently where they had some simple looking min/max macros that generated megabytes of source code.

WalterBright 10/28/2024||||

I did see slow optimize builds before, but that was on much, much slower machines. I thought that was a thing of the past with modern CPUs.

ajb 10/28/2024|||

It's not even just code. I remember once writing a script to generate a large table, only to discover that GCC took forever to compile it as it appeared to have an n^2 time in the size of the file. That can't have been the optimiser because there were no functions. I think I ended up compiling that file with TCC.

WalterBright 10/28/2024||

Yah, one often doesn't discover the existence of a quadratic algorithm until something like this happens.

jeffrallen 10/28/2024||

The two scariest words in technical English are "accidentally quadratic".

ajb 10/28/2024||

Oh, I don't know about that :-) If we are looking for the two scariest words, "intermittent failure" would be my candidate.

WalterBright 10/29/2024||

Mine too. I can find the quadratic source, but intermittent failure can suck up all your time.

jeffrallen 10/29/2024||

Ok, it is Halloween. How about we go with "accidentally intermittently quadratic"?

Is that scary enough for ya?

pizlonator 10/28/2024||||

> One thing I learned about optimizers is some optimizations undo earlier optimizations. The state of the optimized program will sometimes flip-flop back and forth and never converge on a solution. Then you have to rely on a heuristic to guess at the better form, and live with it. The optimizer has a counter in it so it will "pull the plug" if it fails to converge, and will just go with the last state.

The way every optimizer I've worked on (and written) deals with this is canonical forms. Like, you decree that the canonical form of "multiply integer by 2" is "x << 1", and then you make sure that no optimization ever turns "x << 1" into anything else (though the instruction selector may then turn "x << 1" into "x + x" since that's the best thing on most CPUs).

But that doesn't necessarily make this problem any easier. Just gives you a principled story for how to fix the problem if you find it. I think that usually the canonical forms aren't even that well documented, and if you get it wrong, then you'll still have valid IR so it's not like the IR verifier will tell you that you made a mistake - you'll just find out because of some infinite loop.

And yeah, lots of compiler optimization fixpoints have a counter to kill them after some limit. The LLVM inliner fixpoint is one example of such a thing.

> I was really angry about that, as the journalist never bothered to call us and ask about DLC's behavior. Our sales tanked after that article.

Whoa! That's a crazy story! Thanks for sharing!

saagarjha 10/28/2024||

I'm not sure if any "real" optimizers are using them yet but in theory e-graphs can store a set of forms and avoid this kind of issue.

Sesse__ 10/28/2024|||

Cranelift, the (perenially?) up-and-coming Rust compiler, is based on e-graphs.

saagarjha 10/28/2024||

I guess it's borderline "real" ;)

pizlonator 10/28/2024||

Cranelift's reason for using e-graphs is, in my view, that Rust sucks.

Seriously.

Consider the slides here, particularly slides 27-30: https://cfallin.org/pubs/egraphs2023_aegraphs_slides.pdf

They show code for pattern matching IR in their Rust code, and it's awful. I think it's because they can't just have an IR with pointers, because that would violate Rust's rules. So, they need to call goofy helpers to deconstruct the IR. Consequently, a super simple rewrite rule ends up being a full page of gnarly code. That code would be less than half as long, and much easier to parse, if it was in LLVM IR or B3 IR or any other sensible C++ IR.

Then they show that the egraph rule is "simple". It's certainly shorter than C++ code. But while it is conceptually simpler than their Rust code, it is no conceptually simpler than the same rewrite written in C++. The C++ code for rewrites in either LLVM or B3 is not that much more verbose than the egraph rule, but in a way that makes it easy to understand. Plus, it's code, so it can call any API in the compiler and do any logic it likes - making it inherently more flexible than an egraph.

So, if they had used a sensible programming language to write their compiler then they wouldn't have needed egraphs as an escape hatch from how bad Rust is.

And the phase ordering argument for e-graphs is baloney because:

- Their example of alias analysis, GVN, and RLE assumes the strawman that you put those things in separate phases. Most compilers don't. LLVM does alias analysis on demand and combines GVN and RLE. Ditto in B3.

- The difficulty of phase ordering in LLVM's pass builders (and in any pipeline I've ever worked on) is that you've got passes that are not expressible as egraphs. Until someone can demonstrate an egraph based inliner, SCCP, SROA, coroutine lowerings, and heck an egraph-based FilPizlonator (the pass I use in the Fil-C version of clang to make it memory safe), then you'll be left with those passes having to run the classic way and then you'll still have phase ordering problems.

saagarjha 10/31/2024||

Maybe we are looking at different slides? The rewrite rules seem fairly straightforward and not really related to Rust at all?

pizlonator 10/28/2024|||

As far as I can tell e-graphs can’t handle all of the kinds of transforms that a compiler does, so they’re likely to be avoided by real compilers.

nlewycky 10/28/2024||

Set e-graphs apart from e-matchers. An e-graph is a vector-of-sets, where each set contains equal representations of the same thing. What makes it space efficient is that your representation refers to things by index number in the vector, so if you have "3 * (x + 1)" as an expression, you store "x + 1" in vector #1 and "3 * #1". That way if "x + 1" is also known to equivalent to "load ptr" then you add that to #1 and don't need to add anything to #0. In e-graphs, #0 and #1 are e-classes, the sets holding equivalence classes. There's no limit to what transforms you can use, and let the e-graph hold them efficiently.

E-matchers are an algorithm over e-graphs to efficiently search for a pattern in an e-graph, for instance if you want to find "(load ptr) * (load ptr)" then you have to find "#x * #x" for all x in all e-classes, then check whether "load ptr" is a member of e-class #x. This is where you get limits on what you can match. "Pattern matching" style transforms are easy and fast using e-matchers, things like "x * 2" -> "x << 1", but beyond that they don't help.

There's an optimizer problem where you have "load ptr" and you solve ptr and figure out it's pointing to a constant value and you replace the load instruction with the constant value. Later, you get to code emission and you realize that you can't encode your constant in the CPU instruction, there's not enough bits. You now need to take the constant and stuff it into a constant pool and emit a load for it. If you had stored them in an e-graph, you could have chosen to use the load you already had.

Suppose you wanted to do sparse conditional constant/constant-range propagation but your IR uses an e-graph. You could analyze each expression in the e-class, intersect them, and annotate the resulting constant-range on the whole e-class. Then do SCCP as normal looking up the e-class for each instruction as you go.

WalterBright 10/28/2024||

Haha, I have this issue with the Arch64 code generator I'm building.

cryptonector 10/28/2024|||

That is a horrible story. I would have been as livid as you.

typesanitizer 10/28/2024|||

> In my view, if a compiler optimization is so critical that users rely on it reliably “hitting” then what you really want is for that optimization to be something guaranteed by the language using syntax or types. The way tail calls work in functional languages comes to mind. Also, the way value types work in C#, Rust, C++, etc - you’re guaranteed that passing them around won’t call into the allocator. Basically, relying on the compiler to deliver an optimization whose speedup from hitting is enormous (like order of magnitude, as in the escape analysis to remove GC allocations case) and whose probability of hitting is not 100% is sort of a language design bug. > > This is sort of what the article is saying, I guess.

I agree with this to some extent but not fully. I think there are shades of grey to this -- adding language features is a fairly complex and time-consuming process, especially for mainstream languages. Even for properties which many people would like to have, such as "no GC", there are complex tradeoffs (e.g. https://em-tg.github.io/csborrow/)

My position is that language users need to empowered in different ways depending on the requirements. If you look at the Haskell example involving inspection testing/fusion, there are certain guarantees around some type conversions (A -> B, B -> A) being eliminated -- these are somewhat specific to the library at hand. Trying to formalize each and every performance-sensitive library's needs using language features is likely not practical.

Rather, I think it makes sense instead focus on a more bottoms-up approach, where you give somewhat general tools to the language users (doesn't need to expose a full IR), and see what common patterns emerge before deciding whether to "bless" some of them as first-class language features.

pizlonator 10/28/2024||

I think I agree with all of that.

My point is that if in the course of discovering common patterns you find that the optimizer must do a heroic optimization with a 10x upside when it hits and weird flakiness about when it hits, then that’s a good indication that maybe a language feature that lets you skip the optimization and let the programmer sort of dictate the outcome is a good idea.

By the way, avoiding GC is not the same thing as having value types. Avoiding GC altogether is super hard. But value types aren’t that hard and aren’t about entirely avoiding GC - just avoiding it in specific cases.

pjmlp 10/27/2024|||

Java doesn't have it yet, but they are making progress,

https://jdk.java.net/valhalla/

Yes, it was a bummer that Java didn't take up on the ideas of Cedar, Oberon linage, Modula-3, Eiffel,... even though some are quoted as its influences.

Still I am confident that it might be getting value types, before C++ reflection, networking, senders/receivers, or safety gets sorted out. Or even that we can finally write portable C++ code using C++20 modules.

dzaima 10/27/2024|||

Unfortunately, it doesn't look like there's much of a guarantee that value objects wouldn't result in heap allocations. https://openjdk.org/jeps/401 even contains:

> So while many small value classes can be flattened, classes that declare, say, 2 int fields or a double field, might have to be encoded as ordinary heap objects.

There's a further comment about potential of opting out of atomicity guarantees to not have that problem, but then there are more problems - looks like pre-JIT would still allocate, and who knows how consistent would JIT be about scalarization. IIRC there was also some mention somewhere about just forcing large enough value objects to always be heap allocations.

elygre 10/27/2024||

That’s some very selective quoting there. Let’s do the full thing:

> Heap flattening must maintain the integrity of objects. For example, the flattened data must be small enough to read and write atomically, or else it may become corrupted. On common platforms, "small enough" may mean as few as 64 bits, including the null flag. So while many small value classes can be flattened, classes that declare, say, 2 int fields or a double field, might have to be encoded as ordinary heap objects.

And maybe the end of the next paragraph is even more relevant:

> In the future, 128-bit flattened encodings should be possible on platforms that support atomic reads and writes of that size. And the Null-Restricted Value Types JEP will enable heap flattening for even larger value classes in use cases that are willing to opt out of atomicity guarantees.

pizlonator 10/27/2024|||

If value types still require allocation for stuff larger than 128 bits then that really sucks! That’s not how value types usually work.

bremac 10/28/2024||

Value types still require allocation for types larger than 128 bits if the value is either nullable or atomic — that seems like a reasonable trade-off to me.

pizlonator 10/28/2024||

Oh! Reasonable trade-off indeed. Thank you for clarifying.

dzaima 10/27/2024|||

Yeah, had edited my comment to expand a bit on that. Nevertheless, this is just one place where allocation of value objects may appear; and it looks like eliding allocations is still generally in the "optimization" category, rather than "guarantee".

astrange 10/27/2024||||

When are they going to finish making progress? I've been hearing about this for years now and it doesn't seem to have happened. Which means Java continues with just about the worst design for memory efficiency you could have.

pjmlp 10/28/2024|||

When they are sure the design is sound enough not to do a Python 2/Python 3.

If it was easy it would be done by now.

There are plenty of long time efforts on other language ecosystem that have also taken decades and still not fully done, e.g. C++ modules, contracts, reflection,...

If you really need value type like objects today, it is possible in Panama, even without language syntax for them.

saagarjha 10/28/2024|||

Don't worry, Python is always there to do a worse job.

pizlonator 10/27/2024||||

That is really cool!

It’s a dang hard feature to retrofit into the way the JVM works. I wish those folks the best of luck.

pjmlp 10/27/2024||

Yes, that is the biggest engineering effort of the whole thing, how to add the value types concept into the JVM, without breaking the ecosystem.

JARs and modules that work on the JVM before value types introduction should keep running, and how can new code interoperate with such jars.

crabmusket 10/27/2024|||

Automatic vectorisation is another big one. It feels to me like vectorisation is less reliable / more complex than TCO? But on the other hand the downside is a linear slowdown, not "your program blows the stack and crashes".

dzaima 10/27/2024|||

With clang you can add "#pragma clang loop vectorize(assume_safety)" to a loop to reduce the burden of proof of vectorizability and to do it if at all possible, giving a warning when it fails to. gcc has "#pragma GCC ivdep" to reduce dependency analysis, but it's not as powerful as clang's pragma.

nextaccountic 10/28/2024||

> giving a warning when it fails to

This is the critical point. If CI fails or otherwise you are warned when the loop doesn't vectorize, then you can count on it to always happen

pizlonator 10/27/2024|||

If you’re talking about autovectorization in C++ then you have the option of using intrinsics to get real vector code. So I think that’s fine because you have a way to tell the compiler, “I really want simd”.

astrange 10/27/2024||

A better fundamental design would be a SPMD language (like ispc, or GPU languages) and then autoscalarization, which is a lot easier to do reliably than autovectorization.

Intrinsics work poorly in some compilers, and Intel's intrinsics are so hard to read because of inscrutable Hungarian notation that you should just write in asm instead.

pizlonator 10/28/2024||

Ispc is great but I think you’re also saying that:

- it would be better if the intrinsics had sensible names. I couldn’t agree more.

- it would be better if compilers consistently did a good job of implementing them. I wonder which compilers do a bad job? Does clang do a good job or not so much?

I think intrinsics make sense for the case where the language being used is not otherwise simd and that language already has value types (so it’s easy to add a vector type). It would be great if they at least worked consistently well and had decent names in that case.

neonsunset 10/28/2024|||

Intrinsics have somewhat saner names in C#, and unify under the same types for both portable and platform-specific variants (Vector64/128/256/512<T>, numeric operators on them and methods on VectorXXX.* or Sse42/Avx/AdvSimd/etc.)

saagarjha 10/28/2024|||

I have yet to see a compiler that does a good job with autovectorization.

fsckboy 10/29/2024|||

>the way tail calls work in functional languages comes to mind

tail call optimizations are useful as optimizations.

But in a language like scheme, tail calls are required to be implemented as loops. If you don't do that, then people can't write code that relies on it. Scheme treats tail call handling as a semantic, not as an optimization.

xxs 10/28/2024|||

>definitely removing a GC allocation

I am not sure in what cases that's a major win, in generation/copy GCs an allocation is a pointer bump, then the object dies trivially (no references from tenured objects) in the young gen collection.

When you run out of the local arena, there will be a CAS involved, of course.

pizlonator 10/28/2024||

Removing the GC allocation is a win because it allows you to promote the value’s fields to registers.

I agree just the difference between GC allocation and stack allocation is small. But it’s not really about turning the values into stack allocations; it’s about letting the compiler do downstream optimizations based on the knowledge that it’s dealing with an unaliased private copy of the data and then that unlocks a ton of downstream opts.

xxs 10/31/2024||

(I have missed the response)

The allocation elisions have been tried (escape analysis) with enough inlining to no noticeable benefits (I don't have the source/quote, yet it was over 10y already). The escape analysis already provides the same semantics of proving the Object allocation/etc. can be optimized.

Java does lack value types but they are only useful in (large) arrays. A lot of such code has 'degraded' to direct byte buffers, and in some cases straight unsafe, similar to void* in C.

pizlonator 11/6/2024||

I implemented the original allocation elision in JavaScriptCore and then oversaw the development of the pass that superseded mine.

Huge speedup, but very hit or miss. Out of a large suite of benchmarks, 90% of the tests saw no change and 10% saw improvements of 3x or so. Note each of these tests was itself a large program; these weren’t some BS microbenchmarks.

Maybe someone from V8 can comment but my understanding is they had a similar experience.

So it’s possible that someone measured this being perf neutral, if they used a too small benchmark suite.

morkalork 10/28/2024|||

Isn't that "winning on average" mindset the motivation for JIT compilers and optimization? Compile once, instrument the code the recompile again with more targeted optimization? Granted that's only something available to Java/C# and not C++.

mshockwave 10/28/2024||

> If you’re relying on it then you’re playing with fire.

Unfortunately I think Linux kernel is one of the most notable examples in which case you have to compile with -O1 or up.

pizlonator 10/28/2024||

That's not playing with fire.

Playing with fire is when you require an optimization to hit in a specific place in your code.

Having software that only works if optimized means you're effectively relying on optimizations hitting with a high enough rate overall that your code runs fast enough.

It's like the difference between an amateur gambler placing one huge ass bet and praying that it hits and a professional gambler (like a Blackjack card counter, or a poker pro) placing thousands or even millions of bets and relying on postive EV to make a profit.

JonChesterfield 10/28/2024||

The thing where people quote optimisers as only eeking out tiny percentages of performance over decades really annoys me. It's true for the numerical kernels written in C and Fortran where the code is essentially optimal already, and all the compiler has to do is not screw it up.

It's grossly false for the vast majority of code, where the sludge written by thousands of engineers gets completely rewritten by the compiler before it executes. Compilers are the reason unskilled devs manage to get tolerable performance out of frameworks.

The cpython is slow complaints? That's what life without a good compiler looks like.

klodolph 10/28/2024|

“CPython is slow” is kind of a complaint about the Python object model. I’m not sure that I would pin this one on the compiler, per se. Everything you do in Python involves chasing down pointers and method tables, looking up strings in dictionaries, and that sort of thing.

Maybe it’s splitting hairs to talk about whether this is down to the compiler or the language.

> It's grossly false for the vast majority of code, where the sludge written by thousands of engineers gets completely rewritten by the compiler before it executes.

It used to be true, for sure. What has changed since then is not the sludge or bloat or many cooks. Instead, the major change is that we have written a bunch of code on top of optimizing compilers with the assumption that these optimizations are happening. For example, nowadays, you might write C++ code with a deep call stack and lots of small functions (e.g. through templates), and it’s fast because the compiler inlines and then optimizes the result. Back in the 1990s, you would not have written code that way because the compiler would have choked on it. You see a lot more macro use in code from the 1990s rather than functions, because programmers didn’t trust that the functions will get inlined.

nine_k 10/28/2024|||

The point is that Python object model prevents most optimizations, and thus makes the interpreter to executive code exactly as written. All inefficiencies become more obvious this way, while an optimizing compiler could have detected and eliminated many of them in a different language.

klodolph 10/28/2024||

Yes, that’s a good point. But I think you can already see the difference in a lot of C++ codebases when you compile at -O0. Until recently, std::move() was a function call, and at -O0, this meant that calls which should be basically a memcpy() ended up having thousands of little function calls in it to std::move().

arka2147483647 10/28/2024|||

> Everything you do in Python involves chasing down pointers and method tables, looking up strings in dictionaries, and that sort of thing.

One of the things that a Cpp Compiler can do is De-Virtualize Function Calls. That is; solve those function pointer tables at Compile Time. Why could not a good compiler for Python do the same?

klodolph 10/28/2024|||

It’s a lot easier in C++ or Java. In C++, you make the virtual calls to some base class, and all you need to demonstrate is that only one concrete class is being used at that place. For example, you might see a List<T> in Java, and then the JIT compiler figures out that the field is always assigned = new ArrayList<T>(), which allows it to devirtualize.

In Python, the types are way more general. Basically, every method call is being made to “object”. Every field has a value of type “object”. This makes it much more difficult for the compiler to devirtualize anything. It might be much harder to track which code assigns values to a particular field, because objects can easily escape to obscure parts of your codebase and be modified without knowing their type at all.

This happens even if you write very boring, Java-like code in Python.

nine_k 10/28/2024|||

C++ has an explicit compilation phase after which things like classes remain stable.

Python only has a byte-compilation phase that turns the parse tree into executable format. Everything past that, including the creation if classes, imports, etc is runtime. You can pick a class and patch it. You can create classes and functions at runtime; in fact, this happens all the time, and not only for lambdas, but this is how decorators work.

A JIT compiler could detect actual usage patterns and replace the code with more efficient versions, until a counter-example is found, and the original "de-optimized" code is run. This is how JavaScript JITs generally work.

keybored 10/27/2024||

The standard optimization user case story is absurd.

1. You’re not gonna get any guarantees that the optimization will happen. That makes it High Level. Just write code. We won’t force you to pollute your code with ugly annotations or pragmas.

2. In turn: check the assembly or whatever the concrete thing that reveals that the optimization you wished for in your head actually went through

There’s some kind of abstraction violation in the above somewhere.

GuB-42 10/27/2024||

It is a leaky abstraction, all abstractions are. I didn't find it to be much of a problem in practice.

Usually, if you don't get the optimization you wished for, it means that there is something you didn't account for. In C++, it may be exception processing, aliasing rules, etc... Had the compiler made the optimization you wished for, it wouldn't have been correct with regard to the specifications of the language, it may even hide a bug. The solution is then to write it in a way that is more explicit, to make the compiler understand that the edge case can never happen, which will then enable the optimization. It is not really an abstraction violation, more like a form of debugging.

If you really need to get low level, there is some point where you need to write assembly language, which is obviously not portable, but getting every last bit of performance is simply incompatible with portability.

keybored 10/28/2024||

> It is a leaky abstraction, all abstractions are.

It’s not leaky (and that term is kind of meh). It’s just an abstraction! Such optimizations are supposed to be abstracted away (point 1).[1] The problem comes when that is inconvenient; when the client does not want it to be abstracted away.

There’s a mismatch there.

[1] EDIT: The point here is that the API is too abstracted compared to what the client wants. Of course the API could choose to not abstract certain things. For example the Vec[2] type in Rust has specified, as part of the documentation, how it is implemented (to a large degree). They could call it something like “List” and say that whatever the concrete implementation is, is an implementation detail. But they chose not to.

[2] https://doc.rust-lang.org/std/vec/struct.Vec.html#capacity-a...

jiggawatts 10/27/2024|||

I have the same experience with database query planners. The promise is that you just write your business logic in SQL and the planner takes care of the rest. In practice you spend weeks staring at execution plans.

dspillett 10/27/2024||

A key difference between many compilers and DB query planners, is that a compiler can spend more time over its optimisations because it is run dev-side and the benefits (and none of the time taken) are felt by the users. A query planner needs to be more conservative with its resource use.

This is not true of JIT compilers, of course, which have similar constraints to DB query planners. In these cases the goal is to do a good job pretty quickly, rather than an excellent job in a reasonable time.

jiggawatts 10/27/2024|||

> A query planner needs to be more conservative with its resource use.

The number of possible distinct query plans grows very rapidly as the complexity increases (exponentially or factorially... I can't remember). So even if you have 10x as much time available for optimisation, it makes a surprisingly small difference.

One approach I've seen with systems like Microsoft Exchange and its undrelying Jet database is that queries are expressed in a lower-level syntax tree DOM structure. The specific query plan is "baked in" by developers right from the beginning, which provides stable and consistent performance in production. It's also lower latency because the time spent by the optimiser at runtime is zero.

mamcx 10/28/2024||||

An even bigger issue is that regular optimizer see a whole static world (the code). You can know very well the size of most things.

For DBs, it would be 'trivial' if we could know the exact (or very very close) size of the tables and indexes. But the db is a mutating environment, that start with 1 row and 1 second later is 1 million, and 1 second later is 1 row again.

And then you mutate it from thousands of different connections. Also, the users will mutate the shape, structure, runtime parameters, indexes, kind of indexes, data types, etc.

joncrocks 10/28/2024||||

Query planners can (and will) use statistics gathered on the contents of the tables to more effectively execute queries, which can be gathered in the background/batch processes.

For large database vendors there is a good amount of complexity put into up-front information gathering so that query execution is fast.

pjmlp 10/28/2024|||

However most big boys databases also package a JIT compiler into the mix, making it more fun.

dspillett 10/28/2024||

As the optimal plan will change as data sizes and patterns do, it is almost all JIT with fairly aggressive caching. They keep some stats on each object and plan to help guess when a cached plan should be used or the full planner be engaged again, though this process often needs a little help.

glitchc 10/27/2024|||

Agreed. You put your finger directly on something that's always bugged me about optimistic code optimization strategies. The code I write is supposed to have deterministic behaviour. If the determinism changes, it should be because of a change I made, not a flag in the compiler. The behaviour is completely opaque and uncorrelatable, makes it very hard to figure out if a given change will lead to better or worse performance.

"Abstraction violation" is a good way to put it.

rocqua 10/27/2024||

The excuse for this is that performance is not considered part of the behavior of the program. So it doesn't matter to the question of whether your program is deterministic.

bluGill 10/27/2024|||

These days you don't know what cpu you run on so you can't make performance guarentees anyway. Even in embedded we have been burn with the only cpu going out of production enough to not depend on it anymore in most cases.

glitchc 10/28/2024|||

Except for the fact that performance optimization frequently does change the behaviour of the program.

rocqua 10/28/2024|||

That is only possible if the optimization is wrong (rare), or the program already had undefined behavior (very common, because undefined behavior is very easy to trigger in C).

glitchc 11/1/2024||

I'll give an example: -ffast-math.

There's extensive literature out there on how fast-math changes the behaviour of code. I've been bitten by this a couple of times already.

mrob 10/28/2024|||

If performance optimization changes anything other than performance it's either a bug in the program or a bug in the compiler.

naasking 10/28/2024||

Yes, but program bugs are obviously very, very common. Developers already do a great job of adding bugs to their programs, they don't need compilers compounding that problem in non-deterministic ways.

randomNumber7 10/27/2024||

When you already know what you want to have after the optimization, you could just write that code in the beginning.

aleph_minus_one 10/28/2024|||

Not all programming languages have this amount of expressiveness. How do you, for example, tell how a database what the intended execution plan is for some complicated SQL query?

You can normally only send SQL queries to a database and not execution plans.

magicalhippo 10/28/2024|||

Using bit rotation instructions in C is another case I recall.

Since there's no bit rotate operator in C, you're left hoping the compiler recognizes what the shifts and bitwise-ands are trying to do.

dzaima 10/29/2024|||

clang does have __builtin_rotateleft32 & co. gcc doesn't though. (but of course for both of those just doing the manual rotate should be fine like always)

account42 10/28/2024|||

Inline assembly or platform intrinsics is the other option.

ablob 10/29/2024||

Both of which aren't portable, however.

Spivak 10/28/2024|||

I've always wanted to just write the execution plan directly and bypass the planner for a few hot queries.

keybored 10/28/2024|||

Let’s assume that you can rewrite it, e.g. write some inline assembly.

The thinking here seems to be that you want multiple things:

1. You want the high-level code since that is easier to reason about

2. You also want some specific optimizations

Maybe the missing link here is some annotation that asserts that some optimization is applied. Then if that assertion fails at some point you might have to bite the bullet and inline that assembly. Because (2) might trump (1).

h1fra 10/27/2024||

I don't hate query planning, some people are better than me at writing algo and the database knows my data better than me.

However, what I hate is the lack of transparency (and I feel like this article tries to pin-point just this). When I execute a query locally I get a different plan vs staging vs prod. A plan than can also change depending on some parameters or load or size.

I don't care about understanding all the underlying optimizations, I just care that the query plan I saw is the same and is still the same in prod, and that I can be warned when it changes. PG does not return the hash of the query plan or metrics along the data, which is imo a mistake. With this you could track it in your favorite metrics store and be able to point when and why stuff are executing differently.

wrs 10/28/2024|

Yes! The thing with SQL is that the DB “recompiles” the query plan from source every time you do the query. That results in changes at runtime, which are often good and healthy changes, but are sometimes a very bad surprise.

I like the metrics idea, but by the time you see the change in the metric, it’s too late.

For critical queries it might be helpful to be able to “freeze” a query plan just as one “freezes” a binary executable by compiling. In other words, let the query planner do its job, but only at a time of your choosing, so the performance of a production system doesn’t change suddenly.

So no hints in the source, just an opaque token representing a compiled query plan that can be deployed alongside a binary. With tooling you could be notified if the planner wants to do it differently and decide whether to deploy the new plan, after testing.

(And again, you’d only do this for a critical subset of your choice.)

WalterBright 10/27/2024||

> D has first-class support for marking functions as “no GC”.

It never occurred to me that this would be considered a hint to the optimizer. It doesn't affect code generation. What it does do is flag any use of the gc in the function and any functions it transitively may call.

Optimizers have been likened to turning a cow into a hamburger. If you're symbolically debugging optimized code, you're looking at the hamburger. Nobody has been able to solve that problem.

It's true that optimizers themselves are hard to show being correct. The one in the D compiler is a conventional DFA optimizer that uses data flow equations I learned from Hennessy and Ullman in a 1982 seminar they taught. So it has been battle tested for 42 years now(!) and it's pretty rare to find a problem with it, unless it's a new pass I added like SROA. The idea is anytime a problem is identified and corrected, it goes into the test suite. This has the effect of always ratcheting it forward and not regress.

The GC dates from around 2000, when I wrote it for a Javascript engine. It was brutally tested for that, and has been pretty solid ever since. People complain about the GC, but not about it being buggy. A buggy GC is a real horror show as it is painfully difficult to debug.

typesanitizer 10/28/2024||

Thanks for the feedback.

The preceding paragraph had "and occasionally language features" so I thought it would be understood that I didn't mean it as an optimizer-specific thing, but on re-reading the post, I totally see how the other wording "The knobs to steer the optimizer are limited. Usually, these [...]" implies the wrong thing.

I've changed the wording to be clearer and put the D example into a different bucket.

> In some cases, languages have features which enforce performance-related properties at the semantic checking layer, hence, granting more control that integrates with semantic checks instead of relying on the optimizer: > > - D has first-class support for marking functions as “no GC”.

WalterBright 10/28/2024||

That's better, thanks!

saagarjha 10/28/2024||

Languages that run in a VM are pretty good at turning the hamburger back into a cow when you're looking at it.

dzaima 10/28/2024|||

But they only allow you to look at it at certain points in time (unless requested more precisely ahead-of-time), and/or lose the ability to do a bunch of meaningful optimizations. Whereas compiled programs can be desired to be paused and made sense of at any assembly instruction, and there'll always be some people wanting to trade perfect debuggability for perf.

saagarjha 10/28/2024||

Not really. If you look at HotSpot for example it will deoptimize a function for you all the way if you debug it and then reoptimize it when you stop single-stepping.

dzaima 10/28/2024||

But how do you get in the "debug it" mode? If you mark it as to-be-debugged ahead-of-time (e.g. adding a breakpoint) then it can indeed ahead-of-time set up exactly that spot to be debuggable from. If you just pause the VM at a random time, it can freely choose to quietly run maybe a couple dozen to thousand instructions to reach a safepoint. Compiled languages don't have that luxury, as they don't get to control when a coredump happens or ptrace attaches, but people nevertheless expect such to provide a debugging environment.

Even if HotSpot had perfect assembly-level debug information (which it cannot, as it does do (a tiny bit of) autovectorization, which by necessity can reorder operations, potentially leading to intermediate states that cannot be mapped to any program state), that just means it'd come at a performance cost (e.g. no autovectorization).

dzaima 10/28/2024|||

Perhaps there's an argument for having two debugging modes for compiled programs - the "basic" best-effort one that is what we already have, and a feature-complete one that requires tight runtime integration and cannot relate to any of the compiled assembly. But that's adding a decent amount of complication & confusion, doesn't really help coredumps, and still has to pay the cost of having things be deoptimizable even if at limited locations.

dzaima 10/28/2024||||

Also a general problem is that HotSpot does not tolerate memory corruption, but for memory-unsafe languages you'd quite like to have an as-reasonable-as-possible debugging environment even if 60% of the heap was zeroed by bad code, which throws a pretty large wrench in being able to complete a "just deoptimize everything" as the first step to debugging.

saagarjha 10/31/2024|||

Yes, ok, the answer to "how should this work for a crash" is that the information is not there.

WalterBright 10/28/2024|||

True, but the hamburger conversion is the VM to machine code.

leni536 10/27/2024||

I think this is mostly a philosphical question, rather than optimizer quality.

Once an optimization becomes part of the interface and it is guaranteed, is it really an optimization? Or did it just became part of the language/library/database/whatever?

One example is return value optimization in C++. In C++17 the "optimization" became mandatory in some contexts. What really happened though is that the rules of temporary materialization changed, and in those contexts it just never happens prematurely by the language rules. This ceased to be an optimization and became a mechanism in the language.

What I'm getting at is that unreliability is a defining quality of optimizations.

Sure, there are certain optimizations that become load-bearing, in which case it would be better if they became part of the language's semantics and guarantees, therefore they ceased to be optimizations.

gizmo686 10/27/2024||

It is still a useful distinction. Programming languages are complex enough to understand. It is useful to have a base of 'this is the simplest description of how programs will work' that is as simple as possible. Then, a separate set of 'and here is a separate description of its performance characteristics; nothing in this portion should be understood to change the defined behavior of any program".

Even if that second description is stable and part of the guarantees you make, keeping it seperate is still incredibly useful from a user perspective.

From an implementation perspective, there is also a useful distinction. Optimizations take a valid representation, and turn it into a different valid representation of the same type that shares all defined behavior. This is a fairly different operation than compilation, which converts between representations. In particular, for the compilation step, you typically have only one compilation function for a given pair of representations; and if you have multiple, you select one ahead of time. For optimizations, each representation has a set of optimization functions, and you need to decided what order to apply them and how many times to do so. Compilation functions, for their part, need to deal with every difference between the two representations, whereas optimization functions get to ignore everything except the part they care about.

account42 10/28/2024||

C++ return value optimization has always been a language feature rather than a true optimization because it is/was allowed to change the observable behavior of the program.

nanolith 10/27/2024||

One area that I have been exploring is building equivalence proofs between high-level specifications, an implementation in C, and the machine code output from the compiler. I'm still very early in that work, but one of my hopes is to at least demonstrate that the output still meets the specifications, and that we can control things like timing (e.g. no branching on secret data) and cache in this output.

I think that the compilation and optimization step, as a black box, is a disservice for highly reliable software development. Compiler and optimizer bugs are definitely a thing. I was bitten by one that injected timing attacks into certain integer operations by branching on the integer data in order to optimize 32-bit multiplications on 8-bit microcontrollers. Yeah, this makes perfect sense when trying to optimize fixed point multiplication, but it completely destroys the security of DLP or ecDLP based cryptography by introducing timing attacks that can recover the private key. Thankfully, I was fastidious about examining the optimized machine code output of this compiler, and was able to substitute hand coded assembler in its place.

cesarb 10/27/2024||

> One area that I have been exploring is building equivalence proofs between high-level specifications, an implementation in C, and the machine code output from the compiler.

AFAIK, that's how seL4 is verified. Quoting from https://docs.sel4.systems/projects/sel4/frequently-asked-que...

"[...] Specifically, the ARM, ARM_HYP (ARM with virtualisation extensions), X64, and RISCV64 versions of seL4 comprise the first (and still only) general-purpose OS kernel with a full code-level functional correctness proof, meaning a mathematical proof that the implementation (written in C) adheres to its specification. [...] On the ARM and RISCV64 platforms, there is a further proof that the binary code which executes on the hardware is a correct translation of the C code. This means that the compiler does not have to be trusted, and extends the functional correctness property to the binary. [...] Combined with the proofs mentioned above, these properties are guaranteed to be enforced not only by a model of the kernel (the specification) but the actual binary that executes on the hardware."

nanolith 10/27/2024||

Indeed it is. What I'm working toward is a more efficient way to do the same, that doesn't take the touted 30 man years of effort to accomplish.

I'm working on a hybrid approach between SMT solving and constructive proofs. Model checking done with an SMT solver is pretty sound. I'm actually planning a book on a scalable technique to do this with CBMC. But, the last leg of this really is understanding the compiler output.

nlewycky 10/27/2024|||

> I was bitten by one that injected timing attacks into certain integer operations by branching on the integer data in order to optimize 32-bit multiplications on 8-bit microcontrollers.

FWIW, I think this should be considered a language design problem rather than an optimizer design problem. Black box optimizer behaviour is good for enabling language designs that have little connection to hardware behaviour, and good for portability including to different extensions within an ISA.

C doesn't offer a way to express any timing guarantees. The compiler, OS, CPU designer, etc. can't even do the right thing if they wanted to because the necessary information isn't being received from the programmer.

jcranmer 10/27/2024|||

To a large degree, constant-time programming is hampered by the fact that even hardware is often unwilling to provide constant-time guarantees, let alone any guarantees that the compiler would care to preserve. (Although, to be honest, constant-time guarantees are the sort of things that most compiler writers prefer to explicitly not guarantee in any circumstances whatsoever).

yxhuvud 10/28/2024|||

If it were starting to show up in languages though, then that would make it more likely for hardware to start introducing functionality like this.

bluGill 10/27/2024|||

8 bit cpus offer constant time. 16 bit was starting to get into the issues where you cannot off it.

wiml 10/28/2024||

Your comment makes me wonder about the idea of building a superscalar, out of order, speculative implementation of the 6502 or 8080 instruction sets. Might make a good educational project.

nanolith 10/27/2024|||

Few languages provide such guarantees. But, there really was no way with this particular compiler to pass a hint to generate constant time code.

Black box designs work until the knob or dial you need to control it isn't there. I would have taken a pragma, a command-line option to the compiler, or even a language extension.

This is one example of many as to why I think that user-guided code generation should be an option of a modern tool suite. If I build formal specifications indicating the sort of behavior I expect, I should be able to link these specifications to the output. Ultimately, this will come down to engineering, and possibly, overriding or modifying the optimizer itself. An extensible design that makes it possible to do this would significantly improve my work. Barring that, I have to write assembler by hand to work around bad assumptions made by the optimizer.

gizmo686 10/27/2024|||

If you haven't done so, you might want to look at some of the work done by the seL4 microkernel project.

They start with a Haskell prototype that is translated programatically into a formal specification for the theorem prover.

They then implement the same thing in C, and use a refinement prove to demonstrate that it matches their Haskell implementation.

They then compile the program, and create another refinement proof to demonstrate that the binary code matches the C semantics.

nanolith 10/27/2024||

I'll refer you to my reply to a sibling comment. I'm hoping that I can build a more efficient means of doing similar work as with seL4, but without the 30 man year effort.

They are on the right track. But, I think there have been some improvements since their effort that can lead to more streamlined equivalence proofs.

kragen 10/28/2024||

What do you think about Jasmin?

nanolith 10/28/2024||

Jasmin has some great ideas. My goal in particular is to improve the tooling around C so that verification of C programs is easier. Because there are trillions of lines of C out there, I want to ensure that semi-automated processes can be developed for verifying and fixing this code. For cryptography, this involves similar features as Jasmin. The main difference is that I'm looking at ways to guide this while largely preserving the C language. That's not because I think C is better, but because I want to inject this into existing code bases and development processes, with existing software developers.

kragen 10/28/2024||

It sounds like a very promising approach! Do you think there are really trillions of lines of C? At ten delivered and debugged lines of code per programmer-day we'd have on the order of a single trillion lines ever written, but most of that has been discarded, and probably less than half of the remainder is C.

I suspect something like wasm may be a better way to preserve backward-compatibility with C, although of course it won't help with constant time or confused-deputy vulnerabilities. CHERI might help with the latter.

nanolith 10/28/2024||

That's based on current estimates. It's impossible to know for certain, but there is a lot of C software out there.

I'm a big fan of runtime mitigations. I use a few in my own work.

WASM can help in some areas. But, bear in mind that a lot of this C source code is firmware or lower level operating system details (e.g. kernels, device drivers, system services, low-level APIs, portable cryptography routines). In this case, WASM wouldn't be a good fit.

CHERI is also a possibility in some contexts for runtime mitigation. But, since that does involve a hardware component, unless or until such capabilities are available in mainstream devices and microcontrollers, this would only be of limited use.

There are other runtime mitigations that are more mainstream, such as pointer authentication, that are in various states of evaluation, deployment, or regression due to hardware vulnerabilities. I think that each of these runtime mitigations are important for defense in depth, but I think that defense in depth works best when these mitigations are adding an additional layer of protection instead of the only layer of protection.

So, this verification work should be seen as an added layer of protection on top of runtime mitigations, which I hope will become more widely available as time goes on.

CalChris 10/27/2024||

We are nearing the death of Proebsting's Law. AMD CEO Lisa Su's HotChips’19 keynote said that compilers had accounted for 8% performance increase over the decade. That means compilers are now only doubling performance every 90 years.

https://www.youtube.com/watch?v=nuVBid9e3RA

f33d5173 10/27/2024||

the optimizer should be a magic black box. As soon as you start demanding a particular optimization, it shouldn't be an optimization anymore. In C you have inline assembly and intrinsics and pragmas and macros and so on. If you want the compiler to compile your code a particular way you should be using these, not trying to wrangle the optimizer to invoke a particular optimization.

dzaima 10/27/2024|

Except even intrinsics aren't even that much of a guarantee - clang converts them to its internal operations and applies the same optimization passes over them as it does on its own autovectorized code; and there are no intrinsics for x86's inline memory operands, so issues can arise around those (or the inverse - I've seen clang do a memory load of the same constant twice (the second one being behind some branches), despite there being only one such load).

And there also are no intrinsics for most scalar operations, e.g. if you wanted to force "x>>48 == 0x1234" to be actually done via the shift and not "x & 0xffff000000000000 == 0x1234000000000000" (or vice versa).

And of course assembly means writing platform-specific code (potentially undesirable even if you want to only do the optimization for a single architecture, as it means having to learn to write assembly of said architecture).

There is some potential middle-ground of doing black-boxing, but as-is in C/C++ the way to do this is with a no-op asm block, but that can make register allocation worse, and still requires some platform-specific logic for deriving the register kind from value type.

looneysquash 10/28/2024|

The postgresql query optimization one is an interesting one.

I've been very frustrated at times that you can't just tell it to use a certain index or a certain query plan.

But at the same time, the data in the table can change over time.

So for that particular problem, postgresql's superpower is it can optimize and reoptimize your query at runtime as the data changes!

Not an easy thing to do in most programming languages.

But I do agree with the rest of the article to a large extent.

For some cases, you need a way to say "do this optimization, fail to compile if you cannot".

Most of the time though, I just want the optimizer to do it's best, and don't have something in particular in mind.

What if the compiler could output the optimization or transformations it does to a file? A file that was checked into source code.

Think of lockfiles for dependencies. Or snapshot testing.

I don't mean output the entire IR. I mean more a list of applied transformations.

When those change, you could then be aware of those changes.

Maybe it could act as an actual lockfile and constrain the compiler? Could it also double as a cache and save time?

More comments...