A great example of when winning in the average works is register allocation. It’s fine there because the cost of any particular variable getting spilled is so low. So, all that matters is that most variables are in registers most of the time. If spill heuristics change for the better, it usually means some of your variables that previously got spilled now are in registers while others that were in registers are now spilled - and the compiler writer declares victory if this is a speedup in some overall average of large benchmarks. Similar thinking plays out in stuff like common subexpression elimination or basically any strength reduction. (In fact, most of those optimizations have the peculiar property that you’ll always be able to craft a program that shows the optimization to be a bad idea; we do them anyway because on average they are a speedup.)
In my view, if a compiler optimization is so critical that users rely on it reliably “hitting” then what you really want is for that optimization to be something guaranteed by the language using syntax or types. The way tail calls work in functional languages comes to mind. Also, the way value types work in C#, Rust, C++, etc - you’re guaranteed that passing them around won’t call into the allocator. Basically, relying on the compiler to deliver an optimization whose speedup from hitting is enormous (like order of magnitude, as in the escape analysis to remove GC allocations case) and whose probability of hitting is not 100% is sort of a language design bug.
This is sort of what the article is saying, I guess. But for example on the issue of the optimizer definitely removing a GC allocation: the best design there is for the GC’d language to have a notion of value types that don’t involve allocation at all. C# has that, Java doesn’t.
My optimizer first appeared in Datalight C around 1984 or so. It was the first DFA optimizer for any C compiler on the PC. C compiler benchmark roundups were popular articles in programming magazines at the time. We breathlessly waited for the next roundup article.
When it came, Datalight C was omitted from it! The journalist said they excluded Datalight C because it was buggy, as it deleted the benchmark code and just printed the success message. The benchmarks at the time consisted of things like:
for (i = 0; i < 1000; ++i) a = 3;
so of course it deleted the useless code.I was really angry about that, as the journalist never bothered to call us and ask about DLC's behavior. Our sales tanked after that article.
But it wasn't long until the industry realized that optimizers were the future, the benchmarks were revised, and the survivors in the compiler business all did optimizers. DLC recovered, but as you can tell, I am still annoyed at the whole incident.
Some things one just doesn't anticipate.
Working on compilers is never dull.
I’ve been working on compilers 30 years, primarily on optimizers (although some frontend work as well). Learned C from Borland C and C++ from…Zortech C++ 1.0, so thank you for that!
Within a few years of working on compilers I came across my first examples of 50k+ line functions. These were often (but not always) the result of source code generators that were translating some kind of problem description to code. It taught me very early on that you really need to focus on scalability and compile time in compilers, whether it’s the amount of code within a function, or across functions (for IPO / LTO).
And yes, working on compilers is never dull. 25 years ago I thought we’d end up in a monolithic world with x86, C++, and Java being the focus of all work. Instead, there’s been an absolute explosion of programming models, languages, and architectures, as well as entirely new problem spaces like graph compilers for ML.
The way every optimizer I've worked on (and written) deals with this is canonical forms. Like, you decree that the canonical form of "multiply integer by 2" is "x << 1", and then you make sure that no optimization ever turns "x << 1" into anything else (though the instruction selector may then turn "x << 1" into "x + x" since that's the best thing on most CPUs).
But that doesn't necessarily make this problem any easier. Just gives you a principled story for how to fix the problem if you find it. I think that usually the canonical forms aren't even that well documented, and if you get it wrong, then you'll still have valid IR so it's not like the IR verifier will tell you that you made a mistake - you'll just find out because of some infinite loop.
And yeah, lots of compiler optimization fixpoints have a counter to kill them after some limit. The LLVM inliner fixpoint is one example of such a thing.
> I was really angry about that, as the journalist never bothered to call us and ask about DLC's behavior. Our sales tanked after that article.
Whoa! That's a crazy story! Thanks for sharing!
I agree with this to some extent but not fully. I think there are shades of grey to this -- adding language features is a fairly complex and time-consuming process, especially for mainstream languages. Even for properties which many people would like to have, such as "no GC", there are complex tradeoffs (e.g. https://em-tg.github.io/csborrow/)
My position is that language users need to empowered in different ways depending on the requirements. If you look at the Haskell example involving inspection testing/fusion, there are certain guarantees around some type conversions (A -> B, B -> A) being eliminated -- these are somewhat specific to the library at hand. Trying to formalize each and every performance-sensitive library's needs using language features is likely not practical.
Rather, I think it makes sense instead focus on a more bottoms-up approach, where you give somewhat general tools to the language users (doesn't need to expose a full IR), and see what common patterns emerge before deciding whether to "bless" some of them as first-class language features.
My point is that if in the course of discovering common patterns you find that the optimizer must do a heroic optimization with a 10x upside when it hits and weird flakiness about when it hits, then that’s a good indication that maybe a language feature that lets you skip the optimization and let the programmer sort of dictate the outcome is a good idea.
By the way, avoiding GC is not the same thing as having value types. Avoiding GC altogether is super hard. But value types aren’t that hard and aren’t about entirely avoiding GC - just avoiding it in specific cases.
https://jdk.java.net/valhalla/
Yes, it was a bummer that Java didn't take up on the ideas of Cedar, Oberon linage, Modula-3, Eiffel,... even though some are quoted as its influences.
Still I am confident that it might be getting value types, before C++ reflection, networking, senders/receivers, or safety gets sorted out. Or even that we can finally write portable C++ code using C++20 modules.
If it was easy it would be done by now.
There are plenty of long time efforts on other language ecosystem that have also taken decades and still not fully done, e.g. C++ modules, contracts, reflection,...
If you really need value type like objects today, it is possible in Panama, even without language syntax for them.
> So while many small value classes can be flattened, classes that declare, say, 2 int fields or a double field, might have to be encoded as ordinary heap objects.
There's a further comment about potential of opting out of atomicity guarantees to not have that problem, but then there are more problems - looks like pre-JIT would still allocate, and who knows how consistent would JIT be about scalarization. IIRC there was also some mention somewhere about just forcing large enough value objects to always be heap allocations.
> Heap flattening must maintain the integrity of objects. For example, the flattened data must be small enough to read and write atomically, or else it may become corrupted. On common platforms, "small enough" may mean as few as 64 bits, including the null flag. So while many small value classes can be flattened, classes that declare, say, 2 int fields or a double field, might have to be encoded as ordinary heap objects.
And maybe the end of the next paragraph is even more relevant:
> In the future, 128-bit flattened encodings should be possible on platforms that support atomic reads and writes of that size. And the Null-Restricted Value Types JEP will enable heap flattening for even larger value classes in use cases that are willing to opt out of atomicity guarantees.
It’s a dang hard feature to retrofit into the way the JVM works. I wish those folks the best of luck.
JARs and modules that work on the JVM before value types introduction should keep running, and how can new code interoperate with such jars.
Automatic vectorisation is another big one. It feels to me like vectorisation is less reliable / more complex than TCO? But on the other hand the downside is a linear slowdown, not "your program blows the stack and crashes".
This is the critical point. If CI fails or otherwise you are warned when the loop doesn't vectorize, then you can count on it to always happen
Intrinsics work poorly in some compilers, and Intel's intrinsics are so hard to read because of inscrutable Hungarian notation that you should just write in asm instead.
- it would be better if the intrinsics had sensible names. I couldn’t agree more.
- it would be better if compilers consistently did a good job of implementing them. I wonder which compilers do a bad job? Does clang do a good job or not so much?
I think intrinsics make sense for the case where the language being used is not otherwise simd and that language already has value types (so it’s easy to add a vector type). It would be great if they at least worked consistently well and had decent names in that case.
1. You’re not gonna get any guarantees that the optimization will happen. That makes it High Level. Just write code. We won’t force you to pollute your code with ugly annotations or pragmas.
2. In turn: check the assembly or whatever the concrete thing that reveals that the optimization you wished for in your head actually went through
There’s some kind of abstraction violation in the above somewhere.
"Abstraction violation" is a good way to put it.
Usually, if you don't get the optimization you wished for, it means that there is something you didn't account for. In C++, it may be exception processing, aliasing rules, etc... Had the compiler made the optimization you wished for, it wouldn't have been correct with regard to the specifications of the language, it may even hide a bug. The solution is then to write it in a way that is more explicit, to make the compiler understand that the edge case can never happen, which will then enable the optimization. It is not really an abstraction violation, more like a form of debugging.
If you really need to get low level, there is some point where you need to write assembly language, which is obviously not portable, but getting every last bit of performance is simply incompatible with portability.
This is not true of JIT compilers, of course, which have similar constraints to DB query planners. In these cases the goal is to do a good job pretty quickly, rather than an excellent job in a reasonable time.
The number of possible distinct query plans grows very rapidly as the complexity increases (exponentially or factorially... I can't remember). So even if you have 10x as much time available for optimisation, it makes a surprisingly small difference.
One approach I've seen with systems like Microsoft Exchange and its undrelying Jet database is that queries are expressed in a lower-level syntax tree DOM structure. The specific query plan is "baked in" by developers right from the beginning, which provides stable and consistent performance in production. It's also lower latency because the time spent by the optimiser at runtime is zero.
You can normally only send SQL queries to a database and not execution plans.
Since there's no bit rotate operator in C, you're left hoping the compiler recognizes what the shifts and bitwise-ands are trying to do.
Once an optimization becomes part of the interface and it is guaranteed, is it really an optimization? Or did it just became part of the language/library/database/whatever?
One example is return value optimization in C++. In C++17 the "optimization" became mandatory in some contexts. What really happened though is that the rules of temporary materialization changed, and in those contexts it just never happens prematurely by the language rules. This ceased to be an optimization and became a mechanism in the language.
What I'm getting at is that unreliability is a defining quality of optimizations.
Sure, there are certain optimizations that become load-bearing, in which case it would be better if they became part of the language's semantics and guarantees, therefore they ceased to be optimizations.
Even if that second description is stable and part of the guarantees you make, keeping it seperate is still incredibly useful from a user perspective.
From an implementation perspective, there is also a useful distinction. Optimizations take a valid representation, and turn it into a different valid representation of the same type that shares all defined behavior. This is a fairly different operation than compilation, which converts between representations. In particular, for the compilation step, you typically have only one compilation function for a given pair of representations; and if you have multiple, you select one ahead of time. For optimizations, each representation has a set of optimization functions, and you need to decided what order to apply them and how many times to do so. Compilation functions, for their part, need to deal with every difference between the two representations, whereas optimization functions get to ignore everything except the part they care about.
I think that the compilation and optimization step, as a black box, is a disservice for highly reliable software development. Compiler and optimizer bugs are definitely a thing. I was bitten by one that injected timing attacks into certain integer operations by branching on the integer data in order to optimize 32-bit multiplications on 8-bit microcontrollers. Yeah, this makes perfect sense when trying to optimize fixed point multiplication, but it completely destroys the security of DLP or ecDLP based cryptography by introducing timing attacks that can recover the private key. Thankfully, I was fastidious about examining the optimized machine code output of this compiler, and was able to substitute hand coded assembler in its place.
AFAIK, that's how seL4 is verified. Quoting from https://docs.sel4.systems/projects/sel4/frequently-asked-que...
"[...] Specifically, the ARM, ARM_HYP (ARM with virtualisation extensions), X64, and RISCV64 versions of seL4 comprise the first (and still only) general-purpose OS kernel with a full code-level functional correctness proof, meaning a mathematical proof that the implementation (written in C) adheres to its specification. [...] On the ARM and RISCV64 platforms, there is a further proof that the binary code which executes on the hardware is a correct translation of the C code. This means that the compiler does not have to be trusted, and extends the functional correctness property to the binary. [...] Combined with the proofs mentioned above, these properties are guaranteed to be enforced not only by a model of the kernel (the specification) but the actual binary that executes on the hardware."
I'm working on a hybrid approach between SMT solving and constructive proofs. Model checking done with an SMT solver is pretty sound. I'm actually planning a book on a scalable technique to do this with CBMC. But, the last leg of this really is understanding the compiler output.
FWIW, I think this should be considered a language design problem rather than an optimizer design problem. Black box optimizer behaviour is good for enabling language designs that have little connection to hardware behaviour, and good for portability including to different extensions within an ISA.
C doesn't offer a way to express any timing guarantees. The compiler, OS, CPU designer, etc. can't even do the right thing if they wanted to because the necessary information isn't being received from the programmer.
Black box designs work until the knob or dial you need to control it isn't there. I would have taken a pragma, a command-line option to the compiler, or even a language extension.
This is one example of many as to why I think that user-guided code generation should be an option of a modern tool suite. If I build formal specifications indicating the sort of behavior I expect, I should be able to link these specifications to the output. Ultimately, this will come down to engineering, and possibly, overriding or modifying the optimizer itself. An extensible design that makes it possible to do this would significantly improve my work. Barring that, I have to write assembler by hand to work around bad assumptions made by the optimizer.
They start with a Haskell prototype that is translated programatically into a formal specification for the theorem prover.
They then implement the same thing in C, and use a refinement prove to demonstrate that it matches their Haskell implementation.
They then compile the program, and create another refinement proof to demonstrate that the binary code matches the C semantics.
They are on the right track. But, I think there have been some improvements since their effort that can lead to more streamlined equivalence proofs.
However, what I hate is the lack of transparency (and I feel like this article tries to pin-point just this). When I execute a query locally I get a different plan vs staging vs prod. A plan than can also change depending on some parameters or load or size.
I don't care about understanding all the underlying optimizations, I just care that the query plan I saw is the same and is still the same in prod, and that I can be warned when it changes. PG does not return the hash of the query plan or metrics along the data, which is imo a mistake. With this you could track it in your favorite metrics store and be able to point when and why stuff are executing differently.
It never occurred to me that this would be considered a hint to the optimizer. It doesn't affect code generation. What it does do is flag any use of the gc in the function and any functions it transitively may call.
Optimizers have been likened to turning a cow into a hamburger. If you're symbolically debugging optimized code, you're looking at the hamburger. Nobody has been able to solve that problem.
It's true that optimizers themselves are hard to show being correct. The one in the D compiler is a conventional DFA optimizer that uses data flow equations I learned from Hennessy and Ullman in a 1982 seminar they taught. So it has been battle tested for 42 years now(!) and it's pretty rare to find a problem with it, unless it's a new pass I added like SROA. The idea is anytime a problem is identified and corrected, it goes into the test suite. This has the effect of always ratcheting it forward and not regress.
The GC dates from around 2000, when I wrote it for a Javascript engine. It was brutally tested for that, and has been pretty solid ever since. People complain about the GC, but not about it being buggy. A buggy GC is a real horror show as it is painfully difficult to debug.
The preceding paragraph had "and occasionally language features" so I thought it would be understood that I didn't mean it as an optimizer-specific thing, but on re-reading the post, I totally see how the other wording "The knobs to steer the optimizer are limited. Usually, these [...]" implies the wrong thing.
I've changed the wording to be clearer and put the D example into a different bucket.
> In some cases, languages have features which enforce performance-related properties at the semantic checking layer, hence, granting more control that integrates with semantic checks instead of relying on the optimizer: > > - D has first-class support for marking functions as “no GC”.
In the case of SQL, I'd love access to a flavor where I do the joins on indices explicitly, the query is executed as written, and each join (or filter) can be annoted with a strategy (btree lookup, etc). (The most difficult part about using indices by hand is correctly writing all the synchronous triggers on updates, not the queries, IMO).
Technically, yes. :)
But I think this should perhaps be treated as a bug in how we define/design languages, rather than as an immutable truth.
- We already have time-based versioning for languages. - We also have "tiers" of support for different platforms in language implementations (e.g. rarer architectures might have Tier 2 or Tier 3 support where the debugging tooling might not quite work)
One idea would be to introduce "tiers" into a language's definition. A smaller implementation could implement the language at Tier 1 (perhaps this would even be within reach for a university course project). An industrial-strength implementation could implement the language at Tier 3.
(Yes, this would also introduce more complications, such as making sure that the dynamic semantics at different tiers are equivalent. At that point, it becomes a matter of tradeoffs -- does introducing tiers help reduce complexity overall?)
Not when "correct" needs optimizations to meet real-time guarantees. It's hard to argue that a program which doesn't run is "correct".
Well, sure, sometimes, to an extent. But if it's load-bearing, maybe that's a bug? You might have written non-portable code that won't last, because it depends on an implementation detail that isn't standardized.
There are widespread applications where performance isn't critical. For example, any time you do a network request. Web pages shouldn't break because the network is slow today, if you can possibly avoid it.
The web provides no performance guarantees, but it tries pretty hard to provide compatibility guarantees. Your code should, usually, work on new browser versions and on devices that haven't been released yet. New browsers will have different JIT compilers with different performance cliffs. And yet, there are websites written many years ago that still work.
When standardizing things, we need to be precise about what's standardized and what isn't, or protocols "rust shut" and can't be changed without breaking lots of stuff. Not standardizing performance is often a win. (With some exceptions like video game consoles where the hardware is standardized.)
Hyrum's law suggests that all compiler optimizations will eventually be load-bearing for someone, but we should usually try to avoid it. To make sure that the code is robust, perhaps performance should vary. Maybe it would be useful to have something like a chaos monkey for programs, where optimizations vary based on a random seed?
If your "debug optimization" code is so slow as to be unusable (see: Rust), then your optimizations qualify as load-bearing.
The problem is that "optimization level" needs a mindset change. The optimization levels should be "release", "small" and "experimental".
"Release level" needs to be perfectly usable for debugging as well as in production--"debug level" should be "release level". Compilation time should be reasonable and run time should be functional.
After that, for embedded, you should have "small level"--checks should get turned off and any optimizations that make code significantly bigger should get turned off (loop unrolling, for example). You might enable some optimizations that make compile time brutally slow.
Finally, there should be an "experimental level" which tests out optimizations before they go into release.
And there should be no "fast" optimization level. If your optimization is that situation specific, it should stay stuck in "experimental".
And through all of this, the compiler should also eject files that carry enough information to allow debuggers to make sense of the compiled code, unwind the optimizations when the users asks, and present a coherent version of what is going on. This is actually where compilers really break down nowadays. The compiler needs to eject enough information and context that a debugger can unwind what is going on rather than being an afterthought.
We need the equivalent of an LSP (Language Server Protocol) for debuggers.
But completely-reversible-anywhere optimizations are rather limiting, disallowing a bunch of things; including, but not limited to, dead code elimination (more generally, forgetting some fraction of state from somewhere), identical code merging, and instruction reordering (esp. vectorization, which essentially reorders across multiple loop iterations), severely limiting what your optimizer can even do.
You can't do this because some optimizations can't be seen through; most obvious one is identical code merging. If you're in a merged function then you don't know which of the original source lines you're looking at.
I'd like to see a system which only had optimized compilation and used an interpreter for debugging.
> If your optimization is that situation specific, it should stay stuck in "experimental".
Yeah, I have a feeling that there’s low-key lots of applications for whom many compiler optimisations that you’re branding “experimental” are in fact run-of-the-mill and would just enable “experimental” mode so frequently it’d just be synonymous with “actually release mode”.
> And through all of this, the compiler should also eject files that carry enough information to allow debuggers to make sense of the compiled code, unwind the optimizations when the users asks, and present a coherent version of what is going on.
This is a pretty cool idea, semi-related, you should check out some of the work around E-Graphs, which attempt to enable the same level of optimisations but also preserve more original context. It’d be neat to have the equivalent of source-maps for highly optimised code, but I’m not sure how we’d manage them without them becoming enormous. After things like inlining, constant-folding, desugaring, rearrangement operations, etc all take place I suspect you’d be left with code that bears only the merest passing resemblance to its original form.
And there also are no intrinsics for most scalar operations, e.g. if you wanted to force "x>>48 == 0x1234" to be actually done via the shift and not "x & 0xffff000000000000 == 0x1234000000000000" (or vice versa).
And of course assembly means writing platform-specific code (potentially undesirable even if you want to only do the optimization for a single architecture, as it means having to learn to write assembly of said architecture).
There is some potential middle-ground of doing black-boxing, but as-is in C/C++ the way to do this is with a no-op asm block, but that can make register allocation worse, and still requires some platform-specific logic for deriving the register kind from value type.