A great example of when winning in the average works is register allocation. It’s fine there because the cost of any particular variable getting spilled is so low. So, all that matters is that most variables are in registers most of the time. If spill heuristics change for the better, it usually means some of your variables that previously got spilled now are in registers while others that were in registers are now spilled - and the compiler writer declares victory if this is a speedup in some overall average of large benchmarks. Similar thinking plays out in stuff like common subexpression elimination or basically any strength reduction. (In fact, most of those optimizations have the peculiar property that you’ll always be able to craft a program that shows the optimization to be a bad idea; we do them anyway because on average they are a speedup.)
In my view, if a compiler optimization is so critical that users rely on it reliably “hitting” then what you really want is for that optimization to be something guaranteed by the language using syntax or types. The way tail calls work in functional languages comes to mind. Also, the way value types work in C#, Rust, C++, etc - you’re guaranteed that passing them around won’t call into the allocator. Basically, relying on the compiler to deliver an optimization whose speedup from hitting is enormous (like order of magnitude, as in the escape analysis to remove GC allocations case) and whose probability of hitting is not 100% is sort of a language design bug.
This is sort of what the article is saying, I guess. But for example on the issue of the optimizer definitely removing a GC allocation: the best design there is for the GC’d language to have a notion of value types that don’t involve allocation at all. C# has that, Java doesn’t.
My optimizer first appeared in Datalight C around 1984 or so. It was the first DFA optimizer for any C compiler on the PC. C compiler benchmark roundups were popular articles in programming magazines at the time. We breathlessly waited for the next roundup article.
When it came, Datalight C was omitted from it! The journalist said they excluded Datalight C because it was buggy, as it deleted the benchmark code and just printed the success message. The benchmarks at the time consisted of things like:
for (i = 0; i < 1000; ++i) a = 3;
so of course it deleted the useless code.I was really angry about that, as the journalist never bothered to call us and ask about DLC's behavior. Our sales tanked after that article.
But it wasn't long until the industry realized that optimizers were the future, the benchmarks were revised, and the survivors in the compiler business all did optimizers. DLC recovered, but as you can tell, I am still annoyed at the whole incident.
Some things one just doesn't anticipate.
Working on compilers is never dull.
I’ve been working on compilers 30 years, primarily on optimizers (although some frontend work as well). Learned C from Borland C and C++ from…Zortech C++ 1.0, so thank you for that!
Within a few years of working on compilers I came across my first examples of 50k+ line functions. These were often (but not always) the result of source code generators that were translating some kind of problem description to code. It taught me very early on that you really need to focus on scalability and compile time in compilers, whether it’s the amount of code within a function, or across functions (for IPO / LTO).
And yes, working on compilers is never dull. 25 years ago I thought we’d end up in a monolithic world with x86, C++, and Java being the focus of all work. Instead, there’s been an absolute explosion of programming models, languages, and architectures, as well as entirely new problem spaces like graph compilers for ML.
Like the issue in the Linux kernel recently where they had some simple looking min/max macros that generated megabytes of source code.
Is that scary enough for ya?
The way every optimizer I've worked on (and written) deals with this is canonical forms. Like, you decree that the canonical form of "multiply integer by 2" is "x << 1", and then you make sure that no optimization ever turns "x << 1" into anything else (though the instruction selector may then turn "x << 1" into "x + x" since that's the best thing on most CPUs).
But that doesn't necessarily make this problem any easier. Just gives you a principled story for how to fix the problem if you find it. I think that usually the canonical forms aren't even that well documented, and if you get it wrong, then you'll still have valid IR so it's not like the IR verifier will tell you that you made a mistake - you'll just find out because of some infinite loop.
And yeah, lots of compiler optimization fixpoints have a counter to kill them after some limit. The LLVM inliner fixpoint is one example of such a thing.
> I was really angry about that, as the journalist never bothered to call us and ask about DLC's behavior. Our sales tanked after that article.
Whoa! That's a crazy story! Thanks for sharing!
Seriously.
Consider the slides here, particularly slides 27-30: https://cfallin.org/pubs/egraphs2023_aegraphs_slides.pdf
They show code for pattern matching IR in their Rust code, and it's awful. I think it's because they can't just have an IR with pointers, because that would violate Rust's rules. So, they need to call goofy helpers to deconstruct the IR. Consequently, a super simple rewrite rule ends up being a full page of gnarly code. That code would be less than half as long, and much easier to parse, if it was in LLVM IR or B3 IR or any other sensible C++ IR.
Then they show that the egraph rule is "simple". It's certainly shorter than C++ code. But while it is conceptually simpler than their Rust code, it is no conceptually simpler than the same rewrite written in C++. The C++ code for rewrites in either LLVM or B3 is not that much more verbose than the egraph rule, but in a way that makes it easy to understand. Plus, it's code, so it can call any API in the compiler and do any logic it likes - making it inherently more flexible than an egraph.
So, if they had used a sensible programming language to write their compiler then they wouldn't have needed egraphs as an escape hatch from how bad Rust is.
And the phase ordering argument for e-graphs is baloney because:
- Their example of alias analysis, GVN, and RLE assumes the strawman that you put those things in separate phases. Most compilers don't. LLVM does alias analysis on demand and combines GVN and RLE. Ditto in B3.
- The difficulty of phase ordering in LLVM's pass builders (and in any pipeline I've ever worked on) is that you've got passes that are not expressible as egraphs. Until someone can demonstrate an egraph based inliner, SCCP, SROA, coroutine lowerings, and heck an egraph-based FilPizlonator (the pass I use in the Fil-C version of clang to make it memory safe), then you'll be left with those passes having to run the classic way and then you'll still have phase ordering problems.
E-matchers are an algorithm over e-graphs to efficiently search for a pattern in an e-graph, for instance if you want to find "(load ptr) * (load ptr)" then you have to find "#x * #x" for all x in all e-classes, then check whether "load ptr" is a member of e-class #x. This is where you get limits on what you can match. "Pattern matching" style transforms are easy and fast using e-matchers, things like "x * 2" -> "x << 1", but beyond that they don't help.
There's an optimizer problem where you have "load ptr" and you solve ptr and figure out it's pointing to a constant value and you replace the load instruction with the constant value. Later, you get to code emission and you realize that you can't encode your constant in the CPU instruction, there's not enough bits. You now need to take the constant and stuff it into a constant pool and emit a load for it. If you had stored them in an e-graph, you could have chosen to use the load you already had.
Suppose you wanted to do sparse conditional constant/constant-range propagation but your IR uses an e-graph. You could analyze each expression in the e-class, intersect them, and annotate the resulting constant-range on the whole e-class. Then do SCCP as normal looking up the e-class for each instruction as you go.
I agree with this to some extent but not fully. I think there are shades of grey to this -- adding language features is a fairly complex and time-consuming process, especially for mainstream languages. Even for properties which many people would like to have, such as "no GC", there are complex tradeoffs (e.g. https://em-tg.github.io/csborrow/)
My position is that language users need to empowered in different ways depending on the requirements. If you look at the Haskell example involving inspection testing/fusion, there are certain guarantees around some type conversions (A -> B, B -> A) being eliminated -- these are somewhat specific to the library at hand. Trying to formalize each and every performance-sensitive library's needs using language features is likely not practical.
Rather, I think it makes sense instead focus on a more bottoms-up approach, where you give somewhat general tools to the language users (doesn't need to expose a full IR), and see what common patterns emerge before deciding whether to "bless" some of them as first-class language features.
My point is that if in the course of discovering common patterns you find that the optimizer must do a heroic optimization with a 10x upside when it hits and weird flakiness about when it hits, then that’s a good indication that maybe a language feature that lets you skip the optimization and let the programmer sort of dictate the outcome is a good idea.
By the way, avoiding GC is not the same thing as having value types. Avoiding GC altogether is super hard. But value types aren’t that hard and aren’t about entirely avoiding GC - just avoiding it in specific cases.
https://jdk.java.net/valhalla/
Yes, it was a bummer that Java didn't take up on the ideas of Cedar, Oberon linage, Modula-3, Eiffel,... even though some are quoted as its influences.
Still I am confident that it might be getting value types, before C++ reflection, networking, senders/receivers, or safety gets sorted out. Or even that we can finally write portable C++ code using C++20 modules.
> So while many small value classes can be flattened, classes that declare, say, 2 int fields or a double field, might have to be encoded as ordinary heap objects.
There's a further comment about potential of opting out of atomicity guarantees to not have that problem, but then there are more problems - looks like pre-JIT would still allocate, and who knows how consistent would JIT be about scalarization. IIRC there was also some mention somewhere about just forcing large enough value objects to always be heap allocations.
> Heap flattening must maintain the integrity of objects. For example, the flattened data must be small enough to read and write atomically, or else it may become corrupted. On common platforms, "small enough" may mean as few as 64 bits, including the null flag. So while many small value classes can be flattened, classes that declare, say, 2 int fields or a double field, might have to be encoded as ordinary heap objects.
And maybe the end of the next paragraph is even more relevant:
> In the future, 128-bit flattened encodings should be possible on platforms that support atomic reads and writes of that size. And the Null-Restricted Value Types JEP will enable heap flattening for even larger value classes in use cases that are willing to opt out of atomicity guarantees.
If it was easy it would be done by now.
There are plenty of long time efforts on other language ecosystem that have also taken decades and still not fully done, e.g. C++ modules, contracts, reflection,...
If you really need value type like objects today, it is possible in Panama, even without language syntax for them.
It’s a dang hard feature to retrofit into the way the JVM works. I wish those folks the best of luck.
JARs and modules that work on the JVM before value types introduction should keep running, and how can new code interoperate with such jars.
Automatic vectorisation is another big one. It feels to me like vectorisation is less reliable / more complex than TCO? But on the other hand the downside is a linear slowdown, not "your program blows the stack and crashes".
This is the critical point. If CI fails or otherwise you are warned when the loop doesn't vectorize, then you can count on it to always happen
Intrinsics work poorly in some compilers, and Intel's intrinsics are so hard to read because of inscrutable Hungarian notation that you should just write in asm instead.
- it would be better if the intrinsics had sensible names. I couldn’t agree more.
- it would be better if compilers consistently did a good job of implementing them. I wonder which compilers do a bad job? Does clang do a good job or not so much?
I think intrinsics make sense for the case where the language being used is not otherwise simd and that language already has value types (so it’s easy to add a vector type). It would be great if they at least worked consistently well and had decent names in that case.
tail call optimizations are useful as optimizations.
But in a language like scheme, tail calls are required to be implemented as loops. If you don't do that, then people can't write code that relies on it. Scheme treats tail call handling as a semantic, not as an optimization.
I am not sure in what cases that's a major win, in generation/copy GCs an allocation is a pointer bump, then the object dies trivially (no references from tenured objects) in the young gen collection.
When you run out of the local arena, there will be a CAS involved, of course.
I agree just the difference between GC allocation and stack allocation is small. But it’s not really about turning the values into stack allocations; it’s about letting the compiler do downstream optimizations based on the knowledge that it’s dealing with an unaliased private copy of the data and then that unlocks a ton of downstream opts.
The allocation elisions have been tried (escape analysis) with enough inlining to no noticeable benefits (I don't have the source/quote, yet it was over 10y already). The escape analysis already provides the same semantics of proving the Object allocation/etc. can be optimized.
Java does lack value types but they are only useful in (large) arrays. A lot of such code has 'degraded' to direct byte buffers, and in some cases straight unsafe, similar to void* in C.
Huge speedup, but very hit or miss. Out of a large suite of benchmarks, 90% of the tests saw no change and 10% saw improvements of 3x or so. Note each of these tests was itself a large program; these weren’t some BS microbenchmarks.
Maybe someone from V8 can comment but my understanding is they had a similar experience.
So it’s possible that someone measured this being perf neutral, if they used a too small benchmark suite.
Unfortunately I think Linux kernel is one of the most notable examples in which case you have to compile with -O1 or up.
Playing with fire is when you require an optimization to hit in a specific place in your code.
Having software that only works if optimized means you're effectively relying on optimizations hitting with a high enough rate overall that your code runs fast enough.
It's like the difference between an amateur gambler placing one huge ass bet and praying that it hits and a professional gambler (like a Blackjack card counter, or a poker pro) placing thousands or even millions of bets and relying on postive EV to make a profit.
It's grossly false for the vast majority of code, where the sludge written by thousands of engineers gets completely rewritten by the compiler before it executes. Compilers are the reason unskilled devs manage to get tolerable performance out of frameworks.
The cpython is slow complaints? That's what life without a good compiler looks like.
Maybe it’s splitting hairs to talk about whether this is down to the compiler or the language.
> It's grossly false for the vast majority of code, where the sludge written by thousands of engineers gets completely rewritten by the compiler before it executes.
It used to be true, for sure. What has changed since then is not the sludge or bloat or many cooks. Instead, the major change is that we have written a bunch of code on top of optimizing compilers with the assumption that these optimizations are happening. For example, nowadays, you might write C++ code with a deep call stack and lots of small functions (e.g. through templates), and it’s fast because the compiler inlines and then optimizes the result. Back in the 1990s, you would not have written code that way because the compiler would have choked on it. You see a lot more macro use in code from the 1990s rather than functions, because programmers didn’t trust that the functions will get inlined.
One of the things that a Cpp Compiler can do is De-Virtualize Function Calls. That is; solve those function pointer tables at Compile Time. Why could not a good compiler for Python do the same?
In Python, the types are way more general. Basically, every method call is being made to “object”. Every field has a value of type “object”. This makes it much more difficult for the compiler to devirtualize anything. It might be much harder to track which code assigns values to a particular field, because objects can easily escape to obscure parts of your codebase and be modified without knowing their type at all.
This happens even if you write very boring, Java-like code in Python.
Python only has a byte-compilation phase that turns the parse tree into executable format. Everything past that, including the creation if classes, imports, etc is runtime. You can pick a class and patch it. You can create classes and functions at runtime; in fact, this happens all the time, and not only for lambdas, but this is how decorators work.
A JIT compiler could detect actual usage patterns and replace the code with more efficient versions, until a counter-example is found, and the original "de-optimized" code is run. This is how JavaScript JITs generally work.
1. You’re not gonna get any guarantees that the optimization will happen. That makes it High Level. Just write code. We won’t force you to pollute your code with ugly annotations or pragmas.
2. In turn: check the assembly or whatever the concrete thing that reveals that the optimization you wished for in your head actually went through
There’s some kind of abstraction violation in the above somewhere.
Usually, if you don't get the optimization you wished for, it means that there is something you didn't account for. In C++, it may be exception processing, aliasing rules, etc... Had the compiler made the optimization you wished for, it wouldn't have been correct with regard to the specifications of the language, it may even hide a bug. The solution is then to write it in a way that is more explicit, to make the compiler understand that the edge case can never happen, which will then enable the optimization. It is not really an abstraction violation, more like a form of debugging.
If you really need to get low level, there is some point where you need to write assembly language, which is obviously not portable, but getting every last bit of performance is simply incompatible with portability.
It’s not leaky (and that term is kind of meh). It’s just an abstraction! Such optimizations are supposed to be abstracted away (point 1).[1] The problem comes when that is inconvenient; when the client does not want it to be abstracted away.
There’s a mismatch there.
[1] EDIT: The point here is that the API is too abstracted compared to what the client wants. Of course the API could choose to not abstract certain things. For example the Vec[2] type in Rust has specified, as part of the documentation, how it is implemented (to a large degree). They could call it something like “List” and say that whatever the concrete implementation is, is an implementation detail. But they chose not to.
[2] https://doc.rust-lang.org/std/vec/struct.Vec.html#capacity-a...
This is not true of JIT compilers, of course, which have similar constraints to DB query planners. In these cases the goal is to do a good job pretty quickly, rather than an excellent job in a reasonable time.
The number of possible distinct query plans grows very rapidly as the complexity increases (exponentially or factorially... I can't remember). So even if you have 10x as much time available for optimisation, it makes a surprisingly small difference.
One approach I've seen with systems like Microsoft Exchange and its undrelying Jet database is that queries are expressed in a lower-level syntax tree DOM structure. The specific query plan is "baked in" by developers right from the beginning, which provides stable and consistent performance in production. It's also lower latency because the time spent by the optimiser at runtime is zero.
For DBs, it would be 'trivial' if we could know the exact (or very very close) size of the tables and indexes. But the db is a mutating environment, that start with 1 row and 1 second later is 1 million, and 1 second later is 1 row again.
And then you mutate it from thousands of different connections. Also, the users will mutate the shape, structure, runtime parameters, indexes, kind of indexes, data types, etc.
For large database vendors there is a good amount of complexity put into up-front information gathering so that query execution is fast.
"Abstraction violation" is a good way to put it.
There's extensive literature out there on how fast-math changes the behaviour of code. I've been bitten by this a couple of times already.
You can normally only send SQL queries to a database and not execution plans.
Since there's no bit rotate operator in C, you're left hoping the compiler recognizes what the shifts and bitwise-ands are trying to do.
The thinking here seems to be that you want multiple things:
1. You want the high-level code since that is easier to reason about
2. You also want some specific optimizations
Maybe the missing link here is some annotation that asserts that some optimization is applied. Then if that assertion fails at some point you might have to bite the bullet and inline that assembly. Because (2) might trump (1).
However, what I hate is the lack of transparency (and I feel like this article tries to pin-point just this). When I execute a query locally I get a different plan vs staging vs prod. A plan than can also change depending on some parameters or load or size.
I don't care about understanding all the underlying optimizations, I just care that the query plan I saw is the same and is still the same in prod, and that I can be warned when it changes. PG does not return the hash of the query plan or metrics along the data, which is imo a mistake. With this you could track it in your favorite metrics store and be able to point when and why stuff are executing differently.
I like the metrics idea, but by the time you see the change in the metric, it’s too late.
For critical queries it might be helpful to be able to “freeze” a query plan just as one “freezes” a binary executable by compiling. In other words, let the query planner do its job, but only at a time of your choosing, so the performance of a production system doesn’t change suddenly.
So no hints in the source, just an opaque token representing a compiled query plan that can be deployed alongside a binary. With tooling you could be notified if the planner wants to do it differently and decide whether to deploy the new plan, after testing.
(And again, you’d only do this for a critical subset of your choice.)
It never occurred to me that this would be considered a hint to the optimizer. It doesn't affect code generation. What it does do is flag any use of the gc in the function and any functions it transitively may call.
Optimizers have been likened to turning a cow into a hamburger. If you're symbolically debugging optimized code, you're looking at the hamburger. Nobody has been able to solve that problem.
It's true that optimizers themselves are hard to show being correct. The one in the D compiler is a conventional DFA optimizer that uses data flow equations I learned from Hennessy and Ullman in a 1982 seminar they taught. So it has been battle tested for 42 years now(!) and it's pretty rare to find a problem with it, unless it's a new pass I added like SROA. The idea is anytime a problem is identified and corrected, it goes into the test suite. This has the effect of always ratcheting it forward and not regress.
The GC dates from around 2000, when I wrote it for a Javascript engine. It was brutally tested for that, and has been pretty solid ever since. People complain about the GC, but not about it being buggy. A buggy GC is a real horror show as it is painfully difficult to debug.
The preceding paragraph had "and occasionally language features" so I thought it would be understood that I didn't mean it as an optimizer-specific thing, but on re-reading the post, I totally see how the other wording "The knobs to steer the optimizer are limited. Usually, these [...]" implies the wrong thing.
I've changed the wording to be clearer and put the D example into a different bucket.
> In some cases, languages have features which enforce performance-related properties at the semantic checking layer, hence, granting more control that integrates with semantic checks instead of relying on the optimizer: > > - D has first-class support for marking functions as “no GC”.
Even if HotSpot had perfect assembly-level debug information (which it cannot, as it does do (a tiny bit of) autovectorization, which by necessity can reorder operations, potentially leading to intermediate states that cannot be mapped to any program state), that just means it'd come at a performance cost (e.g. no autovectorization).
Once an optimization becomes part of the interface and it is guaranteed, is it really an optimization? Or did it just became part of the language/library/database/whatever?
One example is return value optimization in C++. In C++17 the "optimization" became mandatory in some contexts. What really happened though is that the rules of temporary materialization changed, and in those contexts it just never happens prematurely by the language rules. This ceased to be an optimization and became a mechanism in the language.
What I'm getting at is that unreliability is a defining quality of optimizations.
Sure, there are certain optimizations that become load-bearing, in which case it would be better if they became part of the language's semantics and guarantees, therefore they ceased to be optimizations.
Even if that second description is stable and part of the guarantees you make, keeping it seperate is still incredibly useful from a user perspective.
From an implementation perspective, there is also a useful distinction. Optimizations take a valid representation, and turn it into a different valid representation of the same type that shares all defined behavior. This is a fairly different operation than compilation, which converts between representations. In particular, for the compilation step, you typically have only one compilation function for a given pair of representations; and if you have multiple, you select one ahead of time. For optimizations, each representation has a set of optimization functions, and you need to decided what order to apply them and how many times to do so. Compilation functions, for their part, need to deal with every difference between the two representations, whereas optimization functions get to ignore everything except the part they care about.
I think that the compilation and optimization step, as a black box, is a disservice for highly reliable software development. Compiler and optimizer bugs are definitely a thing. I was bitten by one that injected timing attacks into certain integer operations by branching on the integer data in order to optimize 32-bit multiplications on 8-bit microcontrollers. Yeah, this makes perfect sense when trying to optimize fixed point multiplication, but it completely destroys the security of DLP or ecDLP based cryptography by introducing timing attacks that can recover the private key. Thankfully, I was fastidious about examining the optimized machine code output of this compiler, and was able to substitute hand coded assembler in its place.
AFAIK, that's how seL4 is verified. Quoting from https://docs.sel4.systems/projects/sel4/frequently-asked-que...
"[...] Specifically, the ARM, ARM_HYP (ARM with virtualisation extensions), X64, and RISCV64 versions of seL4 comprise the first (and still only) general-purpose OS kernel with a full code-level functional correctness proof, meaning a mathematical proof that the implementation (written in C) adheres to its specification. [...] On the ARM and RISCV64 platforms, there is a further proof that the binary code which executes on the hardware is a correct translation of the C code. This means that the compiler does not have to be trusted, and extends the functional correctness property to the binary. [...] Combined with the proofs mentioned above, these properties are guaranteed to be enforced not only by a model of the kernel (the specification) but the actual binary that executes on the hardware."
I'm working on a hybrid approach between SMT solving and constructive proofs. Model checking done with an SMT solver is pretty sound. I'm actually planning a book on a scalable technique to do this with CBMC. But, the last leg of this really is understanding the compiler output.
FWIW, I think this should be considered a language design problem rather than an optimizer design problem. Black box optimizer behaviour is good for enabling language designs that have little connection to hardware behaviour, and good for portability including to different extensions within an ISA.
C doesn't offer a way to express any timing guarantees. The compiler, OS, CPU designer, etc. can't even do the right thing if they wanted to because the necessary information isn't being received from the programmer.
Black box designs work until the knob or dial you need to control it isn't there. I would have taken a pragma, a command-line option to the compiler, or even a language extension.
This is one example of many as to why I think that user-guided code generation should be an option of a modern tool suite. If I build formal specifications indicating the sort of behavior I expect, I should be able to link these specifications to the output. Ultimately, this will come down to engineering, and possibly, overriding or modifying the optimizer itself. An extensible design that makes it possible to do this would significantly improve my work. Barring that, I have to write assembler by hand to work around bad assumptions made by the optimizer.
They start with a Haskell prototype that is translated programatically into a formal specification for the theorem prover.
They then implement the same thing in C, and use a refinement prove to demonstrate that it matches their Haskell implementation.
They then compile the program, and create another refinement proof to demonstrate that the binary code matches the C semantics.
They are on the right track. But, I think there have been some improvements since their effort that can lead to more streamlined equivalence proofs.
I suspect something like wasm may be a better way to preserve backward-compatibility with C, although of course it won't help with constant time or confused-deputy vulnerabilities. CHERI might help with the latter.
I'm a big fan of runtime mitigations. I use a few in my own work.
WASM can help in some areas. But, bear in mind that a lot of this C source code is firmware or lower level operating system details (e.g. kernels, device drivers, system services, low-level APIs, portable cryptography routines). In this case, WASM wouldn't be a good fit.
CHERI is also a possibility in some contexts for runtime mitigation. But, since that does involve a hardware component, unless or until such capabilities are available in mainstream devices and microcontrollers, this would only be of limited use.
There are other runtime mitigations that are more mainstream, such as pointer authentication, that are in various states of evaluation, deployment, or regression due to hardware vulnerabilities. I think that each of these runtime mitigations are important for defense in depth, but I think that defense in depth works best when these mitigations are adding an additional layer of protection instead of the only layer of protection.
So, this verification work should be seen as an added layer of protection on top of runtime mitigations, which I hope will become more widely available as time goes on.
And there also are no intrinsics for most scalar operations, e.g. if you wanted to force "x>>48 == 0x1234" to be actually done via the shift and not "x & 0xffff000000000000 == 0x1234000000000000" (or vice versa).
And of course assembly means writing platform-specific code (potentially undesirable even if you want to only do the optimization for a single architecture, as it means having to learn to write assembly of said architecture).
There is some potential middle-ground of doing black-boxing, but as-is in C/C++ the way to do this is with a no-op asm block, but that can make register allocation worse, and still requires some platform-specific logic for deriving the register kind from value type.
I've been very frustrated at times that you can't just tell it to use a certain index or a certain query plan.
But at the same time, the data in the table can change over time.
So for that particular problem, postgresql's superpower is it can optimize and reoptimize your query at runtime as the data changes!
Not an easy thing to do in most programming languages.
But I do agree with the rest of the article to a large extent.
For some cases, you need a way to say "do this optimization, fail to compile if you cannot".
Most of the time though, I just want the optimizer to do it's best, and don't have something in particular in mind.
What if the compiler could output the optimization or transformations it does to a file? A file that was checked into source code.
Think of lockfiles for dependencies. Or snapshot testing.
I don't mean output the entire IR. I mean more a list of applied transformations.
When those change, you could then be aware of those changes.
Maybe it could act as an actual lockfile and constrain the compiler? Could it also double as a cache and save time?