You can't fool the optimizer

Posted by HeliumHydride 12/3/2025

267 points | 189 commentspage 2

senfiaj 12/3/2025|

I wonder if compilers do multiple passes on the intermediate code in order to optimize / simplify it. For example, during each pass the optimizer searches some known harcoded patterns and replaces them with something else and repeats until no possible improvement is found.

Also optimizers have a limit, they can't reason as abstractly as humans, for example:

  bool is_divisible_by_6(int x) {
      return x % 2 == 0 && x % 3 == 0;
  }

  bool is_divisible_by_6_optimal(int x) {
      return x % 6 == 0;
  }

I tried with both gcc and clang, the asm code for is_divisible_by_6 is still less optimal. So no, there are plenty of easy ways to fool the optimizer by obfuscation.

The morale is that you still have to optimize algorithms (O notation) and math operations / expressions.

inkyoto 12/4/2025||

> So no, there are plenty of easy ways to fool the optimizer by obfuscation.

If you mean fooling the compiler by the source code obfuscation, it won't – by the time the first optimisation pass arrives, the source had already been transformed into an abstract syntax tree and the source code obfuscation becomes irrelevant.

Multiple optimiser passes do take place, but they are bounded in time – it is not an accepted expectation that the optimiser will spend a – theoretically – indefinite amount of time trying to arrive at the most perfect instruction sequence.

There was a GNU project a long time ago, «superoptimiser», which, given a sequence of instructions, would spend a very long time trying to optimise it into oblivion. The project was more of an academic exercise, and it has been long abandoned since.

jakobnissen 12/3/2025|||

They do, and the order of the passes matter. Sometimes, optimizations are missed because they require a certain order of passes that is different from the one your compiler uses.

On higher optimization levels, many passes occur multiple times. However, as far as I know, compilers don't repeatedly run passes until they've reached an optimum. Instead, they run a fixed series of passes. I don't know why, maybe someone can chime in.

titzer 12/3/2025||

It's a long-standing problem in compilers, often referred to as the "phase ordering problem". In general, forward dataflow optimizations can be combined if they are monotonic (meaning, never make the code worse, or at least, never undo a previous step. It's possible to run forward dataflow problems together repeatedly to a fixpoint. In TurboFan a general graph reduction algorithm is [1] instantiated with a number of reducers, and then a fixpoint is run. The technique of trying to combine multiple passes has been tried a number of times. What doesn't seem so obvious is how to run optimizations that are not traditional forward dataflow problems or are indeed backward dataflow problems (like DCE) together with other transformations. Generally compilers get tuned by running them on lots of different kinds of code, often benchmarks, and then tinkering with the order of passes and other heuristics like loop unroll factors, thresholds for inlining, etc, and seeing what works best.

[1]was? TurboFan seems to have splintered into a number of pieces being reused in different ways these days

windward 12/3/2025|||

Those aren't isomorphic. The C spec says `is_divisible_by_6` short-circuits. You don't want the compiler optimising away null checks.

https://www.open-std.org/jtc1/sc22/wg14/www/docs/n1256.pdf

6.5.13, semantics

senfiaj 12/3/2025|||

So you claim that the compiler "knows about this but doesn't optimize because of some safety measures"? As far as I remember, compilers don't optimize math expressions / brackets, probably because the order of operations might affect the precision of ints/floats, also because of complexity.

But my example is trivial (x % 2 == 0 && x % 3 == 0 is exactly the same as x % 6 == 0 for all C/C++ int), yet the compiler produced different outputs (the outputs are different and most likely is_divisible_by_6 is slower). Also what null (you mean 0?) checks are you talking about? The denominator is not null/0. Regardless, my point about not over relying on compiler optimization (especially for macro algorithms (O notation) and math expressions) remains valid.

1718627440 12/5/2025||

> the order of operations might affect the precision of ints/floats

That's only the problem of floats, with ints this issue doesn't exist.

Why do you write (x % 2 == 0 && x % 3 == 0) instead of (x % 2 == 0 & x % 3 == 0), when the latter is what you think you mean?

Are you sure, that dividing by 6 is actually faster, than dividing by 2 and 3? A division operation is quite costly compared to other arithmetic and 2 and 3 are likely to have some special optimization (2 is a bitshift), which isn't necessary the case for 6.

senfiaj 12/10/2025||

> That's only the problem of floats, with ints this issue doesn't exist.

With ints the results can be dramatically different (often even worse than floats) even though in pure mathematics the order doesn't matter:

  1 * 2 * 3 * 4 / 8 --> 3
  3 * 4 / 8 * 1 * 2 --> 2

This is a trivial example, but it shows why it's extremely hard for compilers to optimize expressions and why they usually leave this task to humans.

But x % 2 == 0 && x % 3 == 0 isn't such case, swapping operands of && has no side effects, nor swapping operands of each ==.

> Are you sure, that dividing by 6 is actually faster

Compilers usually transform divisions into multiplications when the denominator is a constant.

I wrote another example in other comment but I'll write again.

I also tried this

  bool is_divisible_by_15(int x) {
      return x % 3 == 0 && x % 5 == 0;
  }

  bool is_divisible_by_15_optimal(int x) {
      return x % 15 == 0;
  }

is_divisible_by_15 still has a branch, while is_divisible_by_15_optimal does not

  is_divisible_by_15(int):
        imul    eax, edi, -1431655765
        add     eax, 715827882
        cmp     eax, 1431655764
        jbe     .LBB0_2
        xor     eax, eax
        ret
  .LBB0_2:
        imul    eax, edi, -858993459
        add     eax, 429496729
        cmp     eax, 858993459
        setb    al
        ret

  is_divisible_by_15_optimal(int):
        imul    eax, edi, -286331153
        add     eax, 143165576
        cmp     eax, 286331153
        setb    al
        ret

My point is that the compiler still doesn't notice that 2 functions are equivalent. Even when choosing 3 and 5 (to eliminate the questionable bit check trick for 2) the 1st function appears less optimal (more code + branch).

1718627440 12/16/2025||

> in pure mathematics the order doesn't matter

I don't perceive that as an ordering issue. "Pure mathematics" has multiple division definitions, what we see here is the definition you use in class 1: integer division. The issue here is not associativity, it is that the inverse of an integer division is NOT integer multiplication, the inverse of division is the sum of multiplication and the modulo. Integer division is a information destroying operation.

> I wrote another example in other comment but I'll write again. [...]

Yes, this is because optimizing compilers are not optimizers in the mathematical sense, but heuristics and sets of folk wisdoms. This doesn't make them any less impressive.

senfiaj 12/17/2025||

> I don't perceive that as an ordering issue. "Pure mathematics" has multiple division definitions, what we see here is the definition you use in class 1: integer division. The issue here is not associativity, it is that the inverse of an integer division is NOT integer multiplication, the inverse of division is the sum of multiplication and the modulo. Integer division is a information destroying operation.

Agree, I've gone too far with integer division. But a similar problem exists for floats as well. In abstract mathematics the order of some operations between real numbers doesn't matter, but since the CPU floats have limited size and accuracy, it does. This is why when you are calculating some decreasing convergent series, you should better to start from the smallest terms, because the accuracy would be lost during float normalization when adding a tiny term to an already large accumulated sum. A compiler is unlikely to do any optimization here and people should be aware of this. Compilers can't assume the intention in your code, so they make sure the program behavior isn't affected after the optimizations.

> Yes, this is because optimizing compilers are not optimizers in the mathematical sense, but heuristics and sets of folk wisdoms. This doesn't make them any less impressive.

I'm not implying that it's not impressive, but I'm implying that compilers still aren't magic wands, and you should still optimize the algorithms (to a reasonable degree). Just let the compiler do the microptimizations (all this register allocation golf, instruction reordering, caching, the discussed division trick, etc.). IMHO this suboptimal output in this particular case was somewhat expected because it's some "niche" case although it's obvious. I'm not blaming the compiler people. Yes someone could add that optimization rule for my case, but as I said, It's quite rare and it's probably not worth adding optimization rules for such case to make the optimizer more bloated and complicated.

1718627440 12/17/2025||

> I'm not implying that it's not impressive

That part was about me :-).

> I'm implying that compilers still aren't magic wands,

Agreed.

> you should still optimize the algorithms (to a reasonable degree). Just let the compiler do the microptimizations (all this register allocation golf, instruction reordering, caching, the discussed division trick, etc.).

Agreed.

jcranmer 12/3/2025||||

x % 3 == 0 is an expression without side effects (the only cases that trap on a % operator are x % 0 and INT_MIN % -1), and thus the compiler is free to speculate the expression, allowing the comparison to be converted to (x % 2 == 0) & (x % 3 == 0).

Yes, compilers will tend to convert && and || to non-short-circuiting operations when able, so as to avoid control flow.

Sohcahtoa82 12/3/2025||||

Any number divisible by 6 will also be divisible by both 2 and 3 since 6 is divisible by 2 and 3, so the short-circuiting is inconsequential. They're bare ints, not pointers, so null isn't an issue.

So how are they not isomorphic?

dzaima 12/3/2025|||

That only matters for things with side-effects; and changing the `&&` to `&` doesn't get it to optimize anyway.

You can check - copy the LLVM IR from https://godbolt.org/z/EMPr4Yc84 into https://alive2.llvm.org/ce/ and it'll tell you that it is a valid refinement as far as compiler optimization goes.

ramon156 12/3/2025||

I don't know enough about ASM. Are u saying the first one is more optimal because it is faster or because it uses less instructions? Would this reflect a real world use case? Do any other compilers (e.g. V8) optimize modulo's into something else?

senfiaj 12/3/2025||

The compiler didn't recognize that x % 2 == 0 && x % 3 == 0 is exactly the same as x % 6 == 0 for all C/C++ int values. In theory a compiler could detect that and generate identical code for both functions, but it isn't done because this case is "niche" despite being trivial. My point is not to over rely on optimizer for math expressions and algorithms.

barishnamazov 12/3/2025||

Sometimes you can fool the compiler :-)

See "Example 2: Tricking the compiler" in my blog post about O3 sometimes being slower than O2: https://barish.me/blog/cpp-o3-slower/

jmcomets 12/3/2025||

Obvious caveat: pushing this a bit further it can quickly fallback to the default case. The optimizer is a superpower but you still need to try to write efficient code.

    unsigned add_v5(unsigned x, unsigned y) {
      if (x == y) return 2 * x;
      return x + y;
    }

Results in:

    add_v5(unsigned int, unsigned int):
      lsl w8, w0, #1
      add w9, w1, w0
      cmp w0, w1
      csel w0, w8, w9, eq
      ret

(armv8-a clang 21.1.0 with O3)

If compiler folks can chime in, I'm curious why incrementing in a loop can be unrolled and inspected to optimize to an addition, but doubling the number when both operands are equal can't?

jcranmer 12/3/2025||

> If compiler folks can chime in, I'm curious why incrementing in a loop can be unrolled and inspected to optimize to an addition, but doubling the number when both operands are equal can't?

Compilers are essentially massive towers of heuristics for which patterns to apply for optimization. We don't throw a general SMT solver at your code because that takes way too long to compile; instead, we look at examples of actual code and make reasonable efforts to improve code.

In the case of the incrementing in a loop, there is a general analysis called Scalar Evolution that recasts expressions as an affine expression of canonical loop iteration variables (i.e., f(x), where x is 0 on the first loop iteration, 1 on the second, etc.). In the loop `while (x--) y++;`, the x variable [at the end of each loop iteration] can be rewritten as x = x₀ + -1*i, while the y variable is y = y₀ + 1*i. The loop trip count can be solved to an exact count, so we can replace the use of y outside the loop with y = y₀ + 1*trip count = y₀ + x, and then the loop itself is dead and can be deleted. These are all optimizations that happen to be quite useful in other contexts, so it's able to easily recognize this form of loop.

In the example you give, the compiler has to recognize the equivalence of two values conditional on control flow. The problem is that this problem really starts to run into the "the time needed to optimize this isn't worth the gain you get in the end." Note that there are a lot of cases where you have conditional joins (these are "phis" in SSA optimizer parlance), most of which aren't meaningfully simplifiable, so you're cutting off the analysis for all but the simplest cases. At a guess, the simplification is looking for all of the input values to be of the same form, but 2 * x (which will actually be canonicalized to x << 1) is not the same form as x + y, so it's not going to see if the condition being used to choose between the same values would be sufficient to make some operation return the same value. There are representations that make this problem much easier (egraphs), but these are not the dominant form for optimizers at present.

DannyBee 12/3/2025||

This is all true. Additionally, the payback from optimizing purely scalar arithmetic harder has gone down more and more over time compared to almost anything else.

For example, eliminating an extra load or store is often worth more than eliminating 100 extra arithmetic operations these days.

Someone 12/3/2025|||

> I'm curious why incrementing in a loop can be unrolled and inspected to optimize to an addition, but doubling the number when both operands are equal can’t?

I expect because the former helps more in optimising real-world code than the latter. It’s not worth the LLVM developer's time to make the compiler better for programs that it won’t see in practice.

It’s not as if the compiler did nothing with that code, though. It replaced the multiplication by a left shift and removed the branch.

scialex 12/3/2025|||

This sort of pattern can't be found by incremental lowering (and isn't common enough to have more sophisticated analysis written for it) so it ends up in a local maximum.

Basically the idea for most compilers is to do a series of transforms which incrementally improve the program (or at least make it worse in understood and reversible ways). To do this transform you need the optimizer to do the (not always trivial) proof that the 2*x is equivalent to x+y, do the replacement, do the gvn to duplicate the adds and finally do the branch elimination. Each of these steps is however totally separate from one another and the first one doesn't trigger since as far as it's concerned a shift left is faster than an add so why should it do the replacement.

This is all even more complicated since what representation is faster can depend on the target.

AlotOfReading 12/3/2025||

I agree, but GCC manages the optimization, and not all optimizations need to take fewer cycles. The single instruction version is obviously better for -Os and it would probably be a win in general.

DullPointer 12/3/2025||

I’m not a compiler expert, an assembly expert or an ARM expert, so this may be wildly wrong, but this looks optimized to me.

The trick is that it’s doing both the add and the left shift in parallel then selecting which to use based on a compare of the two values with csel.

(To see this, rather than reading the code sequentially, think of every instruction as being issued at the same time until you hit an instruction that needs a destination register from an earlier instruction)

The add is stored in W9 but only read if the two arguments are unequal.

If the compare succeeds and the lsl retires before the add, the add is never read, so nothing stalls waiting for it and the answer can be returned while the add is still in flight. The result of the add would then be quietly discarded assuming it ever started (maybe there’s some magic where it doesn’t even happen at all?).

It’s not clear to me that this is power efficient, or that on many real cpus there’s a latency difference to exploit between add and lsl, so it may not be faster than just unconditionally doing the addition.

That said, it is definitely faster than the code as it was written which if translated to asm verbatim stalls on the compare before executing either the add or the left shift.

adwn 12/3/2025|||

> this looks optimized to me.

It's not. Why would lsl+csel or add+csel or cmp+csel ever be faster than a simple add? Or have higher throughput? Or require less energy? An integer addition is just about the lowest-latency operation you can do on mainstream CPUs, apart from register-renaming operations that never leave the front-end.

DannyBee 12/3/2025|||

In the end, the simple answer is that scalar code is just not worth optimizing harder these days. It's rarer and rarer for compilers to be compiling code where spending more time optimizing purely scalar arithmetic/etc is worth the payback.

This is even true for mid to high end embedded.

DullPointer 12/3/2025|||

ARM is a big target, there could be cpus where lsl is 1 cycle and add is 2+.

Without knowing about specific compiler targets/settings this looks reasonable.

Dumb in the majority case? Absolutely, but smart on the lowest common denominator.

adwn 12/3/2025||

> Without knowing about specific compiler targets/settings this looks reasonable.

But we do, armv8-a clang 21.1.0 with O3, and it doesn't.

> […] but smart on the lowest common denominator.

No, that would be the single add instruction.

msarnoff 12/3/2025||

I was very surprised that GCC could optimize NEON SIMD intrinsics. After spending hours trying to optimize my vector code, trying to get the spacing between register dependencies right to reduce stalls, breaking long reduction operations into intermediate results, messing with LLVM-MCA, etc., I realized that I just couldn’t beat the compiler. It was doing its best to allocate registers and reorder instructions to keep the pipeline filled.

I don’t think it always did the best job and saw a bunch of register spills I thought were unnecessary, but I couldn’t justify the time and effort to do it in assembly…

sureglymop 12/3/2025||

With this one I instead wondered: If there are 4 functions doing exactly the same thing, couldn't the compiler also only generate the code for one of them?

E.g. if in `main` you called two different add functions, couldn't it optimize one of them away completely?

It probably shouldn't do that if you create a dynamic library that needs a symbol table but for an ELF binary it could, no? Why doesn't it do that?

optionalsquid 12/3/2025||

This is not quite what you asked, I think, but GCC is able to remove duplicate functions and variables after code generation via the -fipa-icf options:

> Perform Identical Code Folding for functions (-fipa-icf-functions), read-only variables (-fipa-icf-variables), or both (-fipa-icf). The optimization reduces code size and may disturb unwind stacks by replacing a function by an equivalent one with a different name. The optimization works more effectively with link-time optimization enabled.

In addition, the Gold linker supports a similar feature via `--icf={safe,all}`:

> Identical Code Folding. '--icf=safe' Folds ctors, dtors and functions whose pointers are definitely not taken

cyco130 12/3/2025|||

It would but it's harder to trigger. Here, it's not safe because they're public functions and the standard would require `add_v1 != add_v2` (I think).

If you declare them as static, it eliminates the functions and the calls completely: https://aoco.compiler-explorer.com/z/soPqe7eYx

I'm sure it could also perform definition merging like you suggest but I can't think of a way of triggering it at the moment without also triggering their complete elision.

tialaramex 12/3/2025|||

If your language has monomorphization† (as C++ and Rust do) then it's really common to have this commonality in the emitted code and I believe it is common for compilers to detect and condense the resulting identical machine code. If the foo<T> function for an integer checks if it's equal to four, it well be that on your target hardware that's the same exact machine code whether the integer types T are 1 byte, 2 bytes or 4 bytes and whether they're signed or unsigned, so we should only emit one such implementation of foo, not six for u8, i8, u16, i16, u32 and i32.

† Monomorphization takes Parametrically Polymorphic functions, ie functions which are strongly typed but those types are parameters at compile time, and it emits distinct machine code for each needed variation of the function, so e.g. add(a, b) maybe gets compiled to produce add_integer(a, b) and add_float(a, b) and add_matrix(a, b) even though we only wrote one function, and then code which calls add(a, b) with matrices, is at compile time emitted as calling add_matrix(a, b), because the compiler knew it needs that version. In C++ the number of parameters is also potentially allowed to vary between callers so add_matrix(a, b, c, d) might exist too, this feature is not yet available in Rust.

titzer 12/3/2025||

The linker de-duping identical machine code is common, but most frontends that do monomorphization aren't that smart about identical copies, because monomorphization is usually done with source-level types, and there are lots of typeful operations that need to get resolved and lowered before it's known that the machine code will be identical.

moefh 12/3/2025|||

> It probably shouldn't do that if you create a dynamic library that needs a symbol table but for an ELF binary it could, no?

It can't do that because the program might load a dynamic library that depends on the function (it's perfectly OK for a `.so` to depend on a function from the main executable, for example).

That's one of the reasons why a very cheap optimization is to always use `static` for functions when you can. You're telling the compiler that the function doesn't need to be visible outside the current compilation unit, so the compiler is free to even inline it completely and never produce an actual callable function, if appropriate.

bruce343434 12/3/2025|||

Sadly most C++ projects are organized in a way that hampers static functions. To achieve incremental builds, stuff is split into separate source files that are compiled and optimized separately, and only at the final step linked, which requires symbols of course.

I get it though, because carefully structuring your #includes to get a single translation unit is messy, and compile times get too long.

cyco130 12/3/2025|||

That’s where link-time optimization enters the picture. It’s expensive but tolerable for production builds of small projects and feasible for mid-sized ones.

1718627440 12/5/2025||||

That's one major reason why I don't like C++. I think the concept of header and implementation files is fine, but idiomatic C++ code basically makes it broken. Surely a class should go into the implementation file? (Internal) Types belong into the implementation, what belongs into headers are interfaces and function signatures. A class is a type, so it does not belong into a header file.

gpderetta 12/3/2025|||

[[gnu::visibility(hidden)]] (or the equivalent for your compiler), might help.

sureglymop 12/3/2025|||

> It can't do that because the program might load a dynamic library that depends on the function

That makes perfect sense, thank you!

And I just realized why I was mistaken. I am using fasm with `format ELF64 executable` to create a ELF file. Looking at it with a hex editor, it has no sections or symbol table because it creates a completely stripped binary.

Learned something :)

Joker_vD 12/3/2025|||

Nope. Function with external linkage are required to have different addresses. MSVC actually breaks this and this means that you can't reliably compare function pointers on MSVC because some different functions may happen to have same object code by chance:

    void go_forward(Closure *clo, Closure *cont, Closure *forward) {
        GC_CHECK(clo, cont, forward);
        ((Fun0)(forward->fun))(forward, cont);
    }

    void go_left(Closure *clo, Closure *cont, Closure *left, Closure *right) {
        GC_CHECK(clo, cont, left, right);
        ((Fun0)(left->fun))(left, cont);
    }

    void go_right(Closure *clo, Closure *cont, Closure *left, Closure *right) {
        GC_CHECK(clo, cont, left, right);
        ((Fun0)(right->fun))(right, cont);
    }

    GcInfo gc_info[] = {
        { .fun = (GenericFun)&go_forward, .envc = 0, .argc = 1 },
        { .fun = (GenericFun)&go_left, .envc = 0, .argc = 2 },
        { .fun = (GenericFun)&go_right, .envc = 0, .argc = 2 },
    };

Since, the pointers to go_forward and go_left will be the same, the gc_info table is less useless that it could be otherwise.

gpderetta 12/3/2025||

But it could generate one then make the remaining three tail call to that one, or lay them out so that they are at 1byte-nop each to the next one and fallthrough the next until the last one implements the logic (This is a bit more compilcated on msvc as I believe the ABI requires a well defined prologue).

zozbot234 12/3/2025||

They can't be at 1byte-nop distance because pointer addresses as well as branch target addresses are expected to be aligned for performance reasons - often to 16 bytes. You need either a nop sequence or a jump/tailcall.

gpderetta 12/3/2025||

Sure, there are also probably pointer integrity landing pads. Make it larger nops then.

apple1417 12/3/2025||

The MSVC linker has a feature where it will merge byte-for-byte identical functions. It's most noticeable for default constructors, you might get hundreds of functions which all boil down to "zero the first 32 bytes of this type".

A quick google suggests it's called "identical comdat folding" https://devblogs.microsoft.com/oldnewthing/20161024-00/?p=94...

amelius 12/3/2025||

One undesirable property of optimizers is that in theory one day they produce good code and the next day they don't.

titzer 12/3/2025|

These situations are known as "performance cliffs" and they are particularly pernicious in optimizing dynamic languages like JavaScript, where runtime optimization happens that depends not just on the program's shape, but its past behavior.

317070 12/3/2025||

"The compiler" and "The optimizer" are doing a lot of the heavy lifting here in the argument. I definitely know compilers and optimizers which are not that great. Then again, they are not turning C++ code into ARM instructions.

You absolutely can fool a lot of compilers out there! And I am not only looking at you, NVCC.

Almondsetat 12/3/2025|

But the point should be to follow the optimization cycle: develop, benchmark, evaluate, profile, analyze, optimize. Writing performant code is no joke and very often destroys readability and introduces subtle bugs, so before trying to oursmart the compiler, evaluate if what it produces is good enough already

Panzerschrek 12/4/2025||

I can: https://godbolt.org/z/Kc8cTddd5 Compilers still struggle to optimize non-trivial recursive functions, where obvious non-recursive approach is possible.

bgbntty2 12/3/2025||

I'm not well-versed in compilers, so it was a bit surprising to see how it optimizes all the add_vX functions.

What I most enjoyed, though, was how the guy in the video (linked at the bottom of the article) was typing - a mistake on every few characters. Backspace was likely his most-used key. I found it encouraging, somehow. I know typing speed or correctness isn't really important for coders, but I always felt like I'm behind others with regards to typing, even though when I really concentrate, I do good on those online typing tests. Even when writing this comment, I made like 30 mistakes. Probably an useless comment, but it may give some people hope or validation if they feel like they not great typists.

raluk 12/4/2025|

Years ago I wrote c++ library for stream compostion. Something like C++20 ranges. It turns out that as long as you compose everything with lambdas, compiled code is same as it would be with naive loops. Everything gets optimised.

For example, you can write sum of numbers less than n as:

  count(uint64_t(0)) 
   | take(n) 
   | sum<uint64_t>();

Clang converted this into n*(n-1)/2.

More comments...