You can't fool the optimizer

Posted by HeliumHydride 12/3/2025

267 points | 189 comments

stabbles 12/3/2025|

For people who enjoy these blogs, you would definitely like the Julia REPL as well. I used to play with this a lot to discover compiler things.

For example:

    $ julia
    julia> function f(n)
             total = 0
             for x in 1:n
               total += x
             end
             return total
           end
    julia> @code_native f(10)
        ...
        sub    x9, x0, #2
        mul    x10, x8, x9
        umulh    x8, x8, x9
        extr    x8, x8, x10, #1
        add    x8, x8, x0, lsl #1
        sub    x0, x8, #1
        ret
        ...

it shows this with nice colors right in the REPL.

In the example above, you see that LLVM figured out the arithmetic series and replaced the loop with a simple multiplication.

lifthrasiir 12/3/2025||

This and add_v3 in the OP fall into the general class of Scalar Evolution optimizations (SCEV). LLVM for example is able to handle almost all Brainfuck loops in practice---add_v3 indeed corresponds to a Brainfuck loop `[->+<]`---, and its SCEV implementation is truly massive: https://github.com/llvm/llvm-project/blob/main/llvm/lib/Anal...

Someone 12/3/2025|||

LLVM can do more complex sums, too. See https://kristerw.blogspot.com/2019/04/how-llvm-optimizes-geo...

eigenspace 12/4/2025||

Another nice thing in julia is that if you dont want the optimizer to delete something, you can just ask it nicely :)

    julia> function f(n)
             total = 0
             for x in 1:n
               total += Base.donotdelete(x)
             end
             return total
           end

will keep the loop

abainbridge 12/3/2025||

The examples are fun, but rather than yet another article saying how amazing optimizing compilers are (they are, I already know), I'd probably benefit more from an article explaining when obvious optimizations are missed and what to do about it.

Some boring examples I've just thought of...

eg 1:

    int bar(int num) { return num / 2; }

Doesn't get optimized to a single shift right, because the that won't work if num is negative. In this case we can change the ints to unsigneds to tell the compiler we know the number isn't negative. But it isn't always easy to express to the compiler everything you know about your data and use case. There is an art in knowing what kinds of things you need to tell the compiler in order to unlock optimizations.

eg 2:

    int foo(void) { return strlen("hello"); }

We all know that strlen will return 5, but some compilers don't: https://godbolt.org/z/M7x5qraE6

eg 3:

    int foo(char const *s) {
      if (strlen(s) < 3) return 0;
      if (strcmp(s, "hello") == 0)
        return 1;
      return 0;
    }

This function returns 1 if s is "hello". 0 otherwise. I've added a pointless strlen(). It seems like no compiler is clever enough to remove it. https://godbolt.org/z/Koj65eo5K. I can think of many reasons the compiler isn't able to spot this.

abbeyj 12/3/2025||

> We all know that strlen will return 5, but some compilers don't: https://godbolt.org/z/M7x5qraE6

I feel like it is unfair to blame the compiler when you've explicitly asked for `/O1`. If you change this to `/O2` or `/Ox` then MSVC will optimize this into a constant 5, proving that it does "know" that strlen will return 5 in this case.

abainbridge 12/3/2025||

Fair point. It doesn't do the optimization if you ask to optimize for size '/Os' either.

senfiaj 12/3/2025|||

Yeah, this one as well:

  bool is_divisible_by_6(int x) {
      return x % 2 == 0 && x % 3 == 0;
  }

  bool is_divisible_by_6_optimal(int x) {
      return x % 6 == 0;
  }

Mathematically x % 2 == 0 && x % 3 == 0 is exactly the same as x % 6 == 0 for all C/C++ int values but the compiler doesn't see them as identical, and produces less optimal code for is_divisible_by_6 than for is_divisible_by_6_optimal.

Koffiepoeder 12/3/2025|||

Mhm, this is one of these cases I'd prefer a benchmark to be sure. Checking %2 is very performant and actually just a single bit check. I can also imagine some cpu's having a special code path for %3. In practice I would not be surprised that the double operand is actually faster than the %6. I am mobile at this moment, so not able to verify.

Bratmon 12/3/2025||

But if % 2 && % 3 is better, then isn't there still a missed optimization in this example?

NobodyNada 12/4/2025||

Let's throw this into godbolt: https://clang.godbolt.org/z/qW3qx13qT

    is_divisible_by_6(int):
        test    dil, 1
        jne     .LBB0_1
        imul    eax, edi, -1431655765
        add     eax, 715827882
        cmp     eax, 1431655765
        setb    al
        ret
    .LBB0_1:
        xor     eax, eax
        ret

    is_divisible_by_6_optimal(int):
        imul    eax, edi, -1431655765
        add     eax, 715827882
        ror     eax
        cmp     eax, 715827883
        setb    al
        ret

By themselves, the mod 6 and mod 3 operations are almost identical -- in both cases the compiler used the reciprocal trick to transform the modulo into an imul+add+cmp, the only practical difference being that the %6 has one extra bit shift.

But note the branch in the first function! The original code uses the && operator, which is short-circuiting -- so from the compiler's perspective, perhaps the programmer expects that x % 2 will usually be false, and so we can skip the expensive 3 most of the time. The "suboptimal" version is potentially quite a bit faster in the best case, but also potentially quite a bit slower in the worst case (since that branch could be mispredicted). There's not really a way for the compiler to know which version is "better" without more context, so deferring to "what the programmer wrote" makes sense.

That being said, I don't know that this is really a case of "the compiler knows best" rather than just not having that kind of optimization implemented. If we write 'x % 6 && x % 3', the compiler pointlessly generates both operations. And GCC generates branchless code for 'is_divisible_by_6', which is just worse than 'is_divisible_by_6_optimal' in all cases.

senfiaj 12/4/2025||

I also tried this

  bool is_divisible_by_15(int x) {
      return x % 3 == 0 && x % 5 == 0;
  }

  bool is_divisible_by_15_optimal(int x) {
      return x % 15 == 0;
  }

is_divisible_by_15 still has a branch, while is_divisible_by_15_optimal does not

  is_divisible_by_15(int):
        imul    eax, edi, -1431655765
        add     eax, 715827882
        cmp     eax, 1431655764
        jbe     .LBB0_2
        xor     eax, eax
        ret
  .LBB0_2:
        imul    eax, edi, -858993459
        add     eax, 429496729
        cmp     eax, 858993459
        setb    al
        ret

  is_divisible_by_15_optimal(int):
        imul    eax, edi, -286331153
        add     eax, 143165576
        cmp     eax, 286331153
        setb    al
        ret

abainbridge 12/3/2025|||

Nice.

Is the best way to think of optimizing compilers, "I wonder if someone hand wrote a rule for the optimizer that fits this case"?

stouset 12/3/2025||

Probably not, because a lot of the power of optimizing compilers comes from composing optimizations. Also a lot comes from being able to rule out undefined behavior.

SkiFire13 12/3/2025|||

> int bar(int num) { return num / 2; } > > Doesn't get optimized to a single shift right, because the that won't work if num is negative.

Nit: some might think the reason this doesn't work is because the shift would "move" the sign bit, but actually arithmetic shifting instructions exist for this exact purpose. The reason they are not enough is because shifting provides the wrong kind of division rounding for negative numbers. This can however be fixed up by adding 1 if the number is negative (this can be done with an additional logical shift for moving the sign bit to the rightmost position and an addition).

abainbridge 12/4/2025|||

Good point. I guess there are more cases than just this one where I'd like to be able to tell the compiler I don't care about rounding behaviour and would prefer the fastest code. Like -ffast-math but for integer operations. I don't think that exists. I wonder why.

kragen 12/4/2025|||

Will shift, shift, and add be slower or faster than a divide instruction on machines with a divide instruction?

SkiFire13 12/4/2025||

Most likely no, division instructins generally take as much as 10-20 other arithmetic/logic instruction.

commandlinefan 12/3/2025|||

> won't work if num is negative

I remember reading (although I can't find it now) a great analysis of all the optimizations that Javascript compilers _can't_ do because of the existence of the "eval" instruction.

cibyr 12/3/2025|||

The extra fun thing about this is that eval has different semantics if it's assigned to a different name, in order to allow JavaScript implementations to apply extra optimizations to code that doesn't call a function literally named "eval": https://developer.mozilla.org/en-US/docs/Web/JavaScript/Refe...

Andy Wingo (of course!) has a good explanation of this: https://wingolog.org/archives/2012/01/12/javascript-eval-con...

LoganDark 12/3/2025||||

Could this perhaps be it? https://janvitek.org/pubs/ecoop11.pdf

astrange 12/3/2025|||

A JIT can do any optimization it wants, as long as it can deoptimize if it turns out it was wrong.

adrianN 12/3/2025||

You also want to prove that the „optimization“ doesn’t make things slower.

astrange 12/4/2025||

I don't think anyone actually bothers to do that tbh.

stabbles 12/3/2025|||

The compiler doesn't know the implementation of strlen, it only has its header. At runtime it might be different than at compile time (e.g. LD_PRELOAD=...). For this to be optimized you need link time optimization.

dzaima 12/3/2025|||

Both clang and gcc do optimize it though - https://godbolt.org/z/cGG9dq756. You need -fno-builtin or similar to get them to not.

valleyer 12/3/2025||||

No, the compiler may assume that the behavior of standard library functions is standards-conformant.

SpaceManNabs 12/3/2025||

> No, the compiler may assume that the behavior of standard library functions is standards-conformant.

Why?

What happens if it isn't?

MindSpunk 12/3/2025|||

Sadness. Tons of functions from the standard library are special cases by the compiler. The compiler can elide malloc calls if it can prove it doesn't need them, even though strictly speaking malloc has side effects by changing the heap state. Just not useful side effects.

memcpy will get transformed and inlined for small copies all the time.

1718627440 12/5/2025||||

What happens when you change random functions in your C compiler? The C standard library and compiler are not independent, both make up a C implementation, which behaviour is described by the C standard.

valleyer 12/6/2025||

Yes, though it's worth stating that it's a little more nuanced than that, since (for historical, path-dependent reasons) the compiler and libc are often independent projects (and libc often includes a bunch of other stuff beyond what the standard/compiler need).

This is the case, for example, on macOS, FreeBSD, and Linux.

Cute username, BTW.

1718627440 12/7/2025||

You are right, it depends, whether you write C (from the standard) or a specific dialect from your vendor (everybody does in practice). In the latter case, you need to know about the rules of the compiler. But to allow optimization, these are often similar, so that the compiler assumes these have the behaviour of the implementation, that the compiler is tailored against.

> Cute username, BTW.

Thanks, I was to lazy to think of a real name, so this is the timestamp, I created the account.

valleyer 12/4/2025||||

> What happens if it isn't?

§6.4.2.1: "If the program defines a reserved identifier [...] the behavior is undefined."

kragen 12/4/2025||||

The most common reason is to do optimizations such as replacing strlen("hello") with 5 or open-coding strlen (or, more commonly, memcpy or memcmp). If you're linking with a non-conformant strlen (or memcpy or whatever) the usual thing that happens is that you get standards-compliant behavior when the compiler optimizes away the call, but you get the non-conformant behavior you presumably wanted when the compiler compiles a call to your non-conformant function.

But the orthodox answer to such questions is that demons fly out of your nose.

DannyBee 12/3/2025|||

Because that's what it means to compile a specific dialect of a specific programming language?

If you want a dialect where they aren't allowed to assume that you would have to make your own

1718627440 12/5/2025||||

It does. The meaning of certain functions are prescribed by the C standard and the compiler is allowed to expect them to have certain implementations. It can replace them with intrinsics or even remove them entirely. It is of course different for a freestanding implementation.

abainbridge 12/3/2025|||

Hmmm, really? Switching compiler seems sufficient: https://godbolt.org/z/xnevov5d7

BTW, the case of it not optimizing was MSVC targetting Windows (which doesn't support LD_PRELOAD, but maybe has something similar?).

dzaima 12/3/2025|||

> I've added a pointless strlen(). It seems like no compiler is clever enough to remove it.

For that you could at least argue that if the libc's strlen is faster than strcmp, that improves performance if the programmer expects the function to be usually called with a short input.

That said, changing it to `if (strlen(s) == 5) return 0;` it still doesn't get optimized (https://godbolt.org/z/7feWWjhfo), even though the entire function is completely equivalent to just `return 0;`.

delta_p_delta_x 12/4/2025|||

> but some compilers don't: https://godbolt.org/z/M7x5qraE6

You've misrepresented the situation. Turn up the optimiser to `/O2` and MSVC returns 5 directly, too.

> This function returns 1 if s is "hello". 0 otherwise. I've added a pointless strlen(). It seems like no compiler is clever enough to remove it.

It's funny how sometimes operating at a higher level of abstraction allows the compiler to optimise the code better: https://godbolt.org/z/EYP5764Mv

In this, the string literal "hello" is lowered not merely into a static string, but a handful of integral immediates that are directly inline in the assembly, no label-dereferencing required, and the 'is equal to "hello"' test is cast as the result of some sign extends and a bitwise-xor.

Of course, one could argue that std::string_view::size() is statically available, but then my counter-argument is that C's zero-terminated strings are a massive pessimisation (which is why the compiler couldn't 'see' what we humans can), and should always be avoided.

abainbridge 12/3/2025|||

eg 4:

   int foo(char const *s) {
     if (s[0] == 'h' && s[1] == 'e' && s[2] == 'l' && s[3] == 'l')
        return 1;
     return 0;
   }

The outputs 4 cmp instructions here, even though I'd have thought 1 was sufficient. https://godbolt.org/z/hqMnbrnKe

ynik 12/3/2025|||

`s[0] == 'h'` isn't sufficient to guarantee that `s[3]` can be access without a segfault, so the compiler is not allowed to perform this optimization.

If you use `&` instead of `&&` (so that all array elements are accessed unconditionally), the optimization will happen: https://godbolt.org/z/KjdT16Kfb

(also note you got the endianness wrong in your hand-optimized version)

zrm 12/3/2025|||

> If you use `&` instead of `&&` (so that all array elements are accessed unconditionally), the optimization will happen

But then you're accessing four elements of a string that could have a strlen of less than 3. If the strlen is 1 then the short circuit case saves you because s[1] will be '\0' instead of 'e' and then you don't access elements past the end of the string. The "optimized" version is UB for short strings.

Denvercoder9 12/3/2025|||

Yes, so that's why the compiler can't and doesn't emit the optimized version if you write the short circuited version - because it behaves differently for short strings.

immibis 12/4/2025|||

UB doesn't exist in the processor (it does, but not here). If the compiler knows the pointer is aligned it can do the transformation.

zrm 12/6/2025||

For the compiler to know the pointer is aligned it would have to actually be aligned and there is no guarantee that it is.

kragen 12/4/2025||||

This is fantastic, thanks! This is the approach I use in httpdito to detect the CRLFCRLF that terminates an HTTP/1.0 GET request, but I'm doing it in assembly.

abainbridge 12/3/2025||||

Ooo, I'd never thought of using & like that. Interesting.

> (also note you got the endianness wrong in your hand-optimized version) Doh :-)

rdc12 12/3/2025||

Matt Godbolt's talk on ray tracers, shows how effective that change can be. Think it was that talk anyway.

https://www.youtube.com/watch?v=HG6c4Kwbv4I

NooneAtAll3 12/3/2025|||

good ol' short circuiting

abbeyj 12/3/2025||||

If you want to tell the compiler not to worry about the possible buffer overrun then you can try `int foo(char const s[static 4])`. Or use `&` instead of `&&` to ensure that there is no short-circuiting, e.g. `if ((s[0] == 'h') & (s[1] == 'e') & (s[2] == 'l') & (s[3] == 'l'))` Either way, this then compiles down to a single 32-bit comparison.

Interestingly, it is comparing against a different 32-bit value than `bar` does. I think this is because you accidentally got the order backwards in `bar`.

The code in `bar` is probably not a good idea on targets that don't like unaligned loads.

raphlinus 12/3/2025|||

That's because the 1 instruction variant may read past the end of an array. Let's say s is a single null byte at 0x2000fff, for example (and that memory is only mapped through 0x2001000); the function as written is fine, but the optimized version may page fault.

abainbridge 12/3/2025||

Ah, yes, good point. I think this is a nice example of "I didn't notice I needed to tell the compiler a thing I know so it can optimize".

WalterBright 12/3/2025||

`s` may be null, and so the strlen may seg fault.

pianom4n 12/3/2025|||

But that's undefined behavior, so the compiler is free to ignore that possibility.

WalterBright 12/4/2025||

> so the compiler is free to ignore that possibility

And that's what is wrong. This is the most unfriendly behavior towards the programmer.

flqn 12/3/2025||||

Since the optimiser is allowed to assume you're not invoking UB, and strlen of null is UB, I don't believe that it would consider that case when optimising this function.

WalterBright 12/4/2025||

I understand that, but I don't agree that such optimizer behavior is worth it and I won't put it in my compilers.

kragen 12/4/2025||

I appreciate that greatly.

WalterBright 12/4/2025||

The notion that because it is undefined behavior means that the compiler is free to replace it with anything up to and including "launch nuclear missiles". This is just nuts.

If I program it to cause a null pointer seg fault, I expect a null pointer seg fault. If I program it to cause a twos complement overflow, I want a twos complement overflow.

kragen 12/4/2025||

Yeah, I feel the same way. It's refreshing to hear that that's not just because I'm insane. I think C compiler teams are sort of forced into this stupid shit because they don't have new CPU architectures to port to anymore, so, unless they want to go find new jobs, they're forced to waste their time and everyone else's by "improving" the compilers by increasing performance in riskier and riskier ways.

jagged-chisel 12/3/2025||

I always code with the mindset “the compiler is smarter than me.” No need to twist my logic around attempting to squeeze performance out of the processor - write something understandable to humans, let the computer do what computers do.

adrianN 12/3/2025||

This is decent advice in general, but it pays off to try and express your logic in a way that is machine friendly. That mostly means thinking carefully about how you organize the data you work with. Optimizers generally don't change data structures or memory layout but that can make orders of magnitude difference in the performance of your program. It is also often difficult to refactor later.

amiga386 12/3/2025|||

I find the same too. I find gcc and clang can inline functions, but can't decide to break apart a struct used only among those inlined functions and make every struct member a local variable, and then decide that one or more of those local variables should be allocated as a register for the full lifetime of the function, rather than spill onto the local stack.

So if you use a messy solution where something that should be a struct and operated on with functions, is actually just a pile of local variables within a single function, and you use macros operating on local variables instead of inlineable functions operating on structs, you get massively better performance.

e.g.

    /* slower */
    struct foo { uint32_t a,b,c,d,e,f,g,h; }
    uint32_t do_thing(struct foo *foo) {
        return foo->a ^ foo->b ^ foo->c ^ foo->d;
    }
    void blah() {
        struct foo x;
        for (...) {
            x.e = do_thing(&x) ^ x.f;
            ...
        }
    }

    /* faster */
    #define DO_THING (a^b^c^d)
    void blah() {
        uint32_t a,b,c,d,e,f,g,h;
        for (...) {
            e = DO_THING ^ f;
            ...
        }
    }

torginus 12/3/2025|||

The nice thing about godbolt is that it can show you that clang not only can but do it in theory but also does it in practice:

https://aoco.compiler-explorer.com/#g:!((g:!((g:!((h:codeEdi...

The ability of turning stack allocated variables into locals(which can be then put in registers) is one of the most important passes of modern compilers.

Since compilers use SSA, where locals are immutable while lots of languages, like C have mutable variables, some compiler frontends put locals onto the stack, and let the compiler figure out what can be put into locals and how.

amiga386 12/3/2025||

That's really good; clearly I haven't looked at more recent versions. The magic seems to happen in your link at SROAPass, "Scalar Replacement Of Aggregates". Very cool!

According to https://docs.hdoc.io/hdoc/llvm-project/r2E8025E445BE9CEE.htm...

> This pass takes allocations which can be completely analyzed (that is, they don't escape) and tries to turn them into scalar SSA values.

That's actually a useful hint to me. When I was trying to replace locals and macros with a struct and functions, I also used the struct directly in another struct (which was the wider source of persistence across functions), so perhaps this pass thought the struct _did_ escape. I should revisit my code and see if I can tweak it to get this optimisation applied.

actionfromafar 12/3/2025|||

I guess the chances of the compiler doing something smart increases with link-time optimizations and when keeping as much as possible inside the same "compilation unit". (In practice in the same source file.)

lou1306 12/3/2025|||

To make a more specific example, if you malloc()/free() within a loop, it's unlikely that the compiler will fix that for you. However, moving those calls outside of the loop (plus maybe add some realloc()s within, only if needed) is probably going to perform better.

adrianN 12/3/2025||

That is something that can be easily found and usually fixed with trivial profiling. I'm more talking about data locality instead of pointer chasing. Once you set up a pointer-chasing data infrastructure changing that means rewriting most of your application.

jaccola 12/3/2025|||

I would take it one step further, often trying to eke out performance gains with clever tricks can hurt performance by causing you to "miss the forest for the trees".

I work with Cuda kernels a lot for computer vision. I am able to consistently and significantly improve on the performance of research code without any fancy tricks, just with good software engineering practices.

By organising variables into structs, improving naming, using helper functions, etc... the previously impenetrable code becomes so much clearer and the obvious optimisations reveal themselves.

Not to say there aren't certain tricks / patterns / gotchas / low level hardware realities to keep in mind, of course.

qsort 12/3/2025|||

> I always code with the mindset “the compiler is smarter than me.”

Like with people in general, it depends on what compiler/interpreter we're talking about, I'll freely grant that clang is smarter than me, but CPython for sure isn't. :)

More generally, canonicalization goes very far, but no farther than language semantics allows. Not even the notorious "sufficiently smart compiler" with infinite time can figure out what you don't tell it.

manbitesdog 12/3/2025||

To add to this, the low-level constraints also make this assumption noisy, no matter how smart the compiler is. On the CPython case, if you do `dis.dis('DAY = 24 * 60 * 60)` you will see that constant folding nicely converts it to `LOAD_CONST 86400`. However, if you try `dis.dis('ATOMS_IN_THE_WORLD = 10*50')` you will get LOAD_CONST 10, LOAD_CONST 50, BINARY_OP **.

zbentley 12/6/2025||

> However, if you try `dis.dis('ATOMS_IN_THE_WORLD = 10*50')` you will get LOAD_CONST 10, LOAD_CONST 50...

I do not get that, I get LOAD_CONST 500.

Tested on: Python 3.9.3 MacOS (Apple provided), 3.13.3 (uv provided) MacOS and Linux, and 3.14.0 (uv provided) MacOS and Linux.

manbitesdog 12/12/2025||

Sorry, I meant 10**50; HN formatting removed one asterisk

zbentley 12/20/2025||

Ah, gotcha. Yep! Not constant-calculated at all!

wat10000 12/3/2025|||

I would modify this a bit. Someone with decent computer architecture knowledge, tools, and time can generally do better than the compiler. But you generally won't, because you have a lot of other things to think about. So I'd state this as, "the compiler is more diligent and consistent than me." It's not so much that it can spot a for loop that's equivalent to a single add, but that it will spot it just about every time, so you don't have to worry about it.

moregrist 12/3/2025|||

There are optimizations that a compiler can perform; usually these are code transformations. Modern optimizing compilers usually get these right.

The optimizations that tend to have the most impact involve changes to the algorithm or data layout. Most compilers won’t do things like add a hash table to make a lookup O(1) or rearrange an array of structures to be a structure of arrays for better data locality. Coding with an eye for these optimizations is still a very good use of your time.

stonemetal12 12/3/2025|||

I go with "You are responsible for the algorithms, it is responsible for the code micro optimizations". The compiler can't optimize you out of an SQL N+1 situation, that is on me to avoid, but it is better than me at loop unrolling.

IshKebab 12/3/2025|||

The fact that compilers are smart isn't an excuse to not think about performance at all. They can't change your program architecture, algorithms, memory access patterns, etc.

You can mostly not think about super low level integer manipulation stuff though.

wavemode 12/3/2025|||

This is very often true when your data is sitting right there on the stack.

Though when your data is behind pointers, it's very easy to write code that the compiler can no longer figure out how to optimize.

flohofwoe 12/3/2025|||

> I always code with the mindset “the compiler is smarter than me.”

...I don't know... for instance the MSVC compiler creates this output for the last two 'non-trivial' functions with '/Ox':

  add w8,w1,w0
  cmp w0,#0
  cseleq w0,w1,w8

Even beginner assembly coders on their first day wouldn't write such bullshit :)

A better mindset is "don't trust the compiler for code that's actually performance sensitive".

You shouldn't validate each line of compiler output, but at least for the 'hot areas' in the code base that definitely pays off, because sometimes compilers do really weird shit for no good reason (often because of 'interference' between unrelated optimizer passes) - and often you don't need to dig deep to stumble over weird output like in the example above.

sumtechguy 12/3/2025||

I see the msvc arm compiler has not improved much in 20 years. The msvc arm was pretty odd when we used it in ~2003. We did not trust it at all. Think we had to get 4 or so compiler fixes out of MS for that project plus 3 or 4 library fixes. The x86 one was pretty solid. We were targeting 4 different CPU platforms at the same time so we could find things like that decently quickly. Most of the the time it was something we did that was weird. But even then we would find them. That one looks like maybe the optimizer back filled a nop slot?

ErroneousBosh 12/3/2025|||

You say that, but I was able to reduce the code size of some avr8 stuff I was working on by removing a whole bunch of instructions that zero out registers and then shift a value around. I don't it to literally shift the top byte 24 bits to the right and zero out the upper 24 bits, I just need it to pass the value in the top 8 bits direct to the next operation.

I agree that most people are not writing hand-tuned avr8 assembly. Most people aren't attempting to do DSP on 8-bit AVRs either.

mamcx 12/3/2025|||

> “the compiler is smarter than me.”

This is true, but it also means "the compiler IS made for someone median smart, that now knows the machine".

It works great for basic, simple, common code, and for code that is made with care for data structures.

A total mess of code is another story.

P.D: is similar to the query optimizers, that neither can outrun a terrible made schema and queries

kragen 12/4/2025|||

> I always code with the mindset “the compiler is smarter than me.”

That mindset will last until the first or second time you walk through the compiler's assembly-language output. GCC and LLVM can find some amazing optimizations, but they also consistently miss very obvious optimization opportunities, often even on utterly trivial code.

That isn't to say that it's worthwhile to hand-optimize all your code. But the compiler is very much not smarter than you, and if you do need to squeeze performance out of the processor, you can do a lot better than the compiler does. A good first step is to look at the stupid shit the compiler has spat out and massage your source code until the shit stinks less.

tonyhart7 12/3/2025||

also not all software need optimization to the bone

pareto principle like always, dont need the best but good enough

not every company is google level anyway

WalterBright 12/3/2025||

There are general optimizations, based on DFA (Data Flow Analysis). These recognize things like loops, loop invariants, dead code, copy propagation, constant propagation, common subexpressions, etc.

Then, there are is a (very long) list of checks for specific patterns and replacing them with shorter sequences of code, things like recognizing the pattern of bswap and replacing it with a bswap instruction. There's no end to adding patterns to check for.

DannyBee 12/3/2025|

This is true but there is actually an end ;)

There are limits on provable equivalence in the first place. things like equality saturation also try and do a better job of formalizing equivalence based rewrites.

Scene_Cast2 12/3/2025||

This post assumes C/C++ style business logic code.

Anything HPC will benefit from thinking about how things map onto hardware (or, in case of SQL, onto data structures).

I think way too few people use profilers. If your code is slow, profiling is the first tool you should reach for. Unfortunately, the state of profiling tools outside of NSight and Visual Studio (non-Code) is pretty disappointing.

layer8 12/3/2025|

I don’t disagree, but profiling also won’t help you with death by a thousand indirections.

lmm 12/4/2025||

Sure, but that's mostly a myth.

1718627440 12/5/2025||

So how do you see in a profiler, that everything is 1.2x slower than it could be?

lmm 12/6/2025||

No-one's getting out of bed for 1.2x.

1718627440 12/6/2025||

Depends. If you have a real-time system, that might very well will be what you chase after. Also why not make your program a bit faster when it is no work, by starting it the right way upfront. I mean I wouldn't rewrite a program for this, but when I program some new part and I can avoid an indirection, why not do it? Less complexity, less (failure) state, better performance.

lmm 12/7/2025||

> If you have a real-time system, that might very well will be what you chase after.

If you have a real-time system you care about the difference between real-time and not. But any given 1.2x factor is extremely unlikely to be the difference-maker.

> When I program some new part and I can avoid an indirection, why not do it? Less complexity, less (failure) state, better performance.

Well if it makes the code simpler then you should do it for that reason. But thinking about this kind of low-level performance detail at all will usually do more harm than good.

1718627440 12/7/2025|||

In languages that don't hide much (e.g heap allocations in Java), less complexity corresponds to better performance (up to a point). Because better performance is fundamentally about letting the computer do less work. When pointer indirection is explicit, the programmer is nudged towards thinking whether they actually want it. Same with dynamic dispatch, polymorphism and every other runtime complexity.

teworks 12/7/2025|||

I know games are constantly brought up as an example, but it's for good reason. Your frametime being 16.6ms instead of 20ms is the difference between a shipped feature and a cut one in a console title. And all the data and instruction cache thrashing caused by pointer chasing can (and does) absolutely make that difference.

CodeArtisan 12/3/2025||

Recursive Popcount:

    unsigned int popcount(unsigned int n) 
    {
        return (n &= n - 1u) ? (1u  + popcount(n)) : 0u;
    }

Clang 21.1 x64:

    popcount:
            mov     eax, -1
    .LBB0_1:
            lea     ecx, [rdi - 1]
            inc     eax
            and     ecx, edi
            mov     edi, ecx
            jne     .LBB0_1
            ret

GCC 15.2:

    popcount:
            blsr    edi, edi
            popcnt  eax, edi
            ret

Both compiled with -O3 -march=znver5

pbsd 12/3/2025|

Because the function is not quite correct. It should be

    return n ? (1u  + popcount(n & n - 1u)) : 0u;

which both Clang and GCC promptly optimize to a single popcnt.

1718627440 12/5/2025||

There should be testsuites, which are based on testing which compilation passes the compiler chose.

matja 12/3/2025||

You can fool the optimizer, but you have to work harder to do so:

    unsigned add(unsigned x, unsigned y) {
        unsigned a, b;
        do {
            a = x & y;
            b = x ^ y;
            x = a << 1;
            y = b;
        } while (a);
        return b;
    }

becomes (with armv8-a clang 21.1.0 -O3) :

    add(unsigned int, unsigned int):
    .LBB0_1:
            ands    w8, w0, w1
            eor     w1, w0, w1
            lsl     w0, w8, #1
            b.ne    .LBB0_1
            mov     w0, w1
            ret

thaumasiotes 12/3/2025|

Since I had to think about it:

    unsigned add(unsigned x, unsigned y) {
        unsigned a, b;
        do {
            a = x & y;   /* every position where addition will generate a carry */
            b = x ^ y;   /* the addition, with no carries */
            x = a << 1;  /* the carries */
            y = b;
        /* if there were any carries, repeat the loop */
        } while (a);
        return b;
    }

It's easy to show that this algorithm is correct in the sense that, when b is returned, it must be equal to x+y. x+y summing to a constant is a loop invariant, and at termination x is 0 and y is b.

It's a little more difficult to see that the loop will necessarily terminate.

New a values come from a bitwise & of x and y. New x values come from a left shift of a. This means that, if x ends in some number of zeroes, the next value of a will also end in at least that many zeroes, and the next value of x will end in an additional zero (because of the left shift). Eventually a will end in as many zeroes as there are bits in a, and the loop will terminate.

gfaster 12/3/2025||

In C, I'm pretty confident the loop is defined by the standard to terminate.

Also I did take the excuse to plug it (the optimized llvm ir) into Alive:

https://alive2.llvm.org/ce/#g:!((g:!((g:!((h:codeEditor,i:(f...

dzaima 12/3/2025|||

Alive2 does not handle loops; don't know what exactly it does by default, but changing the `shl i32 %and, 1` to `shl i32 %and, 2` has it still report the transformation as valid. You can add `--src-unroll=2` for it to check up to two loop iterations, which does catch such an error (and does still report the original as valid), but of course that's quite limited. (maybe the default is like `--src-unroll=1`?)

gfaster 12/3/2025||

Oh wow nice catch - I was not at all familiar with the limitations. I would've hoped for a warning there, but I suppose it is a research project.

I was able to get it working with unrolling and narrower integers:

https://alive2.llvm.org/ce/#z:OYLghAFBqd5QCxAYwPYBMCmBRdBLAF...

thaumasiotes 12/4/2025|||

> In C, I'm pretty confident the loop is defined by the standard to terminate.

Huh? What's that supposed to mean?

gfaster 12/5/2025||

That it is Undefined Behavior for a loop with a non-constant conditional and that doesn't cause side effects in its body to not terminate.

For example, you can use this make the compiler "prove" the Collatz Conjecture:

https://gcc.godbolt.org/#g:!((g:!((g:!((h:codeEditor,i:(file...

anon-3988 12/3/2025||

What I am curious about is, is the compiler smart enough to be lazy with computation and or variables? For example consider:

let a = expr let b = expr2

if (a || b) { return true; }

is the compiler allowed to lazily compute this if it is indeed faster to do that way? Or declaring a bunch of variables that may or may not be used in all of the branches. Is the compiler smart enough to only compute them whenever it is necessary? AFAIK this is now allowed in C-like languages. Things have to materialize. Another one is, I like to do memcpy every single time eventhough it might not even be used or overwritten by other memcpys. Is the compiler smart enough to not perform those and reorder my program so that only the last relevant memcpy is performed?

A lot of times, my code becomes ugly because I don't trust that it does any of this. I would like t write code in consistent and simple ways but I need compilers to be much smarter than it is today.

A bad example recently is something like

const S * s =;

let a = constant; let b = constant; let c = constant; let d = constant; let e = constant; let f = constant; let g = constant; let h = constant; let i = constant; let j = constant; let k = constant; let l = constant;

if (s->a == a && s->b == b /* etc */ ) { return true; }

It did not turn all of this into a SIMD mask or something like that.

jcranmer 12/3/2025||

> Is the compiler smart enough to only compute them whenever it is necessary?

This is known as "code sinking," and most optimizers are capable of doing this. Except keep in mind that a) the profitability of doing so is not always clear [1] and b) the compiler is a lot more fastidious about corner-case behavior than you are, so it might conclude that it's not in fact safe to sink the operation when you think it is safe to do so.

[1] If the operation to sink is x = y + z, you now may need to keep the values of y and z around longer to compute the addition, increasing register pressure and potentially hurting performance as a result.

Denvercoder9 12/3/2025||

> It did not turn all of this into a SIMD mask or something like that.

Did you try using bitwise and (&), or a local for the struct? The short-circuiting behaviour of the logical means that if `s->a != a`, `s->b` must not be dereferenced, so the compiler cannot turn this into a SIMD mask operation, because it behaves differently.

Generally compilers are pretty smart these days, and I find that more often than not if they miss an "obvious" optimization it's because there's a cornercase where it behaves differently from the code I wrote.

Findecanor 12/3/2025||

I'm wondering how the compiler optimised add_v3() and add_v4() though.

Was it through "idiom detection", i.e. by recognising those specific patterns, or did the compiler deduce the answers them through some more involved analysis?

fooyc 12/3/2025||

add_v3() is the result of induction variable simplification: https://llvm.org/doxygen/IndVarSimplify_8cpp_source.html

j2kun 12/3/2025||

Scalar Evolution is one way loops can be simplified

derefr 12/3/2025|

Even better / potentially more surprising:

    unsigned mult(unsigned x, unsigned y) {
        unsigned y0 = y;
        while (x--) y = add_v1(y, y0);
        return y;
    }

optimizes to:

    mult(unsigned int, unsigned int):
      madd w0, w1, w0, w1
      ret

(and this produces the same result when substituting any of the `add_vN`s from TFA)

More comments...