Top
Best
New

Posted by HeliumHydride 13 hours ago

You can't fool the optimizer(xania.org)
226 points | 136 comments
stabbles 12 hours ago|
For people who enjoy these blogs, you would definitely like the Julia REPL as well. I used to play with this a lot to discover compiler things.

For example:

    $ julia
    julia> function f(n)
             total = 0
             for x in 1:n
               total += x
             end
             return total
           end
    julia> @code_native f(10)
        ...
        sub    x9, x0, #2
        mul    x10, x8, x9
        umulh    x8, x8, x9
        extr    x8, x8, x10, #1
        add    x8, x8, x0, lsl #1
        sub    x0, x8, #1
        ret
        ...
it shows this with nice colors right in the REPL.

In the example above, you see that LLVM figured out the arithmetic series and replaced the loop with a simple multiplication.

lifthrasiir 11 hours ago||
This and add_v3 in the OP fall into the general class of Scalar Evolution optimizations (SCEV). LLVM for example is able to handle almost all Brainfuck loops in practice---add_v3 indeed corresponds to a Brainfuck loop `[->+<]`---, and its SCEV implementation is truly massive: https://github.com/llvm/llvm-project/blob/main/llvm/lib/Anal...
Someone 11 hours ago||
LLVM can do more complex sums, too. See https://kristerw.blogspot.com/2019/04/how-llvm-optimizes-geo...
abainbridge 8 hours ago||
The examples are fun, but rather than yet another article saying how amazing optimizing compilers are (they are, I already know), I'd probably benefit more from an article explaining when obvious optimizations are missed and what to do about it.

Some boring examples I've just thought of...

eg 1:

    int bar(int num) { return num / 2; }
Doesn't get optimized to a single shift right, because the that won't work if num is negative. In this case we can change the ints to unsigneds to tell the compiler we know the number isn't negative. But it isn't always easy to express to the compiler everything you know about your data and use case. There is an art in knowing what kinds of things you need to tell the compiler in order to unlock optimizations.

eg 2:

    int foo(void) { return strlen("hello"); }
We all know that strlen will return 5, but some compilers don't: https://godbolt.org/z/M7x5qraE6

eg 3:

    int foo(char const *s) {
      if (strlen(s) < 3) return 0;
      if (strcmp(s, "hello") == 0)
        return 1;
      return 0;
    }
This function returns 1 if s is "hello". 0 otherwise. I've added a pointless strlen(). It seems like no compiler is clever enough to remove it. https://godbolt.org/z/Koj65eo5K. I can think of many reasons the compiler isn't able to spot this.
abbeyj 7 hours ago||
> We all know that strlen will return 5, but some compilers don't: https://godbolt.org/z/M7x5qraE6

I feel like it is unfair to blame the compiler when you've explicitly asked for `/O1`. If you change this to `/O2` or `/Ox` then MSVC will optimize this into a constant 5, proving that it does "know" that strlen will return 5 in this case.

abainbridge 6 hours ago||
Fair point. It doesn't do the optimization if you ask to optimize for size '/Os' either.
senfiaj 8 hours ago|||
Yeah, this one as well:

  bool is_divisible_by_6(int x) {
      return x % 2 == 0 && x % 3 == 0;
  }

  bool is_divisible_by_6_optimal(int x) {
      return x % 6 == 0;
  }
Mathematically x % 2 == 0 && x % 3 == 0 is exactly the same as x % 6 == 0 for all C/C++ int values but the compiler doesn't see them as identical, and produces less optimal code for is_divisible_by_6 than for is_divisible_by_6_optimal.
Koffiepoeder 4 hours ago|||
Mhm, this is one of these cases I'd prefer a benchmark to be sure. Checking %2 is very performant and actually just a single bit check. I can also imagine some cpu's having a special code path for %3. In practice I would not be surprised that the double operand is actually faster than the %6. I am mobile at this moment, so not able to verify.
Bratmon 3 hours ago||
But if % 2 && % 3 is better, then isn't there still a missed optimization in this example?
abainbridge 8 hours ago|||
Nice.

Is the best way to think of optimizing compilers, "I wonder if someone hand wrote a rule for the optimizer that fits this case"?

stouset 7 hours ago||
Probably not, because a lot of the power of optimizing compilers comes from composing optimizations. Also a lot comes from being able to rule out undefined behavior.
SkiFire13 4 hours ago|||
> int bar(int num) { return num / 2; } > > Doesn't get optimized to a single shift right, because the that won't work if num is negative.

Nit: some might think the reason this doesn't work is because the shift would "move" the sign bit, but actually arithmetic shifting instructions exist for this exact purpose. The reason they are not enough is because shifting provides the wrong kind of division rounding for negative numbers. This can however be fixed up by adding 1 if the number is negative (this can be done with an additional logical shift for moving the sign bit to the rightmost position and an addition).

commandlinefan 8 hours ago|||
> won't work if num is negative

I remember reading (although I can't find it now) a great analysis of all the optimizations that Javascript compilers _can't_ do because of the existence of the "eval" instruction.

cibyr 2 hours ago|||
The extra fun thing about this is that eval has different semantics if it's assigned to a different name, in order to allow JavaScript implementations to apply extra optimizations to code that doesn't call a function literally named "eval": https://developer.mozilla.org/en-US/docs/Web/JavaScript/Refe...

Andy Wingo (of course!) has a good explanation of this: https://wingolog.org/archives/2012/01/12/javascript-eval-con...

astrange 7 hours ago||||
A JIT can do any optimization it wants, as long as it can deoptimize if it turns out it was wrong.
adrianN 6 hours ago||
You also want to prove that the „optimization“ doesn’t make things slower.
LoganDark 4 hours ago|||
Could this perhaps be it? https://janvitek.org/pubs/ecoop11.pdf
stabbles 8 hours ago|||
The compiler doesn't know the implementation of strlen, it only has its header. At runtime it might be different than at compile time (e.g. LD_PRELOAD=...). For this to be optimized you need link time optimization.
dzaima 8 hours ago|||
Both clang and gcc do optimize it though - https://godbolt.org/z/cGG9dq756. You need -fno-builtin or similar to get them to not.
valleyer 7 hours ago||||
No, the compiler may assume that the behavior of standard library functions is standards-conformant.
SpaceManNabs 2 hours ago||
> No, the compiler may assume that the behavior of standard library functions is standards-conformant.

Why?

What happens if it isn't?

MindSpunk 1 hour ago|||
Sadness. Tons of functions from the standard library are special cases by the compiler. The compiler can elide malloc calls if it can prove it doesn't need them, even though strictly speaking malloc has side effects by changing the heap state. Just not useful side effects.

memcpy will get transformed and inlined for small copies all the time.

DannyBee 2 hours ago|||
Because that's what it means to compile a specific dialect of a specific programming language?

If you want a dialect where they aren't allowed to assume that you would have to make your own

abainbridge 8 hours ago|||
Hmmm, really? Switching compiler seems sufficient: https://godbolt.org/z/xnevov5d7

BTW, the case of it not optimizing was MSVC targetting Windows (which doesn't support LD_PRELOAD, but maybe has something similar?).

dzaima 8 hours ago|||
> I've added a pointless strlen(). It seems like no compiler is clever enough to remove it.

For that you could at least argue that if the libc's strlen is faster than strcmp, that improves performance if the programmer expects the function to be usually called with a short input.

That said, changing it to `if (strlen(s) == 5) return 0;` it still doesn't get optimized (https://godbolt.org/z/7feWWjhfo), even though the entire function is completely equivalent to just `return 0;`.

abainbridge 8 hours ago|||
eg 4:

   int foo(char const *s) {
     if (s[0] == 'h' && s[1] == 'e' && s[2] == 'l' && s[3] == 'l')
        return 1;
     return 0;
   }
The outputs 4 cmp instructions here, even though I'd have thought 1 was sufficient. https://godbolt.org/z/hqMnbrnKe
ynik 7 hours ago|||
`s[0] == 'h'` isn't sufficient to guarantee that `s[3]` can be access without a segfault, so the compiler is not allowed to perform this optimization.

If you use `&` instead of `&&` (so that all array elements are accessed unconditionally), the optimization will happen: https://godbolt.org/z/KjdT16Kfb

(also note you got the endianness wrong in your hand-optimized version)

zrm 3 hours ago|||
> If you use `&` instead of `&&` (so that all array elements are accessed unconditionally), the optimization will happen

But then you're accessing four elements of a string that could have a strlen of less than 3. If the strlen is 1 then the short circuit case saves you because s[1] will be '\0' instead of 'e' and then you don't access elements past the end of the string. The "optimized" version is UB for short strings.

Denvercoder9 3 hours ago||
Yes, so that's why the compiler can't and doesn't emit the optimized version if you write the short circuited version - because it behaves differently for short strings.
abainbridge 7 hours ago||||
Ooo, I'd never thought of using & like that. Interesting.

> (also note you got the endianness wrong in your hand-optimized version) Doh :-)

rdc12 3 hours ago||
Matt Godbolt's talk on ray tracers, shows how effective that change can be. Think it was that talk anyway.

https://www.youtube.com/watch?v=HG6c4Kwbv4I

NooneAtAll3 6 hours ago|||
good ol' short circuiting
abbeyj 7 hours ago||||
If you want to tell the compiler not to worry about the possible buffer overrun then you can try `int foo(char const s[static 4])`. Or use `&` instead of `&&` to ensure that there is no short-circuiting, e.g. `if ((s[0] == 'h') & (s[1] == 'e') & (s[2] == 'l') & (s[3] == 'l'))` Either way, this then compiles down to a single 32-bit comparison.

Interestingly, it is comparing against a different 32-bit value than `bar` does. I think this is because you accidentally got the order backwards in `bar`.

The code in `bar` is probably not a good idea on targets that don't like unaligned loads.

raphlinus 7 hours ago|||
That's because the 1 instruction variant may read past the end of an array. Let's say s is a single null byte at 0x2000fff, for example (and that memory is only mapped through 0x2001000); the function as written is fine, but the optimized version may page fault.
abainbridge 7 hours ago||
Ah, yes, good point. I think this is a nice example of "I didn't notice I needed to tell the compiler a thing I know so it can optimize".
WalterBright 6 hours ago||
`s` may be null, and so the strlen may seg fault.
pianom4n 5 hours ago|||
But that's undefined behavior, so the compiler is free to ignore that possibility.
flqn 5 hours ago||||
Since the optimiser is allowed to assume you're not invoking UB, and strlen of null is UB, I don't believe that it would consider that case when optimising this function.
jagged-chisel 12 hours ago||
I always code with the mindset “the compiler is smarter than me.” No need to twist my logic around attempting to squeeze performance out of the processor - write something understandable to humans, let the computer do what computers do.
adrianN 12 hours ago||
This is decent advice in general, but it pays off to try and express your logic in a way that is machine friendly. That mostly means thinking carefully about how you organize the data you work with. Optimizers generally don't change data structures or memory layout but that can make orders of magnitude difference in the performance of your program. It is also often difficult to refactor later.
amiga386 12 hours ago|||
I find the same too. I find gcc and clang can inline functions, but can't decide to break apart a struct used only among those inlined functions and make every struct member a local variable, and then decide that one or more of those local variables should be allocated as a register for the full lifetime of the function, rather than spill onto the local stack.

So if you use a messy solution where something that should be a struct and operated on with functions, is actually just a pile of local variables within a single function, and you use macros operating on local variables instead of inlineable functions operating on structs, you get massively better performance.

e.g.

    /* slower */
    struct foo { uint32_t a,b,c,d,e,f,g,h; }
    uint32_t do_thing(struct foo *foo) {
        return foo->a ^ foo->b ^ foo->c ^ foo->d;
    }
    void blah() {
        struct foo x;
        for (...) {
            x.e = do_thing(&x) ^ x.f;
            ...
        }
    }

    /* faster */
    #define DO_THING (a^b^c^d)
    void blah() {
        uint32_t a,b,c,d,e,f,g,h;
        for (...) {
            e = DO_THING ^ f;
            ...
        }
    }
torginus 11 hours ago|||
The nice thing about godbolt is that it can show you that clang not only can but do it in theory but also does it in practice:

https://aoco.compiler-explorer.com/#g:!((g:!((g:!((h:codeEdi...

The ability of turning stack allocated variables into locals(which can be then put in registers) is one of the most important passes of modern compilers.

Since compilers use SSA, where locals are immutable while lots of languages, like C have mutable variables, some compiler frontends put locals onto the stack, and let the compiler figure out what can be put into locals and how.

amiga386 10 hours ago||
That's really good; clearly I haven't looked at more recent versions. The magic seems to happen in your link at SROAPass, "Scalar Replacement Of Aggregates". Very cool!

According to https://docs.hdoc.io/hdoc/llvm-project/r2E8025E445BE9CEE.htm...

> This pass takes allocations which can be completely analyzed (that is, they don't escape) and tries to turn them into scalar SSA values.

That's actually a useful hint to me. When I was trying to replace locals and macros with a struct and functions, I also used the struct directly in another struct (which was the wider source of persistence across functions), so perhaps this pass thought the struct _did_ escape. I should revisit my code and see if I can tweak it to get this optimisation applied.

actionfromafar 11 hours ago|||
I guess the chances of the compiler doing something smart increases with link-time optimizations and when keeping as much as possible inside the same "compilation unit". (In practice in the same source file.)
lou1306 12 hours ago|||
To make a more specific example, if you malloc()/free() within a loop, it's unlikely that the compiler will fix that for you. However, moving those calls outside of the loop (plus maybe add some realloc()s within, only if needed) is probably going to perform better.
adrianN 10 hours ago||
That is something that can be easily found and usually fixed with trivial profiling. I'm more talking about data locality instead of pointer chasing. Once you set up a pointer-chasing data infrastructure changing that means rewriting most of your application.
jaccola 12 hours ago|||
I would take it one step further, often trying to eke out performance gains with clever tricks can hurt performance by causing you to "miss the forest for the trees".

I work with Cuda kernels a lot for computer vision. I am able to consistently and significantly improve on the performance of research code without any fancy tricks, just with good software engineering practices.

By organising variables into structs, improving naming, using helper functions, etc... the previously impenetrable code becomes so much clearer and the obvious optimisations reveal themselves.

Not to say there aren't certain tricks / patterns / gotchas / low level hardware realities to keep in mind, of course.

qsort 12 hours ago|||
> I always code with the mindset “the compiler is smarter than me.”

Like with people in general, it depends on what compiler/interpreter we're talking about, I'll freely grant that clang is smarter than me, but CPython for sure isn't. :)

More generally, canonicalization goes very far, but no farther than language semantics allows. Not even the notorious "sufficiently smart compiler" with infinite time can figure out what you don't tell it.

manbitesdog 11 hours ago||
To add to this, the low-level constraints also make this assumption noisy, no matter how smart the compiler is. On the CPython case, if you do `dis.dis('DAY = 24 * 60 * 60)` you will see that constant folding nicely converts it to `LOAD_CONST 86400`. However, if you try `dis.dis('ATOMS_IN_THE_WORLD = 10*50')` you will get LOAD_CONST 10, LOAD_CONST 50, BINARY_OP **.
moregrist 6 hours ago|||
There are optimizations that a compiler can perform; usually these are code transformations. Modern optimizing compilers usually get these right.

The optimizations that tend to have the most impact involve changes to the algorithm or data layout. Most compilers won’t do things like add a hash table to make a lookup O(1) or rearrange an array of structures to be a structure of arrays for better data locality. Coding with an eye for these optimizations is still a very good use of your time.

stonemetal12 10 hours ago|||
I go with "You are responsible for the algorithms, it is responsible for the code micro optimizations". The compiler can't optimize you out of an SQL N+1 situation, that is on me to avoid, but it is better than me at loop unrolling.
wavemode 10 hours ago|||
This is very often true when your data is sitting right there on the stack.

Though when your data is behind pointers, it's very easy to write code that the compiler can no longer figure out how to optimize.

mamcx 9 hours ago|||
> “the compiler is smarter than me.”

This is true, but it also means "the compiler IS made for someone median smart, that now knows the machine".

It works great for basic, simple, common code, and for code that is made with care for data structures.

A total mess of code is another story.

P.D: is similar to the query optimizers, that neither can outrun a terrible made schema and queries

flohofwoe 11 hours ago|||
> I always code with the mindset “the compiler is smarter than me.”

...I don't know... for instance the MSVC compiler creates this output for the last two 'non-trivial' functions with '/Ox':

  add w8,w1,w0
  cmp w0,#0
  cseleq w0,w1,w8
Even beginner assembly coders on their first day wouldn't write such bullshit :)

A better mindset is "don't trust the compiler for code that's actually performance sensitive".

You shouldn't validate each line of compiler output, but at least for the 'hot areas' in the code base that definitely pays off, because sometimes compilers do really weird shit for no good reason (often because of 'interference' between unrelated optimizer passes) - and often you don't need to dig deep to stumble over weird output like in the example above.

sumtechguy 11 hours ago||
I see the msvc arm compiler has not improved much in 20 years. The msvc arm was pretty odd when we used it in ~2003. We did not trust it at all. Think we had to get 4 or so compiler fixes out of MS for that project plus 3 or 4 library fixes. The x86 one was pretty solid. We were targeting 4 different CPU platforms at the same time so we could find things like that decently quickly. Most of the the time it was something we did that was weird. But even then we would find them. That one looks like maybe the optimizer back filled a nop slot?
wat10000 10 hours ago|||
I would modify this a bit. Someone with decent computer architecture knowledge, tools, and time can generally do better than the compiler. But you generally won't, because you have a lot of other things to think about. So I'd state this as, "the compiler is more diligent and consistent than me." It's not so much that it can spot a for loop that's equivalent to a single add, but that it will spot it just about every time, so you don't have to worry about it.
IshKebab 12 hours ago|||
The fact that compilers are smart isn't an excuse to not think about performance at all. They can't change your program architecture, algorithms, memory access patterns, etc.

You can mostly not think about super low level integer manipulation stuff though.

ErroneousBosh 12 hours ago|||
You say that, but I was able to reduce the code size of some avr8 stuff I was working on by removing a whole bunch of instructions that zero out registers and then shift a value around. I don't it to literally shift the top byte 24 bits to the right and zero out the upper 24 bits, I just need it to pass the value in the top 8 bits direct to the next operation.

I agree that most people are not writing hand-tuned avr8 assembly. Most people aren't attempting to do DSP on 8-bit AVRs either.

tonyhart7 12 hours ago||
also not all software need optimization to the bone

pareto principle like always, dont need the best but good enough

not every company is google level anyway

WalterBright 5 hours ago||
There are general optimizations, based on DFA (Data Flow Analysis). These recognize things like loops, loop invariants, dead code, copy propagation, constant propagation, common subexpressions, etc.

Then, there are is a (very long) list of checks for specific patterns and replacing them with shorter sequences of code, things like recognizing the pattern of bswap and replacing it with a bswap instruction. There's no end to adding patterns to check for.

DannyBee 2 hours ago|
This is true but there is actually an end ;)

There are limits on provable equivalence in the first place. things like equality saturation also try and do a better job of formalizing equivalence based rewrites.

CodeArtisan 8 hours ago||
Recursive Popcount:

    unsigned int popcount(unsigned int n) 
    {
        return (n &= n - 1u) ? (1u  + popcount(n)) : 0u;
    }
Clang 21.1 x64:

    popcount:
            mov     eax, -1
    .LBB0_1:
            lea     ecx, [rdi - 1]
            inc     eax
            and     ecx, edi
            mov     edi, ecx
            jne     .LBB0_1
            ret
GCC 15.2:

    popcount:
            blsr    edi, edi
            popcnt  eax, edi
            ret
Both compiled with -O3 -march=znver5
pbsd 4 hours ago|
Because the function is not quite correct. It should be

    return n ? (1u  + popcount(n & n - 1u)) : 0u;
which both Clang and GCC promptly optimize to a single popcnt.
Scene_Cast2 12 hours ago||
This post assumes C/C++ style business logic code.

Anything HPC will benefit from thinking about how things map onto hardware (or, in case of SQL, onto data structures).

I think way too few people use profilers. If your code is slow, profiling is the first tool you should reach for. Unfortunately, the state of profiling tools outside of NSight and Visual Studio (non-Code) is pretty disappointing.

layer8 9 hours ago|
I don’t disagree, but profiling also won’t help you with death by a thousand indirections.
Findecanor 9 hours ago||
I'm wondering how the compiler optimised add_v3() and add_v4() though.

Was it through "idiom detection", i.e. by recognising those specific patterns, or did the compiler deduce the answers them through some more involved analysis?

fooyc 8 hours ago||
add_v3() is the result of induction variable simplification: https://llvm.org/doxygen/IndVarSimplify_8cpp_source.html
j2kun 6 hours ago||
Scalar Evolution is one way loops can be simplified
anon-3988 10 hours ago||
What I am curious about is, is the compiler smart enough to be lazy with computation and or variables? For example consider:

let a = expr let b = expr2

if (a || b) { return true; }

is the compiler allowed to lazily compute this if it is indeed faster to do that way? Or declaring a bunch of variables that may or may not be used in all of the branches. Is the compiler smart enough to only compute them whenever it is necessary? AFAIK this is now allowed in C-like languages. Things have to materialize. Another one is, I like to do memcpy every single time eventhough it might not even be used or overwritten by other memcpys. Is the compiler smart enough to not perform those and reorder my program so that only the last relevant memcpy is performed?

A lot of times, my code becomes ugly because I don't trust that it does any of this. I would like t write code in consistent and simple ways but I need compilers to be much smarter than it is today.

A bad example recently is something like

const S * s =;

let a = constant; let b = constant; let c = constant; let d = constant; let e = constant; let f = constant; let g = constant; let h = constant; let i = constant; let j = constant; let k = constant; let l = constant;

if (s->a == a && s->b == b /* etc */ ) { return true; }

It did not turn all of this into a SIMD mask or something like that.

jcranmer 9 hours ago||
> Is the compiler smart enough to only compute them whenever it is necessary?

This is known as "code sinking," and most optimizers are capable of doing this. Except keep in mind that a) the profitability of doing so is not always clear [1] and b) the compiler is a lot more fastidious about corner-case behavior than you are, so it might conclude that it's not in fact safe to sink the operation when you think it is safe to do so.

[1] If the operation to sink is x = y + z, you now may need to keep the values of y and z around longer to compute the addition, increasing register pressure and potentially hurting performance as a result.

Denvercoder9 3 hours ago||
> It did not turn all of this into a SIMD mask or something like that.

Did you try using bitwise and (&), or a local for the struct? The short-circuiting behaviour of the logical means that if `s->a != a`, `s->b` must not be dereferenced, so the compiler cannot turn this into a SIMD mask operation, because it behaves differently.

Generally compilers are pretty smart these days, and I find that more often than not if they miss an "obvious" optimization it's because there's a cornercase where it behaves differently from the code I wrote.

matja 11 hours ago||
You can fool the optimizer, but you have to work harder to do so:

    unsigned add(unsigned x, unsigned y) {
        unsigned a, b;
        do {
            a = x & y;
            b = x ^ y;
            x = a << 1;
            y = b;
        } while (a);
        return b;
    }
becomes (with armv8-a clang 21.1.0 -O3) :

    add(unsigned int, unsigned int):
    .LBB0_1:
            ands    w8, w0, w1
            eor     w1, w0, w1
            lsl     w0, w8, #1
            b.ne    .LBB0_1
            mov     w0, w1
            ret
thaumasiotes 10 hours ago|
Since I had to think about it:

    unsigned add(unsigned x, unsigned y) {
        unsigned a, b;
        do {
            a = x & y;   /* every position where addition will generate a carry */
            b = x ^ y;   /* the addition, with no carries */
            x = a << 1;  /* the carries */
            y = b;
        /* if there were any carries, repeat the loop */
        } while (a);
        return b;
    }
It's easy to show that this algorithm is correct in the sense that, when b is returned, it must be equal to x+y. x+y summing to a constant is a loop invariant, and at termination x is 0 and y is b.

It's a little more difficult to see that the loop will necessarily terminate.

New a values come from a bitwise & of x and y. New x values come from a left shift of a. This means that, if x ends in some number of zeroes, the next value of a will also end in at least that many zeroes, and the next value of x will end in an additional zero (because of the left shift). Eventually a will end in as many zeroes as there are bits in a, and the loop will terminate.

gfaster 7 hours ago||
In C, I'm pretty confident the loop is defined by the standard to terminate.

Also I did take the excuse to plug it (the optimized llvm ir) into Alive:

https://alive2.llvm.org/ce/#g:!((g:!((g:!((h:codeEditor,i:(f...

dzaima 6 hours ago||
Alive2 does not handle loops; don't know what exactly it does by default, but changing the `shl i32 %and, 1` to `shl i32 %and, 2` has it still report the transformation as valid. You can add `--src-unroll=2` for it to check up to two loop iterations, which does catch such an error (and does still report the original as valid), but of course that's quite limited. (maybe the default is like `--src-unroll=1`?)
gfaster 3 hours ago||
Oh wow nice catch - I was not at all familiar with the limitations. I would've hoped for a warning there, but I suppose it is a research project.

I was able to get it working with unrolling and narrower integers:

https://alive2.llvm.org/ce/#z:OYLghAFBqd5QCxAYwPYBMCmBRdBLAF...

senfiaj 9 hours ago|
I wonder if compilers do multiple passes on the intermediate code in order to optimize / simplify it. For example, during each pass the optimizer searches some known harcoded patterns and replaces them with something else and repeats until no possible improvement is found.

Also optimizers have a limit, they can't reason as abstractly as humans, for example:

  bool is_divisible_by_6(int x) {
      return x % 2 == 0 && x % 3 == 0;
  }

  bool is_divisible_by_6_optimal(int x) {
      return x % 6 == 0;
  }
I tried with both gcc and clang, the asm code for is_divisible_by_6 is still less optimal. So no, there are plenty of easy ways to fool the optimizer by obfuscation.

The morale is that you still have to optimize algorithms (O notation) and math operations / expressions.

jakobnissen 9 hours ago||
They do, and the order of the passes matter. Sometimes, optimizations are missed because they require a certain order of passes that is different from the one your compiler uses.

On higher optimization levels, many passes occur multiple times. However, as far as I know, compilers don't repeatedly run passes until they've reached an optimum. Instead, they run a fixed series of passes. I don't know why, maybe someone can chime in.

titzer 3 hours ago||
It's a long-standing problem in compilers, often referred to as the "phase ordering problem". In general, forward dataflow optimizations can be combined if they are monotonic (meaning, never make the code worse, or at least, never undo a previous step. It's possible to run forward dataflow problems together repeatedly to a fixpoint. In TurboFan a general graph reduction algorithm is [1] instantiated with a number of reducers, and then a fixpoint is run. The technique of trying to combine multiple passes has been tried a number of times. What doesn't seem so obvious is how to run optimizations that are not traditional forward dataflow problems or are indeed backward dataflow problems (like DCE) together with other transformations. Generally compilers get tuned by running them on lots of different kinds of code, often benchmarks, and then tinkering with the order of passes and other heuristics like loop unroll factors, thresholds for inlining, etc, and seeing what works best.

[1]was? TurboFan seems to have splintered into a number of pieces being reused in different ways these days

windward 9 hours ago|||
Those aren't isomorphic. The C spec says `is_divisible_by_6` short-circuits. You don't want the compiler optimising away null checks.

https://www.open-std.org/jtc1/sc22/wg14/www/docs/n1256.pdf

6.5.13, semantics

senfiaj 8 hours ago|||
So you claim that the compiler "knows about this but doesn't optimize because of some safety measures"? As far as I remember, compilers don't optimize math expressions / brackets, probably because the order of operations might affect the precision of ints/floats, also because of complexity.

But my example is trivial (x % 2 == 0 && x % 3 == 0 is exactly the same as x % 6 == 0 for all C/C++ int), yet the compiler produced different outputs (the outputs are different and most likely is_divisible_by_6 is slower). Also what null (you mean 0?) checks are you talking about? The denominator is not null/0. Regardless, my point about not over relying on compiler optimization (especially for macro algorithms (O notation) and math expressions) remains valid.

jcranmer 8 hours ago||||
x % 3 == 0 is an expression without side effects (the only cases that trap on a % operator are x % 0 and INT_MIN % -1), and thus the compiler is free to speculate the expression, allowing the comparison to be converted to (x % 2 == 0) & (x % 3 == 0).

Yes, compilers will tend to convert && and || to non-short-circuiting operations when able, so as to avoid control flow.

Sohcahtoa82 8 hours ago||||
Any number divisible by 6 will also be divisible by both 2 and 3 since 6 is divisible by 2 and 3, so the short-circuiting is inconsequential. They're bare ints, not pointers, so null isn't an issue.

So how are they not isomorphic?

dzaima 8 hours ago|||
That only matters for things with side-effects; and changing the `&&` to `&` doesn't get it to optimize anyway.

You can check - copy the LLVM IR from https://godbolt.org/z/EMPr4Yc84 into https://alive2.llvm.org/ce/ and it'll tell you that it is a valid refinement as far as compiler optimization goes.

ramon156 9 hours ago||
I don't know enough about ASM. Are u saying the first one is more optimal because it is faster or because it uses less instructions? Would this reflect a real world use case? Do any other compilers (e.g. V8) optimize modulo's into something else?
senfiaj 8 hours ago||
The compiler didn't recognize that x % 2 == 0 && x % 3 == 0 is exactly the same as x % 6 == 0 for all C/C++ int values. In theory a compiler could detect that and generate identical code for both functions, but it isn't done because this case is "niche" despite being trivial. My point is not to over rely on optimizer for math expressions and algorithms.
More comments...