For example:
$ julia
julia> function f(n)
total = 0
for x in 1:n
total += x
end
return total
end
julia> @code_native f(10)
...
sub x9, x0, #2
mul x10, x8, x9
umulh x8, x8, x9
extr x8, x8, x10, #1
add x8, x8, x0, lsl #1
sub x0, x8, #1
ret
...
it shows this with nice colors right in the REPL.In the example above, you see that LLVM figured out the arithmetic series and replaced the loop with a simple multiplication.
julia> function f(n)
total = 0
for x in 1:n
total += Base.donotdelete(x)
end
return total
end
will keep the loopSome boring examples I've just thought of...
eg 1:
int bar(int num) { return num / 2; }
Doesn't get optimized to a single shift right, because the that won't work if num is negative. In this case we can change the ints to unsigneds to tell the compiler we know the number isn't negative. But it isn't always easy to express to the compiler everything you know about your data and use case. There is an art in knowing what kinds of things you need to tell the compiler in order to unlock optimizations.eg 2:
int foo(void) { return strlen("hello"); }
We all know that strlen will return 5, but some compilers don't: https://godbolt.org/z/M7x5qraE6eg 3:
int foo(char const *s) {
if (strlen(s) < 3) return 0;
if (strcmp(s, "hello") == 0)
return 1;
return 0;
}
This function returns 1 if s is "hello". 0 otherwise. I've added a pointless strlen(). It seems like no compiler is clever enough to remove it. https://godbolt.org/z/Koj65eo5K. I can think of many reasons the compiler isn't able to spot this.I feel like it is unfair to blame the compiler when you've explicitly asked for `/O1`. If you change this to `/O2` or `/Ox` then MSVC will optimize this into a constant 5, proving that it does "know" that strlen will return 5 in this case.
bool is_divisible_by_6(int x) {
return x % 2 == 0 && x % 3 == 0;
}
bool is_divisible_by_6_optimal(int x) {
return x % 6 == 0;
}
Mathematically x % 2 == 0 && x % 3 == 0 is exactly the same as x % 6 == 0 for all C/C++ int values but the compiler doesn't see them as identical, and produces less optimal code for is_divisible_by_6 than for is_divisible_by_6_optimal. is_divisible_by_6(int):
test dil, 1
jne .LBB0_1
imul eax, edi, -1431655765
add eax, 715827882
cmp eax, 1431655765
setb al
ret
.LBB0_1:
xor eax, eax
ret
is_divisible_by_6_optimal(int):
imul eax, edi, -1431655765
add eax, 715827882
ror eax
cmp eax, 715827883
setb al
ret
By themselves, the mod 6 and mod 3 operations are almost identical -- in both cases the compiler used the reciprocal trick to transform the modulo into an imul+add+cmp, the only practical difference being that the %6 has one extra bit shift.But note the branch in the first function! The original code uses the && operator, which is short-circuiting -- so from the compiler's perspective, perhaps the programmer expects that x % 2 will usually be false, and so we can skip the expensive 3 most of the time. The "suboptimal" version is potentially quite a bit faster in the best case, but also potentially quite a bit slower in the worst case (since that branch could be mispredicted). There's not really a way for the compiler to know which version is "better" without more context, so deferring to "what the programmer wrote" makes sense.
That being said, I don't know that this is really a case of "the compiler knows best" rather than just not having that kind of optimization implemented. If we write 'x % 6 && x % 3', the compiler pointlessly generates both operations. And GCC generates branchless code for 'is_divisible_by_6', which is just worse than 'is_divisible_by_6_optimal' in all cases.
bool is_divisible_by_15(int x) {
return x % 3 == 0 && x % 5 == 0;
}
bool is_divisible_by_15_optimal(int x) {
return x % 15 == 0;
}
is_divisible_by_15 still has a branch, while is_divisible_by_15_optimal does not is_divisible_by_15(int):
imul eax, edi, -1431655765
add eax, 715827882
cmp eax, 1431655764
jbe .LBB0_2
xor eax, eax
ret
.LBB0_2:
imul eax, edi, -858993459
add eax, 429496729
cmp eax, 858993459
setb al
ret
is_divisible_by_15_optimal(int):
imul eax, edi, -286331153
add eax, 143165576
cmp eax, 286331153
setb al
retIs the best way to think of optimizing compilers, "I wonder if someone hand wrote a rule for the optimizer that fits this case"?
Nit: some might think the reason this doesn't work is because the shift would "move" the sign bit, but actually arithmetic shifting instructions exist for this exact purpose. The reason they are not enough is because shifting provides the wrong kind of division rounding for negative numbers. This can however be fixed up by adding 1 if the number is negative (this can be done with an additional logical shift for moving the sign bit to the rightmost position and an addition).
I remember reading (although I can't find it now) a great analysis of all the optimizations that Javascript compilers _can't_ do because of the existence of the "eval" instruction.
Andy Wingo (of course!) has a good explanation of this: https://wingolog.org/archives/2012/01/12/javascript-eval-con...
Why?
What happens if it isn't?
memcpy will get transformed and inlined for small copies all the time.
This is the case, for example, on macOS, FreeBSD, and Linux.
Cute username, BTW.
> Cute username, BTW.
Thanks, I was to lazy to think of a real name, so this is the timestamp, I created the account.
§6.4.2.1: "If the program defines a reserved identifier [...] the behavior is undefined."
But the orthodox answer to such questions is that demons fly out of your nose.
If you want a dialect where they aren't allowed to assume that you would have to make your own
BTW, the case of it not optimizing was MSVC targetting Windows (which doesn't support LD_PRELOAD, but maybe has something similar?).
For that you could at least argue that if the libc's strlen is faster than strcmp, that improves performance if the programmer expects the function to be usually called with a short input.
That said, changing it to `if (strlen(s) == 5) return 0;` it still doesn't get optimized (https://godbolt.org/z/7feWWjhfo), even though the entire function is completely equivalent to just `return 0;`.
You've misrepresented the situation. Turn up the optimiser to `/O2` and MSVC returns 5 directly, too.
> This function returns 1 if s is "hello". 0 otherwise. I've added a pointless strlen(). It seems like no compiler is clever enough to remove it.
It's funny how sometimes operating at a higher level of abstraction allows the compiler to optimise the code better: https://godbolt.org/z/EYP5764Mv
In this, the string literal "hello" is lowered not merely into a static string, but a handful of integral immediates that are directly inline in the assembly, no label-dereferencing required, and the 'is equal to "hello"' test is cast as the result of some sign extends and a bitwise-xor.
Of course, one could argue that std::string_view::size() is statically available, but then my counter-argument is that C's zero-terminated strings are a massive pessimisation (which is why the compiler couldn't 'see' what we humans can), and should always be avoided.
int foo(char const *s) {
if (s[0] == 'h' && s[1] == 'e' && s[2] == 'l' && s[3] == 'l')
return 1;
return 0;
}
The outputs 4 cmp instructions here, even though I'd have thought 1 was sufficient. https://godbolt.org/z/hqMnbrnKeIf you use `&` instead of `&&` (so that all array elements are accessed unconditionally), the optimization will happen: https://godbolt.org/z/KjdT16Kfb
(also note you got the endianness wrong in your hand-optimized version)
But then you're accessing four elements of a string that could have a strlen of less than 3. If the strlen is 1 then the short circuit case saves you because s[1] will be '\0' instead of 'e' and then you don't access elements past the end of the string. The "optimized" version is UB for short strings.
> (also note you got the endianness wrong in your hand-optimized version) Doh :-)
Interestingly, it is comparing against a different 32-bit value than `bar` does. I think this is because you accidentally got the order backwards in `bar`.
The code in `bar` is probably not a good idea on targets that don't like unaligned loads.
And that's what is wrong. This is the most unfriendly behavior towards the programmer.
If I program it to cause a null pointer seg fault, I expect a null pointer seg fault. If I program it to cause a twos complement overflow, I want a twos complement overflow.
So if you use a messy solution where something that should be a struct and operated on with functions, is actually just a pile of local variables within a single function, and you use macros operating on local variables instead of inlineable functions operating on structs, you get massively better performance.
e.g.
/* slower */
struct foo { uint32_t a,b,c,d,e,f,g,h; }
uint32_t do_thing(struct foo *foo) {
return foo->a ^ foo->b ^ foo->c ^ foo->d;
}
void blah() {
struct foo x;
for (...) {
x.e = do_thing(&x) ^ x.f;
...
}
}
/* faster */
#define DO_THING (a^b^c^d)
void blah() {
uint32_t a,b,c,d,e,f,g,h;
for (...) {
e = DO_THING ^ f;
...
}
}https://aoco.compiler-explorer.com/#g:!((g:!((g:!((h:codeEdi...
The ability of turning stack allocated variables into locals(which can be then put in registers) is one of the most important passes of modern compilers.
Since compilers use SSA, where locals are immutable while lots of languages, like C have mutable variables, some compiler frontends put locals onto the stack, and let the compiler figure out what can be put into locals and how.
According to https://docs.hdoc.io/hdoc/llvm-project/r2E8025E445BE9CEE.htm...
> This pass takes allocations which can be completely analyzed (that is, they don't escape) and tries to turn them into scalar SSA values.
That's actually a useful hint to me. When I was trying to replace locals and macros with a struct and functions, I also used the struct directly in another struct (which was the wider source of persistence across functions), so perhaps this pass thought the struct _did_ escape. I should revisit my code and see if I can tweak it to get this optimisation applied.
I work with Cuda kernels a lot for computer vision. I am able to consistently and significantly improve on the performance of research code without any fancy tricks, just with good software engineering practices.
By organising variables into structs, improving naming, using helper functions, etc... the previously impenetrable code becomes so much clearer and the obvious optimisations reveal themselves.
Not to say there aren't certain tricks / patterns / gotchas / low level hardware realities to keep in mind, of course.
Like with people in general, it depends on what compiler/interpreter we're talking about, I'll freely grant that clang is smarter than me, but CPython for sure isn't. :)
More generally, canonicalization goes very far, but no farther than language semantics allows. Not even the notorious "sufficiently smart compiler" with infinite time can figure out what you don't tell it.
I do not get that, I get LOAD_CONST 500.
Tested on: Python 3.9.3 MacOS (Apple provided), 3.13.3 (uv provided) MacOS and Linux, and 3.14.0 (uv provided) MacOS and Linux.
The optimizations that tend to have the most impact involve changes to the algorithm or data layout. Most compilers won’t do things like add a hash table to make a lookup O(1) or rearrange an array of structures to be a structure of arrays for better data locality. Coding with an eye for these optimizations is still a very good use of your time.
You can mostly not think about super low level integer manipulation stuff though.
Though when your data is behind pointers, it's very easy to write code that the compiler can no longer figure out how to optimize.
...I don't know... for instance the MSVC compiler creates this output for the last two 'non-trivial' functions with '/Ox':
add w8,w1,w0
cmp w0,#0
cseleq w0,w1,w8
Even beginner assembly coders on their first day wouldn't write such bullshit :)A better mindset is "don't trust the compiler for code that's actually performance sensitive".
You shouldn't validate each line of compiler output, but at least for the 'hot areas' in the code base that definitely pays off, because sometimes compilers do really weird shit for no good reason (often because of 'interference' between unrelated optimizer passes) - and often you don't need to dig deep to stumble over weird output like in the example above.
I agree that most people are not writing hand-tuned avr8 assembly. Most people aren't attempting to do DSP on 8-bit AVRs either.
This is true, but it also means "the compiler IS made for someone median smart, that now knows the machine".
It works great for basic, simple, common code, and for code that is made with care for data structures.
A total mess of code is another story.
P.D: is similar to the query optimizers, that neither can outrun a terrible made schema and queries
That mindset will last until the first or second time you walk through the compiler's assembly-language output. GCC and LLVM can find some amazing optimizations, but they also consistently miss very obvious optimization opportunities, often even on utterly trivial code.
That isn't to say that it's worthwhile to hand-optimize all your code. But the compiler is very much not smarter than you, and if you do need to squeeze performance out of the processor, you can do a lot better than the compiler does. A good first step is to look at the stupid shit the compiler has spat out and massage your source code until the shit stinks less.
pareto principle like always, dont need the best but good enough
not every company is google level anyway
Then, there are is a (very long) list of checks for specific patterns and replacing them with shorter sequences of code, things like recognizing the pattern of bswap and replacing it with a bswap instruction. There's no end to adding patterns to check for.
There are limits on provable equivalence in the first place. things like equality saturation also try and do a better job of formalizing equivalence based rewrites.
Anything HPC will benefit from thinking about how things map onto hardware (or, in case of SQL, onto data structures).
I think way too few people use profilers. If your code is slow, profiling is the first tool you should reach for. Unfortunately, the state of profiling tools outside of NSight and Visual Studio (non-Code) is pretty disappointing.
If you have a real-time system you care about the difference between real-time and not. But any given 1.2x factor is extremely unlikely to be the difference-maker.
> When I program some new part and I can avoid an indirection, why not do it? Less complexity, less (failure) state, better performance.
Well if it makes the code simpler then you should do it for that reason. But thinking about this kind of low-level performance detail at all will usually do more harm than good.
unsigned int popcount(unsigned int n)
{
return (n &= n - 1u) ? (1u + popcount(n)) : 0u;
}
Clang 21.1 x64: popcount:
mov eax, -1
.LBB0_1:
lea ecx, [rdi - 1]
inc eax
and ecx, edi
mov edi, ecx
jne .LBB0_1
ret
GCC 15.2: popcount:
blsr edi, edi
popcnt eax, edi
ret
Both compiled with -O3 -march=znver5 return n ? (1u + popcount(n & n - 1u)) : 0u;
which both Clang and GCC promptly optimize to a single popcnt. unsigned add(unsigned x, unsigned y) {
unsigned a, b;
do {
a = x & y;
b = x ^ y;
x = a << 1;
y = b;
} while (a);
return b;
}
becomes (with armv8-a clang 21.1.0 -O3) : add(unsigned int, unsigned int):
.LBB0_1:
ands w8, w0, w1
eor w1, w0, w1
lsl w0, w8, #1
b.ne .LBB0_1
mov w0, w1
ret unsigned add(unsigned x, unsigned y) {
unsigned a, b;
do {
a = x & y; /* every position where addition will generate a carry */
b = x ^ y; /* the addition, with no carries */
x = a << 1; /* the carries */
y = b;
/* if there were any carries, repeat the loop */
} while (a);
return b;
}
It's easy to show that this algorithm is correct in the sense that, when b is returned, it must be equal to x+y. x+y summing to a constant is a loop invariant, and at termination x is 0 and y is b.It's a little more difficult to see that the loop will necessarily terminate.
New a values come from a bitwise & of x and y. New x values come from a left shift of a. This means that, if x ends in some number of zeroes, the next value of a will also end in at least that many zeroes, and the next value of x will end in an additional zero (because of the left shift). Eventually a will end in as many zeroes as there are bits in a, and the loop will terminate.
Also I did take the excuse to plug it (the optimized llvm ir) into Alive:
https://alive2.llvm.org/ce/#g:!((g:!((g:!((h:codeEditor,i:(f...
I was able to get it working with unrolling and narrower integers:
https://alive2.llvm.org/ce/#z:OYLghAFBqd5QCxAYwPYBMCmBRdBLAF...
Huh? What's that supposed to mean?
For example, you can use this make the compiler "prove" the Collatz Conjecture:
https://gcc.godbolt.org/#g:!((g:!((g:!((h:codeEditor,i:(file...
let a = expr let b = expr2
if (a || b) { return true; }
is the compiler allowed to lazily compute this if it is indeed faster to do that way? Or declaring a bunch of variables that may or may not be used in all of the branches. Is the compiler smart enough to only compute them whenever it is necessary? AFAIK this is now allowed in C-like languages. Things have to materialize. Another one is, I like to do memcpy every single time eventhough it might not even be used or overwritten by other memcpys. Is the compiler smart enough to not perform those and reorder my program so that only the last relevant memcpy is performed?
A lot of times, my code becomes ugly because I don't trust that it does any of this. I would like t write code in consistent and simple ways but I need compilers to be much smarter than it is today.
A bad example recently is something like
const S * s =;
let a = constant; let b = constant; let c = constant; let d = constant; let e = constant; let f = constant; let g = constant; let h = constant; let i = constant; let j = constant; let k = constant; let l = constant;
if (s->a == a && s->b == b /* etc */ ) { return true; }
It did not turn all of this into a SIMD mask or something like that.
This is known as "code sinking," and most optimizers are capable of doing this. Except keep in mind that a) the profitability of doing so is not always clear [1] and b) the compiler is a lot more fastidious about corner-case behavior than you are, so it might conclude that it's not in fact safe to sink the operation when you think it is safe to do so.
[1] If the operation to sink is x = y + z, you now may need to keep the values of y and z around longer to compute the addition, increasing register pressure and potentially hurting performance as a result.
Did you try using bitwise and (&), or a local for the struct? The short-circuiting behaviour of the logical means that if `s->a != a`, `s->b` must not be dereferenced, so the compiler cannot turn this into a SIMD mask operation, because it behaves differently.
Generally compilers are pretty smart these days, and I find that more often than not if they miss an "obvious" optimization it's because there's a cornercase where it behaves differently from the code I wrote.
Was it through "idiom detection", i.e. by recognising those specific patterns, or did the compiler deduce the answers them through some more involved analysis?
unsigned mult(unsigned x, unsigned y) {
unsigned y0 = y;
while (x--) y = add_v1(y, y0);
return y;
}
optimizes to: mult(unsigned int, unsigned int):
madd w0, w1, w0, w1
ret
(and this produces the same result when substituting any of the `add_vN`s from TFA)