Posted by HeliumHydride 13 hours ago
For example:
$ julia
julia> function f(n)
total = 0
for x in 1:n
total += x
end
return total
end
julia> @code_native f(10)
...
sub x9, x0, #2
mul x10, x8, x9
umulh x8, x8, x9
extr x8, x8, x10, #1
add x8, x8, x0, lsl #1
sub x0, x8, #1
ret
...
it shows this with nice colors right in the REPL.In the example above, you see that LLVM figured out the arithmetic series and replaced the loop with a simple multiplication.
Some boring examples I've just thought of...
eg 1:
int bar(int num) { return num / 2; }
Doesn't get optimized to a single shift right, because the that won't work if num is negative. In this case we can change the ints to unsigneds to tell the compiler we know the number isn't negative. But it isn't always easy to express to the compiler everything you know about your data and use case. There is an art in knowing what kinds of things you need to tell the compiler in order to unlock optimizations.eg 2:
int foo(void) { return strlen("hello"); }
We all know that strlen will return 5, but some compilers don't: https://godbolt.org/z/M7x5qraE6eg 3:
int foo(char const *s) {
if (strlen(s) < 3) return 0;
if (strcmp(s, "hello") == 0)
return 1;
return 0;
}
This function returns 1 if s is "hello". 0 otherwise. I've added a pointless strlen(). It seems like no compiler is clever enough to remove it. https://godbolt.org/z/Koj65eo5K. I can think of many reasons the compiler isn't able to spot this.I feel like it is unfair to blame the compiler when you've explicitly asked for `/O1`. If you change this to `/O2` or `/Ox` then MSVC will optimize this into a constant 5, proving that it does "know" that strlen will return 5 in this case.
bool is_divisible_by_6(int x) {
return x % 2 == 0 && x % 3 == 0;
}
bool is_divisible_by_6_optimal(int x) {
return x % 6 == 0;
}
Mathematically x % 2 == 0 && x % 3 == 0 is exactly the same as x % 6 == 0 for all C/C++ int values but the compiler doesn't see them as identical, and produces less optimal code for is_divisible_by_6 than for is_divisible_by_6_optimal.Is the best way to think of optimizing compilers, "I wonder if someone hand wrote a rule for the optimizer that fits this case"?
Nit: some might think the reason this doesn't work is because the shift would "move" the sign bit, but actually arithmetic shifting instructions exist for this exact purpose. The reason they are not enough is because shifting provides the wrong kind of division rounding for negative numbers. This can however be fixed up by adding 1 if the number is negative (this can be done with an additional logical shift for moving the sign bit to the rightmost position and an addition).
I remember reading (although I can't find it now) a great analysis of all the optimizations that Javascript compilers _can't_ do because of the existence of the "eval" instruction.
Andy Wingo (of course!) has a good explanation of this: https://wingolog.org/archives/2012/01/12/javascript-eval-con...
Why?
What happens if it isn't?
memcpy will get transformed and inlined for small copies all the time.
If you want a dialect where they aren't allowed to assume that you would have to make your own
BTW, the case of it not optimizing was MSVC targetting Windows (which doesn't support LD_PRELOAD, but maybe has something similar?).
For that you could at least argue that if the libc's strlen is faster than strcmp, that improves performance if the programmer expects the function to be usually called with a short input.
That said, changing it to `if (strlen(s) == 5) return 0;` it still doesn't get optimized (https://godbolt.org/z/7feWWjhfo), even though the entire function is completely equivalent to just `return 0;`.
int foo(char const *s) {
if (s[0] == 'h' && s[1] == 'e' && s[2] == 'l' && s[3] == 'l')
return 1;
return 0;
}
The outputs 4 cmp instructions here, even though I'd have thought 1 was sufficient. https://godbolt.org/z/hqMnbrnKeIf you use `&` instead of `&&` (so that all array elements are accessed unconditionally), the optimization will happen: https://godbolt.org/z/KjdT16Kfb
(also note you got the endianness wrong in your hand-optimized version)
But then you're accessing four elements of a string that could have a strlen of less than 3. If the strlen is 1 then the short circuit case saves you because s[1] will be '\0' instead of 'e' and then you don't access elements past the end of the string. The "optimized" version is UB for short strings.
> (also note you got the endianness wrong in your hand-optimized version) Doh :-)
Interestingly, it is comparing against a different 32-bit value than `bar` does. I think this is because you accidentally got the order backwards in `bar`.
The code in `bar` is probably not a good idea on targets that don't like unaligned loads.
So if you use a messy solution where something that should be a struct and operated on with functions, is actually just a pile of local variables within a single function, and you use macros operating on local variables instead of inlineable functions operating on structs, you get massively better performance.
e.g.
/* slower */
struct foo { uint32_t a,b,c,d,e,f,g,h; }
uint32_t do_thing(struct foo *foo) {
return foo->a ^ foo->b ^ foo->c ^ foo->d;
}
void blah() {
struct foo x;
for (...) {
x.e = do_thing(&x) ^ x.f;
...
}
}
/* faster */
#define DO_THING (a^b^c^d)
void blah() {
uint32_t a,b,c,d,e,f,g,h;
for (...) {
e = DO_THING ^ f;
...
}
}https://aoco.compiler-explorer.com/#g:!((g:!((g:!((h:codeEdi...
The ability of turning stack allocated variables into locals(which can be then put in registers) is one of the most important passes of modern compilers.
Since compilers use SSA, where locals are immutable while lots of languages, like C have mutable variables, some compiler frontends put locals onto the stack, and let the compiler figure out what can be put into locals and how.
According to https://docs.hdoc.io/hdoc/llvm-project/r2E8025E445BE9CEE.htm...
> This pass takes allocations which can be completely analyzed (that is, they don't escape) and tries to turn them into scalar SSA values.
That's actually a useful hint to me. When I was trying to replace locals and macros with a struct and functions, I also used the struct directly in another struct (which was the wider source of persistence across functions), so perhaps this pass thought the struct _did_ escape. I should revisit my code and see if I can tweak it to get this optimisation applied.
I work with Cuda kernels a lot for computer vision. I am able to consistently and significantly improve on the performance of research code without any fancy tricks, just with good software engineering practices.
By organising variables into structs, improving naming, using helper functions, etc... the previously impenetrable code becomes so much clearer and the obvious optimisations reveal themselves.
Not to say there aren't certain tricks / patterns / gotchas / low level hardware realities to keep in mind, of course.
Like with people in general, it depends on what compiler/interpreter we're talking about, I'll freely grant that clang is smarter than me, but CPython for sure isn't. :)
More generally, canonicalization goes very far, but no farther than language semantics allows. Not even the notorious "sufficiently smart compiler" with infinite time can figure out what you don't tell it.
The optimizations that tend to have the most impact involve changes to the algorithm or data layout. Most compilers won’t do things like add a hash table to make a lookup O(1) or rearrange an array of structures to be a structure of arrays for better data locality. Coding with an eye for these optimizations is still a very good use of your time.
Though when your data is behind pointers, it's very easy to write code that the compiler can no longer figure out how to optimize.
This is true, but it also means "the compiler IS made for someone median smart, that now knows the machine".
It works great for basic, simple, common code, and for code that is made with care for data structures.
A total mess of code is another story.
P.D: is similar to the query optimizers, that neither can outrun a terrible made schema and queries
...I don't know... for instance the MSVC compiler creates this output for the last two 'non-trivial' functions with '/Ox':
add w8,w1,w0
cmp w0,#0
cseleq w0,w1,w8
Even beginner assembly coders on their first day wouldn't write such bullshit :)A better mindset is "don't trust the compiler for code that's actually performance sensitive".
You shouldn't validate each line of compiler output, but at least for the 'hot areas' in the code base that definitely pays off, because sometimes compilers do really weird shit for no good reason (often because of 'interference' between unrelated optimizer passes) - and often you don't need to dig deep to stumble over weird output like in the example above.
You can mostly not think about super low level integer manipulation stuff though.
I agree that most people are not writing hand-tuned avr8 assembly. Most people aren't attempting to do DSP on 8-bit AVRs either.
pareto principle like always, dont need the best but good enough
not every company is google level anyway
Then, there are is a (very long) list of checks for specific patterns and replacing them with shorter sequences of code, things like recognizing the pattern of bswap and replacing it with a bswap instruction. There's no end to adding patterns to check for.
There are limits on provable equivalence in the first place. things like equality saturation also try and do a better job of formalizing equivalence based rewrites.
unsigned int popcount(unsigned int n)
{
return (n &= n - 1u) ? (1u + popcount(n)) : 0u;
}
Clang 21.1 x64: popcount:
mov eax, -1
.LBB0_1:
lea ecx, [rdi - 1]
inc eax
and ecx, edi
mov edi, ecx
jne .LBB0_1
ret
GCC 15.2: popcount:
blsr edi, edi
popcnt eax, edi
ret
Both compiled with -O3 -march=znver5 return n ? (1u + popcount(n & n - 1u)) : 0u;
which both Clang and GCC promptly optimize to a single popcnt.Anything HPC will benefit from thinking about how things map onto hardware (or, in case of SQL, onto data structures).
I think way too few people use profilers. If your code is slow, profiling is the first tool you should reach for. Unfortunately, the state of profiling tools outside of NSight and Visual Studio (non-Code) is pretty disappointing.
Was it through "idiom detection", i.e. by recognising those specific patterns, or did the compiler deduce the answers them through some more involved analysis?
let a = expr let b = expr2
if (a || b) { return true; }
is the compiler allowed to lazily compute this if it is indeed faster to do that way? Or declaring a bunch of variables that may or may not be used in all of the branches. Is the compiler smart enough to only compute them whenever it is necessary? AFAIK this is now allowed in C-like languages. Things have to materialize. Another one is, I like to do memcpy every single time eventhough it might not even be used or overwritten by other memcpys. Is the compiler smart enough to not perform those and reorder my program so that only the last relevant memcpy is performed?
A lot of times, my code becomes ugly because I don't trust that it does any of this. I would like t write code in consistent and simple ways but I need compilers to be much smarter than it is today.
A bad example recently is something like
const S * s =;
let a = constant; let b = constant; let c = constant; let d = constant; let e = constant; let f = constant; let g = constant; let h = constant; let i = constant; let j = constant; let k = constant; let l = constant;
if (s->a == a && s->b == b /* etc */ ) { return true; }
It did not turn all of this into a SIMD mask or something like that.
This is known as "code sinking," and most optimizers are capable of doing this. Except keep in mind that a) the profitability of doing so is not always clear [1] and b) the compiler is a lot more fastidious about corner-case behavior than you are, so it might conclude that it's not in fact safe to sink the operation when you think it is safe to do so.
[1] If the operation to sink is x = y + z, you now may need to keep the values of y and z around longer to compute the addition, increasing register pressure and potentially hurting performance as a result.
Did you try using bitwise and (&), or a local for the struct? The short-circuiting behaviour of the logical means that if `s->a != a`, `s->b` must not be dereferenced, so the compiler cannot turn this into a SIMD mask operation, because it behaves differently.
Generally compilers are pretty smart these days, and I find that more often than not if they miss an "obvious" optimization it's because there's a cornercase where it behaves differently from the code I wrote.
unsigned add(unsigned x, unsigned y) {
unsigned a, b;
do {
a = x & y;
b = x ^ y;
x = a << 1;
y = b;
} while (a);
return b;
}
becomes (with armv8-a clang 21.1.0 -O3) : add(unsigned int, unsigned int):
.LBB0_1:
ands w8, w0, w1
eor w1, w0, w1
lsl w0, w8, #1
b.ne .LBB0_1
mov w0, w1
ret unsigned add(unsigned x, unsigned y) {
unsigned a, b;
do {
a = x & y; /* every position where addition will generate a carry */
b = x ^ y; /* the addition, with no carries */
x = a << 1; /* the carries */
y = b;
/* if there were any carries, repeat the loop */
} while (a);
return b;
}
It's easy to show that this algorithm is correct in the sense that, when b is returned, it must be equal to x+y. x+y summing to a constant is a loop invariant, and at termination x is 0 and y is b.It's a little more difficult to see that the loop will necessarily terminate.
New a values come from a bitwise & of x and y. New x values come from a left shift of a. This means that, if x ends in some number of zeroes, the next value of a will also end in at least that many zeroes, and the next value of x will end in an additional zero (because of the left shift). Eventually a will end in as many zeroes as there are bits in a, and the loop will terminate.
Also I did take the excuse to plug it (the optimized llvm ir) into Alive:
https://alive2.llvm.org/ce/#g:!((g:!((g:!((h:codeEditor,i:(f...
I was able to get it working with unrolling and narrower integers:
https://alive2.llvm.org/ce/#z:OYLghAFBqd5QCxAYwPYBMCmBRdBLAF...
Also optimizers have a limit, they can't reason as abstractly as humans, for example:
bool is_divisible_by_6(int x) {
return x % 2 == 0 && x % 3 == 0;
}
bool is_divisible_by_6_optimal(int x) {
return x % 6 == 0;
}
I tried with both gcc and clang, the asm code for is_divisible_by_6 is still less optimal. So no, there are plenty of easy ways to fool the optimizer by obfuscation.The morale is that you still have to optimize algorithms (O notation) and math operations / expressions.
On higher optimization levels, many passes occur multiple times. However, as far as I know, compilers don't repeatedly run passes until they've reached an optimum. Instead, they run a fixed series of passes. I don't know why, maybe someone can chime in.
[1]was? TurboFan seems to have splintered into a number of pieces being reused in different ways these days
https://www.open-std.org/jtc1/sc22/wg14/www/docs/n1256.pdf
6.5.13, semantics
But my example is trivial (x % 2 == 0 && x % 3 == 0 is exactly the same as x % 6 == 0 for all C/C++ int), yet the compiler produced different outputs (the outputs are different and most likely is_divisible_by_6 is slower). Also what null (you mean 0?) checks are you talking about? The denominator is not null/0. Regardless, my point about not over relying on compiler optimization (especially for macro algorithms (O notation) and math expressions) remains valid.
Yes, compilers will tend to convert && and || to non-short-circuiting operations when able, so as to avoid control flow.
So how are they not isomorphic?
You can check - copy the LLVM IR from https://godbolt.org/z/EMPr4Yc84 into https://alive2.llvm.org/ce/ and it'll tell you that it is a valid refinement as far as compiler optimization goes.