Posted by HeliumHydride 13 hours ago
unsigned add_v5(unsigned x, unsigned y) {
if (x == y) return 2 * x;
return x + y;
}
Results in: add_v5(unsigned int, unsigned int):
lsl w8, w0, #1
add w9, w1, w0
cmp w0, w1
csel w0, w8, w9, eq
ret
(armv8-a clang 21.1.0 with O3)If compiler folks can chime in, I'm curious why incrementing in a loop can be unrolled and inspected to optimize to an addition, but doubling the number when both operands are equal can't?
Compilers are essentially massive towers of heuristics for which patterns to apply for optimization. We don't throw a general SMT solver at your code because that takes way too long to compile; instead, we look at examples of actual code and make reasonable efforts to improve code.
In the case of the incrementing in a loop, there is a general analysis called Scalar Evolution that recasts expressions as an affine expression of canonical loop iteration variables (i.e., f(x), where x is 0 on the first loop iteration, 1 on the second, etc.). In the loop `while (x--) y++;`, the x variable [at the end of each loop iteration] can be rewritten as x = x₀ + -1*i, while the y variable is y = y₀ + 1*i. The loop trip count can be solved to an exact count, so we can replace the use of y outside the loop with y = y₀ + 1*trip count = y₀ + x, and then the loop itself is dead and can be deleted. These are all optimizations that happen to be quite useful in other contexts, so it's able to easily recognize this form of loop.
In the example you give, the compiler has to recognize the equivalence of two values conditional on control flow. The problem is that this problem really starts to run into the "the time needed to optimize this isn't worth the gain you get in the end." Note that there are a lot of cases where you have conditional joins (these are "phis" in SSA optimizer parlance), most of which aren't meaningfully simplifiable, so you're cutting off the analysis for all but the simplest cases. At a guess, the simplification is looking for all of the input values to be of the same form, but 2 * x (which will actually be canonicalized to x << 1) is not the same form as x + y, so it's not going to see if the condition being used to choose between the same values would be sufficient to make some operation return the same value. There are representations that make this problem much easier (egraphs), but these are not the dominant form for optimizers at present.
For example, eliminating an extra load or store is often worth more than eliminating 100 extra arithmetic operations these days.
I expect because the former helps more in optimising real-world code than the latter. It’s not worth the LLVM developer's time to make the compiler better for programs that it won’t see in practice.
It’s not as if the compiler did nothing with that code, though. It replaced the multiplication by a left shift and removed the branch.
Basically the idea for most compilers is to do a series of transforms which incrementally improve the program (or at least make it worse in understood and reversible ways). To do this transform you need the optimizer to do the (not always trivial) proof that the 2*x is equivalent to x+y, do the replacement, do the gvn to duplicate the adds and finally do the branch elimination. Each of these steps is however totally separate from one another and the first one doesn't trigger since as far as it's concerned a shift left is faster than an add so why should it do the replacement.
This is all even more complicated since what representation is faster can depend on the target.
The trick is that it’s doing both the add and the left shift in parallel then selecting which to use based on a compare of the two values with csel.
(To see this, rather than reading the code sequentially, think of every instruction as being issued at the same time until you hit an instruction that needs a destination register from an earlier instruction)
The add is stored in W9 but only read if the two arguments are unequal.
If the compare succeeds and the lsl retires before the add, the add is never read, so nothing stalls waiting for it and the answer can be returned while the add is still in flight. The result of the add would then be quietly discarded assuming it ever started (maybe there’s some magic where it doesn’t even happen at all?).
It’s not clear to me that this is power efficient, or that on many real cpus there’s a latency difference to exploit between add and lsl, so it may not be faster than just unconditionally doing the addition.
That said, it is definitely faster than the code as it was written which if translated to asm verbatim stalls on the compare before executing either the add or the left shift.
It's not. Why would lsl+csel or add+csel or cmp+csel ever be faster than a simple add? Or have higher throughput? Or require less energy? An integer addition is just about the lowest-latency operation you can do on mainstream CPUs, apart from register-renaming operations that never leave the front-end.
This is even true for mid to high end embedded.
Without knowing about specific compiler targets/settings this looks reasonable.
Dumb in the majority case? Absolutely, but smart on the lowest common denominator.
But we do, armv8-a clang 21.1.0 with O3, and it doesn't.
> […] but smart on the lowest common denominator.
No, that would be the single add instruction.
See "Example 2: Tricking the compiler" in my blog post about O3 sometimes being slower than O2: https://barish.me/blog/cpp-o3-slower/
I don’t think it always did the best job and saw a bunch of register spills I thought were unnecessary, but I couldn’t justify the time and effort to do it in assembly…
unsigned mult(unsigned x, unsigned y) {
unsigned y0 = y;
while (x--) y = add_v1(y, y0);
return y;
}
optimizes to: mult(unsigned int, unsigned int):
madd w0, w1, w0, w1
ret
(and this produces the same result when substituting any of the `add_vN`s from TFA)E.g. if in `main` you called two different add functions, couldn't it optimize one of them away completely?
It probably shouldn't do that if you create a dynamic library that needs a symbol table but for an ELF binary it could, no? Why doesn't it do that?
> Perform Identical Code Folding for functions (-fipa-icf-functions), read-only variables (-fipa-icf-variables), or both (-fipa-icf). The optimization reduces code size and may disturb unwind stacks by replacing a function by an equivalent one with a different name. The optimization works more effectively with link-time optimization enabled.
In addition, the Gold linker supports a similar feature via `--icf={safe,all}`:
> Identical Code Folding. '--icf=safe' Folds ctors, dtors and functions whose pointers are definitely not taken
† Monomorphization takes Parametrically Polymorphic functions, ie functions which are strongly typed but those types are parameters at compile time, and it emits distinct machine code for each needed variation of the function, so e.g. add(a, b) maybe gets compiled to produce add_integer(a, b) and add_float(a, b) and add_matrix(a, b) even though we only wrote one function, and then code which calls add(a, b) with matrices, is at compile time emitted as calling add_matrix(a, b), because the compiler knew it needs that version. In C++ the number of parameters is also potentially allowed to vary between callers so add_matrix(a, b, c, d) might exist too, this feature is not yet available in Rust.
If you declare them as static, it eliminates the functions and the calls completely: https://aoco.compiler-explorer.com/z/soPqe7eYx
I'm sure it could also perform definition merging like you suggest but I can't think of a way of triggering it at the moment without also triggering their complete elision.
It can't do that because the program might load a dynamic library that depends on the function (it's perfectly OK for a `.so` to depend on a function from the main executable, for example).
That's one of the reasons why a very cheap optimization is to always use `static` for functions when you can. You're telling the compiler that the function doesn't need to be visible outside the current compilation unit, so the compiler is free to even inline it completely and never produce an actual callable function, if appropriate.
I get it though, because carefully structuring your #includes to get a single translation unit is messy, and compile times get too long.
That makes perfect sense, thank you!
And I just realized why I was mistaken. I am using fasm with `format ELF64 executable` to create a ELF file. Looking at it with a hex editor, it has no sections or symbol table because it creates a completely stripped binary.
Learned something :)
A quick google suggests it's called "identical comdat folding" https://devblogs.microsoft.com/oldnewthing/20161024-00/?p=94...
void go_forward(Closure *clo, Closure *cont, Closure *forward) {
GC_CHECK(clo, cont, forward);
((Fun0)(forward->fun))(forward, cont);
}
void go_left(Closure *clo, Closure *cont, Closure *left, Closure *right) {
GC_CHECK(clo, cont, left, right);
((Fun0)(left->fun))(left, cont);
}
void go_right(Closure *clo, Closure *cont, Closure *left, Closure *right) {
GC_CHECK(clo, cont, left, right);
((Fun0)(right->fun))(right, cont);
}
GcInfo gc_info[] = {
{ .fun = (GenericFun)&go_forward, .envc = 0, .argc = 1 },
{ .fun = (GenericFun)&go_left, .envc = 0, .argc = 2 },
{ .fun = (GenericFun)&go_right, .envc = 0, .argc = 2 },
};
Since, the pointers to go_forward and go_left will be the same, the gc_info table is less useless that it could be otherwise.You absolutely can fool a lot of compilers out there! And I am not only looking at you, NVCC.
In the OP examples, instead of optimization, what I would prefer is a separate analysis tool that reports what optimizations are possible and a compiler that makes it easy to write both high level and machine code as necessary. Now instead of the compiler opaquely rewriting your code for you, it helps guide you into writing optimal code at the source level. This, for me, leads to a better equilibrium where you are able to express your intent at a high level and then, as needed, you can perform lower level optimizations in a transparent and deterministic way.
For me, the big value of existing optimizing compilers is that I can use them to figure out what instructions might be optimal for my use case and then I can directly write those instructions where the highest performance is needed. But I do not need to subject myself to the slow compilation times (which compounds as the compiler repeatedly reoptimizes the same function thousands of times during development -- a cost that is repeated with every single compilation of the file) nor the possibility that the optimizer breaks my code in an opaque way that I won't notice until something bad and inscrutable happens at runtime.
It's super cool to see this in practice, and for me it helps putting more trust in the compiler that it does the right thing, rather than me trying to micro-optimize my code and peppering inline qualifiers everywhere.