Posted by Qadriq 2 days ago
Amdahl’s Law would like to have a word.
(Disclaimer I am the author of the article and I am quite familiar with the law)
It’s not surprising they didn’t see a linear speedup from splitting into so many crates. The compiler now produces a large number of intermediate object files that must be read back and linked into the final binary. On top of that, rustc caches a significant amount of semantic information — lifetimes, trait resolutions, type inference — much of which now has to be recomputed for each crate, including dependencies. That introduces a lot of redundant work.
I also would expect this to hurt runtime performance as it likely reduces inlining opportunities (unless LTO is really good now?)
- in rust one semantic compilation unit is one crate
- in C one semantic compilation unit is one file
There are quite a bunch of benefits in the rust approach, but also drawbacks, like huge projects have to be split into multiple workspaces to maximize parallel building.
Oversimplified the codegen-units setting tells the compiler into how many parts the compiler is allowed to split the a single semantic code gen unit.
Now it still seems strange (as in it looks like a performance bug) that most times rust was stuck in just one threat (instead of e.g. 8).
Agreed, seems like there are some rustc performance bugs at play here.
codegen-units defaults to 16 in release builds, and by far the most time in the "passes" list is spend in LLVM passed (which is was codegen-units parallelizes),so most times it shouldn't be stuck with 1 high load core (even if it's not 16 all the time).
so it looks a lot like something is prevented the intended codegen parallelization of the crate
Through it indeed might not have been a bug, e.g. before the change in generation to split it across crates source code might have been in a way where it can't split the crate into multiple units. Or maybe something made rust believe splitting it is a bad idea, e.g. related to memory usage or similar.
- some subtleties related to (proc-)macros
- better optimizations (potentially, not always, sometimes not at all)
- how generics and compilation units interact (reduces the benefit of making each module a compilation unit)
- a lot of unclearity about how rust will develop in the future when this decision was made
Also when people speak about rust compiling slow and splitting helping it's most times related to better caching of repeated builds (unrelated to the incremental build feature) and not the specific issue here. But there is definitely potential to improve on it to make humongous single crates work better (like instead of just 16/256 internal splits you could factor in the crate size, maybe add a attribute to hint code unit splits etc.), but so far no one has deemed it important enough to invest their time into fixing it. I mean splitting crates is often easy so you do that once are good forever or at lest a long time.
> That’s right — 1,106 crates! Sounds excessive? Maybe. But in the end this is what makes rustc much more effective.
> What used to take 30–45 minutes now compiles in under 3 minutes.
I wonder if this kind of trick can be implemented in rustc itself in a more automated fashion to benefit more projects.
It partially is, with codegen units. The problem is that you can't generally do that until codegen time, because of circular dependencies.
It will give you a workspace with a bunch of crates that seems to exercise some of the same bottlenecks the blog post described.
But I wonder if generating rust is the best approach. On the plus side, you can take advantage of the rich type and type checking system the compiler has. On the other hand, you're stuck with that compiler.
I wonder if the dynamic constraints can be expressed and checked through some more directly implemented mechanism. It should be both simpler to express exactly the constraints you want (no need to translate to a rust construct that rustc will check as desired), and, of course, should be a lot more efficient. Feldera may have no feasible way to get away from generated rust, but a potential competitor might avoid the issue. (That's not to say the runtime shouldn't/couldn't be implemented in rust. I'm just talking about the large amounts of generated rust.)
Functions marked #[inline] can still be handled across crates.
LTO can inline across crates but, of course, at a substantial compile time cost.
That said, I could see how it would make writing the transpiler easier, so that's a win.
I'd aim for this linear speedup for compiling (sans overhead to compile a small crate), but the linking part won't be faster, maybe even slower. Maybe a slightly bigger envelope can tell you how much performance is there to extract and the cost of using "too many" crates (which I'm not even sure it's too many, maybe your original crate was too big to ease incremental compilation?)
Looks like I settled on the slower 'one class per file' compilation method for whatever reason, probably because generating a 200k+ file didn't seem like such a good idea.
Mostly a throwaway code with a heavy input from Claude, so the docs are in the code itself :-)
But in case anyone can find it useful:
The evidently misguided assumption was that whoever uses it will need to tweak it anyhow, so might as well read it through. As I wrote - it’s very close to throwaway code.
Anyway, I decide to experiment with Claude also writing a README - the result doesn’t seem too terribly incorrect on the first squint, and hopefully gives a slightly more impression of what that thing was attempting to do. (Disclaimer: I didn’t test it much other than my use case, so YMMV on whether it works at all).
> The evidently misguided assumption was that whoever uses it will need to tweak it anyhow, so might as well read it through. As I wrote - it’s very close to throwaway code.
Even if that assumption is true for part of the potential users, they would appreciate a starting point, you know.
Now I have bookmarked it and will check it out at one point.
If you ever figure you want to invest some more effort into it: try make it into an LSP server so it can integrate with LSP Code Actions directly.
I looked shortly at LSP but never had experience with it, and it looked very overwhelming… (and given that I generally use vi, it seemed like a bit too much overhead to also start using a different editor or learn integrations - which I looked at but they seemed a bit unsatisfying).
As a result this exercise got me into an entirely worse kind of shiny: writing my own TUI editor, with a function/type being the unit of editing rather than file. facepalm.
probably entirely worthless exercise, but it is a ton of fun and that is what matters for now ! :-)
LSP Code Actions is super neat though. You can have a compiler error or a warning and when your cursor is positioned on it (inside the editor) you can invoke LSP Code Actions and have changes offered to you with a preview, then you can just agree to it and boom, it's done.
Obviously this might be too much work or too tedious for a hobby project, but it's good for you to know what's out there and how it's used. I don't use LSP Code Actions too often but find them invaluable when I do.
It might be a bit closer to what I am after, so I will definitely give it a try even if as a source of inspiration for my reinvention of bicycle !
Thanks a lot !