Cutting down Rust compile times from 30 to 2 minutes with one thousand crates

Posted by Qadriq 4/17/2025

Cutting down Rust compile times from 30 to 2 minutes with one thousand crates(www.feldera.com)

143 points | 87 comments

wiseowise 4/17/2025|

> Given that we now fully utilize 128 threads or 64 cores for pretty much the entire compile time, we can do a back of the envelope calculation for how long it should take: 25 min / 128 = 12 sec (or maybe 24 sec since hyper-threads aren't real cores). Yet it takes 170s to compile everything.

Amdahl’s Law would like to have a word.

gz09 4/17/2025||

FWIW the article carefully worded it as a "back of the envelope" calculation, says we can't expect a linear speedup in practice and also gives the time it takes for linking (7 secs).

(Disclaimer I am the author of the article and I am quite familiar with the law)

stouset 4/17/2025|||

This is an observed change of going from 1 core at 100% to 64 cores at 100%. This is establishing a lower bound, assuming there is no wasted work or contention for shared resources.

repstosb 4/18/2025|||

Amdahl's Law, like most 20th-century performance "wisdom" and metrics, focuses excessively on instruction count, neglecting memory and I/O pressure. 64 cores doesn't mean 64 independent cache pyramids and memory buses. In real life, the difference between CPU cycle frequency and memory latency is so great that memory pressure primarily determines performance, whereas core count really only matters to the extent that it contributes to that memory pressure.

throwawaymaths 4/17/2025||

Amdahls law is about coordination costs, so either you would expect cores to be starved or lots pf extra coordination-related compute to be happening, which, i guess is not totally crazy since there are that many crates, but as a first guess OP's back of the envelope is fine

HippoBaro 4/17/2025||

Eminently pragmatic solution — I like it. In Rust, a crate is a compilation unit, and the compiler has limited parallelism opportunities, especially since rustc offloads much of the work to LLVM, which is largely single-threaded.

It’s not surprising they didn’t see a linear speedup from splitting into so many crates. The compiler now produces a large number of intermediate object files that must be read back and linked into the final binary. On top of that, rustc caches a significant amount of semantic information — lifetimes, trait resolutions, type inference — much of which now has to be recomputed for each crate, including dependencies. That introduces a lot of redundant work.

I also would expect this to hurt runtime performance as it likely reduces inlining opportunities (unless LTO is really good now?)

JJJollyjim 4/17/2025|

They mention that compiling one crate at a time (-j1) doesnt give the 7x slowdown, which rules out the object file/caching-in-rustc theories... I think the only explanation is the rustcs are sharing limited L3 cache.

lsuresh 4/17/2025||

The L3 cache angle is one of our hypotheses too. But it doesn't seem like we can do much about it.

dathinab 4/17/2025||

The main issue here is:

- in rust one semantic compilation unit is one crate

- in C one semantic compilation unit is one file

There are quite a bunch of benefits in the rust approach, but also drawbacks, like huge projects have to be split into multiple workspaces to maximize parallel building.

Oversimplified the codegen-units setting tells the compiler into how many parts the compiler is allowed to split the a single semantic code gen unit.

Now it still seems strange (as in it looks like a performance bug) that most times rust was stuck in just one threat (instead of e.g. 8).

bryanlarsen 4/17/2025||

Rust has a great compromise between crate and file: module. I wonder why that's not the compilation unit?

lmkg 4/17/2025|||

Per a reddit comment, modules are allowed to have circular dependencies while crates are not.

dathinab 4/18/2025||||

- cyclic dependencies

- some subtleties related to (proc-)macros

- better optimizations (potentially, not always, sometimes not at all)

- how generics and compilation units interact (reduces the benefit of making each module a compilation unit)

- a lot of unclearity about how rust will develop in the future when this decision was made

Also when people speak about rust compiling slow and splitting helping it's most times related to better caching of repeated builds (unrelated to the incremental build feature) and not the specific issue here. But there is definitely potential to improve on it to make humongous single crates work better (like instead of just 16/256 internal splits you could factor in the crate size, maybe add a attribute to hint code unit splits etc.), but so far no one has deemed it important enough to invest their time into fixing it. I mean splitting crates is often easy so you do that once are good forever or at lest a long time.

colonial 4/17/2025|||

I think it's due to the fact that (unlike crates) cyclic dependencies are allowed between modules without any extra ceremony (e.x. forward declarations in C.)

lsuresh 4/17/2025|||

> Now it still seems strange (as in it looks like a performance bug) that most times rust was stuck in just one threat (instead of e.g. 8).

Agreed, seems like there are some rustc performance bugs at play here.

steveklabnik 4/17/2025||

I haven't dug into the details, but it may not even be a performance bug, depending on how you define 'bug': the Rust compiler is not fully parallel itself yet. That's a bug in the sense of something that needs to be improved and fixed, but isn't one in the sense of "unexpected bad behavior".

lsuresh 4/17/2025|||

Makes sense. We'd appreciate some more eyeballs here for sure. Between HN and a Reddit thread, there are a few hypotheses floating around. I've shared a repro here for anyone interested: https://github.com/feldera/feldera/issues/3882

steveklabnik 4/17/2025||

You may want to post on the Zulip, I think that's the way to get in touch with the team these days.

dathinab 4/17/2025|||

the thing is:

codegen-units defaults to 16 in release builds, and by far the most time in the "passes" list is spend in LLVM passed (which is was codegen-units parallelizes),so most times it shouldn't be stuck with 1 high load core (even if it's not 16 all the time).

so it looks a lot like something is prevented the intended codegen parallelization of the crate

Through it indeed might not have been a bug, e.g. before the change in generation to split it across crates source code might have been in a way where it can't split the crate into multiple units. Or maybe something made rust believe splitting it is a bad idea, e.g. related to memory usage or similar.

steveklabnik 4/17/2025||

Ah yeah, that does sound like a bug to me; it’s the earlier stages that I’m thinking of that aren’t parallel yet.

simfoo 4/17/2025||

Agreed, this is the underlying main issue. I've faced it before with generated C++ code too, and after long and painful refactorings what ultimately helped the most was to just split the generated code into multiple compilation units to allow for parallel compilation. It comes with the drawback of potentially instatiating (and later throwing away) a lot more templates though

hu3 4/17/2025||

> We're using rustc v1.83, and despite having a 64-core machine with 128 threads, Rust barely puts any of them to work.

> That’s right — 1,106 crates! Sounds excessive? Maybe. But in the end this is what makes rustc much more effective.

> What used to take 30–45 minutes now compiles in under 3 minutes.

I wonder if this kind of trick can be implemented in rustc itself in a more automated fashion to benefit more projects.

pcwalton 4/17/2025||

> I wonder if this kind of trick can be implemented in rustc itself in a more automated fashion to benefit more projects.

It partially is, with codegen units. The problem is that you can't generally do that until codegen time, because of circular dependencies.

zwnow 4/17/2025||

1106 crates? Are they sure this is not a Javascript project?

faitswulff 4/17/2025||

They compile their customers' SQL to Rust code. Hence the preponderance of crates. It's a somewhat unique scenario.

lsuresh 4/17/2025||

That's correct. These aren't external dependencies but a dataflow graph being split into crates.

lsuresh 4/17/2025||

For any Rust compiler experts interested in taking a look, I've put together a short repro here: https://github.com/feldera/feldera/issues/3882

It will give you a workspace with a bunch of crates that seems to exercise some of the same bottlenecks the blog post described.

jmull 4/17/2025||

That's a cool project.

But I wonder if generating rust is the best approach. On the plus side, you can take advantage of the rich type and type checking system the compiler has. On the other hand, you're stuck with that compiler.

I wonder if the dynamic constraints can be expressed and checked through some more directly implemented mechanism. It should be both simpler to express exactly the constraints you want (no need to translate to a rust construct that rustc will check as desired), and, of course, should be a lot more efficient. Feldera may have no feasible way to get away from generated rust, but a potential competitor might avoid the issue. (That's not to say the runtime shouldn't/couldn't be implemented in rust. I'm just talking about the large amounts of generated rust.)

slashdev 4/17/2025||

I would probably have gone with generating C here. You don't need all the safety of the Rust compiler, you're the one generating the code. As you point out, you can check all the constraints you want before you compile it.

pjmlp 4/17/2025||

The right way done by the likes of Oracle and SQL Server is to JIT compile their queries and stored procedures, with PGO data from query analyser.

lsuresh 4/17/2025||

We think a JIT compiler is the right approach too. Will be a substantial effort though, so we're waiting to get a bit of bandwidth on that front.

Crisco 4/17/2025||

Are there any performance implications for the final binary because you’re splitting it up into thousands of crates?

epage 4/17/2025|

Loss of inlining

dgacmu 4/17/2025||

Loss of automatic inlining of non-generics without LTO, to be a little pedantic.

Functions marked #[inline] can still be handled across crates.

LTO can inline across crates but, of course, at a substantial compile time cost.

nixpulvis 4/17/2025||

I love Rust, but it seems like a really bad intermediate language if you have a compiler/transpiler which is sound. You don't need the type system from Rust telling you something's wrong.

That said, I could see how it would make writing the transpiler easier, so that's a win.

dietr1ch 4/17/2025||

> back of the envelope calculation for how long it should take: 25 min / 128 = 12 sec (or maybe 24 sec since hyper-threads aren't real cores). Yet it takes 170s to compile everything.

I'd aim for this linear speedup for compiling (sans overhead to compile a small crate), but the linking part won't be faster, maybe even slower. Maybe a slightly bigger envelope can tell you how much performance is there to extract and the cost of using "too many" crates (which I'm not even sure it's too many, maybe your original crate was too big to ease incremental compilation?)

UncleEntity 4/17/2025||

Yeah, I was working on this project to generate a python C-API module from the SVG schema and found when I generated one huge file (as opposed to one file per class) the compilation times were significantly faster. Or maybe it was generating the C++ SVG library, don't quite remember which one was super slow as I did both at around the same time since the code changes between the two were minimal.

Looks like I settled on the slower 'one class per file' compilation method for whatever reason, probably because generating a 200k+ file didn't seem like such a good idea.

lsuresh 4/17/2025||

Towards the end, the article says it takes 7s for linking using mold.

dietr1ch 4/17/2025||

Yeah, I read that part too late. In that case it seems that there's indeed a lot of overhead when building many crates from a cold-start, but it pays off in wall time and can probably save resources in incremental builds.

lsuresh 4/17/2025||

Yes. We found both cold and incremental builds sped up. The incremental builds were the main win -- small changes to the SQL can sometimes complete in seconds for what used to be a full recompilation.

rc00 4/17/2025||

Rewrite it in Zig? You might even be able to sidestep the LLVM bottleneck entirely. https://news.ycombinator.com/item?id=43016944

lesuorac 4/17/2025|

> Of course, we tried debug builds too. Those cut the time down to ~5 minutes — but they’re not usable in practice.

I wonder how true this is.

Haven't use feldera but other rust stuff I have if I run as debug it has serious performance problems. However, for testing I have it compile a few crates that do work like `image` to be optimized (and the vast majority as debug) and that is enough to make the performance issues not noticeable. So if the multi-crate hadn't worked, possibly just only compile some of the stuff as optimized.

More comments...