Top
Best
New

Posted by Bogdanp 9/11/2025

Behind the scenes of Bun Install(bun.com)
431 points | 152 commentspage 2
mrcarrot 9/12/2025|
The "Optimized Tarball Extraction" confuses me a bit. It begins by illustrating how other package managers have to repeatedly copy the received, compressed data into larger and larger buffers (not mentioning anything about the buffer where the decompressed data goes), and then says that:

> Bun takes a different approach by buffering the entire tarball before decompressing.

But seems to sidestep _how_ it does this any differently than the "bad" snippet the section opened with (presumably it checks the Content-Length header when it's fetching the tarball or something, and can assume the size it gets from there is correct). All it says about this is:

> Once Bun has the complete tarball in memory it can read the last 4 bytes of the gzip format.

Then it explains how it can pre-allocate a buffer for the decompressed data, but we never saw how this buffer allocation happens in the "bad" example!

> These bytes are special since store the uncompressed size of the file! Instead of having to guess how large the uncompressed file will be, Bun can pre-allocate memory to eliminate buffer resizing entirely

Presumably the saving is in the slow package managers having to expand _both_ of the buffers involved, while bun preallocates at least one of them?

Jarred 9/12/2025|
Here is the code:

https://github.com/oven-sh/bun/blob/7d5f5ad7728b4ede521906a4...

We trust the self-reported size by gzip up to 64 MB, try to allocate enough space for all the output, then run it through libdeflate.

This is instead of a loop that decompresses it chunk-by-chunk and then extracts it chunk-by-chunk and resizing a big tarball many times over.

mrcarrot 9/12/2025||
Thanks - this does make sense in isolation.

I think my actual issue is that the "most package managers do something like this" example code snippet at the start of [1] doesn't seem to quite make sense - or doesn't match what I guess would actually happen in the decompress-in-a-loop scenario?

As in, it appears to illustrate building up a buffer holding the compressed data that's being received (since the "// ... decompress from buffer ..." comment at the end suggests what we're receiving in `chunk` is compressed), but I guess the problem with the decompress-as-the-data-arrives approach in reality is having to re-allocate the buffer for the decompressed data?

[1] https://bun.com/blog/behind-the-scenes-of-bun-install#optimi...

RestartKernel 9/11/2025||
This is very nicely written, but I don't quite get how Linux's hardlinks are equivalent to MacOS's clonefile. If I understand correctly, wouldn't the former unexpectedly update files across all your projects if you modify just one "copy"?
wink 9/11/2025||
> Node.js uses libuv, a C library that abstracts platform differences and manages async I/O through a thread pool.

> Bun does it differently. Bun is written in Zig, a programming language that compiles to native code with direct system call access:

Guess what, C/C++ also compiles to native code.

I mean, I get what they're saying and it's good, and nodejs could have probably done that as well, but didn't.

But don't phrase it like it's inherently not capable. No one forced npm to be using this abstraction, and npm probably should have been a nodejs addon in C/C++ in the first place.

(If anything of this sounds like a defense of npm or node, it is not.)

k__ 9/11/2025||
To me, the reasoning seems to be:

Npm, pnpm, and yarn are written in JS, so they have to use Node.js facilities, which are based on libuv, which isn't optimal in this case.

Bun is written in Zig, so it doesn't need libuv, and can so it's own thing.

Obviously, someone could write a Node.js package manager in C/C++ as a native module to do the same, but that's not what npm, pnpm, and yarn did.

lkbm 9/11/2025||
Isn't the issue not that libuv is C, but that the thing calling it (Node.js) is Javascript, so you have to switch modes each time you have libuv make a system call?
1vuio0pswjnm7 9/12/2025||
https://github.com/oven-sh/bun/releases/expanded_assets/bun-...

progress: dynamically-linked musl binaries (tnx)

next: statically-linked musl binaries

tracker1 9/11/2025||
I'm somewhat curious how Deno stands up with this... also, not sure what packages are being installed. I'd probably start a vite template project for react+ts+mui as a baseline, since that's a relatively typical application combo for tooling. Maybe hono+zod+openapi as well.
tracker1 9/11/2025||
For my own curiousity on a React app on my work desktop.

    - Clean `bun install`, 48s - converted package-lock.json
    - With bun.lock, no node_modules, 19s
    - Clean with `deno install --allow-scripts`, 1m20s
    - with deno.lock, no node_modules, 20s
    - Clean `npm i`, 26s
    - `npm ci` (package-lock.json), no node_modules, 1m,2s (wild)
So, looks like if Deno added a package-lock.json conversion similar to bun the installs would be very similar all around. I have no control over the security software used on this machine, was just convenience as I was in front of it.

Hopefully someone can put eyes on this issue: https://github.com/denoland/deno/issues/25815

steve_adams_86 9/11/2025||
I think Deno isn't included in the benchmark because it's a harder comparison to make than it might seem.

Deno's dependency architecture isn't built around npm; that compatibility layer is a retrofit on top of the core (which is evident in the source code, if you ever want to see). Deno's core architecture around dependency management uses a different, URL-based paradigm. It's not as fast, but... It's different. It also allows for improved security and cool features like the ability to easily host your own secure registry. You don't have to use npm or jsr. It's very cool, but different from what is being benchmarked here.

tracker1 9/11/2025||
All the same, you can run deno install in a directory with a package.json file an it will resolve and install to node_modules. The process is also written in compiled code, like bun... so I was just curious.

edit: replied to my own post... looks like `deno install --allow-scripts` is about 1s slower than bun once deno.lock exists.

markasoftware 9/11/2025||
I'm pretty confused about why it's beneficial to wait to read the whole compressed file before decompressing. Surely the benefit of beginning decompression before the download is complete outweigh having to copy the memory around a few extra times as the vector is resized?
Jarred 9/12/2025|
Streaming prevents many optimizations because the code can’t assume it’s done when run once, so it has to suspend / resume, clone extra data for longer, and handle boundary cases more carefully.

It’s usually only worth it after ~tens of megabytes, but vast majority of npm packages are much smaller than that. So if you can skip it, it’s better.

yencabulator 9/12/2025||
Streaming compression with a large buffer size handles everything in a single batch for small files.
azangru 9/12/2025||
I am probably being stupid; but aren't install commands run relatively rarely by developers (less than once a day perhaps)? Is it such an important issue how long it takes for `x install` to finish?

Or is the concern about the time spent in CI/CD?

tuetuopay 9/13/2025|
CICD is a major usage. But dependencies version bumps are also a big part of it. In the python ecosystem I’ve had poetry take minutes to resolve the ansible dependencies after bumping the version. And then you see uv take milliseconds to do a full install from scratch.
valtism 9/12/2025||
I had no idea Lydia was working for Bun now. Her technical writing is absolutely top notch
randomsofr 9/11/2025||
wow, crazy to see yarn being so slow, when it used to beat npm by a lot, at a company i was we went from npm, to yarn, to pnpm, back to npm. Nowadays i try to use Bun as much as possible, but Vercel still does not uses natively for Next.
chrisweekly 9/11/2025|
why leave pnpm?
k__ 9/11/2025|
"... the last 4 bytes of the gzip format. These bytes are special since store the uncompressed size of the file!"

What's the reason for this?

I could imagine, many tools could profit from knowing the decompressed file size in advance.

philipwhiuk 9/11/2025||
It's straight from the GZIP spec if you assume there's a single GZIP "member": https://www.ietf.org/rfc/rfc1952.txt

> ISIZE (Input SIZE)

> This contains the size of the original (uncompressed) input data modulo 2^32.

So there's two big caveats:

1. Your data is a single GIZP member (I guess this means everything in a folder)

2. Your data is < 2^32 bytes.

jerf 9/12/2025|||
A GZIP "member" is whatever the creating program wants it to be. I have not carefully verified this but I see no reason for the command line program "gzip" to ever generate more than one member (at least for smaller inputs), after a quick scan through the command line options. I'm sure it's the modal case by far. Since this is specifically about reading .tar.gz files as hosted on npm, this is probably reasonably safe.

However, because of the scale of what bun deals with it's on the edge of what I would consider safe and I hope in the real code there's a fallback for what happens if the file has multiple members in it, because sooner or later it'll happen.

It's not necessarily terribly well known that you can just slam gzip members (or files) together and it's still a legal gzip stream, but it's something I've made use of in real code, so I know it's happened. You can do some simple things with having indices into a compressed file so you can skip over portions of the compressed stream safely, without other programs having to "know" that's a feature of the file format.

Although the whole thing is weird in general because you can stream gzip'd tars without every having to allocate space for the whole thing anyhow. gzip can be streamed without having seen the footer yet and the tar format can be streamed out pretty easily. I've written code for this in Go a couple of times, where I can be quite sure there's no stream rewinding occuring by the nature of the io.Reader system. Reading the whole file into memory to unpack it was never necessary in the first place, not sure if they've got some other reason to do that.

k__ 9/11/2025|||
Yeah, I understood that.

I was just wondering why GZIP specified it that way.

ncruces 9/11/2025||
Because it allows streaming compression.
k__ 9/11/2025||
Ah, makes sense.

Thanks!

lkbm 9/11/2025|||
I believe it's because you get to stream-compress efficiently, at the cost of stream-decompress efficiency.
8cvor6j844qw_d6 9/11/2025||
gzip.py [1]

---

def _read_eof(self):

# We've read to the end of the file, so we have to rewind in order

# to reread the 8 bytes containing the CRC and the file size.

# We check the that the computed CRC and size of the

# uncompressed data matches the stored values. Note that the size

# stored is the true file size mod 2*32.

---

[1]: https://stackoverflow.com/a/1704576

More comments...