> Bun takes a different approach by buffering the entire tarball before decompressing.
But seems to sidestep _how_ it does this any differently than the "bad" snippet the section opened with (presumably it checks the Content-Length header when it's fetching the tarball or something, and can assume the size it gets from there is correct). All it says about this is:
> Once Bun has the complete tarball in memory it can read the last 4 bytes of the gzip format.
Then it explains how it can pre-allocate a buffer for the decompressed data, but we never saw how this buffer allocation happens in the "bad" example!
> These bytes are special since store the uncompressed size of the file! Instead of having to guess how large the uncompressed file will be, Bun can pre-allocate memory to eliminate buffer resizing entirely
Presumably the saving is in the slow package managers having to expand _both_ of the buffers involved, while bun preallocates at least one of them?
https://github.com/oven-sh/bun/blob/7d5f5ad7728b4ede521906a4...
We trust the self-reported size by gzip up to 64 MB, try to allocate enough space for all the output, then run it through libdeflate.
This is instead of a loop that decompresses it chunk-by-chunk and then extracts it chunk-by-chunk and resizing a big tarball many times over.
I think my actual issue is that the "most package managers do something like this" example code snippet at the start of [1] doesn't seem to quite make sense - or doesn't match what I guess would actually happen in the decompress-in-a-loop scenario?
As in, it appears to illustrate building up a buffer holding the compressed data that's being received (since the "// ... decompress from buffer ..." comment at the end suggests what we're receiving in `chunk` is compressed), but I guess the problem with the decompress-as-the-data-arrives approach in reality is having to re-allocate the buffer for the decompressed data?
[1] https://bun.com/blog/behind-the-scenes-of-bun-install#optimi...
> Bun does it differently. Bun is written in Zig, a programming language that compiles to native code with direct system call access:
Guess what, C/C++ also compiles to native code.
I mean, I get what they're saying and it's good, and nodejs could have probably done that as well, but didn't.
But don't phrase it like it's inherently not capable. No one forced npm to be using this abstraction, and npm probably should have been a nodejs addon in C/C++ in the first place.
(If anything of this sounds like a defense of npm or node, it is not.)
Npm, pnpm, and yarn are written in JS, so they have to use Node.js facilities, which are based on libuv, which isn't optimal in this case.
Bun is written in Zig, so it doesn't need libuv, and can so it's own thing.
Obviously, someone could write a Node.js package manager in C/C++ as a native module to do the same, but that's not what npm, pnpm, and yarn did.
progress: dynamically-linked musl binaries (tnx)
next: statically-linked musl binaries
- Clean `bun install`, 48s - converted package-lock.json
- With bun.lock, no node_modules, 19s
- Clean with `deno install --allow-scripts`, 1m20s
- with deno.lock, no node_modules, 20s
- Clean `npm i`, 26s
- `npm ci` (package-lock.json), no node_modules, 1m,2s (wild)
So, looks like if Deno added a package-lock.json conversion similar to bun the installs would be very similar all around. I have no control over the security software used on this machine, was just convenience as I was in front of it.Hopefully someone can put eyes on this issue: https://github.com/denoland/deno/issues/25815
Deno's dependency architecture isn't built around npm; that compatibility layer is a retrofit on top of the core (which is evident in the source code, if you ever want to see). Deno's core architecture around dependency management uses a different, URL-based paradigm. It's not as fast, but... It's different. It also allows for improved security and cool features like the ability to easily host your own secure registry. You don't have to use npm or jsr. It's very cool, but different from what is being benchmarked here.
edit: replied to my own post... looks like `deno install --allow-scripts` is about 1s slower than bun once deno.lock exists.
It’s usually only worth it after ~tens of megabytes, but vast majority of npm packages are much smaller than that. So if you can skip it, it’s better.
Or is the concern about the time spent in CI/CD?
What's the reason for this?
I could imagine, many tools could profit from knowing the decompressed file size in advance.
> ISIZE (Input SIZE)
> This contains the size of the original (uncompressed) input data modulo 2^32.
So there's two big caveats:
1. Your data is a single GIZP member (I guess this means everything in a folder)
2. Your data is < 2^32 bytes.
However, because of the scale of what bun deals with it's on the edge of what I would consider safe and I hope in the real code there's a fallback for what happens if the file has multiple members in it, because sooner or later it'll happen.
It's not necessarily terribly well known that you can just slam gzip members (or files) together and it's still a legal gzip stream, but it's something I've made use of in real code, so I know it's happened. You can do some simple things with having indices into a compressed file so you can skip over portions of the compressed stream safely, without other programs having to "know" that's a feature of the file format.
Although the whole thing is weird in general because you can stream gzip'd tars without every having to allocate space for the whole thing anyhow. gzip can be streamed without having seen the footer yet and the tar format can be streamed out pretty easily. I've written code for this in Go a couple of times, where I can be quite sure there's no stream rewinding occuring by the nature of the io.Reader system. Reading the whole file into memory to unpack it was never necessary in the first place, not sure if they've got some other reason to do that.
I was just wondering why GZIP specified it that way.
Thanks!
---
def _read_eof(self):
# We've read to the end of the file, so we have to rewind in order
# to reread the 8 bytes containing the CRC and the file size.
# We check the that the computed CRC and size of the
# uncompressed data matches the stored values. Note that the size
# stored is the true file size mod 2*32.
---