Behind the scenes of Bun Install

Posted by Bogdanp 2 days ago

Behind the scenes of Bun Install(bun.com)

423 points | 152 commentspage 3

k__ 2 days ago|

"... the last 4 bytes of the gzip format. These bytes are special since store the uncompressed size of the file!"

What's the reason for this?

I could imagine, many tools could profit from knowing the decompressed file size in advance.

philipwhiuk 2 days ago||

It's straight from the GZIP spec if you assume there's a single GZIP "member": https://www.ietf.org/rfc/rfc1952.txt

> ISIZE (Input SIZE)

> This contains the size of the original (uncompressed) input data modulo 2^32.

So there's two big caveats:

1. Your data is a single GIZP member (I guess this means everything in a folder)

2. Your data is < 2^32 bytes.

jerf 1 day ago|||

A GZIP "member" is whatever the creating program wants it to be. I have not carefully verified this but I see no reason for the command line program "gzip" to ever generate more than one member (at least for smaller inputs), after a quick scan through the command line options. I'm sure it's the modal case by far. Since this is specifically about reading .tar.gz files as hosted on npm, this is probably reasonably safe.

However, because of the scale of what bun deals with it's on the edge of what I would consider safe and I hope in the real code there's a fallback for what happens if the file has multiple members in it, because sooner or later it'll happen.

It's not necessarily terribly well known that you can just slam gzip members (or files) together and it's still a legal gzip stream, but it's something I've made use of in real code, so I know it's happened. You can do some simple things with having indices into a compressed file so you can skip over portions of the compressed stream safely, without other programs having to "know" that's a feature of the file format.

Although the whole thing is weird in general because you can stream gzip'd tars without every having to allocate space for the whole thing anyhow. gzip can be streamed without having seen the footer yet and the tar format can be streamed out pretty easily. I've written code for this in Go a couple of times, where I can be quite sure there's no stream rewinding occuring by the nature of the io.Reader system. Reading the whole file into memory to unpack it was never necessary in the first place, not sure if they've got some other reason to do that.

k__ 2 days ago|||

Yeah, I understood that.

I was just wondering why GZIP specified it that way.

ncruces 2 days ago||

Because it allows streaming compression.

k__ 2 days ago||

Ah, makes sense.

Thanks!

lkbm 2 days ago|||

I believe it's because you get to stream-compress efficiently, at the cost of stream-decompress efficiency.

8cvor6j844qw_d6 2 days ago||

gzip.py [1]

---

def _read_eof(self):

# We've read to the end of the file, so we have to rewind in order

# to reread the 8 bytes containing the CRC and the file size.

# We check the that the computed CRC and size of the

# uncompressed data matches the stored values. Note that the size

# stored is the true file size mod 2*32.

---

[1]: https://stackoverflow.com/a/1704576

djfobbz 2 days ago||

I really like Bun too, but I had a hard time getting it to play nicely with WSL1 on Windows 10 (which I prefer over WSL2). For example:

  ~/: bun install
  error: An unknown error occurred (Unexpected)

lfx 2 days ago|

Why you prefer WSL1 over WSL2?

tracker1 2 days ago|||

FS calls across the OS boundary are significantly faster in WSL1, as the biggest example from the top of my head. I prefer WSL2 myself, but I avoid using the /mnt/c/ paths as much as possible, and never, ever run a database (like sqlite) across that boundary, you will regret it.

djfobbz 2 days ago|||

WSL1's just faster, no weird networking issues, and I can edit the Linux files from both Windows and Linux without headaches.

LeicaLatte 2 days ago||

Liking the package management from first principles as a systems-level optimization problem rather than file scripting. resembling a database engine - dependency aware task scheduling, cache locality, sys call overhead - they are all there.

wojtek1942 2 days ago||

> However, this mode switching is expensive! Just this switch alone costs 1000-1500 CPU cycles in pure overhead, before any actual work happens.

...

> On a 3GHz processor, 1000-1500 cycles is about 500 nanoseconds. This might sound negligibly fast, but modern SSDs can handle over 1 million operations per second. If each operation requires a system call, you're burning 1.5 billion cycles per second just on mode switching.

> Package installation makes thousands of these system calls. Installing React and its dependencies might trigger 50,000+ system calls: that's seconds of CPU time lost to mode switching alone! Not even reading files or installing packages, just switching between user and kernel mode.

Am I missing something or is this incorrect. They claim 500ns per syscall with 50k syscalls. 500ns * 50000 = 25 milliseconds. So that is very far from "seconds of CPU time lost to mode switching alone!" right?

Bolwin 2 days ago|

Read further. In one of the later benchmarks, yarn makes 4 million syscalls.

Still only about 2 secs, but still.

phildougherty 2 days ago||

Bun is FUN to say.

swyx 2 days ago||

i'm curious why Yarn is that much slower than npm? whats the opposite of this article?

yahoozoo 1 day ago||

Good article but it sounds a lot like AI wrote it.

WesolyKubeczek 2 days ago||

macOS has hardlinks too. Why not use them?

moffkalast 2 days ago||

Anyone else also having a first association to https://xkcd.com/1682 instead of, you know, bread?

jcmartinezdev 2 days ago|

[dead]

More comments...