Shrunked JavaScript monorepo Git size by 94%

Posted by kwantaz 9 hours ago

Shrunked JavaScript monorepo Git size by 94%(www.jonathancreamer.com)

208 points | 115 comments

jakub_g 3 hours ago|

Paraphrasing meat of the article:

- When you have multiple files in the repo which have the same trailing 16 characters in the repo path, git may wrongly calculate deltas, mixing up between those files. In here they had multiple CHANGELOG.md files mixed up.

- So if those files are big and change often, you end up with massive deltas and inflated repo size.

- There's a new git option (in Microsoft git fork for now) and config to use full file path to calculate those deltas, which fixes the issue when pushing, and locally repacking the repo.

```

git repack -adf --path-walk

git config --global pack.usePathWalk true

```

- According to a screenshot, Chromium repacked in this way shrinks from 100GB to 22GB.

- However AFAIU until GitHub enables it by default, GitHub clones from such repos will still be inflated.

kreetx 2 hours ago||

I don't think GitHub, or any other git host, will have objections to using it once it's part of mainline git?

Also, thank you for the TLDR!

masklinn 1 hour ago||

> I don't think GitHub, or any other git host, will have objections to using it once it's part of mainline git?

Fixing an existing repository requires a full repack, and for a repository as big as Chromium it still takes more than half a day (56000 seconds is 15h30), even if that's an improvement over the previous 3 days it's a lot of compute.

From my experience of previous attempts, trying to get Github to run a full repack with harsh settings is extremely difficult (possibly because their infrastructure relies on more loosely packed repositories), I tried to get that for $dayjob's primary repository whose initial checkout had gotten pretty large and got nowhere.

As of right now, said repository is ~9.5GB on disk on initial clone (full, not partial, excluding working copy). Locally running `repack -adf --window 250` brings it down to ~1.5GB, at the cost of a few hours of CPU.

The repository does have some of the attributes described in TFA, so I'm definitely looking forward to trying these changes out.

leksak 23 minutes ago||

Wouldn't a potential workaround be to create a new barebones repository and push the repacked one there? Sure, people will have to change their remote origin but if it solves the problem that might be worth the hassle?

masklinn 14 minutes ago||

It breaks the issues, PRs, all the tooling and integration, …

For now we’re getting by with partial clones, and employee machines being imaged with a decently up to date repository.

jamalaramala 1 hour ago||

Thank you to the AI that summarised the article. ;-)

tux3 4 hours ago||

For those wondering where this new git-survey command is, it's actually not in git.git yet!

The author is using microsoft's git fork, they've added this new command just this summer: https://github.com/microsoft/git/pull/667

clktmr 2 hours ago||

This is your daily Embrace, Extend, Extinguish reminder.

atombender 1 hour ago|||

Why? You do realize their fork is open source?

The fix described in this post have been submitted as a patch to the official Git project. The fix is improving a legitimate inefficiency in Git, and does nothing towards "embracing", "extending", or "extinguishing" anything.

throwuxiytayq 1 hour ago||||

Can you elaborate how exactly git is at risk here? These posts never do.

PoignardAzur 1 hour ago||||

Oh for crying out loud.

"EEE" isn't a magic incantation, it's the name of an actual policy with actual tangible steps that their executives were implementing back when the CEO thought open source was the greatest threat to their business model.

Microsoft contributing to a project doesn't automatically make it EEE. For one thing, EEE was about adopting open standards in proprietary software. Microsoft during EEE didn't publish GPL code like this is.

maccard 1 hour ago|||

No, this is cathedral vs bazaar development

masklinn 3 hours ago||

I assume full-name-hash and path-walk are also only in the fork as well (or in git HEAD)? Can't see them in the man pages, or in the 2.47 changelog.

tux3 3 hours ago||

Yep. Path-walk is currently pending review here: https://lore.kernel.org/all/pull.1813.git.1728396723.gitgitg...

It more or less replaces the --full-name-hash option (again a very good cover letter that explains the differences and pros/cons of each very well!)

bubblesnort 5 hours ago||

    > We work in a very large Javascript monorepo at Microsoft we colloquially call 1JS.

I used to call it office.com.. Teams is the worst offender there. Even a website with a cryptominer on it runs faster than that junk.

wodenokoto 5 hours ago||

We were all impressed with google docs, but office.com is way more impressive.

Collaborative editing between a web app, two mobile anpps and a desktop app with 30 years of backwards compatibility and it pretty much just works. No wonder that took a lot of JavaScript!

esperent 4 hours ago|||

We use MS Teams at my company. The Word and Excel in the Windows Teams app are so buggy that I can almost never successfully open a file. It just times out and eventually shows a "please try again later" message nearly every time. I've uninstalled and reinstalled the Teams app four or five times trying to fix this.

We've totally given up any kind of collaborative document editing because it's too frustrating, or we use Notion instead, which for all it's fault, at least the basic stuff like loading a bloody file works...

matrss 1 hour ago||||

> [...] and it pretty much just works.

I beg to differ. Last time I had to use PowerPoint (granted, that was ~3 years ago), math on the slides broke when you touched it with a client that wasn't of the same type as the one that initially put it there. So you would need to use either the web app or the desktop app to edit it, but you couldn't switch between them. Since we were working on the slides with multiple people you also never knew what you had to use if someone else wrote that part initially.

ezst 2 hours ago||||

That's the thing, though, the compat story is terrible. I can't say much about the backwards one, but Microsoft has started the process of removing features from the native versions just to lower the bar for the web one catching up. Even my most Microsoft-enamoured colleagues are getting annoyed by this (and the state of all-MS things going downhill, but that's another story)

lostlogin 2 hours ago||

> That's the thing, though, the compat story is terrible.

It really is. With shared documents you just have to give up. If someone edits them on the web, in Teams, in the actual app or some other way like on iOS, it all goes to hell.

Pages get added or removed, images jump about, fonts change and various other horrors occur.

If you care, you’ll get ground into the earth.

tinco 4 hours ago||||

To be fair, we were impressed with Google Docs 15 years ago. Not saying office.com isn't impressive, but Google Docs certainly isn't impressive today. My company still uses GSuite, as I don't like being in Microsoft's ecosystem and we don't need any advanced features of our office suite but Google Docs and the rest of the GSuite seem to be intentionally held back to technology of the early 2010's.

alexanderchr 4 hours ago||

Google docs certainly haven't changed much the last 5-10 years. I wonder if that's an intentional choice, or if it is because those that built it and understand how it works are long gone to work on other things.

jakub_g 3 hours ago|||

Actually I did see a few long awaited improvements landing in gdocs lately (e.g. better markdown support, pageless mode).

I think they didn't deliver much new features in early 2020s because they were busy with a big refactoring from DOM to canvas rendering [0].

[0] https://news.ycombinator.com/item?id=27129858

sexy_seedbox 51 minutes ago|||

No more development? Time for Google to kill Google Docs!

Cthulhu_ 1 hour ago|||

> No wonder that took a lot of JavaScript!

To the point where they quickly found the flaws in JS for large codebases and came up with Typescript. I think. It makes sense that TS came out of the office for web project.

inglor 3 hours ago||

Hey, I worked with Jonathan on 1JS a while ago (on a team, Excel).

Just a note OMR (the office monorepo) is a different (and actually much larger) monorepo than 1JS (which is big on its own)

To be fair I suspect a lot of the bloat in both originates from the amount of home grown tooling.

IshKebab 2 hours ago||

I thought Microsoft had one monorepo. Isn't that kind of the point? How many do they have?

lbriner 40 minutes ago||

The point of a monorepo is that all the dependencies for a suite of related products are all in a single repo, not that everything your company produces is in a single repo.

eviks 5 hours ago||

upd: silly mistake - file name does not include its full path

The explanation probably got lost among all the gifs, but the last 16 chars here are different:

> was actually only checking the last 16 characters of a filename > For example, if you changed repo/packages/foo/CHANGELOG.md, when git was getting ready to do the push, it was generating a diff against repo/packages/bar/CHANGELOG.md!

tux3 4 hours ago||

Derrick provides a better explanation in this cover letter: https://lore.kernel.org/git/pull.1785.git.1725890210.gitgitg...

(See also the path-walk API cover letter: https://lore.kernel.org/all/pull.1786.git.1725935335.gitgitg...)

The example in the blog post isn't super clear, but Git was essentially taking all the versions of all the files in the repo, putting the last 16 bytes of the path (not filename) in a hash table, and using that to group what they expected to be different versions of the same file together for delta compression.

Indeed in the blog it doesn't work, because foo/CHANGELOG.md and bar/CHANGELOG.md is only 13 chars, but you have to imagine the paths have a longer common suffix. That part is fixed by the --full-name-hash option, now you compare the full path instead of just 16 bytes.

Then they talk about increasing the window size. That's kind of a hack to workaround bad file grouping, but it's not the real fix. You're still giving terrible inputs to the compressor and working around it by consuming huge amounts of memory. So that was a bit confusing to present it as the solution. The path walk API and/or --full-name-hash are the real interesting parts here =)

lastdong 4 hours ago||

Thank you! I ended up having to look at the PR to make any sense of the blog post, but your explanation and links makes things much clearer

derriz 5 hours ago|||

I wish they had provided an actual explanation of what exactly was happening and skipped all the “color” in the story. By filename do they mean path? Or is it that git will just pick any file with a matching name to generate a diff? Is there any pattern to the choice of other file to use?

snthpy 4 hours ago||

daenney 4 hours ago|||

File name doesn’t necessarily include the whole path. The last 16 characters of CHANGELOG.md is the full file name.

If we interpret it that way, that also explains why the filepathwalk solution solves the problem.

But if it’s really based on the last 16 characters of just the file name, not the whole path, then it feels like this problem should be a lot more common. At least in monorepos.

floam 4 hours ago|||

It did shrink Chromium’s repo quite a bit!

eviks 4 hours ago|||

yes, this makes sense, thanks for pointing it out, silly confusion on my part

p4bl0 5 hours ago||

I was also bugged by that. I imagine that the meta variables foo and bar are at fault here, and that probably the actual package names had a common suffix like firstPkg and secondPkg. A common suffix of length three is enough in this case to get 16 chars in common as "/CHANGELOG.md" is already 13 chars long.

yunusabd 5 hours ago||

> For many reasons, that's just too big, we have folks in Europe that can't even clone the repo due to it's size.

What's up with folks in Europe that they can't clone a big repo, but others can? Also it sounds like they still won't be able to clone, until the change is implemented on the server side?

> This meant we were in many occasions just pushing the entire file again and again, which could be 10s of MBs per file in some cases, and you can imagine in a repo

The sentence seems to be cut off.

Also, the gifs are incredibly distracting while trying to read the article, and they are there even in reader mode.

tazjin 6 minutes ago||

> What's up with folks in Europe that they can't clone a big repo, but others can?

They might be in a country with underdeveloped internet infrastructure, e.g. Germany))

anon-3988 5 hours ago|||

> For many reasons, that's just too big, we have folks in Europe that can't even clone the repo due to it's size.

I read that as an anecdote, a more complete sentence would be "We had a story where someone from Europe couldn't clone the whole repo on his laptop for him to use on a journey across Europe because his disk is full at the time. He has since cleared up the disk and able to clone the repo".

I don't think it points to a larger issue with Europe not being able to handle 180GB files...I surely hope so.

peebeebee 3 hours ago||

The European Union doesn't like when a file get too big and powerful. It needs to be broken apart in order to give smaller files a chance of success.

_joel 2 hours ago|||

People foolishly thought the G in GDPR stood for "general" when it's actually GIANT.

wizzwizz4 2 hours ago|||

Ever since they enshrined the Unix Philosophy into law, it's been touch-and-go for monorepotic corporations.

thrance 3 hours ago||

The repo is probably hosted on the west coast, meaning it has to cross the Atlantic whenever you clone it from Europe?

develatio 2 hours ago||

Hacking Git sounds fun, but isn't there a way to just not have 2.500 packages in a monorepo?

Cthulhu_ 1 hour ago|

Yeah, have 2500 separate Git repos with all the associated overhead.

nkmnz 4 hours ago||

> we have folks in Europe that can't even clone the repo due to it's size

Officer, I'd like to report a murder committed in a side note!

jakub_g 3 hours ago||

The article mentions Derick Stolee who dig the digging and shipped the necessary changes. If you're interested in git internals, shrinking git clone sizes locally and in CI etc, Derrick wrote some amazing blogs on GitHub blog:

https://github.blog/author/dstolee/