Posted by kwantaz 9 hours ago
- When you have multiple files in the repo which have the same trailing 16 characters in the repo path, git may wrongly calculate deltas, mixing up between those files. In here they had multiple CHANGELOG.md files mixed up.
- So if those files are big and change often, you end up with massive deltas and inflated repo size.
- There's a new git option (in Microsoft git fork for now) and config to use full file path to calculate those deltas, which fixes the issue when pushing, and locally repacking the repo.
```
git repack -adf --path-walk
git config --global pack.usePathWalk true
```
- According to a screenshot, Chromium repacked in this way shrinks from 100GB to 22GB.
- However AFAIU until GitHub enables it by default, GitHub clones from such repos will still be inflated.
Also, thank you for the TLDR!
Fixing an existing repository requires a full repack, and for a repository as big as Chromium it still takes more than half a day (56000 seconds is 15h30), even if that's an improvement over the previous 3 days it's a lot of compute.
From my experience of previous attempts, trying to get Github to run a full repack with harsh settings is extremely difficult (possibly because their infrastructure relies on more loosely packed repositories), I tried to get that for $dayjob's primary repository whose initial checkout had gotten pretty large and got nowhere.
As of right now, said repository is ~9.5GB on disk on initial clone (full, not partial, excluding working copy). Locally running `repack -adf --window 250` brings it down to ~1.5GB, at the cost of a few hours of CPU.
The repository does have some of the attributes described in TFA, so I'm definitely looking forward to trying these changes out.
For now we’re getting by with partial clones, and employee machines being imaged with a decently up to date repository.
The author is using microsoft's git fork, they've added this new command just this summer: https://github.com/microsoft/git/pull/667
The fix described in this post have been submitted as a patch to the official Git project. The fix is improving a legitimate inefficiency in Git, and does nothing towards "embracing", "extending", or "extinguishing" anything.
"EEE" isn't a magic incantation, it's the name of an actual policy with actual tangible steps that their executives were implementing back when the CEO thought open source was the greatest threat to their business model.
Microsoft contributing to a project doesn't automatically make it EEE. For one thing, EEE was about adopting open standards in proprietary software. Microsoft during EEE didn't publish GPL code like this is.
It more or less replaces the --full-name-hash option (again a very good cover letter that explains the differences and pros/cons of each very well!)
> We work in a very large Javascript monorepo at Microsoft we colloquially call 1JS.
I used to call it office.com.. Teams is the worst offender there. Even a website with a cryptominer on it runs faster than that junk.Collaborative editing between a web app, two mobile anpps and a desktop app with 30 years of backwards compatibility and it pretty much just works. No wonder that took a lot of JavaScript!
We've totally given up any kind of collaborative document editing because it's too frustrating, or we use Notion instead, which for all it's fault, at least the basic stuff like loading a bloody file works...
I beg to differ. Last time I had to use PowerPoint (granted, that was ~3 years ago), math on the slides broke when you touched it with a client that wasn't of the same type as the one that initially put it there. So you would need to use either the web app or the desktop app to edit it, but you couldn't switch between them. Since we were working on the slides with multiple people you also never knew what you had to use if someone else wrote that part initially.
It really is. With shared documents you just have to give up. If someone edits them on the web, in Teams, in the actual app or some other way like on iOS, it all goes to hell.
Pages get added or removed, images jump about, fonts change and various other horrors occur.
If you care, you’ll get ground into the earth.
I think they didn't deliver much new features in early 2020s because they were busy with a big refactoring from DOM to canvas rendering [0].
To the point where they quickly found the flaws in JS for large codebases and came up with Typescript. I think. It makes sense that TS came out of the office for web project.
Just a note OMR (the office monorepo) is a different (and actually much larger) monorepo than 1JS (which is big on its own)
To be fair I suspect a lot of the bloat in both originates from the amount of home grown tooling.
The explanation probably got lost among all the gifs, but the last 16 chars here are different:
> was actually only checking the last 16 characters of a filename > For example, if you changed repo/packages/foo/CHANGELOG.md, when git was getting ready to do the push, it was generating a diff against repo/packages/bar/CHANGELOG.md!
(See also the path-walk API cover letter: https://lore.kernel.org/all/pull.1786.git.1725935335.gitgitg...)
The example in the blog post isn't super clear, but Git was essentially taking all the versions of all the files in the repo, putting the last 16 bytes of the path (not filename) in a hash table, and using that to group what they expected to be different versions of the same file together for delta compression.
Indeed in the blog it doesn't work, because foo/CHANGELOG.md and bar/CHANGELOG.md is only 13 chars, but you have to imagine the paths have a longer common suffix. That part is fixed by the --full-name-hash option, now you compare the full path instead of just 16 bytes.
Then they talk about increasing the window size. That's kind of a hack to workaround bad file grouping, but it's not the real fix. You're still giving terrible inputs to the compressor and working around it by consuming huge amounts of memory. So that was a bit confusing to present it as the solution. The path walk API and/or --full-name-hash are the real interesting parts here =)
If we interpret it that way, that also explains why the filepathwalk solution solves the problem.
But if it’s really based on the last 16 characters of just the file name, not the whole path, then it feels like this problem should be a lot more common. At least in monorepos.
What's up with folks in Europe that they can't clone a big repo, but others can? Also it sounds like they still won't be able to clone, until the change is implemented on the server side?
> This meant we were in many occasions just pushing the entire file again and again, which could be 10s of MBs per file in some cases, and you can imagine in a repo
The sentence seems to be cut off.
Also, the gifs are incredibly distracting while trying to read the article, and they are there even in reader mode.
They might be in a country with underdeveloped internet infrastructure, e.g. Germany))
I read that as an anecdote, a more complete sentence would be "We had a story where someone from Europe couldn't clone the whole repo on his laptop for him to use on a journey across Europe because his disk is full at the time. He has since cleared up the disk and able to clone the repo".
I don't think it points to a larger issue with Europe not being able to handle 180GB files...I surely hope so.
Officer, I'd like to report a murder committed in a side note!
https://github.blog/author/dstolee/
See also his website:
Kudos to Derrick, I learnt so much from those!
What’s wrong with Europeans? Do they have smaller disks?
What aspects of Azure DevOps are hell to you?