Posted by kwantaz 10/27/2024
The author is using microsoft's git fork, they've added this new command just this summer: https://github.com/microsoft/git/pull/667
It more or less replaces the --full-name-hash option (again a very good cover letter that explains the differences and pros/cons of each very well!)
"EEE" isn't a magic incantation, it's the name of an actual policy with actual tangible steps that their executives were implementing back when the CEO thought open source was the greatest threat to their business model.
Microsoft contributing to a project doesn't automatically make it EEE. For one thing, EEE was about adopting open standards in proprietary software. Microsoft during EEE didn't publish GPL code like this is.
that's how effective eee is, you don't know (or likely care) where MS ripped all this code.
This is not an accident. It's the point.
If Microsoft want to develop some proprietary extensions for VSCode it’s fine, everyone has this right. It has nothing to do with EEE.
The fix described in this post have been submitted as a patch to the official Git project. The fix is improving a legitimate inefficiency in Git, and does nothing towards "embracing", "extending", or "extinguishing" anything.
I'm not saying this shouldn't be merged, but I think people should be aware and see the early signs.
https://lore.kernel.org/git/7d43a1634bbe2d2efa96a806e3de1f1f...
it will look good, until the extensions get more and more proprietary- but absurdly useful.
Right now the most important thing for them is for people to start thinking the microsoft fork is the superior one, even if things are “backported”.
VS Code is the most common example people have, but it’s not the same: that’s always been their project so while I don’t love how some things like Pylance are not open it’s not like those were ever promised as such and the core product runs like a normal corporate open source project. It’s not like they formed Emacs and started breaking compatibility to prevent people from switching back and forth. One key conceptual difference is that code is inherently portable, unlike office documents 30 years ago – if VSC started charging, you could switch to any of dozens of other editors in seconds with no changes to your source code.
I would recommend thinking about your comment in the context of the boy who cried wolf. If people trot out EEE every time Microsoft does something in open source, it’s just lowering their credibility among anyone who didn’t say Micro$oft in the 90s and we’ll feel that loss when there’s a real problem.
* SMTP
* Kerberos (there was a time you could use KRB4 with Windows because AD is just krb4 with extensions: now you have to use AD).
* HTML (activex etc)
* CALDAV // CARDDAV
* Javas portability breakage
* MSN and AOL compatibility.
“oh, but its not the same”. It never is, which is why I didnt want to give examples and preferred you speak to someone who knows the history more than a tiny internet comment that is unable to convey proper context.
There are so many good things to criticize Microsoft for. When this is what people come with, it serves as a single of emotion-based ignorance and to ignore.
But you'll also be wrong a lot of the time.
This is not the Extend in EEE. We might get there, and we should be generally wary of microsoft, but this doesn't show that we're already there.
For more examples I would consult your local greybeard; since the pattern is broad enough that you can reasonably argue that “this time, its different” which is also what you hear every single time it happens.
Microsoft Embraced by making VSCode free and open source. Then they Extended by using their resources to make VSCode the go to open source IDE/Editor for most use cases and languages, killing much of the development momentum for none VSCode based alternatives. Now they're Extinguishing the competition by making it harder and harder to use the ostensibly open source VSCode codebase to build competing tools.
> Embrace: Development of software substantially compatible with an Open Standard.
> Extend: Addition of features not supported by the Open Standard, creating interoperability problems.
>Extinguish: When extensions become a de facto standard because of their dominant market share, they marginalize competitors who are unable to support the new extensions.
As I see it, there no open standard that Microsoft is rendering proprietary through VSCode. VSCode is their own product.
I see your point that VSCode may have stalled development of other open source editors, and has proprietary extensions... but I don't think really EEE fits. It's just competition.
I'm not totally sold on embrace-extemd-extinguish here, but learning about this case was eyebrow raising for me.
C# DevKit is however based on VS license. It builds on top of base C# extension, the core features like debugger, language server, auto-complete and auto-fixers integration, etc. are in the base extension.
I find these insightful reminders. Use the vanilla free versions if the difference is negligeble.
What's up with folks in Europe that they can't clone a big repo, but others can? Also it sounds like they still won't be able to clone, until the change is implemented on the server side?
> This meant we were in many occasions just pushing the entire file again and again, which could be 10s of MBs per file in some cases, and you can imagine in a repo
The sentence seems to be cut off.
Also, the gifs are incredibly distracting while trying to read the article, and they are there even in reader mode.
I read that as an anecdote, a more complete sentence would be "We had a story where someone from Europe couldn't clone the whole repo on his laptop for him to use on a journey across Europe because his disk is full at the time. He has since cleared up the disk and able to clone the repo".
I don't think it points to a larger issue with Europe not being able to handle 180GB files...I surely hope so.
(Bonus game: count the number of annual zero days they’re exposed to because each of those vendors still ships 90s-style C code)
Every once in a while, my router used to go crazy with seemingly packet loss (I think a memory issue).
Normal websites would become super slow for any pc or phone in the house.
But git… git would fail to clone anything not really small.
My fix was to unplug the modem and router and plug back in. :)
It took a long time to discover the router was reporting packet loss, and that the slowness the browsers were experiencing has to do with some retries, and that git just crapped out.
Eventually when git started misbehaving I restarted the router to fix.
And now I have a new router. :)
After COVID I had to set up a compressing proxy for Artifactory and file a bug with JFrog about it because some of my coworkers with packet loss were getting request timeouts that npm didn’t handle well at all. Npm of that era didn’t bother to check bytes received versus content-length and then would cache the wrong answer. One of my many, many complaints about what total garbage npm was prior to ~8 when the refactoring work first started paying dividends.
Thankfully, we do our work almost entirely in shallow clones inside codespaces so it's not a big deal. I hope the problems presented in the 1JS repro from this blog post are causing similar size blowout in our repo and can be fixed.
They might be in a country with underdeveloped internet infrastructure, e.g. Germany))
Both countries are behind e.g. Sweden or Russia, but Germany by a much larger margin.
There's some trickery done in official statistics (e.g. by factoring in private connections that are unavailable to consumers) to make this seem better than it is, but ask anyone who lives there and you'll be surprised.
The explanation probably got lost among all the gifs, but the last 16 chars here are different:
> was actually only checking the last 16 characters of a filename > For example, if you changed repo/packages/foo/CHANGELOG.md, when git was getting ready to do the push, it was generating a diff against repo/packages/bar/CHANGELOG.md!
(See also the path-walk API cover letter: https://lore.kernel.org/all/pull.1786.git.1725935335.gitgitg...)
The example in the blog post isn't super clear, but Git was essentially taking all the versions of all the files in the repo, putting the last 16 bytes of the path (not filename) in a hash table, and using that to group what they expected to be different versions of the same file together for delta compression.
Indeed in the blog it doesn't work, because foo/CHANGELOG.md and bar/CHANGELOG.md is only 13 chars, but you have to imagine the paths have a longer common suffix. That part is fixed by the --full-name-hash option, now you compare the full path instead of just 16 bytes.
Then they talk about increasing the window size. That's kind of a hack to workaround bad file grouping, but it's not the real fix. You're still giving terrible inputs to the compressor and working around it by consuming huge amounts of memory. So that was a bit confusing to present it as the solution. The path walk API and/or --full-name-hash are the real interesting parts here =)
No, it is the full path that's considered. Look at the commit message on the first commit in the `--full-name-hash` PR:
https://github.com/git-for-windows/git/pull/5157/commits/d5c...
Excerpt: "/CHANGELOG.json" is 15 characters, and is created by the beachball [1] tool. Only the final character of the parent directory can differntiate different versions of this file, but also only the two most-significant digits. If that character is a letter, then this is always a collision. Similar issues occur with the similar "/CHANGELOG.md" path, though there is more opportunity for differences in the parent directory.
The grouping algorithm puts less weight on each character the further it is from the right-side of the name:
hash = (hash >> 2) + (c << 24)
Hash is 32-bits. Each 8-bit char (from the full path) in turn is added to the 8-most significant bits of hash, after shifting any previous hash bits to the right by two bits (which is why only the final 16 chars affect the final hash). Look at what happens in practice:https://go.dev/play/p/JQpdUGXdQs7
Here I've translated it to Go and compared the final value of "aaa/CHANGELOG.md" to "zzz/CHANGELOG.md". Plug in various values for "aaa" and "zzz" and see how little they influence the final value.
If we interpret it that way, that also explains why the filepathwalk solution solves the problem.
But if it’s really based on the last 16 characters of just the file name, not the whole path, then it feels like this problem should be a lot more common. At least in monorepos.
The first option mentioned in the post (--window 250) reduced the size to 1.7GB. The new --path-walk option from the Microsoft git fork was less effective, resulting in 1.9GB total size.
Both of these are less than half of the initial size. Would be great if there was a way to get Github to run these, and even greater if people started hosting stuff in a way that gives them control over this ...
https://github.blog/author/dstolee/
See also his website:
Kudos to Derrick, I learnt so much from those!
> Retroactively, once the file is there though, it's semi stuck in history.
Arguably, the fix for that is to run filter-branch, remove the offending binary, teach and get everyone setup to use git-lfs for binaries, force push, and help everyone get their workstation to a good place.
Far from ideal, but better than having a large not-even-used file in git.
As someone else noted, this is about small, frequently changing files, so you could remove old versions from the history to save space, and use LFS going forward.
IME it takes less time to go from 100 modules to 200 than it takes to go from 50 to 100.
It's like hammering a nail through your hand, and then buying a different hammer with a softer handle to make it hurt less.
I don't know anyone who says monorepos are easy.
To the contrary, the tooling is precisely the hard part.
But the point is that the difficulty of the tooling is a lot less than the difficulty of managing compatibility conflicts between tons of separate repos.
Each esoteric bug in C only needs to be fixed once. Whereas your version compatibility conflict this week is going to be followed by another one next week.
And the tooling to handle this is not even particularly conceptually complicated - a "versionset" is a set of versions - a set of pointers to a particular commit of a repository. When you build and deploy an application, what you're building is a versionset containing the correct versions of all its dependencies. And pull requests can span across multiple repositories.
Working at Amazon had its annoyances, but dependency management across repos was not one of them.
This bit is doing a lot of work here.
How do you make commits atomic? Is there a central commit queue? Do you run the tests of every dependent repo? How do you track cross-repo dependencies to do that? Is there a central database? How do you manage rollbacks?
I assume you meant to write "is" there?
If you really dig down into why we code the way we do, the “best practices” in software development, about half of them are heavily influenced by merge conflict, if not the primary cause.
If I group like functions together in a large file, then I (probably) won’t conflict with another person doing an unrelated ticket that touches the same file. But if we both add new functions at the bottom of the file, we’ll conflict. As long as one of us does the right thing everything is fine.
I've been watching all the recent GitMerge talks put up by GitButler and following the monorepo / scaling developments - lots of great things being put out there by Microsoft, Github, and Gitlab.
I'd like to understand this last 16 char vs full path check issue better. How does this fit in with delta compression, pack indexes, multi-pack indexes etc ... ?
Are they going to be opening a merge request to get their custom git command back in git proper then?