Posted by birdculture 12/26/2025
If it didn't work we would not have these massive ecosystems upsetting GitHub's freemium model, but anything at scale is naturally going to have consequences and features that aren't so compatible with the use case.
Homebrew uses OCI as its backend now, and I think every package manager should. It has the right primitives you expect from a registry to scale.
Sqlite data is paged and so you can get away with only fetching the pages you need to resolve your query.
https://phiresky.github.io/blog/2021/hosting-sqlite-database...
But that's different from how you collect the data in a git repository in the first place - or are you suggesting just putting a Sqlite file in a git repository? If so I can think of one big reason against that.
But if you are I wouldn't recommend it.
PR's won't be able to show diff's. Worse, as soon as multiple people send a PR at once you'll have a really painful merge to resolve, and GitHub's tools won't help you at all. And you can't edit the files in GitHub's web UI.
I recommend one file per record, JSON, YAML, whatever non-binary format you want. But then you get:
* PR's with diff's that show you what's being changed
* Files that technical people can edit directly in GitHub's web editor
* If 2 people make PR's on different records at once it's an easy merge with no conflicts
* If 2 people make PR's on the same record at once ... ok, you might now have a merge conflict to resolve but it's in an easy text file and GitHub UI will let you see what it is.
You can of course then compile these data files into a SQLite file that can be served in a static website nicely - in fact if you see my other comments on this post I have a tool that does this. And on that note, sorry, I've done a few projects in this space so I have views :-)
Could even follow your record model, and use that as data to populate the db.
Cut out the middle man, directly serve the query response to the package manager client.
(I do immediately see issues stemming from the fact that you cant leverage features like edge caching this way, but I'm not really asking if its a good solution, im more asking if its possible at all)
Anything where you are opening a TCP connection to a hosted SQL server is a non-starter. You could hypothetically have so many read replicas that no one could blow anyone else up, but this would get to be very expensive at scale.
Something involving SQLite is probably the most viable option.
Also Stackoverflow exposes a SQL interface so it isn't totally impossible.
https://play.clickhouse.com/
clickhouse-client --host play.clickhouse.com --user play --secure
ssh play.clickhouse.comAll of the complexity lives on the client. That makes a lot of sense for a package manager because it's something lots of people want to run, but no one really wants to host.
It doesn't if you do it properly.
I don't get what is so bad about shallow clones either. Why should they be so performance sensative?
If 83GB (4MB/fork) is "too big" then responsibility for that rests solely on the elective centralization encouraged by Github. I suspect if you could go and total up the cumulative storage used by the nixpkgs source tree distributed on computers spread throughout the world, that is many orders of magnitude larger.
The solution is simple: using a shallow clone means that the use case doesn’t care about the history at all, so download a tarball of the repo for the initial download and then later rsync the repo. Git can remain the source of truth for all history, but that history doesn’t have to be exposed.
Scaling that data model beyond projects the size of the Linux kernel was not critical for the original implementation. I do wonder if there are fundamental limits to scaling the model for use cases beyond “source code management for modest-sized, long-lived projects”.
Consider vcpkg. It’s entirely reasonable to download a tree named by its hash to represent a locked package. Git knows how to store exactly this, but git does not know how to transfer it efficiently.
Naïvely, I’d expect shallow clones to be this, so I was quite surprised by a mention of GitHub asking people not to use them. Perhaps Git tries too hard to make a good packfile?..
Meanwhile, what Nixpkgs does (and why “release tarballs” were mentioned as a potential culprit in the discussion linked from TFA) is request a gzipped tarball of a particular commit’s files from a GitHub-specific endpoint over HTTP rather than use the Git protocol. So that’s already more or less what you want, except even the tarball is 46 MB at this point :( Either way, I don’t think the current problems with Nixpkgs actually support TFA’s thesis.
So the phrase the article says "Package managers keep falling for this. And it keeps not working out" I feel that's untrue.
The most issue I have with this really is "flakes" integration where the whole recipe folder is copied into the store (which doesn't happen with non-flakes commands), but that's a tooling problem not an intrinsic problem of using git
O(1) beats O(n) as n gets large.