Package managers keep using Git as a database, it never works out

Posted by birdculture 12/26/2025

Package managers keep using Git as a database, it never works out(nesbitt.io)

784 points | 465 commentspage 2

hogrug 12/26/2025|

The facts are interesting but the conclusion a bit strange. These package managers have succeeded because git is better for the low trust model and GitHub has been hosting infra for free that no one in their right mind would provide for the average DB.

If it didn't work we would not have these massive ecosystems upsetting GitHub's freemium model, but anything at scale is naturally going to have consequences and features that aren't so compatible with the use case.

ifh-hn 12/26/2025||

So what's the answer then? That's the question I wanted answered after reading this article. With no experience with git or package management, would using a local client sqlite database and something similar on the server do?

encom 12/26/2025||

I quite like Gentoo's rsync based package manager. I believe they've used that since the beginning. It works well.

MarsIronPI 12/26/2025||

To be clear though, the rsync trees come from a central Git repo (though it's not hosted on GitHub). And syncing from Git actually makes syncing faster.

AaronFriel 12/26/2025||

OCI artifacts, using the same protocol as container registries. It's a protocol designed for versioning (tagging) content addressable blobs, associating metadata with them, and it's CDN friendly.

Homebrew uses OCI as its backend now, and I think every package manager should. It has the right primitives you expect from a registry to scale.

dleslie 12/26/2025||

GitHub is intoxicatingly free hosting, but Git itself is a terrible database. Why not maintain an _actual_ database on GitHub, with tagged releases?

Sqlite data is paged and so you can get away with only fetching the pages you need to resolve your query.

https://phiresky.github.io/blog/2021/hosting-sqlite-database...

jarofgreen 12/26/2025|

This seems to be about hosting an Sqlite database on a static website like GitHub Pages - this can be a great plan, there is also Datasette in a browser now: https://github.com/simonw/datasette-lite

But that's different from how you collect the data in a git repository in the first place - or are you suggesting just putting a Sqlite file in a git repository? If so I can think of one big reason against that.

dleslie 12/26/2025||

Yes, I'm suggesting hosting it on GitHub, leveraging their git lfs support. Just treat it like a binary blob and periodically update with a tagged release.

jarofgreen 12/26/2025||

It's not clear if you are suggesting accepting contributions to the SQLite file via PR from people (but accepting contributions is generally the point of why people put these on projects on GitHub).

But if you are I wouldn't recommend it.

PR's won't be able to show diff's. Worse, as soon as multiple people send a PR at once you'll have a really painful merge to resolve, and GitHub's tools won't help you at all. And you can't edit the files in GitHub's web UI.

I recommend one file per record, JSON, YAML, whatever non-binary format you want. But then you get:

* PR's with diff's that show you what's being changed

* Files that technical people can edit directly in GitHub's web editor

* If 2 people make PR's on different records at once it's an easy merge with no conflicts

* If 2 people make PR's on the same record at once ... ok, you might now have a merge conflict to resolve but it's in an easy text file and GitHub UI will let you see what it is.

You can of course then compile these data files into a SQLite file that can be served in a static website nicely - in fact if you see my other comments on this post I have a tool that does this. And on that note, sorry, I've done a few projects in this space so I have views :-)

dleslie 12/28/2025||

Nah, git is terrible with binaries. But the SQL database can be rebuilt periodically; the problem being solved is replacing the git querying with SQL.

Could even follow your record model, and use that as data to populate the db.

cbondurant 12/26/2025||

Admittedly, I try and stay away from database design whenever possible at work. (Everything database is legacy for us) But the way the term is being used here kinda makes me wonder, do modern sql databases have enough security features and permissions management systems in place that you could just directly expose your database to the world with a "guest" user that can only make incredibly specific queries?

Cut out the middle man, directly serve the query response to the package manager client.

(I do immediately see issues stemming from the fact that you cant leverage features like edge caching this way, but I'm not really asking if its a good solution, im more asking if its possible at all)

bob1029 12/26/2025||

There are still no realistic ways to expose a hosted SQL solution to the public without really unhappy things occurring. It doesn't matter which vendor you pick.

Anything where you are opening a TCP connection to a hosted SQL server is a non-starter. You could hypothetically have so many read replicas that no one could blow anyone else up, but this would get to be very expensive at scale.

Something involving SQLite is probably the most viable option.

IshKebab 12/26/2025|||

Feels like there's an opening in the market there. Why can't you expose an SQL server to the public?

Also Stackoverflow exposes a SQL interface so it isn't totally impossible.

zX41ZdbW 12/26/2025|||

ClickHouse can do it. Examples:

    https://play.clickhouse.com/

    clickhouse-client --host play.clickhouse.com --user play --secure

    ssh play.clickhouse.com

baobun 12/26/2025||

Yes but CH is not SQL.

Hasnep 12/27/2025||

Yes, SQL is a query language and clickhouse is a database that uses SQL as a query language, but I don't see why that's relevant.

brendoncarroll 12/26/2025|||

I personally think that this is the future, especially since such an architecture allows for E2E encryption of the entire database. The protocol should just be a transaction layer for coordinating changes of opaque blobs.

All of the complexity lives on the client. That makes a lot of sense for a package manager because it's something lots of people want to run, but no one really wants to host.

yawaramin 12/27/2025|||

There's no need to have a publicly accessible database server, just put all the data in a single SQLite database and distribute that to clients. It's possible to do streaming updates by just zipping up a text file containing all the SQL commands and letting clients download that. Or even a more sophisticated option is eg Litestream.

mirekrusin 12/26/2025||

You can use fossil [0]

[0] https://fossil-scm.org

account42 1/8/2026||

> Windows restricts paths to 260 characters, a constraint dating back to DOS.

It doesn't if you do it properly.

Ericson2314 12/26/2025||

The Nixpkgs example is not like the others, because it is source code.

I don't get what is so bad about shallow clones either. Why should they be so performance sensative?

__MatrixMan__ 12/26/2025||

It also seems like it's not git that's emitting scary creaks and groans, but rather GitHub. As much as it would be a bummer to forgo some of GitHub's nice-to-have features, I expect we could survive without some of it.

mindslight 12/26/2025|||

Furthermore, the issues given for nixpkgs are actually demonstrating the success of using git as the database! Those 20k forks are all people maintaining their own version of nixpkgs on Github, right? Each their own independent tree that users can just go ahead and modify for their own whims and purposes, without having to overcome the activation energy of creating their own package repository.

If 83GB (4MB/fork) is "too big" then responsibility for that rests solely on the elective centralization encouraged by Github. I suspect if you could go and total up the cumulative storage used by the nixpkgs source tree distributed on computers spread throughout the world, that is many orders of magnitude larger.

__MatrixMan__ 12/27/2025||

Agreed, nix really makes it easy to go from solving the problem for yourself to solving it for everybody. Not much else is easy, but when it comes to building an open source community, that criterion is a pretty powerful one.

MarsIronPI 12/26/2025|||

Exactly. Gentoo's main package repo is hosted in Git (but not GitHub, except as a mirror). Now, most users fetch it via rsync, but actually using the Git repo IME makes syncing faster, not slower. Though it does make the initial fetch slower.

kccqzy 12/27/2025|||

Shallow clones themselves aren’t the issue. It’s that updating shallow clones requires the server to spend a bunch of CPU time and GitHub simply isn’t willing to provide that for free.

The solution is simple: using a shallow clone means that the use case doesn’t care about the history at all, so download a tarball of the repo for the initial download and then later rsync the repo. Git can remain the source of truth for all history, but that history doesn’t have to be exposed.

collinmanderson 12/30/2025||

Can you rsync a repo from GitHub?

ajb 12/26/2025||

In a compressed format, later commits would be added as a delta of some kind, to avoid increasing the size by the whole tree size each time. To make shallow clones efficient you'd need to rewrite the compressed form such that earlier commits are instead deltas on later ones, or something equivalent.

twoodfin 12/26/2025||

What made git special & powerful from the start was its data model: Like the network databases of old, but embedded in a Merkle tree for independent evolution and verifiability.

Scaling that data model beyond projects the size of the Linux kernel was not critical for the original implementation. I do wonder if there are fundamental limits to scaling the model for use cases beyond “source code management for modest-sized, long-lived projects”.

amluto 12/26/2025|

Most of the problems mentioned in the article are not problems with using a content-addressed tree like git or even with using precisely git’s schema. The problems are with git’s protocol and GitHub’s implementation thereof.

Consider vcpkg. It’s entirely reasonable to download a tree named by its hash to represent a locked package. Git knows how to store exactly this, but git does not know how to transfer it efficiently.

mananaysiempre 12/26/2025||

> Git knows how to store [a hash-addressed tree], but git does not know how to transfer it efficiently.

Naïvely, I’d expect shallow clones to be this, so I was quite surprised by a mention of GitHub asking people not to use them. Perhaps Git tries too hard to make a good packfile?..

Meanwhile, what Nixpkgs does (and why “release tarballs” were mentioned as a potential culprit in the discussion linked from TFA) is request a gzipped tarball of a particular commit’s files from a GitHub-specific endpoint over HTTP rather than use the Git protocol. So that’s already more or less what you want, except even the tarball is 46 MB at this point :( Either way, I don’t think the current problems with Nixpkgs actually support TFA’s thesis.

Zambyte 12/26/2025||

The issues with using Git for Nix seem to entirely be issues with using GitHub for Nix, no?

Rucadi 12/26/2025||

I also got the same feeling from that, in fact, I would go as far as to say that nixpkgs and nix-commands integration with git works quite well and is not an issue.

So the phrase the article says "Package managers keep falling for this. And it keeps not working out" I feel that's untrue.

The most issue I have with this really is "flakes" integration where the whole recipe folder is copied into the store (which doesn't happen with non-flakes commands), but that's a tooling problem not an intrinsic problem of using git

femiagbabiaka 12/26/2025||

Yeah, it's inclusion in here is baffling because none of the listed issues have anything to do with the particular issue nixpkgs is having.

ori_b 12/26/2025||

Alternatively: Downloading the entire state of all packages when you care about just one, it never works out.

O(1) beats O(n) as n gets large.

gruez 12/26/2025|

Seems to still work out for apt?

ajb 12/26/2025|||

Not in the same sense. An analogy might be: apt is like fetching a git repo in which all the packages are submodules, so lazily fetched. Some of the package managers in the article seem to be using a monorepo for all packages - including the content. Others seem to have different issues - go wasn't including enough information in the top level, so all the submodules had to be fetched anyway. vcpkg was doing something with tree hashes which meant they weren't really addressible.

collinmanderson 12/30/2025|||

I consider apt kinds slow. I wish it were much faster.

mikkupikku 12/26/2025|

People who put off learning SQL for later end up using anything other than a database as their database.

redog 12/26/2025||

SQL killed the set theory star

groundzeros2015 12/26/2025||

Is sql over ssh a thing?

yawaramin 12/27/2025||

https://litestream.io/

groundzeros2015 12/27/2025||

A proprietary cloud subscription doesn’t seem like the right fit for this

yawaramin 12/27/2025||

As opposed to a proprietary cloud git hosting platform?

More comments...