Top
Best
New

Posted by birdculture 1 day ago

Package managers keep using Git as a database, it never works out(nesbitt.io)
681 points | 381 commentspage 2
hogrug 1 day ago|
The facts are interesting but the conclusion a bit strange. These package managers have succeeded because git is better for the low trust model and GitHub has been hosting infra for free that no one in their right mind would provide for the average DB.

If it didn't work we would not have these massive ecosystems upsetting GitHub's freemium model, but anything at scale is naturally going to have consequences and features that aren't so compatible with the use case.

jarofgreen 1 day ago||
It's not just package manager who do this - a lot of smaller projects crowd source data in git repositories. Most of these don't reach the scale where the technical limitations become a problem.

Personally my view is that the main problem when they do this is that it gets much harder for non-technical people to contribute. At least that doesn't apply to package managers, where it's all technical people contributing.

There are a few other small problems - but it's interesting to see that so many other projects do this.

I ended up working on an open source software library to help in these cases: https://www.datatig.com/

Here's a write up of an introduction talk about it: https://www.datatig.com/2024/12/24/talk.html I'll add the scale point to future versions of this talk with a link to this post.

Hasnep 2 hours ago|
Oh, this would have been great for a project I was working on a while ago! I'll have to keep it in mind for the future. Thanks for sharing
ifh-hn 1 day ago||
So what's the answer then? That's the question I wanted answered after reading this article. With no experience with git or package management, would using a local client sqlite database and something similar on the server do?
encom 1 day ago||
I quite like Gentoo's rsync based package manager. I believe they've used that since the beginning. It works well.
MarsIronPI 19 hours ago||
To be clear though, the rsync trees come from a central Git repo (though it's not hosted on GitHub). And syncing from Git actually makes syncing faster.
AaronFriel 21 hours ago||
OCI artifacts, using the same protocol as container registries. It's a protocol designed for versioning (tagging) content addressable blobs, associating metadata with them, and it's CDN friendly.

Homebrew uses OCI as its backend now, and I think every package manager should. It has the right primitives you expect from a registry to scale.

dleslie 1 day ago||
GitHub is intoxicatingly free hosting, but Git itself is a terrible database. Why not maintain an _actual_ database on GitHub, with tagged releases?

Sqlite data is paged and so you can get away with only fetching the pages you need to resolve your query.

https://phiresky.github.io/blog/2021/hosting-sqlite-database...

jarofgreen 22 hours ago|
This seems to be about hosting an Sqlite database on a static website like GitHub Pages - this can be a great plan, there is also Datasette in a browser now: https://github.com/simonw/datasette-lite

But that's different from how you collect the data in a git repository in the first place - or are you suggesting just putting a Sqlite file in a git repository? If so I can think of one big reason against that.

dleslie 20 hours ago||
Yes, I'm suggesting hosting it on GitHub, leveraging their git lfs support. Just treat it like a binary blob and periodically update with a tagged release.
jarofgreen 20 hours ago||
It's not clear if you are suggesting accepting contributions to the SQLite file via PR from people (but accepting contributions is generally the point of why people put these on projects on GitHub).

But if you are I wouldn't recommend it.

PR's won't be able to show diff's. Worse, as soon as multiple people send a PR at once you'll have a really painful merge to resolve, and GitHub's tools won't help you at all. And you can't edit the files in GitHub's web UI.

I recommend one file per record, JSON, YAML, whatever non-binary format you want. But then you get:

* PR's with diff's that show you what's being changed

* Files that technical people can edit directly in GitHub's web editor

* If 2 people make PR's on different records at once it's an easy merge with no conflicts

* If 2 people make PR's on the same record at once ... ok, you might now have a merge conflict to resolve but it's in an easy text file and GitHub UI will let you see what it is.

You can of course then compile these data files into a SQLite file that can be served in a static website nicely - in fact if you see my other comments on this post I have a tool that does this. And on that note, sorry, I've done a few projects in this space so I have views :-)

cbondurant 1 day ago||
Admittedly, I try and stay away from database design whenever possible at work. (Everything database is legacy for us) But the way the term is being used here kinda makes me wonder, do modern sql databases have enough security features and permissions management systems in place that you could just directly expose your database to the world with a "guest" user that can only make incredibly specific queries?

Cut out the middle man, directly serve the query response to the package manager client.

(I do immediately see issues stemming from the fact that you cant leverage features like edge caching this way, but I'm not really asking if its a good solution, im more asking if its possible at all)

bob1029 1 day ago||
There are still no realistic ways to expose a hosted SQL solution to the public without really unhappy things occurring. It doesn't matter which vendor you pick.

Anything where you are opening a TCP connection to a hosted SQL server is a non-starter. You could hypothetically have so many read replicas that no one could blow anyone else up, but this would get to be very expensive at scale.

Something involving SQLite is probably the most viable option.

IshKebab 19 hours ago|||
Feels like there's an opening in the market there. Why can't you expose an SQL server to the public?

Also Stackoverflow exposes a SQL interface so it isn't totally impossible.

yawaramin 9 hours ago|||
There's no need to have a publicly accessible database server, just put all the data in a single SQLite database and distribute that to clients. It's possible to do streaming updates by just zipping up a text file containing all the SQL commands and letting clients download that. Or even a more sophisticated option is eg Litestream.
zX41ZdbW 23 hours ago|||
ClickHouse can do it. Examples:

    https://play.clickhouse.com/

    clickhouse-client --host play.clickhouse.com --user play --secure

    ssh play.clickhouse.com
baobun 18 hours ago||
Yes but CH is not SQL.
Hasnep 2 hours ago||
Yes, SQL is a query language and clickhouse is a database that uses SQL as a query language, but I don't see why that's relevant.
brendoncarroll 1 day ago|||
I personally think that this is the future, especially since such an architecture allows for E2E encryption of the entire database. The protocol should just be a transaction layer for coordinating changes of opaque blobs.

All of the complexity lives on the client. That makes a lot of sense for a package manager because it's something lots of people want to run, but no one really wants to host.

mirekrusin 1 day ago||
You can use fossil [0]

[0] https://fossil-scm.org

Ericson2314 23 hours ago||
The Nixpkgs example is not like the others, because it is source code.

I don't get what is so bad about shallow clones either. Why should they be so performance sensative?

__MatrixMan__ 22 hours ago||
It also seems like it's not git that's emitting scary creaks and groans, but rather GitHub. As much as it would be a bummer to forgo some of GitHub's nice-to-have features, I expect we could survive without some of it.
MarsIronPI 19 hours ago|||
Exactly. Gentoo's main package repo is hosted in Git (but not GitHub, except as a mirror). Now, most users fetch it via rsync, but actually using the Git repo IME makes syncing faster, not slower. Though it does make the initial fetch slower.
mindslight 21 hours ago|||
Furthermore, the issues given for nixpkgs are actually demonstrating the success of using git as the database! Those 20k forks are all people maintaining their own version of nixpkgs on Github, right? Each their own independent tree that users can just go ahead and modify for their own whims and purposes, without having to overcome the activation energy of creating their own package repository.

If 83GB (4MB/fork) is "too big" then responsibility for that rests solely on the elective centralization encouraged by Github. I suspect if you could go and total up the cumulative storage used by the nixpkgs source tree distributed on computers spread throughout the world, that is many orders of magnitude larger.

__MatrixMan__ 16 hours ago||
Agreed, nix really makes it easy to go from solving the problem for yourself to solving it for everybody. Not much else is easy, but when it comes to building an open source community, that criterion is a pretty powerful one.
kccqzy 10 hours ago|||
Shallow clones themselves aren’t the issue. It’s that updating shallow clones requires the server to spend a bunch of CPU time and GitHub simply isn’t willing to provide that for free.

The solution is simple: using a shallow clone means that the use case doesn’t care about the history at all, so download a tarball of the repo for the initial download and then later rsync the repo. Git can remain the source of truth for all history, but that history doesn’t have to be exposed.

ajb 23 hours ago||
In a compressed format, later commits would be added as a delta of some kind, to avoid increasing the size by the whole tree size each time. To make shallow clones efficient you'd need to rewrite the compressed form such that earlier commits are instead deltas on later ones, or something equivalent.
nottorp 4 hours ago||
> Auto-updates now run every 24 hours instead of every 5 minutes

What the... why would you run an autoupdate every 5 minutes?

twoodfin 1 day ago||
What made git special & powerful from the start was its data model: Like the network databases of old, but embedded in a Merkle tree for independent evolution and verifiability.

Scaling that data model beyond projects the size of the Linux kernel was not critical for the original implementation. I do wonder if there are fundamental limits to scaling the model for use cases beyond “source code management for modest-sized, long-lived projects”.

amluto 1 day ago|
Most of the problems mentioned in the article are not problems with using a content-addressed tree like git or even with using precisely git’s schema. The problems are with git’s protocol and GitHub’s implementation thereof.

Consider vcpkg. It’s entirely reasonable to download a tree named by its hash to represent a locked package. Git knows how to store exactly this, but git does not know how to transfer it efficiently.

mananaysiempre 1 day ago||
> Git knows how to store [a hash-addressed tree], but git does not know how to transfer it efficiently.

Naïvely, I’d expect shallow clones to be this, so I was quite surprised by a mention of GitHub asking people not to use them. Perhaps Git tries too hard to make a good packfile?..

Meanwhile, what Nixpkgs does (and why “release tarballs” were mentioned as a potential culprit in the discussion linked from TFA) is request a gzipped tarball of a particular commit’s files from a GitHub-specific endpoint over HTTP rather than use the Git protocol. So that’s already more or less what you want, except even the tarball is 46 MB at this point :( Either way, I don’t think the current problems with Nixpkgs actually support TFA’s thesis.

Zambyte 1 day ago||
The issues with using Git for Nix seem to entirely be issues with using GitHub for Nix, no?
Rucadi 1 day ago||
I also got the same feeling from that, in fact, I would go as far as to say that nixpkgs and nix-commands integration with git works quite well and is not an issue.

So the phrase the article says "Package managers keep falling for this. And it keeps not working out" I feel that's untrue.

The most issue I have with this really is "flakes" integration where the whole recipe folder is copied into the store (which doesn't happen with non-flakes commands), but that's a tooling problem not an intrinsic problem of using git

femiagbabiaka 1 day ago||
Yeah, it's inclusion in here is baffling because none of the listed issues have anything to do with the particular issue nixpkgs is having.
shellkr 8 hours ago|
I am not sure this is necessarily a git issue as it is mostly a GitHub issue. just look at the Aur of Arch Linux which works perfectly.
More comments...