Infinite Git repos on Cloudflare workers

Posted by plesiv 10/25/2024

Infinite Git repos on Cloudflare workers(gitlip.com)

144 points | 90 comments

koolba 10/25/2024|

> We’re building Gitlip - the collaborative devtool for the AI era. An all-in-one combination of Git-powered version control, collaborative coding and 1-click deployments.

Did they get a waiver from the git team to name it as such?

Per the trademark policy, new “git${SUFFIX}” names aren’t allowed: https://git-scm.com/about/trademark

>> In addition, you may not use any of the Marks as a syllable in a new word or as part of a portmanteau (e.g., "Gitalicious", "Gitpedia") used as a mark for a third-party product or service without Conservancy's written permission. For the avoidance of doubt, this provision applies even to third-party marks that use the Marks as a syllable or as part of a portmanteau to refer to a product or service's use of Git code.

WorkerBee28474 10/25/2024||

You don't need their permission to make a portmanteau, all you need is to follow trademark law (which may or may not allow it). The policy page can go kick sand.

saurik 10/25/2024||

While true, using someone else's trademark as a prefix of your name when you are actively intending it to reference the protected use seems egregious.

eli 10/25/2024||

Do you think many users will mistakenly believe Gitlip is an official Git project put out by the same authors as Git?

There can't be trademark infringement unless there is a likelihood of confusion.

saurik 10/26/2024||

I don't know. I do know that an incredible number of users do not understand that GitHub is NOT an official anything, and so I feel like we have an existence proof of this being a serious concern. (That said, I also can see an argument that once git allowed that to happen without a fight, or even AFAIK at least some retroactive agreement--and then further also allowed GitLab--that they don't even have an enforceable trademark anymore, at least with respect to this kind of prefix use. But like, this is a kind of oblique argument that I don't think you are already making.)

fumplethumb 10/25/2024|||

What about… GitHub, Gitlab, Gitkraken, GirButler (featured on HN recently)? The list goes on forever!

afiori 10/25/2024||

Supposedly they got written permission

trillic 10/26/2024||

Gitea?

afiori 10/26/2024||

supposedly the same

plesiv 10/25/2024|||

OP here. Oops, thank you for pointing that out! We weren’t aware of it. We will investigate ASAP. In the worst case, we’ll change our name.

benatkin 10/25/2024||

Doesn't sound worse case to me. It could use a better name anyway.

Spunkie 10/25/2024||

gitlip is not a good name, but you can be sure that if a new name does not include git it will be a worse name.

benatkin 10/26/2024|||

How so? I think they'd want to make a more generic name, because their success so far seems to be in gluing a couple of things not directly related to git together. These are running WebAssembly on the server and using a centralized storage backend. Even if they're laser-focused on the same problem of multi-tenant version control, I wouldn't be surprised if they wanted to support using CRDTs in a similar matter down the road.

xk3 10/26/2024|||

right... they should call it Cloudflarelip /s

rzzzt 10/25/2024||

What about an old word? Agitator, legitimate, cogitate?

singron 10/25/2024||

Is the syllable still the Mark if you pronounce it with a soft G instead of a hard G?

ecshafer 10/25/2024||

Github doesn't stop me from making an infinite number of git repos. Or maybe they do, but I have never hit the limit. And if I am hitting that limit, and become a large enterprise customer, I am sure they would work with me on getting around that limit.

Where does this fit into a product? Maybe I am blind, but while this is cool, I don't really see where I would want this.

aftbit 10/25/2024||

Github would definitely reach out if you tried to make 100k+ Github repos. We once automatically opened issues in response to exceptions (sort of a ghetto Bugsnag / Sentry) and received a nice email from an engineer asking us if we really needed to do that when we hit around the 200k mark.

no_wizard 10/25/2024|||

Oh here’s an interesting idea.

What if these bug reporting platforms could create a branch and tag it for each issue.

This would be particularly useful for point and time things where you have an immutable deployment branch. So it could create a branch off that immutable deployment branch and tag it, so you always have a point in time code reference for bugs.

Would that be useful? I feel like what you’re doing here isn’t that different if I get what’s going on (basically creating one repository per bug?)

justincormack 10/25/2024|||

Github werent terribly happy with the number of branches we created for this type of use case at one point.

0zymandiass 10/25/2024||

A branch doesn't use any more space than a commit... I'm curious what their complaint was with a large number of branches?

There are various repositories with 500k+ commits

JasonSage 10/25/2024|||

I’m assuming GitHub has a fair amount of database/cache overhead for most things, especially branches. I think that most things the web client sees are all database content and that there’s no usage of git/filesystem in any hot paths for web views.

So I can easily see why having many branches is more storage than the same number of commits.

dizhn 10/25/2024||||

It might be something silly like the number of items in the Branches dropbox menu.

justincormack 10/26/2024||

That actually worked really well, and provides great branch search features when you have lots.

justincormack 10/26/2024|||

We appeared on a list of the top 10 most egregious users of github so I assume they had database entries for these…

aphantastic 10/25/2024|||

Why not just keep the sha of the release in the big report?

foota 10/25/2024|||

In some ways, you could imagine repos might be more scalable than issues within a repo, since you could reasonably assume a bound on the number of issues in a single repo.

plesiv 10/25/2024|||

OP here. We’re building a new kind of Git platform. "Infinity" is more beneficial for us as platform builders (simplifying infrastructure) but less relevant to our customers as users.

creatonez 10/26/2024|||

Read the article. It's not literally a foray into creating endless git repositories or anything fancy like highly strategic git-specific data compression, they are just using "infinite" as a buzzword for 'highly horizontally scalable. The product is something like Github.

shivasaxena 10/25/2024||

Imagine every notion docs or every airtable base being a a git repo. Imagine the PR workflow that we developers love being available to everyone.

yjftsjthsd-h 10/25/2024||

> It allows us to easily host an infinite number of repositories

I like this system in general, but I don't understand why scaling the number of repos is treated as a pinch point? Are there git hosts that struggle with the number of repos hosted in particular? (I don't think the "Motivation" section answers this, either.)

icambron 10/25/2024||

Seems like it enables you do things like use git repos as per-customer or per-some-business-object storage, which you otherwise wouldn't even consider. Like imagine you were setting up a blogging site where each blog was backed by a repo

abraae 10/25/2024||

Or perhaps a SaaS product where individual customers had their own fork of the code.

There are many reasons not to do this, perhaps this scratches away at one of them.

plesiv 10/25/2024|||

OP here.

It’s unlikely any Git providers struggle with the number of repos they're hosting, but most are larger companies.

Currently, we're a bootstrapped team of 2. I think our approach changes the kind of product we can build as a small team.

rad_gruchalski 10/25/2024||

How? What makes it so much more powerful than gitea hosted on a cheap vps with some backup in s3?

Unless, of course, your product is infinite git repos with cf workers.

bhl 10/25/2024||

Serverless git repos would be useful if you wanted to make a product like real-time collaboration + offline support code editing in the browser.

You can still sync to a platform like GitHub or BitBucket after all users close their tabs.

A long time ago, I looked into using isomorphic-git with lightning-fs to build light note-taking app in the browser: pull your markdown files in, edit them in a rich-text-editor a la Notion, stage and then commit changes back using git.

aphantastic 10/25/2024||

That’s essentially what github.dev and vscode.dev do FWIW.

jauntywundrkind 10/25/2024||

> After extensive research, we rewrote significant parts of Emscripten to support asynchronous file system calls.

> We ended up creating our own Emscripten filesystem on top of Durable Objects, which we call DOFS.

> We abandoned the porting efforts and ended up implementing the missing Git server functionality ourselves by leveraging libgit2’s core functionality, studying all available documentation, and painstakingly investigating Git’s behavior.

Using a ton of great open source & taking it all further. Would sure be great if ya'll could contribute some of this forward!

Libgit2 is GPL with Linking Exception, and Emscripten MIT so I think legally everything is in the clear. But it sure would be such a boon to share.

plesiv 10/25/2024|

Definitely! We're focused on launching right now, but once we have more bandwidth, we'd be happy to do it.

I believe our changes are solid, but they’re tailored specifically to our use case and can’t be merged as-is. For example, our modifications to libgit2 would need at least as much additional code to make them toggleable in the build process, which requires extra effort.

abstractbeliefs 10/25/2024||

No free software no support. You don't have to merge it upstream right away, but publish it for others to study and use as permitted by the license.

sluongng 10/25/2024||

@plesiv could you please elaborate on how repack/gc is handled with a libgit2 backend? I know that Alibaba has done something similar in the past based on libgit2, but I have yet to see another implementation in the wild like this.

Very cool project. I hope Cloudflare workers can support more protocols like SSH and GRPC. It's one of the reasons why I prefer Fly.io over Cloudflare worker for special servers like this.

plesiv 10/25/2024|

Great question! By default, with libgit2 each write to a repo (e.g. push) will create a new pack file. We have written a simple packing algorithm that runs after each write. It works like this:

Choose these values:

* P, pack "Planck" size, e.g. 100kB

* N, branching factor, e.g. 8

After each write:

1. iterate over each pack (pack size is S) and assign each pack a class C which is the smallest integer that satisfies P * N^C > S

2. iterate variable c from 0 to the maximum value of C that you got in step 2

* if there are N packs of class c, repack them into a new pack, new pack is going to be at most of class c+1

betaby 10/25/2024||

Somewhat related question. Assume I have ~1k ~200MB XML files that get ~20% of their content changed. What are my best option to store them? While using vanilla git on a SSD raid10 works, that's quite slow in retrieving historical data dating back ~3-6 months. Are there other options for a quickie back-end? I'm fine with it being not that storage efficient to a degree.

adobrawy 10/25/2024||

I don't know what your "best" criterion is (implementation costs, implementation time, maintainability, performance, compression ratio, etc.). Still, the easiest way to start is to delegate it to the file system, so zfs + compression. Access time should be decent. No application-level changes are required to enable that.

betaby 10/25/2024||

It is already on ZFS with compression.

a_e_k 10/26/2024|||

Crazy thought: something like BorgBackup.

If only 20% of the content gets changed, the rolling hash that Borg does to chunk files could identify the 80% common parts and then with its deduplication it would store just a single compressed copy of those chunks. And as a bonus, it's designed for handling historical data.

nomel 10/25/2024|||

If you can share, but I'd be curious to know what that large of an XML file might be used for, and what benefits it might have over other formats. My persona and professional use of XML has been pretty limited, but XSD was super powerful, and the reason we choose it when we did.

betaby 10/25/2024|||

Juniper routers configs, something like below.

adamc@router> show arp | display xml <rpc-reply xmlns:JUNOS="http://xml.juniper.net/JUNOS/15.1F6/JUNOS"> <arp-table-information xmlns="http://xml.juniper.net/JUNOS/15.1F6/JUNOS-arp" JUNOS:style="normal"> <arp-table-entry> <mac-address>0a:00:27:00:00:00</mac-address> <ip-address>10.0.201.1</ip-address> <hostname>adamc-mac</hostname> <interface-name>em0.0</interface-name> <arp-table-entry-flags> <none/> </arp-table-entry-flags> </arp-table-entry> </arp-table-information> <cli> <banner></banner> </cli> </rpc-reply>

hobs 10/25/2024|||

it's a good question because my answer for a system like this which had very little schema changing was just dump it into a database and add historical tracking per object that way, hash, compare, insert and add historical record.

betaby 10/25/2024||

I do have the current state in the DB. However I need sometimes to compare today's file with the one from 6 month ago.

hobs 10/25/2024||

So I assumed something like - you have the same schema with the same tabular format inside or the XML document, and that those state changes are in a way so you can tell the timestamp - then you can bring up both states at the same time and compare across the attributes for wrongness.

EXCEPT/INTERSECT make this easy for a bunch of columns (excluding the times of course, I usually hash these for performance reasons) but wont tell you what itself is the difference, you have to do column by column comparisons here, which is where I usually shell out to my language of choice because SQL sucks at doing that.

o11c 10/25/2024|||

I'm not sure if this quite fits your workload, but a lot of times people use `git` when `casync` would be more appropriate.

hokkos 10/25/2024|||

You can compress in EXI, it's a format for XML and if it is informed by the schema can give a big boost in compression.

tln 10/25/2024||

> get ~20% of their content changed

...daily? monthly? how many versions do you have to keep around?

I'd look at a simple zstd dictionary based scheme, first. Put your history/metadata into a database. Put the XML data into file system/S3/BackBlaze/B2, zstd compressed against a dictionary.

Create the dictionary : zstd --train PathToTrainingSet/* -o dictionaryName Compress with the dictionary: zstd FILE -D dictionaryName Decompress with the dictionary: zstd --decompress FILE.zst -D dictionaryName

Although you say you're fine with it being not that storage efficient to a degree, I think if you were OK with storing every version of every XML file, uncompressed, you wouldn't have to ask right?

betaby 10/25/2024||

If one stores a whole versions of the files that defeats the idea of git, and would consume too much space. I suppose I don't even need zstd if I have ZFS with compression, although compression levels won't be as good.

tln 10/25/2024||

You're relying on compression either way... my hunch is that controlling the compression yourself may get you a better result.

Git does not store diffs, it stores every version. These get compressed into packfiles https://git-scm.com/book/en/v2/Git-Internals-Packfiles. It looks like it uses zlib.

skybrian 10/25/2024||

Not having a technical limit is nice, because then it’s a matter of spending money. But whenever I see “infinite,” I ask what it will cost. How expensive is it to host git repos this way?

As a hobbyist, “free” is pretty appealing. I’m pretty sure my repos on GitHub won’t cost me anything, and that’s unlikely to change anytime soon. Not sure about the new stuff.

jsheard 10/25/2024|

With CloudFlare at least when you overstay your welcome on the free plan they just start nagging you to start paying, and possibly kick you out if you don't, rather than sending you a surprise bill for $10,000 like AWS or Azure or GCP might do.

VoidWhisperer 10/25/2024||

Not the main purpose of the article but they mention they were working on a notetaking app oriented towards developers - did anything ever come of that? If not, does anyone know products that might fit this niche? (I currently use obsidian)

plesiv 10/25/2024|

OP here. Not yet - it's about 50% complete. I plan to open-source it in the future.

nbbaier 10/25/2024|||

Definitely interested in seeing this as well. What are the key features?

tln 10/25/2024||

Congrats, you've done a lot of interesting work to get here.

This could be a fantastic building block for headless CMS and the like.

plesiv 10/25/2024|

OP here. Thank you and good catch! :-) We have a blog post planned on that topic.

seanvelasco 10/25/2024|

this leverages Durable Objects, but as i remember from two years ago, DO's way of guaranteeing uniqueness is that there can only be once instance of that DO in the world.

what if there are two users who wants to access the same DO repo at the same time, one in the US and the other in Singapore? the DO must live either in US servers or SG servers, but not at the same time. so one of the two users must have high latency then?

then after some time, a user in Australia accesses this DO repo - the DO bounces to AU servers - US and SG users will have high latency?

but please correct me if i'm wrong

skybrian 10/25/2024|

Last I heard, durable objects don’t move while running. It doesn’t seem worse than hosting in US-East, though.

More comments...