I made my own Git - Hacker News

Posted by TonyStr 1/27/2026

I made my own Git(tonystr.net)

380 points | 176 commentspage 3

bryan2 1/28/2026|

Ftr you can make repos with sha256 now.

I wonder if signing sha-1 mitigates the threat of using an outdated hash.

athrowaway3z 1/27/2026||

I do wonder if the compression step makes sense at this layer instead of the filesystem layer.

aabbcc1241 1/28/2026|

Interesting take. I'm using btrfs (instead of ext4) with compression enabled (using zstd), so most of the files are compressed "transparently" - the files appear as normal files to the applications, but on disk it is compressed, and the application don't need to do the compress/decompress.

direwolf20 1/27/2026||

Cool. When you reimplement something, it forces you to see the fractal complexity of it.

jrockway 1/27/2026||

sha256 is a very slow algorithm, even with hardware acceleration. BLAKE3 would probably make a noticeable performance difference.

Some reading from 2021: https://jolynch.github.io/posts/use_fast_data_algorithms/

It is really hard to describe how slow sha256 is. Go sha256 some big files. Do you think it's disk IO that's making it take so long? It's not, you have a super fast SSD. It's sha256 that's slow.

EdSchouten 1/27/2026||

It depends on the architecture. On ARM64, SHA-256 tends to be faster than BLAKE3. The reasons being that most modern ARM64 CPUs have native SHA-256 instructions, and lack an equivalent of AVX-512.

Furthermore, if your input files are large enough that parallelizing across multiple cores makes sense, then it's generally better to change your data model to eliminate the existence of the large inputs altogether.

For example, Git is somewhat primitive in that every file is a single object. In retrospect it would have been smarter to decompose large files into chunks using a Content Defined Chunking (CDC) algorithm, and model large files as a manifest of chunks. That way you get better deduplication. The resulting chunks can then be hashed in parallel, using a single-threaded algorithm.

oconnor663 1/27/2026||

As far as I know, most CDC schemes requires a single-threaded pass over the whole file to find the chunk boundaries? (You can try to "jump to the middle", but usually there's an upper bound on chunk length, so you might need to backtrack depending on what you learn later about the last chunk you skipped?) The more cores you have, the more of a bottleneck that becomes.

EdSchouten 1/28/2026||

You can always use a divide and conquer strategy to compute the chunks. Chunk both halves of the file independently. Once that’s done, you redo the chunking around the midpoint of the file forward, until it starts to match the chunks obtained previously.

grumbelbart2 1/27/2026||

Is that even when using the SHA256 hardware extensions? https://en.wikipedia.org/wiki/SHA_instruction_set

oconnor663 1/27/2026||

It's mixed. You get something in the neighborhood of a 3-4x speedup with SHA-NI, but the algorithm is fundamentally serial. Fully parallel algorithms like BLAKE3 and K12, which can use wide vector extensions like AVX-512, can be substantially faster (10x+) even on one core. And multithreading compounds with that, if you have enough input to keep a lot of cores occupied. On the other hand, if you're limited to one thread and older/smaller vector extensions (SSE, NEON), hardware-accelerated SHA-256 can win. It can also win in the short input regime where parallelism isn't possible (< 4 KiB for BLAKE3).

holoduke 1/27/2026||

I wonder if in the near future there will be no tools anymore in the sense we know it. you will maybe describe the tool you need and its created on the fly.

prakhar1144 1/27/2026||

I was also playing around with the ".git" directory - ended up writing:

"What's inside .git ?" - https://prakharpratyush.com/blog/7/

lasgawe 1/27/2026||

nice work! This is one of the best ways to deeply learn something, reinvent the wheel yourself.

ofou 1/27/2026||

btw, you can change the hashing algorithm in git easily

smangold 1/27/2026||

Tony nice work!

b1temy 1/27/2026|

Nice work, it's always interesting to see how one would design their own VCS from scratch, and see if they fall into problems existing implementations fell into in the past and if the same solution was naturally reached.

The `tvc ls` command seems to always recompute the hash for every non-ignored file in the directory and its children. Based on the description in the blog post, it seems the same/similar thing is happening during commits as well. I imagine such an operation would become expensive in a giant monorepo with many many files, and perhaps a few large binary files thrown in.

I'm not sure how git handles it (if it even does, but I'm sure it must). Perhaps it caches the hash somewhere in the `.git`directory, and only updates it if it senses the file hash changed (Hm... If it can't detect this by re-hashing the file and comparing it with a known value, perhaps by the timestamp the file was last edited?).

> Git uses SHA-1, which is an old and cryptographically broken algorithm. This doesn't actually matter to me though, since I'll only be using hashes to identify files by their content; not to protect any secrets

This _should_ matter to you in any case, even if it is "just to identify files". If hash collisions (See: SHAttered, dating back to 2017) were to occur, an attacker could, for example, have two scripts uploaded in a repository, one a clean benign script, and another malicious script with the same hash, perhaps hidden away in some deeply nested directory, and a user pulling the script might see the benign script but actually pull in the malicious script. In practice, I don't think this attack has ever happened in git, even with SHA-1. Interestingly, it seems that git itself is considering switching to SHA-256 as of a few months ago https://lwn.net/Articles/1042172/

I've not personally heard of the process of hashing to also be known as digesting, though I don't doubt that it is the case. I've mostly familiar of the resulting hash being referred to as the message digest. Perhaps it's to differentiate between the verb 'hash' (the process of hashing) with the output 'hash' (the ` result of hashing). And naming the function `sha256::try_digest`makes it more explicit that it is returning the hash/digest. But it is a bit of a reach, perhaps that are just synonyms to be used interchangeably as you said.

On a tangent, why were TOML files not considered at the end? I've no skin in the game and don't really mind either way, but I'm just curious since I often see Rust developers gravitate to that over YAML or JSON, presumably because it is what Cargo uses for its manifest.

Also, obligatory mention of jujutsu/jj since it seems to always be mentioned when talking of a VCS in HN.

TonyStr 1/27/2026|

You are completely right about tvc ls recomputing each hash, but I think it has to do this? A timestamp wouldn't be reliable, so the only reliable way to verify a file's contents would be to generate a hash.

In my lazy implemenation, I don't even check if the hashes match, the program reads, compresses and tries to write the unchanged files. This is an obvious area to improve performance on. I've noticed that git speeds up object lookups by generating two-letter directories from the first two letters in hashes, so objects aren't actually stored as `.git/objects/asdf12ha89k9fhs98...`, but as `.git/objects/as/df12ha89k9fhs98...`.

>why were TOML files not considered at the end I'm just not that familiar with toml. Maybe that would be a better choice! I saw another commenter who complained about yaml. Though I would argue that the choice doesn't really matter to the user, since you would never actually write a commit object or a tree object by hand. These files are generated by git (or tvc), and only ever read by git/tvc. When you run `git cat-file <hash>`, you'll have to add the `-p` flag (--pretty) to render it in a human-readable format, and at that point it's just a matter of taste whether it's shown in yaml/toml/json/xml/special format.

b1temy 1/27/2026||

> A timestamp wouldn't be reliable

I agree, but I'm still iffy on reading all files (already an expensive operation) in the repository, then hashing every one of them, every time you do an ls or a commit. I took a quick look and git seems to check whether it needs to recalculate the hash based on a combination of the modification timestamp and if the filesize has changed, which is not foolproof either since the timestamp can be modified, and the filesize can remain the same and just have different contents.

I'm not too sure how to solve this myself. Apparently this is a known thing in git and is called the "racy git" problem https://git-scm.com/docs/racy-git/ But to be honest, perhaps I'm biased from working in a large repository, but I'd rather the tradeoff of not rehashing often, rather than suffer the rare case of a file being changed without modifying its timestamp, whilst remaining the same size. (I suppose this might have security implications if an attacker were to place such a file into my local repository, but at that point, having them have access to my filesystem is a far larger problem...)

> I'm just not that familiar with toml... Though I would argue that the choice doesn't really matter to the user, since you would never actually write...

Again, I agree. At best, _maybe_ it would be slightly nicer for a developer or a power user debugging an issue, if they prefer the toml syntax, but ultimately, it does not matter much what format it is in. I mainly asked out of curiosity since your first thoughts were to use yaml or json, when I see (completely empirically) most Rust devs prefer toml, probably because of familiarity with Cargo.toml. Which, by the way, I see you use too in your repository (As to be expected with most Rust projects), so I suppose you must be at least a little bit familiar with it, at least from a user perspective. But I suppose you likely have even more experience with yaml and json, which is why it came to mind first.

TonyStr 1/27/2026||

> ...based on a combination of the modification timestamp and if the filesize has changed

Oh that is interesting. I feel like the only way to get a better and more reliable solution to this would be to have the OS generate a hash each time the file changes, and store that in file metadata. This seems like a reasonable feature for an OS to me, but I don't think any OS does this. Also, it would force programs to rely on whichever hashing algorithm the OS uses.

b1temy 1/27/2026||

>... have the OS generate a hash each time the file changes...

I'm not sure I would want this either tbh. If I have a 10GB file on my filesystem, and I want to fseek to a specific position in the file and just change a single byte, I would probably not want it to re-hash the entire file, which will probably take a minute longer compared to not hashing the file. (Or maybe it's fine and it's fast enough on modern systems to do this every time a file is modified by any program running, I don't know how much this would impact the performance.).

Perhaps a higher resolution timestamp by the OS might help though, for decreasing the chance of a file having the exact same timestamp (unless it was specifically crafted to have been so).

More comments...