Top
Best
New

Posted by TonyStr 10 hours ago

I made my own Git(tonystr.net)
298 points | 134 comments
nasretdinov 8 hours ago|
Nice work! On a complete tangent, Git is the only SCM known to me that supports recursive merge strategy [1] (instead of the regular 3-way merge), which essentially always remembers resolved conflicts without you needing to do anything. This is a very underrated feature of Git and somehow people still manage to choose rebase over it. If you ever get to implementing merges, please make sure you have a mechanism for remembering the conflict resolution history :).

[1] https://stackoverflow.com/questions/55998614/merge-made-by-r...

ezst 22 minutes ago||
On recursive merging, by the author of mercurial

https://www.mercurial-scm.org/pipermail/mercurial/2012-Janua...

arunix 7 hours ago|||
I remember in a previous job having to enable git rerere, otherwise it wouldn't remember previously resolved conflicts.

https://git-scm.com/book/en/v2/Git-Tools-Rerere

nasretdinov 7 hours ago|||
I believe rerere is a local cache, so you'd still have to resolve the conflicts again on another machine. The recursive merge doesn't have this issue — the conflict resolution inside the merge commits is effectively remembered (although due to how Git operates it actually never even considers it a conflict to be remembered — just a snapshot of the closest state to the merged branches)
pyrolistical 35 minutes ago||||
Would be nice if centralized git platforms shared rerere caches
direwolf20 4 hours ago|||
The recursive merge is about merging branches that already have merges in them, while rerere is about repeating the same merge several times.
mkleczek 7 hours ago|||
Much more principled (and hence less of a foot-gun) way of handling conflicts is making them first class objects in the repository, like https://pijul.org does.
jcgl 7 hours ago|||
Jujutsu too[0]:

> Jujutsu keeps track of conflicts as first-class objects in its model; they are first-class in the same way commits are, while alternatives like Git simply think of conflicts as textual diffs. While not as rigorous as systems like Darcs (which is based on a formalized theory of patches, as opposed to snapshots), the effect is that many forms of conflict resolution can be performed and propagated automatically.

[0] https://github.com/jj-vcs/jj

PunchyHamster 3 hours ago||||
I feel like people making new VCSes should just re-use GIT storage/network layer and innovate on top of that. Git storage is flexible enough for that, and that way you can just.... use it on existing repos with very easy migration path for both workflows (CI/CD never need to care about what frontend you use) and users
zaphar 3 hours ago||
Git storage is just a merkle tree. It's a technology that's been around forever and was simultaneously chosen by more than one vcs technology around the same time. It's incredibly effective so it makes sense that it would get used.
theLiminator 2 hours ago|||
It's very cool though I imagine it's doa due to lack of git compatibility...
speed_spread 1 hour ago||
Lack of current-SCM incumbent compatibility can be an advantage. Like Linus decided to explicitly do the reverse of every SVN decision when designing git. He even reversed CLI usability!
theLiminator 1 hour ago||
I think the network effects of git is too large to overcome now. Hence why we see jj get a lot more adoption than pijul.
chungy 2 hours ago|||
as far as I understand the problem (sorry, the SO isn't the clearest around), Fossil should support this operation. It does one better, since it even tracks exactly where merges come from. In Git, you have a merge commit that shows up with more than one parent, but Fossil will show you where it branched off too.

Take out the last "/timeline" component of the URL to clone via Fossil: https://chiselapp.com/user/chungy/repository/test/timeline

See also, the upstream documentation on branches and merging: https://fossil-scm.org/home/doc/trunk/www/branching.wiki

p0w3n3d 6 hours ago||
That's something new to me (using git for 10 years, always rebased)
iberator 1 hour ago||
I'm even more lazy. I almost always clone from scratch after merging or after not touching the project for some time. So easy and silly :)

I always forget all the flags and I work with literally just: clone, branch, checkout, push.

(Each feature is a fresh branch tho)

teiferer 9 hours ago||
If you ever wonder how coding agents know how to plan things etc, this is the kind of article they get this training from.

Ends up being circular if the author used LLM help for this writeup though there are no obvious signs of that.

TonyStr 8 hours ago||
Interestingly, I looked at github insights and found that this repo had 49 clones, and 28 unique cloners, before I published this article. I definitely did not clone it 49 times, and certainly not with 28 unique users. It's unlikely that the handful of friends who follow me on github all cloned the repo. So I can only speculate that there are bots scraping new public github repos and training on everything.

Maybe that's obvious to most people, but it was a bit surprising to see it myself. It feels weird to think that LLMs are being trained on my code, especially when I'm painfully aware of every corner I'm cutting.

The article doesn't contain any LLM output. I use LLMs to ask for advice on coding conventions (especially in rust, since I'm bad at it), and sometimes as part of research (zstd was suggested by chatgpt along with comparisons to similar algorithms).

tonnydourado 7 hours ago|||
Particularly on GitHub, might not even be LLMs, just regular bots looking for committed secrets (AWS keypairs, passwords, etc.)
Phelinofist 7 hours ago||||
I selfhost Gitea. The instance is crawled by AI crawlers (checked the IPs). They never cloned, they just browse and take it directly from there.
Phelinofist 5 hours ago|||
For reference, this is how I do it in my Caddyfile:

   (block_ai) {
       @ai_bots {
           header_regexp User-Agent (?i)(anthropic-ai|ClaudeBot|Claude-Web|Claude-SearchBot|GPTBot|ChatGPT-User|Google-Extended|CCBot|PerplexityBot|ImagesiftBot)
       }

       abort @ai_bots
   }
Then, in a specific app block include it via

   import block_ai
zaphar 2 hours ago||
I have almost exactly this in my own caddyfile :-D The order of the items in the regex is a little different but mostly the same items. I just pulled them from my web access logs over time and update it every once in a while.
Zambyte 7 hours ago|||
i run a cgit server on an r720 in my apartment with my code on it and that puppy screams whenever sam wants his code

blocking openai ips did wonders for the ambient noise levels in my apartment. they're not the only ones obviously, but they're they only ones i had to block to stay sane

MarsIronPI 6 hours ago||
Have you considered putting it behind Anubis or an equivalent?
Zambyte 6 hours ago||
Yes, but I haven't and would prefer not to
nerdponx 8 hours ago||||
Time to start including deliberate bugs. The correct version is in a private repository.
below43 1 hour ago|||
They used to do this with maps - eg. fake islands - to pick up when they were copied.
teiferer 6 hours ago||||
And what purpose would this serve, exactly?
adastra22 5 hours ago||
Spite.
program_whiz 6 hours ago|||
while I think this is a fun idea -- we are in such a dystopian timeline that I fear you will end up being prosecuted under a digital equivalent of various laws like "why did you attack the intruder instead of fleeing" or "you can't simply remove a squatter because its your house, therefore you get an assault charge."

A kind of "they found this code, therefore you have a duty not to poison their model as they take it." Meanwhile if I scrape a website and discover data I'm not supposed to see (e.g. bank details being publicly visible) then I will go to jail for pointing it out. :(

nerdponx 2 hours ago|||
I think if we're at the point where posting deliberate mistakes to poison training data is considered a crime, we would be far far far down the path of authoritarian corporate regulatory capture, much farther than we are now (fortunately).
wredcoll 3 hours ago|||
Look, I get the fantasy of someday pulling out my musket^W ar15 and rushing downstairs to blow away my wife^W an evil intruder, but, like, we live in a society. And it has a lot of benefits, but it does mean you don't get to be "king of your castle" any more.

Living in a country with hundreds of millions of other civilians or a city with tens of thousands means compromising what you're allowed to do when it affects other people.

There's a reason we have attractive nuisance laws and you aren't allowed to put a slide on your yard that electrocutes anyone who touches it.

None of this, of course, applies to "poisoning" llms, that's whatever. But all your examples involved actual humans being attacked, not some database.

0x696C6961 7 hours ago||||
This has been happening before LLMs too.
teiferer 6 hours ago|||
I don't really get why they need to clone in order to scrape ...?

> It feels weird to think that LLMs are being trained on my code, especially when I'm painfully aware of every corner I'm cutting.

That's very much expected. That's why the quality of LLM coding agents is like it is. (No offense.)

The "asking LLMs for advice" part is where the circular aspect starts to come into the picture. Not worse than looking at StackOverflow though which then links to other people who in turn turned to StackOverflow for advice.

storystarling 1 hour ago|||
Cloning gets you the raw text objects directly. If you scrape the web UI you're dealing with a lot of markup overhead that just burns compute during ingestion. For training data you usually want the structure to be as clean as possible from the start.
adastra22 5 hours ago|||
The quality of LLM coding agents is pretty good now.
wasmainiac 9 hours ago|||
Maybe we can poison LLMs with loops of 2 or more self referencing blogs.
jdiff 8 hours ago|||
Only need one, they're not thinking critically about the media they consume during training.
falcor84 8 hours ago|||
Here's a sad prediction: over the coming few years, AIs will get significantly better at critical evaluation of sources, while humans will get even worse at it.
whstl 7 hours ago|||
I wish I could disagree with you, but what I'm seeing on average (especially at work) is exactly that: people asking stuff to ChatGPT and accepting hallucinations as fact, and then fighting me when I say it's not true.
prmoustache 7 hours ago||
There is "death by GPS" for people dying after blindly following their GPS instruction. There will definitely be a "death by AI" expression very soon.
stevekemp 5 hours ago||
Tesla-related fatalities probably count already, albeit without that label/name.
sailfast 5 hours ago||||
Hot take: Humans have always been bad at this (in the aggregate, without training). Only a certain percentage of the population took the time to investigate.

For most throughout history, whatever is presented to you that you believe is the right answer. AI just brings them source information faster so what you're seeing is mostly just the usual behavior, but faster. Before AI people would not have bothered to try and figure out an answer to some of these questions. It would've been too much work.

topaz0 8 hours ago||||
My sad prediction is that LLMs and humans will both get worse. Humans might get worse faster though.
keybored 5 hours ago|||
HN commenters will be technooptimistic misanthrops. Status quo ante bellum.
andy_ppp 8 hours ago||||
The secret sauce about having good understanding, taste and style (both for coding and writing) has always been in the fine tuning and RHLF steps. I'd be skeptical if the signals a few GitHub repos or blogs generate at the initial stages of the learning are that critical. There's probably a filter also for good taste on the initial training set and these are so large not even a single full epoch is done on the data these days.
jama211 3 hours ago|||
It wouldn’t work at all.
jama211 3 hours ago|||
I see the AI hating part of HN has come out again
anu7df 8 hours ago|||
I understand model output put back into training would be an issue, but if model output is guided by multiple prompts and edited by the author to his/her liking wouldn't that at least be marginally useful?
prodigycorp 8 hours ago|||
Random aside about training data:

One of the funniest things I've started to notice from Gemini in particular is that in random situations, it talks with english with an agreeable affect that I can only describe as.. Indian? I've never noticed such a thing leak through before. There must be a ton of people in India who are generating new datasets for training.

evntdrvn 6 hours ago|||
There was a really great article or blog post published in the last few months about the author's very personal experience whose gist was "People complain that I sound/write like an LLM, but it's actually the inverse because I grew up in X where people are taught formal English to sound educated/western, and those areas are now heavily used for LLM training."

I wish I could find it again, if someone else knows the link please post it!

gxnxcxcx 5 hours ago|||
I'm Kenyan. I don't write like ChatGPT, ChatGPT writes like me

https://news.ycombinator.com/item?id=46273466

awesome_dude 3 hours ago|||
I've been critical of people that default to "an em dash being used means the content is generated by an LLM", or, "they've numbered their points, must be an LLM"

I do know that LLMs generate content heavy with those constructs, but they didn't create the ideas out of thin air, it was in the training set, and existed strongly enough that LLMs saw it as common place/best practice.

blenderob 7 hours ago|||
That's very interesting. Any examples you can share which has those agreeable effects?
prodigycorp 7 hours ago||
I'm going to do a cursory look through my antigrav history, i want to find it too. I remember it's primarily in the exclamations of agreement/revelation, and one time expressing concern which I remember were slightly off natural for an american english speaker.
prodigycorp 4 hours ago||
Cant find anything, too many messages telling the agent "please do NOT thosec changes". I'm going to remember to save them going forward.
mexicocitinluez 8 hours ago||
> Ends up being circular if the author used LLM help for this writeup though there are no obvious signs of that.

Great argument for not using AI-assisted tools to write blog posts (especially if you DO use these tools). I wonder how much we're taking for granted in these early phases before it starts to eat itself.

jama211 3 hours ago||
What does eating itself even look like? It doesn’t take much salt to change a hash.
mexicocitinluez 1 hour ago||
Being trained on it's own results?
gkbrk 3 hours ago||
CodeCrafters has an amazing "Build your own Git" [1] tutorial too. Jon Gjengset has a nice video [2] doing this challenge live with Rust.

[1]: https://app.codecrafters.io/courses/git/overview

[2]: https://www.youtube.com/watch?v=u0VotuGzD_w

darkryder 9 hours ago||
Great writeup! It's always fun to learn the details of the tools we use daily.

For others, I highly recommend Git from the Bottom Up[1]. It is a very well-written piece on internal data structures and does a great job of demystifying the opaque git commands that most beginners blindly follow. Best thing you'll learn in 20ish minutes.

1. https://jwiegley.github.io/git-from-the-bottom-up/

MarsIronPI 6 hours ago||
Oh, I hadn't ever seen that one. I "grokked" Git thanks to The Git Parable[0] several years ago.

[0]: https://tom.preston-werner.com/2009/05/19/the-git-parable

sanufar 5 hours ago|||
Ooh, this looks fun! I didn’t know you could cat-file on a hash id, that’s actually quite cool.
spuz 8 hours ago||
Thanks - I think this is the article I was thinking of that really helped me to understand git when I first started using it back in the day. I tried to find it again and couldn't.
KolmogorovComp 1 hour ago||
It’s really a shame git storage use files as the unit for storage. That’s what makes it improper for usage with many of small files, or large files.

Content-based chunking like Xethub uses really should become the default. It’s not like it’s new either, rsync is based on it.

https://huggingface.co/blog/xethub-joins-hf

brendoncarroll 4 hours ago||
Me too. Version control is great, it should get more use outside of software.

https://github.com/gotvc/got

Notable differences: E2E encryption, parallel imports (Got will light up all your cores), and a data structure that supports large files and directories.

rtkwe 2 hours ago||
The problem is when you move beyond text files it gets hard to tell what changes between two versions without opening both versions in whatever program they come from and comparing.
brendoncarroll 2 hours ago||
> The problem is when you move beyond text files it gets hard to tell what changes between two versions without opening both versions in whatever program they come from and comparing.

Yeah, totally agree. Got has not solved conflict resolution for arbitrary files. However, we can tell the user where the files differ, and that the file has changed.

There is still value in being able to import files and directories of arbitrary sizes, and having the data encrypted. This is the necessary infrastructure to be able to do distributed version control on large amounts of private data. You can't do that easily with Git. It's very clunky even with remote helpers and LFS.

I talk about that in the Why Got? section of the docs.

https://github.com/gotvc/got/blob/master/doc/1.1_Why_Got.md

DASD 3 hours ago||
Nice! Not sure if you're aware of Got(Game of Trees) that appears to pre-date your Got.

https://gameoftrees.org/index.html

brendoncarroll 1 hour ago||
Yes the author reached out. There has not yet been a confusion among real users that I am aware of.

https://github.com/gotvc/got/issues/20

sluongng 9 hours ago||
Zstd dictionary compression is essentially how Meta's Mercurial fork (Sapling VCS) stores blobs https://sapling-scm.com/docs/dev/internals/zstdelta. The source code is available in GitHub if folks want to study the tradeoffs vs git delta-compressed packfiles.

I think theoratically, Git delta-compression is still a lot more optimized for smaller repos. But for bigger repos where sharding storaged is required, path-based delta dictionary compression does much better. Git recently (in the last 1 year) got something called "path-walk" which is fairly similar though.

p4bl0 8 hours ago||
Nice post :). It made me think of ugit: DIY Git in Python [1] which is still by far my favorite of this kind of posts. It really goes deep into Git internals while managing to stay easy to follow along the way.

[1] https://www.leshenko.net/p/ugit/

mfashby 6 hours ago||
in a similar vein; Write yourself a Git was fun to follow https://wyag.thb.lt/
TonyStr 8 hours ago|||
This page is beautiful!

Bookmarked for later

UltraSane 5 hours ago||
I mapped git operations to Neo4j and it really helped me understand how it works.
temporallobe 4 hours ago||
Reminds me of when I tried to invent a SPA framework. So much hidden complexity I hadn’t thought of and I found myself going down rabbit holes that I am sure the creators of React and Angular went down. Git seems to be like this and I am often reminded of how impressive it is at hiding underlying complexity.
alsetmusic 4 hours ago|
> at hiding underlying complexity.

It's only in the context of recreating Git that this comment makes sense.

oldestofsports 5 hours ago|
Nice job, great article!

I had a go at it as well a while back, I call it "shit" https://github.com/emanueldonalds/shit

hahahahhaah 3 hours ago||
Fast Useful Change Keeper
tpoacher 5 hours ago||
THE shit, in fact.
More comments...