Weave – A language aware merge algorithm based on entities

Posted by rs545837 6 hours ago

Weave – A language aware merge algorithm based on entities(github.com)

111 points | 63 comments

gritzko 4 hours ago|

At this point, the question is: why keep files as blobs in the first place. If a revision control system stores AST trees instead, all the work is AST-level. One can run SQL-level queries then to see what is changing where. Like

  - do any concurrent branches touch this function?
  - what new uses did this function accrete recently?
  - did we create any actual merge conflicts?

Almost LSP-level querying, involving versions and branches. Beagle is a revision control system like that [1]

It is quite early stage, but the surprising finding is: instead of being a depository of source code blobs, an SCM can be the hub of all activities. Beagle's architecture is extremely open in the assumption that a lot of things can be built on top of it. Essentially, it is a key-value db, keys are URIs and values are BASON (binary mergeable JSON) [2] Can't be more open than that.

[1]: https://github.com/gritzko/librdx/tree/master/be

[2]: https://github.com/gritzko/librdx/blob/master/be/STORE.md

rs545837 4 hours ago||

This is the right question. Storing ASTs directly would make all of this native instead of layered on top.

The pragmatic reason weave works at the git layer: adoption. Getting people to switch merge drivers is hard enough, getting them to switch VCS is nearly impossible. So weave parses the three file versions on the fly during merge, extracts entities, resolves per-entity, and writes back a normal file that git stores as a blob. You get entity-level merging without anyone changing their workflow.

But you're pointing at the ceiling of that approach. A VCS that stores ASTs natively could answer "did any concurrent branches touch this function?" as a query, not as a computation. That's a fundamentally different capability. Beagle looks interesting, will dig into the BASON format.

We built something adjacent with sem (https://github.com/ataraxy-labs/sem) which extracts the entity dependency graph from git history. It can answer "what new uses did this function accrete" and "what's the blast radius of this change" but it's still a layer on top of git, not native storage.

pfdietz 2 hours ago|||

Well, if you're programming in C or C++, there may not be a parse tree. Tree-sitter makes a best effort attempt to parse but it can't in general due to the preprocessor.

rs545837 2 hours ago||

Great point. C/C++ with macros and preprocessor directives is where tree-sitter's error recovery gets stretched. We support both C and C++ in sem-core(https://github.com/Ataraxy-Labs/sem) but the entity extraction is best-effort for heavily macro'd code. For most application-level C++ it works well, but something like the Linux kernel would be rough. Honestly that's an argument for gritzko's AST-native storage approach where the parser can be more tightly integrated.

samuelstros 1 hour ago|||

How do you get blob file writes fast?

I built lix [0] which stores AST’s instead of blobs.

Direct AST writing works for apps that are “ast aware”. And I can confirm, it works great.

But, the all software just writes bytes atm.

The binary -> parse -> diff is too slow.

The parse and diff step need to get out of the hot path. That semi defeats the idea of a VCS that stores ASTs though.

[0] https://github.com/opral/lix

gritzko 47 minutes ago|||

I only diff the changed files. Producing blob out of BASON AST is trivial (one scan). Things may get slow for larger files, e.g. tree-sitter C++ parser is 25MB C file, 750KLoC. Takes couple seconds to import it. But it never changes, so no biggie.

There is room for improvement, but that is not a show-stopper so far. I plan round-tripping Linux kernel with full history, must show all the bottlenecks.

P.S. I checked lix. It uses a SQL database. That solves some things, but also creates an impedance mismatch. Must be x10 slow down at least. I use key-value and a custom binary format, so it works nice. Can go one level deeper still, use a custom storage engine, it will be even faster. Git is all custom.

rs545837 1 hour ago|||

This is exactly a reason why weave stays on top of git instead of replacing storage. Parsing three file versions at merge time is fine (was about 5-67ms). Parsing on every read/write would be a different story. I know about Lix, but will check it out again.

handfuloflight 2 hours ago||

Well, I'll be diving in. Thank you for sharing. Same for Weave.

rs545837 2 hours ago||

Awesome, let me know how it goes. Happy to help if you hit any rough edges.

rs545837 5 hours ago||

Some context on the validation so far: Elijah Newren, who wrote git's merge-ort (the default merge strategy), reviewed weave and said language-aware content merging is the right approach, that he's been asked about it enough times to be certain there's demand, and that our fallback-to-line-level strategy for unsupported languages is "a very reasonable way to tackle the problem." Taylor Blau from the Git team said he's "really impressed" and connected us with Elijah. The creator of libgit2 starred the repo. Martin von Zweigbergk (creator of jj) has also been excited about the direction. We are also working with GitButler team to integrate it as a research feature.

The part that's been keeping me up at night: this becomes critical infrastructure for multi-agent coding. When multiple agents write code in parallel (Cursor, Claude Code, Codex all ship this now), they create worktrees for isolation. But when those branches merge back, git's line-level merge breaks on cases where two agents added different functions to the same file. weave resolves these cleanly because it knows they're separate entities. 31/31 vs git's 15/31 on our benchmark.

Weave also ships as an MCP server with 14 tools, so agents can claim entities before editing, check who's touching what, and detect conflicts before they happen.

kubb 26 minutes ago||

Congrats on getting acknowledged by people with credibility.

I also think that this approach has a lot of potential. Keep up the good work sir.

deckar01 4 hours ago||

Does this actually matter for multi-agent use cases? Surely people that are using swarms of AI agents to write code are just letting them resolve merge conflicts.

rs545837 4 hours ago||

So that you don't feel that I am biased about my thing but just giving more context that it's not just me, its actually people saying on twitter how often the merging breaks when you are running production level code and often merging different branches.

https://x.com/agent_wrapper/status/2026937132649247118 https://x.com/omega_memory/status/2028844143867228241 https://x.com/vincentmvdm/status/2027027874134343717

deckar01 3 hours ago||

Those users all work for companies that sell AI tools. And the first one literally says they let AI fix merge conflicts. The second one is in a thread advocating for 0 code review (which this can’t guarantee) (and also ew). The third is also saying to just have another bot handle merging.

rs545837 3 hours ago||

Thanks a lot for the fair criticism, Appreciate it! You're right that those links aren't the strongest evidence. The real argument isn't "people are complaining on twitter." It's just much simpler when two agents add different functions to the same file, where git creates a conflict that doesn't need to exist. Weave just knows they're separate entities and merges cleanly. Whether you let AI resolve the false conflict or avoid it entirely is a design choice, we think avoiding it is better.

deckar01 3 hours ago||

Dear god, it’s bots all the way down.

rs545837 3 hours ago||

What do you mean?

deckar01 3 hours ago||

It’s your GitHub profile. It looks suspiciously just like the other 10 GitHub users that have been spamming AI generated issues and PRs for the last 2 weeks. They always go quiet eventually. I suspect because they are violating GitHub’s ToS, but maybe they just run out of free tokens.

rs545837 2 hours ago|||

Thanks again for criticising, so tackling each of your comment:

GitHub’s ToS, because you suspect, so I can help you understand them.

> What violates it:

        1. Automated Bulk issues/PRs, that we don't own
        2. Fake Stars or Engagement Farming
        3. Using Bot Accounts.

We own the repo, there's not even a single fake star, I don't even know how to create a bot account lol.

> Scenario when we run out of free tokens.

Open AI and Anthropic have been sponsoring my company with credits, because I am trying to architect new software post agi world, so if I run out I will ask them for more tokens.

deckar01 2 hours ago||

And you are opening issues on projects trying to get them to adopt your product. Seems like spam to me. How much are you willing to spend maintaining this project if those free tokens go away?

rs545837 2 hours ago||

When you're just a normal guy genuinely trying to build something great and there's nobody who believes in you yet, the only thing you can do is go to projects you admire and ask "would this help you?" Patrick Collison did the same thing early on, literally taking people's laptops to install Stripe.

Palanikannan 2 hours ago|||

https://github.com/Ataraxy-Labs/weave/pull/11

Dude did you just call me AI generated haha, i've been actively using weave for a gui I've been building for blazingly fast diffs

https://x.com/Palanikannan_M/status/2022190215021126004

So whenever I run into bugs I patched locally in my clone, I try to let the clanker raise a pr upstream, insane how easy things are now.

deckar01 2 hours ago||

I think you accidentally switched accounts.

rs545837 1 hour ago||

Nope that's other user, he has been working with me on weave, check the PRs that you are calling AI generated.

_flux 1 hour ago||

How does it compare to https://mergiraf.org/ ? I've had good experience with it so far, although I rarely even need it.

It's also based on treesitter, but probably otherwise a more baseline algorithm. I wonder if that "entity-awareness" actually then brings something to the table in addition to the AST.

edit: man, I tried searching this thread for mention of the tool for a few times, but apparently its name is not mergigraf

rs545837 1 hour ago|

Actually tackled it here: https://x.com/rs545837/status/2021423280410988580

Cheers,

_flux 1 hour ago||

I think it would be interesting to include it in the comparison table, as I think it could be viewed as a base line language-aware merge tool.

rs545837 1 hour ago||

what kind of compariosn table are you looking for? I actually added that on the website https://ataraxy-labs.github.io/weave/

keysersoze33 5 hours ago||

Interesting that Weave tries to solve Mergiref's shortcomings (also Tree-sitter based):

> git merges lines. mergiraf merges tree nodes. weave merges entities. [1]

I've been using mergiraf for ~6 months and tried to use it to resolve a conflict from multiple Claude instances editing a large bash script. Sadly neither support bash out of the box, which makes me suspect that classic merge is better in this/some cases...

Will try adding the bash grammar to mergiraf or weave next time

[1] https://ataraxy-labs.github.io/weave/

rs545837 5 hours ago|

Hey, author here. This comparison came up a lot when weave went viral on X (https://x.com/rs545837/status/2021020365376671820).

The key difference: mergiraf matches individual AST nodes (GumTree + PCS triples). Weave matches entities (functions, classes, methods) as whole units. Simpler, faster, and conflicts are readable ("conflict in validate_token" instead of a tree of node triples).

The other big gap: weave ships as an MCP server with 14 tools for agent coordination. Agents can claim entities before editing and detect conflicts before they merge. That's the piece mergiraf doesn't have.

On bash: weave falls back to line-level for unsupported languages, so it'll work as well as git does there.

Adding a bash tree-sitter grammar would unlock entity-level merge for it. But I can work on it tonight, if you want it urgently.

Cheers,

50lo 3 hours ago||

If both sides refactor the same function into multiple smaller ones (extract method) or rename it, can Weave detect that as a structural refactor, or does it become “delete + add”? Any heuristics beyond name matching?

rs545837 3 hours ago|

Yes, weave detects renames via structural_hash (AST-normalized hash that ignores identifier names). If both sides rename the same function, it matches by structure and merges cleanly.

gritzko 2 hours ago||

This will not work for refactors. In fact, any other change will break the hash. I know because I used this approach for quite some time.

rs545837 2 hours ago||

Thanks a lot, I will test it out as you said, in the mean time, could you also open up an issue on the repo, so that it helps me and others to track the issue.

gritzko 2 hours ago||

I will ask Claude to open it, thanks!

rs545837 2 hours ago||

Thanks, lemme know how it goes, I will review and we can discuss over the issue.

sea-gold 6 hours ago||

Website: https://ataraxy-labs.github.io/weave/

I haven't tried it but this sounds like it would be really valuable to me.

rs545837 5 hours ago|

Haha, thanks for the feedback, yeah multi agent workflows were especially kept in mind when designing this! So I hope it helps, I am always here for feedback and feature requests.

spacecrafter3d 3 hours ago||

Awesome, I've been wanting this for a long time! Any chance of Swift being supported?

rs545837 3 hours ago|

Swift isn't supported yet but adding a new language is straightforward since we use tree-sitter. There's already a tree-sitter-swift grammar. Would happily accept a PR for it, if you are down for it.

taejavu 3 hours ago||

I tried this with the kind of merge conflict I'd expect it to solve automatically, and it didn't. Is it supposed to work while rebasing, or is it strictly for merges?

rs545837 3 hours ago|

Thanks for trying it! Would love to know what the merge conflict looked like, if you can share the repo or a minimal repro, I'll dig into why it didn't resolve. That kind of feedback is exactly what helps us improve.

WalterGR 4 hours ago||

It’s described as a “merge driver for Git”. Is it usable independently of git? Can I use it to diff arbitrary files?

rs545837 4 hours ago|

We got asked this on the X thread too, when we went viral here https://x.com/rs545837/status/2021020365376671820. Your git doesn't change at all. Weave plugs in through git's merge driver interface (.gitattributes), so git still handles everything, it just calls weave for the content merge step instead of its default line-level algorithm. All your git commands, workflow, and history stay exactly the same.

For diffing arbitrary files outside git, we built sem (https://github.com/ataraxy-labs/sem) which does entity-level diffs. sem diff file1.py file2.py shows you which functions changed, were added, or deleted rather than line-level changes

More comments...