We tasked Opus 4.6 using agent teams to build a C Compiler

Posted by modeless 8 hours ago

We tasked Opus 4.6 using agent teams to build a C Compiler(www.anthropic.com)

379 points | 345 commentspage 2

underdeserver 5 hours ago|

> when agents started to compile the Linux kernel, they got stuck. [...] Every agent would hit the same bug, fix that bug, and then overwrite each other's changes.

> [...] The fix was to use GCC as an online known-good compiler oracle to compare against. I wrote a new test harness that randomly compiled most of the kernel using GCC, and only the remaining files with Claude's C Compiler. If the kernel worked, then the problem wasn’t in Claude’s subset of the files. If it broke, then it could further refine by re-compiling some of these files with GCC. This let each agent work in parallel

This is a remarkably creative solution! Nicely done.

OsrsNeedsf2P 8 hours ago||

This is like a working version of the Cursor blog. The evidence - it compiling the Linux kernel - is much more impressive than a browser that didn't even compile (until manually intervened)

ben_w 8 hours ago|

It certainly slightly spoils what I was planning to be a fun little April Fool's joke (a daft but complete programming language). Last year's AI wasn't good enough to get me past the compiler-compiler even for the most fundamental basics, now it's all this.

I'll still work on it, of course. It just won't be so surprising.

akrauss 8 hours ago||

I would like to see the following published:

- All prompts used

- The structure of the agent team (which agents / which roles)

- Any other material that went into the process

This would be a good source for learning, even though I'm not ready to spend 20k$ just for replicating the experiment.

password4321 6 hours ago|

Yes unfortunately these days most are satisfied with just the sausage and no details about how it was made.

rwmj 5 hours ago||

The interesting thing here is what's this code worth (in money terms)? I would say it's worth only the cost of recreation, apparently $20,000, and not very much more. Perhaps you can add a bit for the time taken to prompt it. Anyone who can afford that can use the same prompt to generate another C compiler, and another one and another one.

GCC and Clang are worth much much more because they are battle-tested compilers that we understand and know work, even in a multitude of corner cases, over decades.

In future there's going to be lots and lots of basically worthless code, generated and regenerated over and over again. What will distinguish code that provides value? It's going to be code - however it was created, could be AI or human - that has actually been used and maintained in production for a long time, with a community or company behind it, bugs being triaged and fixed and so on.

kingstnap 5 hours ago|

The code isn't worth money. This is an experiment. The knowledge that something like this is even possible is what is worth money.

If you had the knowledge that a transformer could pull this off in 2022. Even with all its flawed code. You would be floored.

Keep in mind that just a few years ago, the state of the art in what these LLMs could do was questions of this nature:

Suppose g(x) = f−1 (x), g(0) = 5, g(4) = 7, g(3) = 2, g(7) = 9, g(9) = 6 what is f(f(f(6)))?

The above is from the "sparks of AGI paper" on GPT-4, where they were floored that it could coherently reason through the 3 steps of inverting things (6 -> 9 -> 7 -> 4) while GPT 3.5 was still spitting out a nonsense argument of this form:

f(f(f(6))) = f(f(g(9))) = f(f(6)) = f(g(7)) = f(9).

This is from March 2023 and it was genuinely very surprising at the time that these pattern matching machines trained on next token prediction could do this. Something like a LSTM can't do anything like this at all btw, no where close.

To me its very surprising that the C compiler works. It takes a ton of effort to build such a thing. I can imagine the flaws actually do get better over the next year as we push the goalposts out.

ks2048 7 hours ago||

It's cool that you can look at the git history to see what it did. Unfortunately, I do not see any of the human written prompts (?).

First 10 commits, "git log --all --pretty=format:%s --reverse | head",

  Initial commit: empty repo structure
  Lock: initial compiler scaffold task
  Initial compiler scaffold: full pipeline for x86-64, AArch64, RISC-V
  Lock: implement array subscript and lvalue assignments
  Implement array subscript, lvalue assignments, and short-circuit evaluation
  Add idea: type-aware codegen for correct sized operations
  Lock: type-aware codegen for correct sized operations
  Implement type-aware codegen for correct sized operations
  Lock: implement global variable support
  Implement global variable support across all three backends

forty 4 hours ago||

We live a wonderful time where I can spend hours and $20000 to build a C compiler which is slow and inefficient and anyway requires an existing great compiler to even work, and then neither I nor the agent has any idea on how to make it useful :D

dzaima 3 hours ago||

Clicked on the first thing I happen to be interested in - SIMD stuff - and ended up at https://github.com/anthropics/claudes-c-compiler/blob/6f1b99..., which is a fast path incompatible with the _mm_free implementation; pretty trivial bug, not even actually SIMD or anything specialized at all.

A whole lot of UB in the actual SIMD impls (who'd have expected), but that can actually be fine here if the compiler is made to not take advantage of the UB. And then there's the super-weird mix of manual loops vs inline assembly vs builtins.

geooff_ 6 hours ago||

Maybe I'm naive, but I find these re-engineering complex product posts underwhelming. C Compilers exist and realistically Claudes training corpus contains a ton of C Compiler code. The task is already perfectly defined. There exists a benchmark of well-adopted codebases that can be used to prove if this is a working solution. Half the difficulty in making something is proving it works and is complete.

IMO a simpler novel product that humans enjoy is 10x more impressive than rehashing a solved problem, regardless of difficulty.

bs7280 6 hours ago||

I don't see this as just exercise in making a new useful thing, but benchmarking the SOTA models ability to create a massive* project on its own, with some verifiable metrics of success. I believe they were able to build FFMPEG with this rust compiler?

How much would it cost to pay someone to make a C compiler in rust? A lot more than $20k

* massive meaning "total context needed" >> model context window

stephc_int13 6 hours ago||

This is a nice benchmark IMO. I would be curious to see how competitors and improved models would compare.

NitpickLawyer 5 hours ago||

And how long will it take before an open model recreates this. The "vibe" consensus before "thinking" models really took off was that open was ~6mo behind SotA. With the massive RL improvements, over the past 6 months I've thought the gap was actually increasing. This will be a nice little verifiable test going forward.

yu3zhou4 7 hours ago||

At this point, I genuinely don't know what to learn next to not become obsolete when another Opus version gets released

missingdays 6 hours ago||

Learn to fix bugs, it's gonna be more relevant than ever

RivieraKid 6 hours ago||

I agree. I don't understand there are so many software engineers who are excited about this. I would only be excited if I was a founder in addition to being a software engineer.

gignico 8 hours ago|

> To stress test it, I tasked 16 agents with writing a Rust-based C compiler, from scratch, capable of compiling the Linux kernel. Over nearly 2,000 Claude Code sessions and $20,000 in API costs, the agent team produced a 100,000-line compiler that can build Linux 6.9 on x86, ARM, and RISC-V.

If you don't care about code quality, maintainability, readability, conformance to the specification, and performance of the compiler and of the compiled code, please, give me your $20,000, I'll give you your C compiler written from scratch :)

chasd00 5 hours ago||

> If you don't care about code quality, maintainability, readability, conformance to the specification, and performance of the compiler and of the compiled code, please, give me your $20,000, I'll give you your C compiler written from scratch :)

i don't know if you could. Let's say you get a check for $20k, how long will it take you to make an equivalent performing and compliant compiler? Are you going to put your life on pause until it's done for $20k? Who's going to pay your bills when the $20k is gone after 3 months?

minimaxir 8 hours ago|||

There is an entire Evaluation section that addresses that criticism (both in agreement and disagreement).

52-6F-62 8 hours ago||

If we're just writing off the billions in up front investment costs, they can just send all that my way while we're at it. No problem. Everybody happy.

More comments...