Top
Best
New

Posted by modeless 16 hours ago

We tasked Opus 4.6 using agent teams to build a C Compiler(www.anthropic.com)
536 points | 507 commentspage 7
davemp 11 hours ago|
Brute forcing a problem with a perfect test oracle and a really good heuristic (how many c compilers are in the training data) is not enough to justify the hype imo.

Yes this is cool. I actually have worked on a similar project with a slightly worse test oracle and would gladly never have to do that sort of work again. Just tedious unfulfilling work. Though we caught issues with both the specifications/test oracle when doing the work. Also many of the team members learned and are now SMEs for related systems.

Is this evidence that knowledge work is dead or AGI is coming? Absolutely not. I think you’d be pretty ignorant with respect to the field to suggest such a thing.

jhallenworld 13 hours ago||
Does it make a conforming preprocessor?
almosthere 11 hours ago||
This is like the 6th trending claude story today. It must be obvious that they told everyone at Anthropic to upvote and comment.
casey2 7 hours ago||
Interesting that they are still going with a testing strategy despite the wasted time. I think in the long run model checking and proofs are more scale-able.

I guess it makes as agents can generate tests, since you are taking this route I'd like to see agents that act as a users, that can only access docs, textbooks, user forums and builds.

IshKebab 13 hours ago||
> I tried (hard!) to fix several of the above limitations but wasn’t fully successful. New features and bugfixes frequently broke existing functionality.

This has been my experience of vibe coding too. Good for getting started, but you quickly reach the point where fixing one thing breaks another and you have to finish the project yourself.

logicprog 9 hours ago||
I will say that one thing that's extremely interesting is that everyone laughed at and made fun of Steve Yegge when he released Gas Town, which centered exactly around this idea — of having more than a dozen agents working on a project simultaneously with some generalized agents focusing on implementing features while other are more specialized and tasked with second-order tasks, where you just independently run them in a loop from an orchestrator until they've finished the project where they all work on work trees and, you know, satisfy merch conflicts and stuff as a coordination mechanism — but it's starting to kind of look like he was right. He really was aiming for where the puck was headed. First we got cursor with the fast render browser, then we got Kimi K2.5 releasing with — from everything I can tell — actually very innovative and new specific RL techniques for orchestrating agent swarms. And now we have this, Anthropic themselves doing a Gas Town-style agent swarm model of development. It's beginning to look like he absolutely did know where the puck was headed before it got there.

Now, whether we should actually be building software in this fashion or even headed in this direction at all is a completely separate question. And I would tend strongly towards no. Not until at least we have very strong, yet easy to use concise and low effort formal verification, deterministic simulation testing, property-based testing, integration testing, etc; and even then, we'll end up pair programming those formal specifications and batteries of tests with AI agents. Not writing them ourselves, since that's inefficient, nor turning them over to agent swarms, since they are very important. And if we turn them over to swarms, we'd end up with an infinite regress problem. And ultimately, that's just programming at a higher level at that point. So I would argue we should never predominantly develop in this way.

But still, there is prescience in Gastown apparently, and that's interesting.

sho_hn 16 hours ago||
Nothing in the post about whether the compiled kernel boots.
chews 15 hours ago|
video does show it booting.
pshirshov 12 hours ago||
Pfft, a C compiler.

Look at this: https://github.com/7mind/jopa

light_hue_1 15 hours ago|
> This was a clean-room implementation (Claude did not have internet access at any point during its development);

This is absolutely false and I wish the people doing these demonstrations were more honest.

It had access to GCC! Not only that, using GCC as an oracle was critical and had to be built in by hand.

Like the web browser project this shows how far you can get when you have a reference implementation, good benchmarks, and clear metrics. But that's not the real world for 99% of people, this is the easiest scenario for any ML setting.

rvz 13 hours ago|
> This is absolutely false and I wish the people doing these demonstrations were more honest.

That's because the "testing" was not done independently. So anything can be possibly be made to be misleading. Hence:

> Written by Nicholas Carlini, a researcher on our Safeguards team.

More comments...