Posted by modeless 18 hours ago
Yes this is cool. I actually have worked on a similar project with a slightly worse test oracle and would gladly never have to do that sort of work again. Just tedious unfulfilling work. Though we caught issues with both the specifications/test oracle when doing the work. Also many of the team members learned and are now SMEs for related systems.
Is this evidence that knowledge work is dead or AGI is coming? Absolutely not. I think you’d be pretty ignorant with respect to the field to suggest such a thing.
This is absolutely false and I wish the people doing these demonstrations were more honest.
It had access to GCC! Not only that, using GCC as an oracle was critical and had to be built in by hand.
Like the web browser project this shows how far you can get when you have a reference implementation, good benchmarks, and clear metrics. But that's not the real world for 99% of people, this is the easiest scenario for any ML setting.
That's because the "testing" was not done independently. So anything can be possibly be made to be misleading. Hence:
> Written by Nicholas Carlini, a researcher on our Safeguards team.
Please fix.. :)
Look at this: https://github.com/7mind/jopa
“Typically, a clean-room design is done by having someone examine the system to be reimplemented and having this person write a specification. This specification is then reviewed by a lawyer to ensure that no copyrighted material is included. The specification is then implemented by a team with no connection to the original examiners.”
In fact the idea of a "clean room" implementation is that all you have to go on is the interface spec of what you are trying to build a clean (non-copyright violating) version of - e.g. IBM PC BIOS API interface.
You can't have previously read the IBM PC BIOS source code, then claim to have created a "clean room" clone!
Otherwise it's not clean-room, it's plagiarism.
I have read nowhere near as much code (or anything) as what Claude has to read to get to where it is.
And I can write an optimizing compiler that isn't slower than GCC -O0
(prompt: what does a clean room implementation mean?)
From ChatGPT without login BTW!
> A clean room implementation is a way of building something (usually software) without copying or being influenced by the original implementation, so you avoid copyright or IP issues.
> The core idea is separation.
> Here’s how it usually works:
> The basic setup
> Two teams (or two roles):
> Specification team (the “dirty room”)
> Looks at the original product, code, or behavior
> Documents what it does, not how it does it
> Produces specs, interfaces, test cases, and behavior descriptions
> Implementation team (the “clean room”)
> Never sees the original code
> Only reads the specs
> Writes a brand-new implementation from scratch
> Because the clean team never touches the original code, their work is considered independently created, even if the behavior matches.
> Why people do this
> Reverse-engineering legally
> Avoid copyright infringement
> Reimplement proprietary systems
> Create open-source replacements
> Build compatible software (file formats, APIs, protocols)
I really am starting to think we have achieved AGI. > Average (G)Human Intelligence
LMAO
If you try to reimplement something in a clean room, its a step by step process, using your own accumulated knowledge as the basis. That knowledge that you hold in your brain, all too often is code that may have copyrights on it, from the companies you worked on.
Is it any different for a LLM?
The fact that the LLM is trained on more data, does not change that when you work for a company, leave it, take that accumulated knowledge to a different company, you are by definition taking that knowledge (that may be copyrighted) and implementing it somewhere else. It only a issue if you copy the code directly, or do the implementation as a 1:1 copy. LLMs do not make 1:1 copies of the original.
At what point is trained on copyrighted data, any different then a human trained on copyrighted data, that get reimplemented in a transformative way. The big difference is that the LLM can hold more data over more fields, vs a human, true... But if we look at specializations, this can come back to the same, no?
This is 100% unambiguously not clean-room unless they can somehow prove it was never trained on any C compiler code (which they can't, because it most certainly was).
[1] https://gitlab.winehq.org/wine/wine/-/wikis/Developer-FAQ#wh...
[2] https://gitlab.winehq.org/wine/wine/-/wikis/Clean-Room-Guide...
They weren't trillion dollar AI companies to bankroll the defense sure. But thinking about clean room and using copyrighted stuff is not even an argument that's just nonsense to try to prove something when no one asked.
Worse than "-O0" takes skill...
So then, it produced something much worse than tcc (which is better than gcc -O0), an equivalent of which one man can produce in under two weeks. So even all those tokens and dollars did not equal one man's week of work.
Except the one man might explain such arbitrary and shitty code as this:
https://github.com/anthropics/claudes-c-compiler/blob/main/s...
why x9? who knows?!
Oh god the more i look at this code the happier I get. I can already feel the contracts coming to fix LLM slop like this when any company who takes this seriously needs it maintained and cannot...
Last year I tried using an LLM to make a joke language, I couldn't even compile the compiler the source code was so bad. Before Christmas, same joke language, a previous version of Claude gave me something that worked. I wouldn't call it "good", it was a joke language, but it did work.
So it sucks at writing a compiler? Yay. The gloriously indefatigable human mind wins another battle against the mediocre AI, but I can't help but notice how the battles keep getting closer to home.
This has been true for all of (known) human history. I’m gonna go ahead and make another bold prediction: tech will keep getting better.
The issue with this blog post is it’s mostly marketing.
Maybe I'm underestimating the simplicity of the C language, but that doesn't sound very plausible to me.
> Projects that compile and pass their test suites include PostgreSQL (all 237 regression tests), SQLite, QuickJS, zlib, Lua, libsodium, libpng, jq, libjpeg-turbo, mbedTLS, libuv, Redis, libffi, musl, TCC, and DOOM — all using the fully standalone assembler and linker with no external toolchain. Over 150 additional projects have also been built successfully, including FFmpeg (all 7331 FATE checkasm tests on x86-64 and AArch64), GNU coreutils, Busybox, CPython, QEMU, and LuaJIT.
Writing a C compiler is not that difficult, I agree. Writing a C compiler that can compile a significant amount of real software across multiple architectures? That's significantly more non-trivial.
First, the agents will attempt to fix issues on their own. Most easy problems will be fixed or worked-around in this manner. The hard problems will require a deeper causal model of how things work. For these, the agents will give up. But, the code-base has evolved to a point where no-one understands whats going on including the agents and its human handlers. Expect your phone to ring at that point, and prepare to ask for a ransom.
Train Claude without the programming dataset and give it a dozen of the best programming books, it'll have no chance of writing a compiler. Do the same for a human with an interest in learning to program and there's a good chance.
Honest question, do you think it’d be easier to fix or rewrite from scratch? With domains I’m intimately familiar with, I’ve come very close to simply throwing the LLM code out after using it to establish some key test cases.