When AI writes the software, who verifies it?

Posted by todsacerdoti 17 hours ago

When AI writes the software, who verifies it?(leodemoura.github.io)

214 points | 217 commentspage 2

_pdp_ 15 hours ago|

I think the issue goes even deeper than verification. Verification is technically possible. You could, in theory, build a C compiler or a browser and use existing tests to confirm it works.

The harder problem is discovery: how do you build something entirely new, something that has no existing test suite to validate against?

Verification works because someone has already defined what "correct" looks like. There is possible a spec, or a reference implementation, or a set of expected behaviours. The system just has to match them.

But truly novel creation does not have ground truth to compare against and no predefined finish line. You are not just solving a problem. You are figuring out what the problem even is.

Avshalom 15 hours ago||

Well that's a problem the software industry has been building for itself for decades.

Software has, since at least the adoption of "agile" created an industry culture of not just refusing to build to specs but insisting that specs are impossible to get from a customer.

bigfishrunning 9 hours ago|||

I always try to get the customer to provide specs, and failing that, to agree to specs before we start working. It's usually very difficult.

daveguy 14 hours ago||||

Agile hasn't been insisting that specs are impossible to get from a customer. They have been insisting that getting specs from a customer is best performed as a dynamic process. In my opinion, that's one of agile's most significant contributions. It lines up with a learning process that doesn't assume the programmer or the customer knows the best course ahead of time.

bunderbunder 10 hours ago|||

I have found that it works well as an open-endlessly dynamic process when you are doing the kind of work that the people who came up with Scrum did as their bread and butter: limited-term contract jobs that were small enough to be handled by a single pizza-sized team and whose design challenges mostly don’t stray too far outside the Cynefn clear domain.

The less any of those applies, the more costly it is to figure it out as you go along, because accounting for design changes can become something of a game of crack the whip. Iterative design is still important under such circumstances, but it may need to be a more thoughtful form of iteration that’s actively mindful about which kinds of design decisions should be front-loaded and which ones can be delayed.

daveguy 6 hours ago||

You definitely need limits around it. Especially as a consulting team. It's not for open ended projects, and if you use it for open ended projects as a consultant you're in for a world of hurt. On the consultant side, hard scope limits are a must.

And I completely agree that requirement proximity estimation is a critical skill. I do think estimation of requirement proximity is a much easier task than time estimates.

skydhash 11 hours ago|||

And good luck when getting misaligned specs (communication issues customer side, docs that are not aligned with the product,...). Drafting specs and investigating failure will require both a diplomat hat and a detective hat. Maybe with the developer hat, we will get DDD being meaningful again.

user3939382 9 hours ago||||

I don’t want to put words in your mouth but I think I agree. It’s called requirements engineering. It’s hard, but it’s possible and waterfall works fine for many domains. Agile teams I see burning resources doing the same thing 2-3x or sprinting their way into major, costly architectural mistakes that would have been easily avoided by upfront planning and specs.

pydry 11 hours ago|||

Agile is a pretty badly defined beast at the best of times but even the most twisted interpretation doesnt mean that. It's mainly just a rejection of BDUF.

galaxyLogic 3 hours ago||

There's two ways of thinking about tests:

A) They let you verify that implementation passes its spec, more or less.

B) They are a (trustworthy) description of how the system behaves, they allow you to understand the system better.

faitswulff 9 hours ago||

Someone actually did fuzz the claude c compiler:

https://john.regehr.org/writing/claude_c_compiler.html

vicchenai 7 hours ago||

The verification problem scales poorly with AI complexity. Current approaches rely on test suites, but AI-generated code tends to optimize for passing existing tests rather than correctness in the general case.

What's interesting is this might be the forcing function that finally brings formal verification into mainstream use. Tools like Lean and Coq have been technically impressive but adoption-starved. If unverified AI code is too risky to deploy in critical systems, organizations may have no choice but to invest in formal specs. AI writes the software, proof assistants verify it.

The irony: AI-generated code may be what makes formal methods economically viable.

SurvivorForge 5 hours ago||

The uncomfortable truth is that most teams were already bad at verification before AI entered the picture. The difference is that AI-generated code comes in faster, so the verification bottleneck becomes painfully obvious. I think the real opportunity here is that AI forcing us to think harder about specifications and correctness might actually improve the quality of human-written code too — it's making us confront the fact that "it works on my machine" was never real verification.

8note 3 hours ago||

i think formal methods and math proofs will be useful like tests are for getting more feedback to the LLM to get to a working solution, but i dont think it at all solves the problem of "poisoned training data introduces specific bugs and vulnerabilities"

the bug will also be introduced in the formal spec, and people will still miss it by not looking.

i think fast response and fix time - anti-entropy - will win out against trying to increase the activation energy, to quote the various S3 talks. You need a cleanup method, rather than to prevent issues in the first place

lateforwork 9 hours ago||

The first thing you should have AI write is a comprehensive test suite. Then have it implement the main functionality. If the tests pass that is one level of verification.

In addition you can have one AI check another AI's code. I routinely copy/paste code from Claude to ChatGPT and Gemini have them check each other's code. This works very well. During the process I have my own eyes verify the code as well.

void-star 9 hours ago||

The advice that everyone seemed to agree on at least just a few months ago was to make sure _you_ write the comprehensive tests/specs and this is what I still would recommend doing to anyone asking. I guess even this may be falling out of fashion though…

p1necone 9 hours ago|||

Generate with carefully steered AI, sanity check carefully. For a big enough project writing actually comprehensive test coverage completely by hand could be months of work.

Even state of the art AI models seem to have no taste, or sense of 'hang on, what's even the point of this test' so I've seen them diligently write hundreds of completely pointless tests and sometimes the reason they're pointless is some subtle thing that's hard to notice amongst all the legit looking expect code.

lateforwork 5 hours ago|||

There is no need to write tests manually. Just review the tests, and make sure there is good coverage, if there isn't ask AI for additional tests and give it guidance.

throwaway613746 8 hours ago||

[dead]

phantomathkg 4 hours ago||

Human writes the requirements, contains flaw. Human or AI translate that to specifications, and eventually code.

It does not matter if the middle man is human or AI, or written in "traditional language" or "formal verification". Bugs will be there as human failed to defined a bullet proof requirements.