When AI writes the software, who verifies it?

Posted by todsacerdoti 22 hours ago

When AI writes the software, who verifies it?(leodemoura.github.io)

261 points | 267 commentspage 6

yoaviram 19 hours ago|

I just finished writing a post about exactly this. Software development, as the act of manually producing code, is dying. A new discipline is being born. It is much closer to proper engineering.

Like an engineer overseeing the construction of a bridge, the job is not to lay bricks. It is to ensure the structure does not collapse.

The marginal cost of code is collapsing. That single fact changes everything.

https://nonstructured.com/zen-of-ai-coding/

shinycode 16 hours ago||

Our CEO, an expert in marketing has discovered Claude Code and is the one having the most open PR of all developers and is pushing for us to « quickly review ». He does not understand why review are so slow because it’s « the easiest part ». We live in a new world.

skydhash 16 hours ago|||

> I just finished writing a post about exactly this. Software development, as the act of manually producing code, is dying.

It was never that. Take any textbook on software engineering and the focus was never the code, but on systems design and correctness. I'm looking at the table of contents of one (Software Engineering by David C. Kung) and these are a few sample chapters:

  ...
  4. Software Requirement Elicitation
  5. Domain Modelling
  6. Architectural Design
  ...  
  8. Actor-System Interaction Modeling
  9. Object Interaction Modeling
  ...
  15. Modeling and Design of Rule-Based Systems
  ...
  19. Software Quality Assurance
  ...
  24. Software Security

What you're talking about was coding, which has never been the bottleneck other than for beginners in some programming languages.

MattDaEskimo 19 hours ago|||

Accountability then

yoaviram 19 hours ago||

Anticipating modes of failure, creating tooling to identify and hedge against risks.

sjajzh 15 hours ago||

If we could do this it would have been done already. Outsourced devs would be ubiquitous.

imiric 14 hours ago||

In what world do these new tools help with "laying bricks", but not with ensuring that the structure does not collapse? How is that work any more difficult than producing the software in the first place? It wasn't that long ago that these tools could barely produce a simple program. If you're buying into the promises of this tech, then what's stopping it from also being able to handle those managerial tasks much better than a human?

The seemingly profound points of your marketing slop article ignore that these new tools are not a higher level of abstraction, but a replacement of all cognitive work. The tech is coming for your job just as it is coming for the job of the "bricklayer" you think is now worthless. The work you're enjoying now is just a temporary transition period, not an indication of the future of this industry.

If you enjoy managing a system that hallucinates solutions and disregards every other instruction, that's great. When you reach a dead end with that approach, and the software is exposing customer data, or failing in unpredictable ways, hopefully you know some good "bricklayers" that can help you with that.

sslayer 18 hours ago||

State Sponsored Hackers AI will verify it.

csense 14 hours ago||

It seems like sound testing methodology to identify important theorems related to the code, prove them, and then verify the proof.

Verification gets sold as "bulletproof" but I'm skeptical for a couple reasons:

- How do you establish the relationship between the code and the theorem? Lean theorem can be applied to zlib implemented in Lean, what if you want to check zlib implemented in a normal programming language like C, JS, Zig, or whatever?

- How do you know the key properties mean what you think they mean? E.g. the theorem says "ZlibDecode.decompressSingle (ZlibEncode.compress data level) = .ok data" but it feels like it would be very easy to accidentally prove ∃ x s.t. decompress(compress(x)) == x while thinking you proved ∀ x, decompress(compress(x)) == x.

I've tried Lean and Coq and...I don't really like them. The proofs use specialized programming languages. And they seem deliberately designed to require you to use a context explorer to have any hope of understanding the proof at all. OTOH a normal unit test is written in a general purpose programming language (usually the same one as the program being tested), I'm much more comfortable checking that a Claude-written unit test does what I think it's doing than a Claude-written Lean proof of correctness.

anonymous6500 3 hours ago||

The article does not reveal it to me either how the existing code would be mapped to Lean and back. The impression from zlib example is that I'd be expected to program in Lean. No way it's going to happen. The language is too complex for me and my average colleague. We're also not going to have two parallel implementations in ordinary language and Lean and compare them with 'differential random testing' (see https://aws.amazon.com/blogs/opensource/lean-into-verified-s... someone linked in the discussion), that's just too taxing for bigger products, let alone we typically don't have enough time to do one implementation right.

The gap of having succinct, expressive, powerful and executable specification to be able to continuously verify AI-generated programs is real, but I don't see how Lean alone closes it. If the author's intention was to attract community to help build that out with Lean in the center, it's not clear to me where to even start. Since the author provided no hints or direction, I've a feeling it's not clear to them either.

nextos 10 hours ago||

Lean and Coq might not be the right choice. Perhaps something like Dafny, F*, or Why3, where code and theorems live together.

Nevertheless, Lean 4 has closed the gap and it's closer to those than Coq at this stage.

Ronsenshi 12 hours ago||

A bit unrelated to the article, more of a commentary about how many engineers at this point barely write any code or even do code review.

It seems to me like a huge amount of engineers/developers in comments are turning into Tom Smykowski from The Office. Remember that guy?

His job was to be a liaison between customers and engineers because he had "people skills":

"I deal with the god damn customers so the engineers don't have to. I have people skills; I am good at dealing with people. Can't you understand that? What the hell is wrong with you people?"

Except now, based on comments here it, some engineers are passing instructions from customers to AI because they have "AI skills". While AI is doing coding, helps with spec clarification, reviewing code and writing tests.

That's scary and depressing. This field in a few years will be impossible to recognize.

drivebyhooting 15 hours ago||

That was a prolix and meandering essay. Next time I’d rather just look at the prompts and hand edits that went into writing it rather than the final artifact; much like reviewing the documentation, spec, and proof over the generated code as extolled by the article.

indymike 20 hours ago||

Because of the scale of generated code, often it is the AI verifying the AI's work.

ptnpzwqd 16 hours ago||

I of course cannot say what the future holds, but current frontier models are - in my experience - nowhere near good enough for such autonomy.

Even with other agents reviewing the code, good test coverage, etc., both smaller - and every now and then larger - mistakes make their way through, and the existence of such mistakes in the codebase tend to accellerate even more of them.

It for sure depends on many factors, but I have seen enough to feel confident that we are not there yet.

tartoran 20 hours ago||

So who's verifying the AI doing the verifying or is it yet another AI layer doing that? If something goes wrong who's liable, the AI?

visarga 19 hours ago||

You have 2 paths - code tests and AI review which is just vibe test of LGTM kind, should be using both in tandem, code testing is cheap to run and you can build more complex systems if you apply it well. But ultimately it is the user or usage that needs to direct testing, or pay the price for formal verification. Most of the time it is usage, time passing reveals failure modes, hindsight is 20/20.

acedTrex 20 hours ago||

No one does currently, and its going to take a few very painful and high profile failures of vital systems for this industry to RELEARN its lesson about the price of speed.

In fact it will probably need to happen a few times PER org for the dust to settle. It will take several years.

arscan 20 hours ago||

Sure but industry cares about value (= benefit - price), not just price. Price could be astronomical, but that doesn’t matter if benefit is larger.

jcgrillo 20 hours ago||

I feel like people used to talk about nines of uptime more. As in more than one. These days we've lost that: https://bsky.app/profile/jkachmar.com/post/3mg4u3e6nak2p

I recall a time, maybe around 2013-2017, when people were talking about 4 or 5 nines. But sometime around then the goalposts shifted, and instead of trying to make things as reliable as possible, it started becoming more about seeing how unreliable they can get before anyone notices or cares. It turns out people will suffer through a lot if there's some marginal benefit--remember what personal computers were like in the 1990s before memory protection? Vibe coding is just another chapter in that user hostile epic. Convenient reliability, like this author describes, (if it can be achieved) might actually make things better? But my money isn't on that.

simonw 19 hours ago||

The "Nearly half of AI-generated code fails basic security tests" link provided in this piece is not credible in my opinion. It's a very thinly backed vendor report from a company selling security scanning software.

waterTanuki 6 hours ago||

I've grown to hate using python in production since LLMs have been around. Python cannot enforce minimum standards like cleaning up unused variables, checking array access, and properly typing your functions. There's a number of tools built to do this but none of them can possibly replace a compiler.

Compiled languages like Go and Rust are my new default for projects on the backend, typescript with strict typing on for the frontend, and I foresee the popularity growing the more LLM use grows. The moment you let an LLM loose in a Javascript/Python codebase everything goes off the rails.

mkoubaa 12 hours ago|

PMs have been asking the same question about software developers for decades

More comments...