When AI writes the software, who verifies it?

Posted by todsacerdoti 19 hours ago

When AI writes the software, who verifies it?(leodemoura.github.io)

241 points | 237 commentspage 4

50lo 16 hours ago|

One thing that seems under-discussed in this space is the shift from verifying programs to verifying generation processes.

If a piece of code is produced by an agent loop (prompt -> tool calls -> edits -> tests), the real artifact isn’t just the final code but the trace/pipeline that produced it.

In that sense verification might look closer to: checking constraints on the generator (tests/specs/contracts), verifying the toolchain used by the agent, and replaying generation under controlled inputs.

That feels closer to build reproducibility or supply-chain verification than traditional program proofs.

js8 7 hours ago||

I think intelligence in general means solving (and manipulating) constraint problems. So when you ask AI to, say, write a "snake game", it figures out what this means in terms of constraints to all the possible source codes that can be written (so it will have things like the program is a game, so it has a score, there is a simple game world and there is user input connected to the display of the game world and all sorts of constraints like this), and then it further refines these constraints until it picks a point (from the space of all possible programs) that satisfies those constraints, more or less.

One beautiful thing about current AI is that this process can handle fuzzy constraints. So you don't have to describe the requirements (constraints) exactly, but it can work with fuzzy sets and constraints (I am using "fuzzy" in the quite broad sense), such as "user can move snake head in 4 directions".

Now, because of this fuzzy reasoning, it can sometimes fail. So the wrong point (source code) can get picked from the fuzzy set that represents "snake game". For example, it can be something buggy or something less like a canonical snake game.

In that case of the failure, you can either sample another datapoint ("write another snake game"), or you can add additional constraints.

Now, the article argues in favor of formal verification, which essentially means, somehow convert all these fuzzy constraints into hard constraints, so then when we get our data point (source code of the snake game), we can verify that it indeeds belongs to the (now exact) set of all snake games.

So, while it can help with the sampling problem, the alignment problem still remains - how can we tell that the AI's (fuzzy) definition of a functional "snake game" is in line with our fuzzy definition? So that is something we don't know how to handle other than iteratively throwing AIs at many problems and slowly getting these definitions aligned with humans.

And I think the latter problem (alignment with humans on definitions) is the real elephant in the room, and so the article is IMHO focusing on the wrong problem by thinking the fuzzy nature of the constraints is the main issue.

Although I think it would definitely be useful if we had a better theoretical grasp on how AI handles fuzzy reasoning. As AI stands now, practicality has beaten theory. (You can formalize fuzzy logic in Lean, so in theory nothing prevents us from specifying fuzzy constraints in a formal way and then solving the resulting constraint problem formally, it just might be quite difficult, like solving an equation symbolically vs numerically.)

p0u4a 11 hours ago||

Someone once told me, agentic coding may lead to something akin to a "software engineering union" forming, where a set of guidelines control code quality. Namely, at least one of writing, testing, and reviewing of code must be done by a human.

midtake 8 hours ago||

I do. I verify it so hard I've begun to mistrust it lately, seeing Gemini make glaring conceptual mistakes and refusing to align to my corrections.

reenorap 6 hours ago||

> Google and Microsoft both report that 25–30% of their new code is AI-generated.

I think it's closer to 100%. I don't know anyone who isn't doing 100% ai-generated code. And we don't even code review it because why bother? If there's an error then we just regenerate the code or adjust the prompt.

turlockmike 9 hours ago||

This is where you can use adverserial systems.

1. Agent writes a minimal test against a spec. 2. Another agent writes minimal implementation to make test pass only.

Repeat

This is ping pong pair programming.

milind-soni 3 hours ago||

At this point, its the users

holtkam2 17 hours ago||

At the end of the day you need humans who understand the business critical (or safety critical) systems that underpin the enterprise.

Someone needs to be held accountable when things go wrong. Someone needs to be able to explain to the CEO why this or that is impossible.

If you want to have AI generate all the code for your business critical software, fine, but you better make sure you understand it well. Sometimes the fastest path to deep understanding is just coding things out yourself - so be it.

This is why the truly critical software doesn’t get developed much faster when AI tools are introduced. The bottleneck isn’t how fast the code can be created, it’s how fast humans can construct their understanding before they put their careers on the line by deploying it.

Ofc… this doesn’t apply to prototypes, hackathons, POCs, etc. for those “low stakes” projects, vibe code away, if you wish.

sjajzh 12 hours ago|

I think this gets to the heart of it. We’re gonna see a new class of devs & software emerge that only use AI and don’t read the code. The devs that understand code will still exist too, but there is certainly an appetite for going faster at the cost of quality.

I personally find the “move fast and break thing” ethos morally abhorrent, but that doesn’t make it go away.

phyzome 12 hours ago|

Verify? Seems like no one is even reviewing this stuff.

More comments...