Top
Best
New

Posted by todsacerdoti 17 hours ago

When AI writes the software, who verifies it?(leodemoura.github.io)
214 points | 217 commentspage 2
_pdp_ 15 hours ago|
I think the issue goes even deeper than verification. Verification is technically possible. You could, in theory, build a C compiler or a browser and use existing tests to confirm it works.

The harder problem is discovery: how do you build something entirely new, something that has no existing test suite to validate against?

Verification works because someone has already defined what "correct" looks like. There is possible a spec, or a reference implementation, or a set of expected behaviours. The system just has to match them.

But truly novel creation does not have ground truth to compare against and no predefined finish line. You are not just solving a problem. You are figuring out what the problem even is.

Avshalom 15 hours ago||
Well that's a problem the software industry has been building for itself for decades.

Software has, since at least the adoption of "agile" created an industry culture of not just refusing to build to specs but insisting that specs are impossible to get from a customer.

bigfishrunning 9 hours ago|||
I always try to get the customer to provide specs, and failing that, to agree to specs before we start working. It's usually very difficult.
daveguy 14 hours ago||||
Agile hasn't been insisting that specs are impossible to get from a customer. They have been insisting that getting specs from a customer is best performed as a dynamic process. In my opinion, that's one of agile's most significant contributions. It lines up with a learning process that doesn't assume the programmer or the customer knows the best course ahead of time.
bunderbunder 10 hours ago|||
I have found that it works well as an open-endlessly dynamic process when you are doing the kind of work that the people who came up with Scrum did as their bread and butter: limited-term contract jobs that were small enough to be handled by a single pizza-sized team and whose design challenges mostly don’t stray too far outside the Cynefn clear domain.

The less any of those applies, the more costly it is to figure it out as you go along, because accounting for design changes can become something of a game of crack the whip. Iterative design is still important under such circumstances, but it may need to be a more thoughtful form of iteration that’s actively mindful about which kinds of design decisions should be front-loaded and which ones can be delayed.

daveguy 6 hours ago||
You definitely need limits around it. Especially as a consulting team. It's not for open ended projects, and if you use it for open ended projects as a consultant you're in for a world of hurt. On the consultant side, hard scope limits are a must.

And I completely agree that requirement proximity estimation is a critical skill. I do think estimation of requirement proximity is a much easier task than time estimates.

skydhash 11 hours ago|||
And good luck when getting misaligned specs (communication issues customer side, docs that are not aligned with the product,...). Drafting specs and investigating failure will require both a diplomat hat and a detective hat. Maybe with the developer hat, we will get DDD being meaningful again.
user3939382 9 hours ago||||
I don’t want to put words in your mouth but I think I agree. It’s called requirements engineering. It’s hard, but it’s possible and waterfall works fine for many domains. Agile teams I see burning resources doing the same thing 2-3x or sprinting their way into major, costly architectural mistakes that would have been easily avoided by upfront planning and specs.
pydry 11 hours ago|||
Agile is a pretty badly defined beast at the best of times but even the most twisted interpretation doesnt mean that. It's mainly just a rejection of BDUF.
galaxyLogic 3 hours ago||
There's two ways of thinking about tests:

A) They let you verify that implementation passes its spec, more or less.

B) They are a (trustworthy) description of how the system behaves, they allow you to understand the system better.

faitswulff 9 hours ago||
Someone actually did fuzz the claude c compiler:

https://john.regehr.org/writing/claude_c_compiler.html

vicchenai 7 hours ago||
The verification problem scales poorly with AI complexity. Current approaches rely on test suites, but AI-generated code tends to optimize for passing existing tests rather than correctness in the general case.

What's interesting is this might be the forcing function that finally brings formal verification into mainstream use. Tools like Lean and Coq have been technically impressive but adoption-starved. If unverified AI code is too risky to deploy in critical systems, organizations may have no choice but to invest in formal specs. AI writes the software, proof assistants verify it.

The irony: AI-generated code may be what makes formal methods economically viable.

SurvivorForge 5 hours ago||
The uncomfortable truth is that most teams were already bad at verification before AI entered the picture. The difference is that AI-generated code comes in faster, so the verification bottleneck becomes painfully obvious. I think the real opportunity here is that AI forcing us to think harder about specifications and correctness might actually improve the quality of human-written code too — it's making us confront the fact that "it works on my machine" was never real verification.
8note 3 hours ago||
i think formal methods and math proofs will be useful like tests are for getting more feedback to the LLM to get to a working solution, but i dont think it at all solves the problem of "poisoned training data introduces specific bugs and vulnerabilities"

the bug will also be introduced in the formal spec, and people will still miss it by not looking.

i think fast response and fix time - anti-entropy - will win out against trying to increase the activation energy, to quote the various S3 talks. You need a cleanup method, rather than to prevent issues in the first place

lateforwork 9 hours ago||
The first thing you should have AI write is a comprehensive test suite. Then have it implement the main functionality. If the tests pass that is one level of verification.

In addition you can have one AI check another AI's code. I routinely copy/paste code from Claude to ChatGPT and Gemini have them check each other's code. This works very well. During the process I have my own eyes verify the code as well.

void-star 9 hours ago||
The advice that everyone seemed to agree on at least just a few months ago was to make sure _you_ write the comprehensive tests/specs and this is what I still would recommend doing to anyone asking. I guess even this may be falling out of fashion though…
p1necone 9 hours ago|||
Generate with carefully steered AI, sanity check carefully. For a big enough project writing actually comprehensive test coverage completely by hand could be months of work.

Even state of the art AI models seem to have no taste, or sense of 'hang on, what's even the point of this test' so I've seen them diligently write hundreds of completely pointless tests and sometimes the reason they're pointless is some subtle thing that's hard to notice amongst all the legit looking expect code.

lateforwork 5 hours ago|||
There is no need to write tests manually. Just review the tests, and make sure there is good coverage, if there isn't ask AI for additional tests and give it guidance.
throwaway613746 8 hours ago||
[dead]
phantomathkg 4 hours ago||
Human writes the requirements, contains flaw. Human or AI translate that to specifications, and eventually code.

It does not matter if the middle man is human or AI, or written in "traditional language" or "formal verification". Bugs will be there as human failed to defined a bullet proof requirements.

signa11 3 hours ago||
> When AI writes the software, who verifies it?

oh thats quite simple: the dude / dudette who gets blamed is the one who verifies it.

bfung 9 hours ago||
When humans write the software, who verifies it?

half sarcasm, half real-talk.

TDD is nice, but human coders barely do it. At least AI can do it more!

nvlled 7 hours ago|
> half sarcasm, half real-talk.

If you could pause a bit from being awed by your own perceived insightfulness, you would think a just bit harder and realize that LLMs can generate hundreds of thousands of code that no human could every verify within a finite amount of time. Human-written software is human verifiable, AI-assisted human-written software is still human verifiable to some extent, but purely AI-written software can no longer be verified by humans.

nicbou 18 minutes ago||
That's a very unpleasant tone to take.
oakpond 15 hours ago|
You do. Even the latest models still frequently write really weird code. The problem is some developers now just submit code for review that they didn't bother to read. You can tell. Code review is more important than ever imho.
sausagefeet 15 hours ago||
I agree with you. But I have to say, it is an uphill battle and all the incentives are against you.

1. AI is meant to make us go faster, reviews are slow, the AI is smart, let it go.

2. There are plenty of AI maximizers who only think we should be writing design docs and letting the AI go to town on it.

Maybe, this might be a great time to start a company. Maximize the benefits of AI while you can without someone who has never written a line of code telling you that your job is going to disappear in 12 months.

All the incentives are against someone who wants to use AI in a reasonable way, right now.

redhed 15 hours ago||
I actually agree with good time to start a company. Lot of available software engineers that can actually understand code, AI at a level that can actually speed up development, and so many startups focusing on AI wrapper slop that you can actually make a useful product and separate yourself from the herd.

Or you can be a grifter and make some AI wrapper yourself and cash out with some VC investment. So good time for a new company either way.

johnmaguire 12 hours ago|||
It's gonna be like that HBO Silicon Valley bit again, where everyone and their doctor is telling you about their app.
rvz 7 hours ago|||
The people that are declaring holier-than-thou with self proclaimed 'principles' are the worst grifters and the ones actively scamming with AI with VC investment.

Pretending that they can only save the world and at the same time declaring they don't use AI but use it secretly by building an so-called "AI startup" and then going on the media doomsaying that "AGI" is coming.

At this point in this cycle in AI, "AGI" is just grifting until IPO.

bradleykingz 14 hours ago|||
But it's so BORING. AI gets to do the fun part (writing code) and I'm stuck with the lame bits.

It's like watching someone else solve a puzzle, or watching someone else play a game vs playing it yourself (at least that's half as interesting as playing it through)

nz 14 hours ago|||
Your workplace has chosen to deprive you of the enjoyment that you got from the work. You have a few options: (1) ask for a raise proportional to the percentage of enjoyment that you lost, (2) find a workplace that does not do this, or (3) phone it in (they see you and your craft as something be milked for cash, so maybe stop letting yourself get milked, and milk them right back, by doing _exactly_ what is asked of you and _not_ more -- let these strategic geniuses strategize using their own brains).
mosura 10 hours ago||||
LLMs are still not good at structurally hard problems, and it is doubtful they ever will be absent some drastic extension. (Including continuous learning). In the mean time the trick is creating a framework where you can walk them through the exact stages you would to do it, only it goes way faster. The problem is many people stop at the first iteration that looks like it works and then move on, but you have to keep pushing in the same way you do with humans.

Bluntly though, if what you were doing was CRUD boilerplate then yeah it is going to just be a review fest now, but that kind of work always was just begging to be automated out one way or another.

HoldOnAMinute 14 hours ago||||
I am really enjoying making requirements docs in an iterative process. I have a continuous improvement loop where I use the implementation to test out the docs. If I find a problem with the docs, I throw away the implementation, improve the docs, then re-implement. The kind of docs I'm getting are of amazing quality.
lukan 14 hours ago||||
For me the most fun part is getting something that works. Design the goal, but not micromanage and get lost in the details. I love AI for that, but it is hard really owning code this way. (At least I manually approve every or most changes, but still, verifying is hard).
bitwize 14 hours ago||
AI has really sharpened the line between the Master Builders of the world and the Lord Businesses along this question: What, exactly, is the "fun part" of programming? Is it simply having something that works? Or is it the process of going from not having it to having it through your own efforts and the sum total of decisions you made along the way?
stretchwithme 14 hours ago|||
I can solve a problem in 10% of the time. Dealing with an issue TODAY, instead of having to put it in the backlog.
MrDarcy 15 hours ago|||
It is remarkably effective to have Claude Code do the code review and assign a quality score, call it a grade, to the contribution derived from your own expectations of quality.

Then don’t even bother looking at C work or below.

NitpickLawyer 15 hours ago||
IME it works even better if you use another model for review. We've seen code by cc and review by gpt5.2/3 work very well.

Also works with planning before any coding sessions. Gemini + Opus + GPT-xhigh works to get a lot of questions answered before coding starts.

throwaw12 12 hours ago|||
> You do

I really want to say: "You are absolutely right"

But here is a problem I am facing personally (numbers are hypothetical).

I get a review request 10-15/day by 4 teammates, who are generating code by prompting, and I am doing same, so you can guess we might have ~20 PRs/day to review. now each PR is roughly updating 5-6 files and 10-15 lines in each.

So you can estimate that, I am looking at around 50-60 files, but I can't keep the context of the whole file because change I am looking is somewhere in the middle, 3 lines here, 5 lines there and another 4 lines at the end.

How am I supposed to review all these?

ptnpzwqd 11 hours ago|||
If reviewing has become the bottleneck, the obvious - albeit slightly boring - solution is to slow down spitting out new code, and spend relatively more time reviewing.

Just going ahead and piling up PRs or skipping the review process is of course not recommended.

throwaw12 11 hours ago||
you are not wrong, but solution you are proposing is just throttling the system because of the bottleneck, and it doesn't solve the bottleneck problem.
ptnpzwqd 10 hours ago|||
Correct, but that has and probably always will be the case.

You spend the time on what is needed for you to move ahead - if code review is now the most time consuming part, that is where you will spend your time. If ever that is no longer a problem, defining requirements will maybe be the next bottleneck and where you spend your time, and so forth.

Of course it would be great to get rid of the review bottleneck as well, but I at least don't have an answer to that - I don't think the current generation of LLMs are good enough to allow us bypassing that step.

sjajzh 10 hours ago|||
You know we’ve had the ability to generate large amounts of code for a long time, right? You could have been drowning in reviews in 2018. Cheap devs are not new. There’s a reason this trend never caught on for any decent company.
throwaw12 10 hours ago||
I hope you are not bot, because your account was created just 8 minutes ago.

> You know we’ve had the ability to generate large amounts of code for a long time, right?

No, I was not aware. Nothing comes close to the scale of 'coherent looking' code generation of today's tech.

Even if you employ 100K people and ask them to write proper if/else code non-stop, LLM can still outcompete them by a huge margin with much better looking code.

(don't compare it LLM output to codegen of the past, because codegen was carefully crafted and a lot of times were deterministic, I am only talking about people writing code vs LLMs writing code)

sjajzh 9 hours ago||
I’m not a bot.

> No, I was not aware. Nothing comes close to the scale of 'coherent looking' code generation of today's tech.

Are you talking about “I’m overwhelmed by code review” or “we can now produce code at a scale no amount of humans can ever review”. Those are 2 very different things.

You review code because you’re responsible for it. This problem existed pre AI and nothing had changed wrt to being overwhelmed. The solution is still the same. To the latter, I think that’s more the software dark factory kind of thinking?

I find that interesting and maybe we’ll get there. But again, the code it takes to verify a system is drastically more complex than the system itself. I don’t know how you could build such a thing except in narrow use cases. Which I do think well see one day, though how narrow they are is the key part.

sjajzh 10 hours ago||||
Ideally, you’re working with teammates you trust. The best teams I’ve worked on reviews were a formality. Most of the time a quick scan and a LGTM. We worked together prior to the review as needed on areas we knew would need input from others.

AI changes none of this. If you’re putting up PRs and getting comments, you need to slow down. Slow is smooth, and smooth is fast.

I’ll caveat this with that’s only if your employer cares about quality. If you’re fine passing that on to your users, might as well just stop reviewing all together.

throwaw12 10 hours ago||
> Ideally, you’re working with teammates you trust.

I do trust them, but code is not theirs, prompt is. What if I trust them, but because how much they use LLMs their brain started becoming lazy and they started missing edge cases, who should review the code? me or them?

At the beginning, I relied on my trust and did quick scans, but eventually noticed they became un-interested in the craft and started submitting LLM output as it is, I still trust them as good faith actors, but not their brain anymore (and my own as well).

Also, assumption is based on ideal team: where everyone behaves in good faith. But this is not the case in corporations and big tech, especially when incentives are aligned with the "output/impact" you are making. A lot of times, promoted people won't see the impact of their past bad judgements, so why craft perfect code

sjajzh 9 hours ago||
Yeah, I agree with you. I’d say they’re not high performers anymore. Best answer I’ve got is find a place where quality matters. If you’re at a body shop it’s not gonna be fun.

I do think some of this is just a hype wave and businesses will learn quality and trust matter. But maybe not - if wealth keeps becoming more concentrated at the top, it’s slop for the plebs.

devin 8 hours ago||
My work has turned into churning out a PR, marking it as a draft so no one reviews it, and walking away. I come back after thinking about what it produced and usually realize it missed something or that the implications of some minor change are more far-reaching than the LLM understood. I take another pass. Then, I walk away again. Repeat.

Honestly I'm not sure much has changed with my output, because I don't submit PRs which aren't thoughtful. That is what the most annoying people in my organization do. They submit something that compiles, and then I spend a couple hours of my day demonstrating how incorrect it is.

For small fixes where I can recognize there is a clear, small fix which is easily testable I no longer add them to a TODO list, I simply set an agent off on the task and take it all the way to PR. It has been nice to be able to autopilot mindless changesets.

johnmaguire 12 hours ago||||
I don't quite follow - are you describing an issue with the way your team has structured PRs? IMO, a PR should contain just enough code to clearly and completely solve "a thing" without solving too much at once. But what this means in practice depends on the team, product, velocity, etc. It sounds like your PRs might be broken up into too small of chunks if you can't understand why the code is being added.
throwaw12 12 hours ago||
I am saying PRs I get are around 60-70 lines of change, which is small enough to be considered as single unit (add to this unit tests which must pass with new change, so we are talking about 30 line change + 30 line unit test)

But when looking at the PR changes, you don't always see whole picture because review subjects (code lines) are scattered across files and methods, and GitHub also shows methods and files partially making it even more difficult to quickly spot the context around those updated lines.

Its difficult problem, because even if GitHub shows whole body of the updated method or a file, you still don't see grand picture.

For example: A (calls) -> B -> C -> D

And you made changes in D, how do you know the side effect on B, what if it broke A?

FartyMcFarter 10 hours ago|||
If the code is well architected, the contract between C and D should make it clear whether changes in D affect C or not. And if C is not affected, then B and A won't be either.
throwaw12 9 hours ago||
> If the code is well architected

Big constraint. Code changes, initial architecture could have been amazing, but constantly changing business requirements make things messy.

Please don't use, "In ideal world" examples :) Because they are singular in vast space of non-ideal solutions

FartyMcFarter 9 hours ago||
In that case your problem is bigger than just reviewing changes. You need to point the fingers at the bad code and bad architecture first.

There's no way to make spaghetti code easy to review.

cesarb 10 hours ago||||
> Its difficult problem, because even if GitHub shows whole body of the updated method or a file, you still don't see grand picture.

> For example: A (calls) -> B -> C -> D

> And you made changes in D, how do you know the side effect on B, what if it broke A?

That's poor encapsulation. If the changes in D respect its contract, and C respects D's contract, your changes in D shouldn't affect C, much less B or A.

throwaw12 9 hours ago|||
> That's poor encapsulation

That's the reality of most software built in last 20 years.

> If the changes in D respect its contract, and C respects D's contract, your changes in D shouldn't affect C, much less B or A.

Any changes in D, eventually must affect B or A, it's inevitable, otherwise D shouldn't exist in call stack.

How the case I mentioned can happen, imagine in each layer you have 3 variations: 1 happy path 2 edge case handling, lets start from lowest:

D: 3, C: 3D=9, B: 3C=27, A: 3*B=81

Obviously, you won't be writing 81 unit tests for A, 27 for B, you will mock implementations and write enough unit tests to make the coverage good. Because of that mocking, when you update D and add a new case, but do not surface relevant mocking to upper layers, you will end up in a situation where D impacts A, but its not visible in unit tests.

While reading the changes in D, I can't reconstruct all possible parent caller chain in my brain, to ask engineer to write relevant unit tests.

So, case I mentioned happens, otherwise in real world there would be no bugs

sjajzh 10 hours ago|||
Leaky abstractions are a thing. You can just encapsulate your way out of everything.
mdarens 9 hours ago|||
check out the branch. if the changes are that risky, the web ui for your repository host is not suitable for reviewing them.

the rest of your issues sound architectural.

if changes are breaking contracts in calling code, that heavily implies that type declarations are not in use, or enumerable values which drive conditional behavior are mistyped as a primitive supertype.

if unit tests are not catching things, that implies the unit tests are asserting trivial things, being written after the implementation to just make cases that pass based on it, or are mocking modules they don't need to. outside of pathological cases the only thing you should be mocking is i/o, and even then that is the textbook use for dependency injection.

jra_samba 11 hours ago|||
Tests. All changes must have tests. If they're generating the code, they can generate the tests too.
throwaw12 11 hours ago||
who reviews the tests? again me? -> that's exactly why I am saying review is a bottleneck, especially with current tooling, which doesn't show second order impacts of the changes and its not easy to reason about when method gets called by 10 other methods with 4 level of parent hierarchy
11 hours ago|||
xienze 15 hours ago||
> The problem is some developers now just submit code for review that they didn't bother to read.

Can you blame them? All the AI companies are saying “this does a better job than you ever could”, every discussion topic on AI includes at least one (totally organic, I’m sure) comment along the lines of “I’ve been developing software for over twenty years and these tools are going to replace me in six months. I’m learning how to be a plumber before I’m permanently unemployed.” So when Claude spits out something that seems to work with a short smoke test, how can you blame developers for thinking “damn the hype is real. LGTM”?

jf22 14 hours ago|||
I'm an 99% organic person (I suppose I have tooth fillings) and the new models write code better than I do.

I've been using LLMS for 14+ months now and they've exceeded my expectations.

HoldOnAMinute 14 hours ago|||
Not only do they exceed expectations, but any time they fall down, you can improve your instructions to them. It's easy to get into a virtuous cycle.
xienze 14 hours ago|||
So are you learning a trade? Or do you somehow think you’ll be one of the developers “good enough” to remain employed?
jf22 14 hours ago||
I have a physical goods side hustle already and I'm brainstorming ideas about a trade I can do that will benefit from my programming experience.

I'm thinking HVAC or painting lines in parking lots. HVAC because I can program smart systems and parking lot lines because I can use google maps and algos to propose more efficient parking lot designs to existing business owners.

There is that paradox when if something becomes cheaper there is more demand so we'll see what happens.

Finally, I'm a mediocre dev that can only handle 2-3 agents at a time so I probably won't be good enough.

bluefirebrand 14 hours ago||||
> Can you blame them?

Yes I absolutely can and do blame them

keybored 9 hours ago|||
This is correct. And at this point (and I think you agree?) we have to take that critical thinking skill and stop letting it just happen to us.

It might seem hopeless. But on the other hand the innate human BS detector is quite good. Imagine the state of us if we could be programmed by putting billions of dollars into our brains and not have any kind of subconscious filter that tells us, hey this doesn’t seem right. We’ve already tried that for a century. And it turns out that the cure is not billions of dollars of counter-propaganda consisting of the truth (that would be hopeless as the Truth doesn’t have that kind of money).

We don’t have to be discouraged by whoever replies to you and says things like, oh my goodness the new Siri AI replaced my parenting skills just in the last two weeks, the progress is astounding (Siri, the kids are home and should be in bed by 21:00). Or by the hypothetical people in my replies insisting, no no people are stupid as bricks; all my neighbors buy the propaganda of [wrong side of the political aisle]. Etc. etc. ad nauseam.

More comments...