Posted by todsacerdoti 15 hours ago
> passing tests, not for correctness. It hard-codes values to satisfy
> the test suite. It will not generalize.
This is one of the pain points I am suffering at work: workers ask coding agents to generate some code, and then to generate test coverage for the code. The LLM happily churns out unit tests which are simply reinforcing the existing behaviour of the code. At no point does anyone stop and ask whether the generated code implements the desired functional behaviour for the system ("business logic").
The icing on the cake is that LLMs are producing so much code that humans are just rubber stamping all of it. Off to merge and build it goes.
I have no constructive recommendations; I feel the industry will keep their foot on the pedal until something catastrophic happens.
This is true for humans too. Tests should not be written or performed by the same person that writes the code
In a team setting I try to do the same thing and invite team members to start writing the initial code by hand only. I suspect if an urgent deliverable comes up though, I will be flexible on some of my ideas.
When LLMs can assist with writing useful tests before having seen any implementation, then I’ll be properly impressed.
we ran into this building a task manager. the PUT endpoint set completed=true but never set the completion timestamp. the agent-written tests all passed because they tested "does it set completed to true" not "does it record when it was completed." 59 tasks in production with null timestamps before a downstream report caught it.
the fix was trivial. the gap in verification wasn't.
no other engineering profession would accept the standards(or rather their lack of) on which software engineering is running.
I have bad news for you: they are pushing those "standards" (Agile, ASPICE) also in hardware and mechanical engineering.
The results can be seen already. Testing is expensive and this is the field where most savings can be implemented.
Some companies will do as you say - have (mostly clueless) engineers feed high level "wishes" to (entirely clueless) LLMs, and hope that everyone kind of gets it. And everyone will kind of get it. And everyone will kind of get it wrong.
Other companies will have their engineers explicitly treat the LLMs as collaborators / pair programmers, not independent developers. As an engineer in such a company, YOU are still the author of the code even if you "prompted" it instead of typing it. You can't just "fix this high level thing for me brah" and get away with it, but instead need to continuously interact with the LLM as you define and it implements the detailed wanted behaviors. That forces you to know _exactly_ what you want and ask for _exactly_ what you want without ambiguity, like in any other kind of programming. The difference is that the LLM is a heck of a lot quicker at typing code than you are.
its fun having LLMs because it makes it quite clear that a lot of testing has been cargo-culting. did people ever check often that the tests check for anything meaningful?
I had a fun discussion when the client tried to change values... Why is it still 0? Didn't you test?
And that was at that time I had to dive into the code base and cry.
Depending on your success rate with agents, you can have one that validates multiple criteria or separate agents for different review criteria.
I know I'm psychologizing the agent. I can't explain it in a different way.
The problem is information fatigue from all the agents+code itself.
I fear thinking about problem solving in this manner to make llms work is damaging to critical thinking skills.
Assigning different agents to have different focuses has worked for me. Especially when you task a code reviewer agent with the goal of critically examining the code. The results will normally be much better than asking the coder agent who will assure you it's "fully tested and production ready"
(Sorry.)
Obvious question: why not? Let’s say you have competent devs, fair assumption. Maybe it’s because they don’t have enough time for solid QA? Lots of places are feature factories. In my personal projects I have more lines of code doing testing than implementation.
Honestly I think the other thing that is happening is that a lot of people who know better are keeping their mouths shut and waiting for things to blow up.
We’re at the very peak of the hype cycle right now, so it’s very hard to push back and tell people that maybe they should slow down and make sure they understand what the system is actually doing and what it should be doing.
And there is an element of uncertainty. Am I just bad at using these new tools? To some degree probably, but does that mean I'm totally wrong and we should be going this fast?
I always felt like that's the main issue with unit testing. That's why I used it very rarely.
Maybe keeping tests in the separate module and not letting th Agent see the source during writing tests and not letting agent see the tests while writing implemntation would help? They could just share the API and the spec.
And in case of tests failing another agent with full context could decide if the fix should be delegated to coding agent or to testing agent.
You can use spec driven development and TDD. Write the tests first. Write failing code. Modify the code to pass the tests.
I've been saying "the last job to be automated will be QA" and it feels more true every day. It's one thing to be a product engineer in this era. It's another to be working at the level the author is, where code needs to be verifiable. However, once people stop vibing apps and start vibing kernels, it really does fundamentally change the game.
I also have another saying: "any sufficiently advanced agent is indistinguishable from a DSL." I hadn't considered Lean in this equation, but I put these two ideas together and I feel like we're approaching some world where Lean eats the entire agentic framework stack and the entire operating system disappears.
If you're thinking about building something today that will still be relevant in 10 years, this is insightful.
A large part of it is probably just using it as a better search. Like "How do I define a new data type in go?".
The real blockers and time sinks were always bad/missing docs and examples. LLMs bridge that gap pretty well, and of course they do. That's what they're designed to be (language models), not an AGI!
I find it baffling how many workplaces are chasing perceived productivity gains that their customers will never notice instead of building out their next gen apps. Anyone who fails to modernize their UI/UX for the massive shift in accessibility about to happen with WebMCP will become irrelevant. Content presentation is so much higher value to the user. People expect things to be reliable and simple. Especially new users don't want your annoying onboarding flow and complicated menus and controls. They'll just find another app that gives them what they want faster.
Do agree it's a weird metric to have, but can't think of a better one outside of "business" but that still seems like a poor rubric because the vast majority of people care about things that aren't businesses and if this "life altering" technology basically amounts to creating digital slaves then maybe we as a species shouldn't explore the stars.
That isn’t necessarily a hit against them - they make an LLM coding tool and they should absolutely be dogfooding it as hard as they can. They need to be the ones to figure out how to achieve this sought-after productivity boost. But so far it seems to me like AI coding is more similar to past trends in industry practice (OOP, Scrum, TDD, whatever) than it is different in the only way that’s ever been particularly noteworthy to me: it massively changes where people spend their time, without necessarily living up to the hype about how much gets done in that time.
This is the ONLY point of software unless you’re doing it for fun.
I don't quite follow but I'd love to hear more about that.
Another way of doing it is the agent just writes an algorithm to perform the task and runs it. In this world, tools are just APIs and the agent has to think through its entire process end to end before it even begins and account for all cases.
Only latter is turing complete, but the former approaches the latter as it improves.
This is an example of an article which 'buries the lede'†.
It should have started with the announcement of the new zlib autoformalization (!) https://leodemoura.github.io/blog/2026/02/28/when-ai-writes-... to get you excited.
Then it should have talked about the rest - instead of starting with rather graceless and ugly LLM-written generic prose about AI topics that to many readers is already tiresomely familiar and doubtless was tldr for even the readers who aren't repelled automatically by that.
† or in my terms, fails to 'make you care': https://gwern.net/blog/2026/make-me-care
Moreover, humans will still need to read even rigorously proved code if only to suss out performance issues. And training people to read Lean will continue to be costly.
Though, as the OP says, this is a very exciting time for developing provably correct systems programming.
Some performance issues (asymptotics) can be addressed via proof, others are routinely verified by benchmarking.
If you want it to be a question of economics, I think the answer is in whether this approach is more economical than the alternative, which is having people run this substrate. There's a lot of enthusiasm here and you can't deny there has been progress.
I wouldn't be so quick to doubt. It costs nothing to be optimistic.
They still can't do math.
This might be the case for a hobby project or a start-up MVP being created in a rush, but in reality, there are a few points we may want to take into account:
1. Software teams I work with are maintaining the usual review practices. Even if a feature is completely created by AI. It goes through the usual PR review process. The dev may choose "Accept All", although I am not saying this is a good practice, the change still gets reviewed by a human.
2. From my experience, sub-agents intended for code and security review do a good job. It is even possible to use another model to review the code, which can provide a different perspective.
3. A year ago, code written by AI was failing to run the first time, requiring a painful joint troubleshooting effort. Now it works 95% of the time, but perhaps it is not optimal. Given the speed at which it is improving, it is safe to expect that in 6-9 months' time, it will not only work but will also be written to a good quality.
Not write code, write tests, ensure all test-cases are covered. Now, imagine such a flaky foundation is used to build on top of even more untested code. That's how bad quality software (that's usually unfixable without a major re-write) is born.
Also, most vibe-coders don't have enough experience/knowledge to figure out what is wrong with the code generated by the AI. For that, you need to know more than the AI and have a strong foundation in the domain you're working on. Here is an example: You ask the AI to write the code for a comment form. It generates the backend and frontend code for you (let's say React/Svelte/Vue/whatever). The vibe-coder sees the UI - most likely written in Tailwind CSS and thinks "wow, that looks really good!" and they click approve. However, an experienced person might notice the form does not have CSRF protection in place. The vibe-coder might not even be aware of the concept of CSRF (let alone the top 10 OSWAP security risks).
Hence, the fundamental problem is not knowing about the domain more than the AI to pick up the flaw. Unless this fundamental problem is solved - which I don't think it will anytime soon because everyone can generate code + UI these days, I don't see a solution to the verification problem.
However, this is good news for consultants and the like because it creates more work down the line to fix the vide-coded mess because they got hacked the very next day and we can charge them a rush fee on top of it, too. So, it's not all that bad.
All largely stemming from the fact that tests can't meaningfully see and interact with the page like the end user will.
Not disagreeing with you here, but what ends up happening is the frontend works flawlessly in the browser/device being tested but has tons of bugs in the others. Best examples of this are most banking apps, corporate portals, etc. But honestly, you can get away with the frontend without writing tests. But the backend? That directly affects the security aspect of the software and cannot afford to skip important tests atleast.
Isn't this a great use case for LLM tests? Have a "computer use agent" and then describe the parameters of the test as "load the page, then navigate to bar, expect foo to happen". You don't need the LLM to generate a test using puppeteer or whatever which is coupled to the specific dom, you just describe what should happen.
they arent good enough yet at all.
i got an agent to use the windows UIA with success for a feedback loop, and it got the code from not wroking very well to basically done overnight, but without the mcp having good feedback and tagged/id-ed buttons and so on, the computer use was just garbage
Now, it is actually completely possible to write UI code without any unit tests in a completely safe way. You use the functional core, imperative shell approach. When all your domain logic is in a fully tested, functional core, you can just go ahead and write "what works" in a thin UI shell. Good luck getting an LLM to rigidly conform to such an architecture, though.
Maybe in some other circles it is not like that, but I am sure that 90% of industry measures output in the amount of value produced and correct code is not the value you can show to the stockholders.
It is sad state of affairs dictated by profit seeking way of life (capitalism).
Currently, engineers work with loose specifications, which they translate into code. With the proposed approach, they would need to first convert those specifications into a formally verifiable form before using LLMs to generate the implementation.
But to be production-ready, that spec would have to cover all possible use-cases, edge cases, error handling, performance targets, security and privacy controls, etc. That sounds awfully close to being an actual implementation, only in a different language.
Of course, this remains largely theoretical for now, but it is an exciting possibility. Note high-level specifications often overlook performance issues, but they are likely sufficient for most scenarios. Regardless, we have formal development methodologies able to decompose problems to an arbitrary level of granularity since the 1990s, all while preserving correctness. It is likely that many of these ideas will be revisited soon.
As you add components to a system, the time it takes to verify that the components work together increases superlinearly.
At a certain point, the verification complexity takes off. You literally run out of time to verify everything.
AI coding agents hit this barrier faster than ever, because of how quickly they can generate components (and how poorly they manage complexity).
I think verification is now the problem of agentic software engineering. I think formal methods will help, but I don't see how they will apply to messy situations like end-to-end UI testing or interactions between the system and the real world.
I posted more detailed thoughts on X: https://x.com/i/status/2027771813346820349
It keeps me in the loop, I'm testing actual functionality rather than code, and my code is always in a state where I can open a PR and merge it back to main.
The Dafny code formed a security kernel at the core of a service, enforcing invariants like that an audit log must always be written to prior to a mutating operation being performed. Of course I still had bugs, usually from specification problems (poor spec / design) or Claude not taking the proof far enough (proving only for one of a number of related types, which could also have been a specification problem on my part).
In the end I realized I'm writing a bunch of I/O bound glue code and plain 'ol test driven development was fine enough for my threat model. I can review Python code more quickly and accurately than Dafny (or the Go code it eventually had to link to), so I'm back to optimizing for humans again...
https://aws.amazon.com/blogs/opensource/lean-into-verified-s...
> We present and test the largest benchmark for vericoding, LLM-generation of formally verified code from formal specifications … We find vericoding success rates of 27% in Lean, 44% in Verus/Rust and 82% in Dafny using off-the-shelf LLMs.