When AI writes the software, who verifies it?

Posted by todsacerdoti 15 hours ago

When AI writes the software, who verifies it?(leodemoura.github.io)

214 points | 217 comments

roadbuster 6 hours ago|

> The Claude C Compiler illustrates the other side: it optimizes for

> passing tests, not for correctness. It hard-codes values to satisfy

> the test suite. It will not generalize.

This is one of the pain points I am suffering at work: workers ask coding agents to generate some code, and then to generate test coverage for the code. The LLM happily churns out unit tests which are simply reinforcing the existing behaviour of the code. At no point does anyone stop and ask whether the generated code implements the desired functional behaviour for the system ("business logic").

The icing on the cake is that LLMs are producing so much code that humans are just rubber stamping all of it. Off to merge and build it goes.

I have no constructive recommendations; I feel the industry will keep their foot on the pedal until something catastrophic happens.

mrighele 14 minutes ago||

> The LLM happily churns out unit tests which are simply reinforcing the existing behaviour of the code

This is true for humans too. Tests should not be written or performed by the same person that writes the code

WhyNotHugo 3 hours ago|||

This is why you write the tests first and then the code. Especially when fixing bugs, since you can be sure that the test properly fails when the bug is present.

pmontra 2 hours ago|||

When fixing bugs, yes. When designing an app not so much because you realize many unexpected things while writing the code and seeing how it behaves. Often the original test code would test something that is never built. It's obvious for integration tests but it happens for tests of API calls and even for unit tests. One could start writing unit tests for a module or class and eventually realize that it must be implemented in a totally different way. I prefer experimenting with the implementation and write tests only when it settles down on something that I'm confident it will go to production.

Karrot_Kream 23 minutes ago||

Where I'm at currently (which may change) is that I lay down the foundation of the program and its initial tests first. That initial bit is completely manual. Then when I'm happy that the program is sufficiently "built up", I let the LLM go crazy. I still audit the tests though personally auditing tests is the part of programming I like the very least. This also largely preserves the initial architectural patterns that I set so it's just much easier to read LLM code.

In a team setting I try to do the same thing and invite team members to start writing the initial code by hand only. I suspect if an urgent deliverable comes up though, I will be flexible on some of my ideas.

usefulcat 2 hours ago|||

Agreed 1000%. But that can be a lot of work; creating a good set of tests is nearly as much or often even more effort than implementing the thing being tested.

When LLMs can assist with writing useful tests before having seen any implementation, then I’ll be properly impressed.

byzantinegene 1 hour ago||

from experience, AI is bad at TDD. they can infer tests based on written code, but are bad at writing generalised test unless a clear requirement is given, so you the engineer is doing most of the work anyway.

MartyMcBot 4 hours ago|||

the test generation loop is the real trap. you ask the agent to write code, then ask it to write tests for that code. of course the tests pass. they're testing what the code does, not what it should do.

we ran into this building a task manager. the PUT endpoint set completed=true but never set the completion timestamp. the agent-written tests all passed because they tested "does it set completed to true" not "does it record when it was completed." 59 tasks in production with null timestamps before a downstream report caught it.

the fix was trivial. the gap in verification wasn't.

tbossanova 3 hours ago||

Once upon a time people advocated writing tests first…

codegladiator 3 hours ago|||

once upon a time 'engineering' in software had some meaning attached to it...

no other engineering profession would accept the standards(or rather their lack of) on which software engineering is running.

hulitu 23 minutes ago||

> no other engineering profession would accept the standards(or rather their lack of) on which software engineering is running.

I have bad news for you: they are pushing those "standards" (Agile, ASPICE) also in hardware and mechanical engineering.

The results can be seen already. Testing is expensive and this is the field where most savings can be implemented.

8note 2 hours ago||||

i dont think that would help. the agent would hard code the test details into the code.

scotty79 1 hour ago|||

I wasn't able to force the agent to write failing tests yet. Although I'm sure it should be possible to do.

himlion 41 minutes ago||

I do that all the time with Claude. What part is not working?

ZaoLahma 20 minutes ago|||

I think it boils down to how companies view LLMs and their engineers.

Some companies will do as you say - have (mostly clueless) engineers feed high level "wishes" to (entirely clueless) LLMs, and hope that everyone kind of gets it. And everyone will kind of get it. And everyone will kind of get it wrong.

Other companies will have their engineers explicitly treat the LLMs as collaborators / pair programmers, not independent developers. As an engineer in such a company, YOU are still the author of the code even if you "prompted" it instead of typing it. You can't just "fix this high level thing for me brah" and get away with it, but instead need to continuously interact with the LLM as you define and it implements the detailed wanted behaviors. That forces you to know _exactly_ what you want and ask for _exactly_ what you want without ambiguity, like in any other kind of programming. The difference is that the LLM is a heck of a lot quicker at typing code than you are.

8note 2 hours ago|||

> At no point does anyone stop and ask whether the generated code implements the desired functional behaviour for the system ("business logic").

its fun having LLMs because it makes it quite clear that a lot of testing has been cargo-culting. did people ever check often that the tests check for anything meaningful?

Foobar8568 1 hour ago|||

15years ago, I had tester writing "UI tests" / "User tests" that matched what the software was cranking out. At that time I just joined to continue at the client side so I didn't really worked on anything yet.

I had a fun discussion when the client tried to change values... Why is it still 0? Didn't you test?

And that was at that time I had to dive into the code base and cry.

taatparya 2 hours ago|||

Property testing could've helped

porphyra 2 hours ago|||

At my job we have a requirement for 100% test coverage. So everyone just uses AI to generate 10,000 line files of unit tests and nobody can verify anything.

ojo-rojo 5 hours ago|||

How about a subsequent review where a separate agent analyzes the original issue and resultant code and approves it if the code meets the intent of the issue. The principle being to keep an eye out for manual work that you can describe well enough to offload.

Depending on your success rate with agents, you can have one that validates multiple criteria or separate agents for different review criteria.

g947o 4 hours ago||

You are fighting nondeterministic behavior with more nondeterministic behavior, or in other words, fighting probability with probability. That doesn't necessarily make things any better.

pyridines 3 hours ago|||

In my experience, an agent with "fresh eyes", i.e., without the context of being told what to write and writing it, does have a different perspective and is able to be more critical. Chatbots tend to take the entire previous conversational history as a sort of canonical truth, so removing it seems to get rid of any bias the agent has towards the decisions that were made while writing the code.

I know I'm psychologizing the agent. I can't explain it in a different way.

Foobar8568 1 hour ago|||

Fresh eyes, some contexts and another LLM.

The problem is information fatigue from all the agents+code itself.

citizenpaul 2 hours ago|||

I think of it as they are additive biased. ie "dont think about the pink elephant ". Not only does this not help llms avoid pink elphants instead it guarantees that pink elephant information is now being considered in its inference when it was not before.

I fear thinking about problem solving in this manner to make llms work is damaging to critical thinking skills.

hex4def6 3 hours ago||||

Aren't human coders also nondeterministic?

Assigning different agents to have different focuses has worked for me. Especially when you task a code reviewer agent with the goal of critically examining the code. The results will normally be much better than asking the coder agent who will assure you it's "fully tested and production ready"

tbossanova 3 hours ago|||

Probably true

(Sorry.)

Herring 5 hours ago|||

> At no point does anyone stop and ask whether the generated code implements the desired functional behaviour for the system ("business logic").

Obvious question: why not? Let’s say you have competent devs, fair assumption. Maybe it’s because they don’t have enough time for solid QA? Lots of places are feature factories. In my personal projects I have more lines of code doing testing than implementation.

sarchertech 5 hours ago||

It’s because people will do what they’re incentivized to do. And if no one cares about anything but whether the next feature goes out the door, that’s what programmers will focus on.

Honestly I think the other thing that is happening is that a lot of people who know better are keeping their mouths shut and waiting for things to blow up.

We’re at the very peak of the hype cycle right now, so it’s very hard to push back and tell people that maybe they should slow down and make sure they understand what the system is actually doing and what it should be doing.

shigawire 5 hours ago|||

Or if you say we should slow down your competence is questioned by others who are going very fast (and likely making mistakes we won't find until later).

And there is an element of uncertainty. Am I just bad at using these new tools? To some degree probably, but does that mean I'm totally wrong and we should be going this fast?

citizenpaul 2 hours ago|||

oh yeah, let them dig a hole and charge sweet consultant rates to fix it. the the healing can begin

scotty79 1 hour ago|||

> The LLM happily churns out unit tests which are simply reinforcing the existing behaviour of the code.

I always felt like that's the main issue with unit testing. That's why I used it very rarely.

Maybe keeping tests in the separate module and not letting th Agent see the source during writing tests and not letting agent see the tests while writing implemntation would help? They could just share the API and the spec.

And in case of tests failing another agent with full context could decide if the fix should be delegated to coding agent or to testing agent.

IAmGraydon 3 hours ago|||

Yeah this is the exact kind of ridiculousness I've noticed as well - everything that comes out of an LLM is optimized to give you what you want to hear, not what's correct.

DeathArrow 1 hour ago|||

>LLM happily churns out unit tests which are simply reinforcing the existing behaviour of the code. At no point does anyone stop and ask whether the generated code implements the desired functional behaviour for the system ("business logic").

You can use spec driven development and TDD. Write the tests first. Write failing code. Modify the code to pass the tests.

bentobean 4 hours ago|||

This hits hard. I’m getting hit with so much slop at work that I’ve quietly stopped being all that careful with reviews.

SoftTalker 3 hours ago||

Um, you're supposed to write the tests first. The agents can't do this?

alexsmirnov 41 minutes ago|||

Actually, they extremely bad at that. All training data contains cod + tests, even if tests where created first. So far, all models that I tried failed to implement tests for interfaces, without access to actual code.

daliusd 3 hours ago|||

They can, but should be explicitly told to do that. Otherwise they just everything in batches. Anyway pure TDD or not but tests catches only what you tell AI to write. AI does not now what is right, it does what you told it to do. The above problem wouldn’t be solved by pure TDD.

madrox 12 hours ago||

I encourage everyone to RTFA and not just respond to the headline. This really is a glimpse into where the future is going.

I've been saying "the last job to be automated will be QA" and it feels more true every day. It's one thing to be a product engineer in this era. It's another to be working at the level the author is, where code needs to be verifiable. However, once people stop vibing apps and start vibing kernels, it really does fundamentally change the game.

I also have another saying: "any sufficiently advanced agent is indistinguishable from a DSL." I hadn't considered Lean in this equation, but I put these two ideas together and I feel like we're approaching some world where Lean eats the entire agentic framework stack and the entire operating system disappears.

If you're thinking about building something today that will still be relevant in 10 years, this is insightful.

fmbb 9 hours ago||

There are still no successful useful vibe codes apps. Kernels are pretty far away I think.

bonoboTP 8 hours ago|||

This is a very strange statement. People don't always announce when they use AI for writing their software since it's a controversial topic. And it's a sliding scale. I'm pretty sure a large fraction of new software has some AI involved in its development.

qsera 6 hours ago||

> new software has some AI involved in its development.

A large part of it is probably just using it as a better search. Like "How do I define a new data type in go?".

sublinear 5 hours ago||

I strongly agree with this. The only place where AI is uncontroversial is web search summaries.

The real blockers and time sinks were always bad/missing docs and examples. LLMs bridge that gap pretty well, and of course they do. That's what they're designed to be (language models), not an AGI!

I find it baffling how many workplaces are chasing perceived productivity gains that their customers will never notice instead of building out their next gen apps. Anyone who fails to modernize their UI/UX for the massive shift in accessibility about to happen with WebMCP will become irrelevant. Content presentation is so much higher value to the user. People expect things to be reliable and simple. Especially new users don't want your annoying onboarding flow and complicated menus and controls. They'll just find another app that gives them what they want faster.

dehrmann 6 hours ago||||

Apps are a strange measure because there aren't really any new, groundbreaking ones. PCs and smartphones have mostly done what people have wanted them to do for a while.

shimman 6 hours ago||

There are plenty of ground breaking apps but they aren't making billions of advertising revenue, nor do they have large numbers. I honestly think torrent applications (and most peer to peer type of stuff) are very cool and very useful for small-medium groups but it'll never scale to a billion user thing.

Do agree it's a weird metric to have, but can't think of a better one outside of "business" but that still seems like a poor rubric because the vast majority of people care about things that aren't businesses and if this "life altering" technology basically amounts to creating digital slaves then maybe we as a species shouldn't explore the stars.

tempaccount5050 8 hours ago||||

I think this might miss the point. We put off upgrading to an new RMM at work because I was able to hack together some dashboards in a couple days. It's not novel and does exactly what we need it to do, no more. We don't need to pay 1000's of dollars a month for the bloated Solarwinds stack. We aren't saving lives, we're saving PDFs so any arguments about 5 9s and maintainability are irrelevant. LLMs are going to give us on demand, one off software. I think the SaaS market is terrified right now because for decades they've gouged customers for continual bloat and lock in that now we can escape from. In a single day I was able to build an RMM that fits our needs exactly. We don't need to hire anyone to maintain it because it's simple, like most business applications should be, but SV needs to keep complicating their offerings with bloat to justify crazy monthly costs that should have been a one time purchase from the start. SV shot itself in the face with AI.

GoatInGrey 9 hours ago|||

To be fair, Claude Code is vibe-coded. It's a terrible piece of software from an engineering (and often usability) standpoint, and the problems run deeper than just the choice of JavaScript. But it is good enough for people to get what they want out of it.

bunderbunder 9 hours ago|||

But also, based on what I have heard of their headcount, they are not necessarily saving any money by vibecoding it - it seems like their productivity per programmer is still well within the historical range.

That isn’t necessarily a hit against them - they make an LLM coding tool and they should absolutely be dogfooding it as hard as they can. They need to be the ones to figure out how to achieve this sought-after productivity boost. But so far it seems to me like AI coding is more similar to past trends in industry practice (OOP, Scrum, TDD, whatever) than it is different in the only way that’s ever been particularly noteworthy to me: it massively changes where people spend their time, without necessarily living up to the hype about how much gets done in that time.

jmathai 5 hours ago|||

> But it is good enough for people to get what they want out of it.

This is the ONLY point of software unless you’re doing it for fun.

charlieflowers 11 hours ago|||

> "any sufficiently advanced agent is indistinguishable from a DSL."

I don't quite follow but I'd love to hear more about that.

madrox 8 hours ago|||

If you give an agent a task, the typical agentic pattern is that it calls tools in some non-deterministic loop, feeding the tool output back into the LLM, until it deems the task complete. The LLM internalizes an algorithm.

Another way of doing it is the agent just writes an algorithm to perform the task and runs it. In this world, tools are just APIs and the agent has to think through its entire process end to end before it even begins and account for all cases.

Only latter is turing complete, but the former approaches the latter as it improves.

thinkling 8 hours ago||||

My read was roughly that agents require constraining scaffolding (CLAUDE.md) and careful phrasing (prompt engineering) which together is vaguely like working in a DSL?

jpollock 9 hours ago||||

If the llm is able to code it, there is enough training data that youight be better off in a different language that removes the boilerplate.

esafak 10 hours ago|||

https://en.wikipedia.org/wiki/Clarke's_three_laws

charlieflowers 8 hours ago||

No i get the clarke reference. But how is an agent a dsl?

whattheheckheck 8 hours ago||

Maybe not an agent exactly but I can see an agentic application is kind of like a dsl because the user space has a set of queries and commands they want to direct the computer to take action but they will describe those queries and commands in English and not with normal programming function calls

gwern 3 hours ago|||

> I encourage everyone to RTFA and not just respond to the headline.

This is an example of an article which 'buries the lede'†.

It should have started with the announcement of the new zlib autoformalization (!) https://leodemoura.github.io/blog/2026/02/28/when-ai-writes-... to get you excited.

Then it should have talked about the rest - instead of starting with rather graceless and ugly LLM-written generic prose about AI topics that to many readers is already tiresomely familiar and doubtless was tldr for even the readers who aren't repelled automatically by that.

† or in my terms, fails to 'make you care': https://gwern.net/blog/2026/make-me-care

bwestergard 9 hours ago||

I am as enthusiastic about formal methods as the next guy, but I very much doubt any LLM-based technique will make it economical to write a substantial fraction of application software in Lean. The LLM can play a powerful heuristic role in searching for proof-bearing code in areas where there is good training data. Unfortunately those areas are few and far between.

Moreover, humans will still need to read even rigorously proved code if only to suss out performance issues. And training people to read Lean will continue to be costly.

Though, as the OP says, this is a very exciting time for developing provably correct systems programming.

zozbot234 9 hours ago|||

LLMs are writing non-trivial math proofs in Lean, and software proofs tend to be individually easier than proofs in math, just more tedious because there's so much more of them in any non-trivial development.

Some performance issues (asymptotics) can be addressed via proof, others are routinely verified by benchmarking.

madrox 8 hours ago|||

This assumes everything about current capabilities stay static, and it wasn't long ago before LLMs couldn't do math. Many were predicting the genAI hype had peaked this time last year.

If you want it to be a question of economics, I think the answer is in whether this approach is more economical than the alternative, which is having people run this substrate. There's a lot of enthusiasm here and you can't deny there has been progress.

I wouldn't be so quick to doubt. It costs nothing to be optimistic.

candiddevmike 7 hours ago||

> and it wasn't long ago before LLMs couldn't do math

They still can't do math.

Hammershaft 7 hours ago||

Pro models won gold at the international math olympiads?

bandrami 5 hours ago||

They have trouble adding two numbers accurately though

evrimoztamur 1 hour ago||

Why are they expected to?

cagz 1 hour ago||

> No one is formally verifying the result

This might be the case for a hobby project or a start-up MVP being created in a rush, but in reality, there are a few points we may want to take into account:

1. Software teams I work with are maintaining the usual review practices. Even if a feature is completely created by AI. It goes through the usual PR review process. The dev may choose "Accept All", although I am not saying this is a good practice, the change still gets reviewed by a human.

2. From my experience, sub-agents intended for code and security review do a good job. It is even possible to use another model to review the code, which can provide a different perspective.

3. A year ago, code written by AI was failing to run the first time, requiring a painful joint troubleshooting effort. Now it works 95% of the time, but perhaps it is not optimal. Given the speed at which it is improving, it is safe to expect that in 6-9 months' time, it will not only work but will also be written to a good quality.

neya 4 hours ago||

The fundamental problem is the verification loop for the average developer is grounded not in tests, but with results. Write code, reload browser, check output. Does it work the way I want? Good. We're done here.

Not write code, write tests, ensure all test-cases are covered. Now, imagine such a flaky foundation is used to build on top of even more untested code. That's how bad quality software (that's usually unfixable without a major re-write) is born.

Also, most vibe-coders don't have enough experience/knowledge to figure out what is wrong with the code generated by the AI. For that, you need to know more than the AI and have a strong foundation in the domain you're working on. Here is an example: You ask the AI to write the code for a comment form. It generates the backend and frontend code for you (let's say React/Svelte/Vue/whatever). The vibe-coder sees the UI - most likely written in Tailwind CSS and thinks "wow, that looks really good!" and they click approve. However, an experienced person might notice the form does not have CSRF protection in place. The vibe-coder might not even be aware of the concept of CSRF (let alone the top 10 OSWAP security risks).

Hence, the fundamental problem is not knowing about the domain more than the AI to pick up the flaw. Unless this fundamental problem is solved - which I don't think it will anytime soon because everyone can generate code + UI these days, I don't see a solution to the verification problem.

However, this is good news for consultants and the like because it creates more work down the line to fix the vide-coded mess because they got hacked the very next day and we can charge them a rush fee on top of it, too. So, it's not all that bad.

Spivak 4 hours ago||

My impression has been with frontend tests is they come in two flavors: extremely hard to write to the point of impracticality, and useless. Most orgs settle for the latter and end up testing easily testable things like the final rendered dom and rely on human qa for all the hard bits like "does the page actually look like it's supposed to, does interacting with the UI elements behave correctly, and do the flows actually work."

All largely stemming from the fact that tests can't meaningfully see and interact with the page like the end user will.

neya 3 hours ago|||

> extremely hard to write to the point of impracticality, and useless > Most orgs settle for the latter and end up testing easily testable things like the final rendered dom and rely on human qa for all the hard bits like "does the page actually look like it's supposed to

Not disagreeing with you here, but what ends up happening is the frontend works flawlessly in the browser/device being tested but has tons of bugs in the others. Best examples of this are most banking apps, corporate portals, etc. But honestly, you can get away with the frontend without writing tests. But the backend? That directly affects the security aspect of the software and cannot afford to skip important tests atleast.

krackers 3 hours ago|||

>can't meaningfully see and interact with the page like the end user will

Isn't this a great use case for LLM tests? Have a "computer use agent" and then describe the parameters of the test as "load the page, then navigate to bar, expect foo to happen". You don't need the LLM to generate a test using puppeteer or whatever which is coupled to the specific dom, you just describe what should happen.

8note 2 hours ago|||

> computer use agent

they arent good enough yet at all.

i got an agent to use the windows UIA with success for a feedback loop, and it got the code from not wroking very well to basically done overnight, but without the mcp having good feedback and tagged/id-ed buttons and so on, the computer use was just garbage

orbital-decay 2 hours ago|||

Depends. Does it represent end users well enough? Does it hit the same edge cases as a million users would (especially considering poor variety of heavily post-trained models)? Does it generalize?

globular-toast 1 hour ago||

This is actually how a lot of software is written, sadly. I used to call it "trial and error programming". I've observed people writing C++ who do not have a mental model of memory but just try stuff until it compiles. For some classes of software, like games, that is acceptable, but for others it would be horrifying.

Now, it is actually completely possible to write UI code without any unit tests in a completely safe way. You use the functional core, imperative shell approach. When all your domain logic is in a fully tested, functional core, you can just go ahead and write "what works" in a thin UI shell. Good luck getting an LLM to rigidly conform to such an architecture, though.

thorn 26 minutes ago||

I don’t think many people are interested in producing 100% correct code. I saw this at big companies and small startups. It is all the same: ship feature asap before competitors. Writing correct code is almost always punished indirectly in the way they only praise the feature delivering heroes that got promoted. Nobody got promoted for preventing bugs.

Maybe in some other circles it is not like that, but I am sure that 90% of industry measures output in the amount of value produced and correct code is not the value you can show to the stockholders.

It is sad state of affairs dictated by profit seeking way of life (capitalism).

maltalex 7 hours ago||

Maybe I'm missing something, but isn't this the same as writing code, but with extra steps?

Currently, engineers work with loose specifications, which they translate into code. With the proposed approach, they would need to first convert those specifications into a formally verifiable form before using LLMs to generate the implementation.

But to be production-ready, that spec would have to cover all possible use-cases, edge cases, error handling, performance targets, security and privacy controls, etc. That sounds awfully close to being an actual implementation, only in a different language.

hwayne 5 hours ago||

The key bit is that specifications don't need to be "obviously computable", so they can be a lot simpler than the code that implements them. Consider the property "if some function has a reference to a value, that value will not change unless that function explicitly changes it". It's simple enough to express, but to implement it Rust needs the borrow checker, which is a pretty heavy piece of engineering. And proving the implementation actually guarantees that property isn't easy, either!

nextos 4 hours ago||

Formal specifications can be easier to write than code. Take refinement types in Dafny, for example. Because they are higher-level, you don't need to bother with tedious implementation details. By shifting our focus to formal specifications rather than manual coding, we could possibly leverage automation to not only develop software faster but also achieve a higher degree of assurance.

Of course, this remains largely theoretical for now, but it is an exciting possibility. Note high-level specifications often overlook performance issues, but they are likely sufficient for most scenarios. Regardless, we have formal development methodologies able to decompose problems to an arbitrary level of granularity since the 1990s, all while preserving correctness. It is likely that many of these ideas will be revisited soon.

wyum 5 hours ago||

I believe there is a Verification Complexity Barrier

As you add components to a system, the time it takes to verify that the components work together increases superlinearly.

At a certain point, the verification complexity takes off. You literally run out of time to verify everything.

AI coding agents hit this barrier faster than ever, because of how quickly they can generate components (and how poorly they manage complexity).

I think verification is now the problem of agentic software engineering. I think formal methods will help, but I don't see how they will apply to messy situations like end-to-end UI testing or interactions between the system and the real world.

I posted more detailed thoughts on X: https://x.com/i/status/2027771813346820349

kaicianflone 2 hours ago|

Thank you for the post. It's a good read. I'm working on governance/validation layers for n-LLMs and making them observable so your comments on runaway AIs resonated with me. My research is pointing me to reputation and stake consensus mechanisms being the validation layer either pre inference or pre-execution, and the time to verify decisions can be skipped with enough "decision liquidity" via reputation alone aka decision precedence.

phyzix5761 33 minutes ago||

I found my best use of AI for writing code is with a TDD approach. I write 1 test, watch it fail, let the AI implement the solution, and then if I like it I commit. Write the next test and repeat.

It keeps me in the loop, I'm testing actual functionality rather than code, and my code is always in a state where I can open a PR and merge it back to main.

milind-soni 6 minutes ago||

At this point, its the users

muraiki 13 hours ago|

The article says that AWS's Cedar authorization policy engine is written in Lean, but it's actually written in Dafny. Writing Dafny is a lot closer to writing "normal" code rather than the proofs you see in Lean. As a non-mathematician I gave up pretty early in the Lean tutorial, while in a recent prototype I learned enough Dafny to be semi-confident in reviewing Claude's Dafny code in about half a day.

The Dafny code formed a security kernel at the core of a service, enforcing invariants like that an audit log must always be written to prior to a mutating operation being performed. Of course I still had bugs, usually from specification problems (poor spec / design) or Claude not taking the proof far enough (proving only for one of a number of related types, which could also have been a specification problem on my part).

In the end I realized I'm writing a bunch of I/O bound glue code and plain 'ol test driven development was fine enough for my threat model. I can review Python code more quickly and accurately than Dafny (or the Go code it eventually had to link to), so I'm back to optimizing for humans again...

dranov 7 hours ago||

Cedar used to be written in Dafny, but AWS abandoned that implementation and rewrote it in Lean.

https://aws.amazon.com/blogs/opensource/lean-into-verified-s...

muraiki 6 hours ago||

Oh whoops, thank you for the correction! I didn't realize that.

cpeterso 6 hours ago||

Looks like LLMs also find Dafny easier to write than Lean. This study, “A benchmark for vericoding: formally verified program synthesis”, reports:

> We present and test the largest benchmark for vericoding, LLM-generation of formally verified code from formal specifications … We find vericoding success rates of 27% in Lean, 44% in Verus/Rust and 82% in Dafny using off-the-shelf LLMs.

https://arxiv.org/html/2509.22908v1

nextos 4 hours ago||

Not surprising, as Dafny is a bit less expressive (refinement instead of dependent types) and therefore easier to write. IMHO, it hits a very nice sweet spot. The disadvantage of Dafny is the lack of manual tactics to prove things when SAT/SMT automation fails. But this is getting fixed.

More comments...