AI will make formal verification go mainstream

Posted by evakhoury 12/16/2025

AI will make formal verification go mainstream(martin.kleppmann.com)

827 points | 434 comments

QuadrupleA 12/17/2025|

I don't think formal verification really addresses most day-to-day programming problems:

    * A user interface is confusing, or the English around it is unclear
    * An API you rely on changes, is deprecated, etc.
    * Users use something in unexpected ways
    * Updates forced by vendors or open source projects cause things to break
    * The customer isn't clear what they want
    * Complex behavior between interconnected systems, out of the purview of the formal language (OS + database + network + developer + VM + browser + user + web server)

For some mathematically pure task, sure, it's great. Or a low-level library like a regular expression parser or a compression codec. But I don't think that represents a lot of what most of us are tasked with, and those low-level "mathematically pure" libraries are generally pretty well handled by now.

Byamarro 12/17/2025||

In fact, automated regression tests done by ai with visual capabilities may have bigger impact than formal verification has. You can have an army of testers now, painfully going through every corner of your software

petesergeant 12/17/2025|||

In practice ends up being a bit like static analysis though, which is you get a ton of false positives.

All said, I’m now running all commits through Codex (which is the only thing it’s any good at), and it’s really pretty good at code reviews.

Maxion 12/17/2025|||

Will only work somewhat when customers expect features to work in a standard way. When customer spec things to work in non-standard approaches you'll just end up with a bunch of false positives.

MetaWhirledPeas 12/17/2025||

This. When the bugs come streaming in you better have some other AI ready to triage them and more AI to work them, because no human will be able to keep up with it all.

Bug reporting is already about signal vs noise. Imagine how it will be when we hand the megaphone to bots.

adrianN 12/17/2025|||

TBH most day to day programming problems are barely worth having tests for. But if we had formal specs and even just hand wavy correspondences between the specs and the implementation for the low level things everybody depends on that would be a huge improvement for the reliability of the whole ecosystem.

gizmo686 12/17/2025|||

A limited form of formal verification is already mainstream. It is called type systems. The industry in general has been slowly moving to encode more invariants into the type system, because every invariant that is in the type system is something you can stop thinking about until the type checker yells at you.

A lot of libraries document invariants that are either not checked at all, only at runtime, or somewhere in between. For instance, the requirement that a collection not be modified during interaction. Or that two region of memory do not overlap, or that a variable is not modified without owning a lock. These are all things that, in principle, can be formally verified.

No one claims that good type systems prevent buggy software. But, they do seem to improve programmer productivity.

For LLMs, there is an added benefit. If you can formally specify what you want, you can make that specification your entire program. Then have an LLM driven compiler produce a provably correct implementation. This is a novel programming paradigm that has never before been possible; although every "declarative" language is an attempt to approximate it.

elbear 12/17/2025|||

> No one claims that good type systems prevent buggy software.

That's exactly what languages with advanced type systems claim. To be more precise, they claim to eliminate entire classes of bugs. So they reduce bugs, they don't eliminate them completely.

bonesss 12/17/2025|||

No nulls, no nullability bombs.

Forcing devs to pre-fix/avoid bugs before the compiler will allow the app means the programs are more correct as a group.

Wrong, incomplete, insufficient, unhelpful, unimpressive, and dumb are all still very possible. But more correct than likely in looser systems.

fc417fc802 12/18/2025||

> No nulls, no nullability bombs.

I hate this meme. Null indicates something. If you disallow null that same state gets encoded in some other way. And if you don't properly check for that state you get the exact same class of bug. The desirable type system feature here is the ability to statically verify that such a check has occurred every time a variable is accessed.

Another example is bounds checking. Languages that stash the array length somewhere and verify against it on access eliminate yet another class of bug without introducing any programmer overhead (although there generally is some runtime overhead).

yencabulator 12/18/2025||

The whole point of "no nullability bombs" is to make it obvious in the type system when the value might be not present, and force that to be handled.

Javascript:

  let x = foo();
  if (x.bar) { ... } // might blow up

Typescript:

  let x = foo(); // type of x is Foo | undefined
  if (x === undefined) { ...; return; } // I am forced to handle this
  if (x.bar) { ... } // this is now safe, as Typescript knows x can only be a Foo now

(Of course, languages like Rust do that cleaner, since they don't have to be backwards-compatible with old Javascript. But I'm using Typescript in hopes of a larger audience.)

kazinator 12/17/2025|||

If you eliminate the odd integers from consideration, you've eliminated an entire class of integers. yet, the set of remaining integers is of the same size as the original.

petesergeant 12/17/2025|||

Peak HN gnomism. While the set of possible errors may be infinite, their distribution is not uniform.

moi2388 12/17/2025||||

No, because integers in computing are generally finite.

tmtvl 12/17/2025|||

There cannot be infinite bugs in a limited program.

kazinator 12/17/2025|||

Programs are not limited; the number of Turing machines is countably infinite.

When you say things like "eliminate a class of bugs", that is played out in the abstraction: an infinite subset of that infinity of machines is eliminated, leaving an infinity.

How you then sample from that infinity in order to have something which fits on your actual machine is a separate question.

Hercuros 12/17/2025||||

How do you count how many bugs a program has? If I replace the Clang code base by a program that always outputs a binary that prints hello world, how many bugs is that? Or if I replace it with a program that exits immediately?

Maybe another example is compiler optimisations: if we say that an optimising compiler is correct if it outputs the most efficient (in number of executed CPU instructions) output program for the every input program, then every optimising compiler is buggy. You can always make it less buggy by making more of the outputs correct, but you can never satisfy the specification on ALL inputs because of undecidability.

gls2ro 12/17/2025|||

Because the number of state where a program can be is so huge (when you consider everything that can influence how a program runs and the context where and when it runs) it is for the current computation power practically infinite but yes it is theoretically finite and can even be calculated.

skissane 12/17/2025||||

> For LLMs, there is an added benefit. If you can formally specify what you want, you can make that specification your entire program. Then have an LLM driven compiler produce a provably correct implementation. This is a novel programming paradigm that has never before been possible; although every "declarative" language is an attempt to approximate it.

The problem is there is always some chance a coding agent will get stuck and be unable to produce a conforming implementation in a reasonable amount of time. And then you are back in a similar place to what you were with those pre-LLM solutions - needing a human expert to work out how to make further progress.

GTP 12/17/2025||

With the added issue that now the expert is working with code they didn't write, and that could be in general be harder to understand than human-written code. So they could find it easier to just throw it away and start from scratch.

kreetx 12/17/2025||||

Some type systems (e.g, Haskell) are closing in in becoming formal verification languages themselves.

rixed 12/17/2025|||

And one can see how quickly they became mainstream...

int_19h 12/22/2025|||

Given that it's the AI doing the coding, it would be pretty quickly so long as it's decent at Haskell. Which it already is, surprisingly so actually for such a niche language. It doesn't necessarily write great code, but it's good enough, and the straightjacket type system makes it very hard for the model to sneak in creative hacks like using globals, or trip itself with mutable state.

egwor 12/17/2025|||

I think that’s because the barrier to entry for a beginner is much higher than say python.

jonathanstrange 12/17/2025|||

IMHO, these strong type systems are just not worth it for most tasks.

As an example, I currently mostly write GUI applications for mobile and desktop as a solo dev. 90% of my time is spent on figuring out API calls and arranging layouts. Most of the data I deal with are strings with their own validation and formatting rules that are complicated and at the same time usually need to be permissive. Even at the backend all the data is in the end converted to strings and integers when it is put into a database. Over-the-wire serialization also discards with most typing (although I prefer protocol buffers to alleviate this problem a bit).

Strong typing can be used in between those steps but the added complexity from data conversions introduces additional sources of error, so in the end the advantages are mostly nullified.

baq 12/17/2025||

> Most of the data I deal with are strings with their own validation and formatting rules that are complicated and at the same time usually need to be permissive

this is exactly where a good type system helps: you have an unvalidated string and a validated string which you make incompatible at the type level, thus eliminating a whole class of possible mistakes. same with object ids, etc.

don't need haskell for this, either: https://brightinventions.pl/blog/branding-flavoring/

jonathanstrange 12/17/2025||

That's neat, I was about to ask which languages support that since the vast majority don't. I didn't know that you can do that in Typescript.

Mekaniko 12/20/2025||

Any language with an type system really...

Even OOP : if you have a string class, you can have a String_Formated_For_API subtype.

Just extends String, and add some checking.

But now the type checker "knows" it can print() a String_Formated_For_API just fine but not call_API(string).

qrobit 12/17/2025||||

I would argue that the barrier to entry is on par with python for a person with no experience, but you need much more time with Haskell to become proficient in it. In python, on the other hand, you can learn the basics and these will get you pretty far

azkalam 12/17/2025||||

Python has a reputation for being good for beginners so it's taught to beginners so it has a reputation for being good for beginners.

Byamarro 12/17/2025|||

I blame syntax. It's too unorthodox nowadays. Historical reasons don't matter all that much, everything mainstream is a C-family memember

Shocka1 12/18/2025|||

Piggybacking off your comment, I just completed a detailed research paper where I compared Haskell to C# with an automated trading strategy. I have many years of OOP and automated trading experience, but struggled a bit at first implementing in Haskell syntax. I attempted to stay away from LLMs, but ended up using them here and there to get the syntax right.

Haskell is actually a pretty fun language, although it doesn't fly off my fingers like C# or C++ does. I think a really great example of the differences is displayed in the recursive Fibonacci sequence.

In C#:

    public int Fib(int n)
    {
        if (n <= 1)
            return n;
        else
            return Fib(n - 1) + Fib(n - 2);
    }

In Haskell:

    fib :: Integer -> Integer
    fib n
      | n <= 1    = n
      | otherwise = fib (n - 1) + fib (n - 2)

As you might know, this isn't even scratching the surface of the Haskell language, but it does a good job highlighting the syntax differences.

mrsmrtss 12/19/2025||

When using switch expression in C#, they are a lot more similar:

    public int Fib(int n) => n switch
    {
        <= 1 => n,
        _    => Fib(n - 1) + Fib(n - 2)
    };

blub 12/17/2025||||

> No one claims that good type systems prevent buggy software. But, they do seem to improve programmer productivity.

To me it seems they reduce productivity. In fact, for Rust, which seems to match the examples you gave about locks or regions of memory the common wisdom is that it takes longer to start a project, but one reaps the benefits later thanks to more confidence when refactoring or adding code.

However, even that weaker claim hasn’t been proven.

In my experience, the more information is encoded in the type system, the more effort is required to change code. My initial enthusiasm for the idea of Ada and Spark evaporated when I saw how much ceremony the code required.

teiferer 12/17/2025|||

> In my experience, the more information is encoded in the type system, the more effort is required to change code.

I would tend to disagree. All that information encoded in the type system makes explicit what is needed in any case and is otherwise only carried informally in peoples' heads by convention. Maybe in some poorly updated doc or code comment where nobody finds it. Making it explicit and compiler-enforced is a good thing. It might feel like a burden at first, but you're otherwise just closing your eyes and ignoring what can end up important. Changed assumptions are immediately visible. Formal verification just pushes the boundary of that.

blub 12/17/2025|||

In practice it would be encoded in comments, automated tests and docs, with varying levels of success.

It’s actually similar to tests in a way: they provide additional confidence in the code, but at the same time ossify it and make some changes potentially more difficult. Interestingly, they also make some changes easier, as long as not too many types/tests have to be adapted.

estebank 12/17/2025|||

This reads to me like an argument for better refactoring tools, not necessarily for looser type systems. Those tools could range from mass editing tools, IDEs changing signatures in definitions when changing the callers and vice versa, to compiler modes where the language rules are relaxed.

jbritton 12/17/2025||

I was thinking about C++ and if you change your mind about whether some member function or parameter should be const, it can be quite the pain to manually refactor. And good refactoring tools can make this go away. Maybe they already have, I haven’t programmed C++ for several years.

gf000 12/17/2025|||

Constraints Liberate, Liberties Constrain. (I also recommend watching the presentation with the same title)

dnautics 12/17/2025||||

> All that information encoded in the type system makes explicit what is needed in any case and is otherwise only carried informally in peoples' heads by convention

this is, in fact better for llms, they are better at carrying information and convention in their kv cache than they are in having to figure out the actual types by jumping between files and burning tokens in context/risking losing it on compaction (or getting it wrong and having to do a compilation cycle).

if a typed language lets a developer fearlessly build a semantically inconsistent or confusing private API, then llms will perform poorer at them even though correctness is more guaranteed.

jappgar 12/17/2025|||

It is definitely harder to refactor Haskell than it is Typescript. Both are "safe" but one is slightly safer, and much harder to work with.

el_pollo_diablo 12/17/2025||||

Capturing invariants in the type system is a two-edged sword.

At one end of the spectrum, the weakest type systems limit the ability of an IDE to do basic maintenance tasks (e.g. refactoring).

At the other end of the spectrum, dependent type and especially sigma types capture arbitrary properties that can be expressed in the logic. But then constructing values in such types requires providing proofs of these properties, and the code and proofs are inextricably mixed in an unmaintainable mess. This does not scale well: you cannot easily add a new proof on top of existing self-sufficient code without temporarily breaking it.

Like other engineering domains, proof engineering has tradeoffs that require expertise to navigate.

gf000 12/17/2025||||

> but one reaps the benefits later thanks to more confidence when refactoring or adding code.

To be honest, I believe it makes refactoring/maintenance take longer. Sure, safer, but this is not a one-time only price.

E.g. you decide to optimize this part of the code and only return a reference or change the lifetime - this is an API-breaking change and you have to potentially recursively fix it. Meanwhile GC languages can mostly get away with a local-only change.

Don't get me wrong, in many cases this is more than worthwhile, but I would probably not choose rust for the n+1th backend crud app for this and similar reasons.

zozbot234 12/17/2025||

The choice of whether to use GC is completely orthogonal to that of a type system. On the contrary, being pointed at all the places that need to be recursively fixed during a refactoring is a huge saving in time and effort.

gf000 12/17/2025||

I was talking about a type system with affine types, as per the topic was Rust specifically.

I compared it to a statically typed language with a GC - where the runtime takes care of a property that Rust has to do statically, requiring more complexity.

GTP 12/17/2025||||

In my opinion, programming languages with a loose type system or no explicit type system only appear to foster productivity, because it is way easier to end up with undetected mistakes that can bite later, sometimes much later. Maybe some people argue that then it is someone else's problem, but even in that case we can agree that the overall quality suffers.

lukan 12/17/2025||||

"In my experience, the more information is encoded in the type system, the more effort is required to change code."

Have you seen large js codebases? Good luck changing anything in it, unless they are really, really well written, which is very rare. (My own js code is often a mess)

When you can change types on the fly somewhere hidden in code ... then this leads to the opposite of clarity for me. And so lots of effort required to change something in a proper way, that does not lead to more mess.

blub 12/17/2025||

There’s two types of slowdown at play:

a) It’s fast to change the code, but now I have failures in some apparently unrelated part of the code base. (Javascript) and fixing that slows me down.

b) It’s slow to change the code because I have to re-encode all the relationships and semantic content in the type system (Rust), but once that’s done it will likely function as expected.

Depending on project, one or the other is preferable.

sothatsit 12/17/2025||

Or: I’m not going to do this refactor at all, even though it would improve the codebase, because it will be near impossible to ensure everything is correct after making so many changes.

To me, this has been one of the biggest advantages of both tests and types. They provide confidence to make changes without needing to be scared of unintended breakages.

skydhash 12/17/2025||

There's a tradeoff point somewhere where it makes sense to go with one or another. You can write a lot of codes in bash and Elisp without having to care about the type of whatever you're manipulating. Because you're handling one type and encoding the actual values in a typesytem would be very cumbersome. But then there are other domain which are fairly known, so the investment in encoding it in a type system does pay off.

wolvesechoes 12/17/2025|||

Soon a lot of people will go out of the way and try to convince you that Rust is most productive language, functions having longer signatures than their bodies is actually a virtue, and putting .clone(), Rc<> or Arc<> everywhere to avoid borrow-checker complaints makes Rust easier and faster to write than languages that doesn't force you to do so.

Of course it is a hyperbole, but sadly not that large.

Marazan 12/17/2025||||

That is not novel and every declarative language precisely embodies it.

naasking 12/17/2025||

I think most existing declarative languages still require the programmer to specify too many details to get something usable. For instance, Prolog often requires the use of 'cut' to get reasonable performance for some problems.

YouAreWRONGtoo 12/17/2025||||

[dead]

devin 12/17/2025|||

> No one claims that good type systems prevent buggy software. But, they do seem to improve programmer productivity.

They really don’t. How did you arrive at such a conclusion?

Permik 12/17/2025|||

Not that I can answer for OP but as a personal anecdote; I've never been more productive than writing in Rust, it's a goddamn delight. Every codebase feels like it would've been my own and you can get to speed from 0 to 100 in no time.

leoedin 12/17/2025||

Yeah, I’ve been working mainly in rust for the last few years. The compile time checks are so effective that run time bugs are rare. Like you can refactor half the codebase and not run the app for a week, and when you do it just works. I’ve never had that experience in other languages.

mplewis 12/17/2025|||

Through empirical evidence? Do you think that the vast majority of software devs moved to typing for no reason?

wolvesechoes 12/17/2025|||

> Do you think that the vast majority of software devs moved to typing for no reason?

It is quite clear that this industry is mostly driven by hype and fades, not by empirical studies.

Empirical evidence in favor of a claim that static typing and complex type systems reduce bugs or improve productivity is highly inconclusive at best

avmich 12/17/2025||||

It's a bad reason. A lot of best practices are temporary blindnesses, comparable, in some sense, with supposed love to BASIC before or despite Dijkstra. So, yes, it's possible there is no good reason. Though I don't think it's the case here.

gf000 12/17/2025|||

We don't actually have empirical evidence on the topic, surprisingly.

It's just people's hunches.

JumpCrisscross 12/17/2025||

I feel like the terms logical, empirical, rational and objective are used interchangeably by the general public, with one being in vogue at a time.

nwah1 12/17/2025|||

> Complex behavior between interconnected systems, out of the purview of the formal language (OS + database + network + developer + VM + browser + user + web server)

Isn't this what TLA+ was meant to deal with?

skydhash 12/17/2025||

Not really, some components like components have a lot of properties that’s very difficult to modelize. Take latency in network, or storage performance in OS.

ScottBurson 12/18/2025|||

Actually, formal verification could help massively with four of those problems — all but the first (UI/UX) and fifth (requirements will always be hard).

A change in the API of a dependency should be detected immediately and handled silently.

Reliance on unspecified behavior shouldn't happen in the first place; the client's verification would fail.

Detecting breakage caused by library changes should be where verification really shines; when you get the update, you try to re-run your verification, and if that fails, it tells you what the problem is.

As for interconnected systems, again, that's pretty much the whole point. Obviously, achieving this dream will require formalizing pretty much everything, which is well beyond our capabilities now. But eventually, with advances in AI, I think it will be possible. It will take something fundamentally better than today's LLMs, though.

raxxorraxor 12/17/2025|||

That has been the problem with unit and integration tests all the time. Especially for systems that tend to be distributed.

AI makes creating mock objects much easier in some cases, but it still creates a lot of busy work and makes configuration more difficult. At at this points it often is difficult configuration management that cause the issues in the first place. Putting everything in some container doesn't help either, on the contrary.

ErroneousBosh 12/17/2025|||

> But I don't think that represents a lot of what most of us are tasked with

Give me a list of all the libraries you work with that don't have some sort of "okay but not that bit" rule in the business logic, or "all of those function are f(src, dst) but the one you use most is f(dst,src) and we can't change it now".

I bet it's a very short list.

Really we need to scrap every piece of software ever written and start again from scratch with all these weirdities written down so we don't do it again, but we never will.

bluGill 12/17/2025||

Scrapping everything wouldn't help. 15 years ago the project I'm on did that - for a billion dollars. We fixed the old mistakes but made plenty of new ones along the way. We are trying to fix those now and I can't help but wonder what new mistakes we are making the in 15 years we will regret.

ErroneousBosh 12/17/2025||

Computers are terrible and software is terrible and we should just go back to tilling the fields with horses and drinking beer.

wolfgangbabad 12/17/2025|||

Yeah, there were about 5 or 10 videos about this "complexity" and unpredictability of 3rd parties and wheels involved that AI doesn't control and even forget - small context window - in like past few weeks. I am sure you have seen at least one of them ;)

But it's true. AI is still super narrow and dumb. Don't understand basic prompts even.

Look at the computer games now - they still don't look real despite almost 30 years since Half-life 1 started the revolution - I would claim. Damn, I think I ran it on 166 Mhz computer on some lowest details even.

Yes, it's just better and better but still looking super uncanny - at least to me. And it's been basically 30 years of constant improvements. Heck, Roomba is going bankrupt.

I am not saying things don't improve but the hype and AI bubble is insane and the reality doesn't match the expectation and predictions at all.

est 12/17/2025||

> An API you rely on changes, is deprecated, etc

Formal verification will eventually lead to good, stable API design.

> Users use something in unexpected ways

> Complex behavior between interconnected systems

It happens when there's no formal verification during the design stage.

Formal verification literally means cover 100% state changes and for every possible input/output, every execution branch should be tested.

Almondsetat 12/17/2025|||

Formal verification has nothing to do with the quality of the API.

Given the spec, formal verification can tell you if your implementation follows the spec. It cannot tell you if the spec if good

dhruv3006 12/17/2025|||

Thats something I agree with.

I am right now working on an offline api client: https://voiden.md/. I wonder if this can be a feature.

est 12/17/2025||||

> It cannot tell you if the spec if good

I beg to differ, if a spec is hard to verify, then it's a bad sign.

Joker_vD 12/17/2025|||

All non-trivial specs, like the one for seL4, are hard to verify. Lots of that complexity comes from interacting with the rest of the world which is a huge shared mutable global state you can't afford to ignore.

Of course, you can declare that the world itself is inherently sinful and imperfect, and is not ready for your beautiful theories but seriously.

jessoteric 12/17/2025||

> Of course, you can declare that the world itself is inherently sinful and imperfect, and is not ready for your beautiful theories

i see we are both familiar with haskellers (friendly joke!)

MattHeard 12/17/2025|||

it can tell you if your spec is bad, but it can't tell you if your spec is good

jeffreygoesto 12/17/2025|||

That is one problem of many solved, isn't that good?

That the spec solves the problem is called validation in my domain and treated explicitly with different methods.

We use formal validation to check for invariants, but also "it must return a value xor an error, but never just hang".

Joker_vD 12/17/2025||||

> Formal verification will eventually lead to good, stable API design.

Why? Has it ever happened like this? Because to me it would seem that if the system verified to work, then it works no matter how API is shaped, so there is no incentive to change it to something better.

est 12/17/2025||

> if the system verified to work, then it works no matter how API is shaped

That's the case for one-off integrations, but the messy part always comes when system goal changes

Let's say formal verification could help to avoid some anti-patterns.

Joker_vD 12/17/2025|||

> Let's say formal verification could help to avoid some anti-patterns.

I'd still like to hear about the actual mechanism of this happening. Because I personally find it much easier to believe that the moment keeping the formal verification up to date becomes untenable for whatever reason (specs changing too fast, external APIs to use are too baroque, etc) people would rather say "okay, guess we ditch the formal verification and just keep maintaining the integration tests" instead of "let's change everything about the external world so we could keep our methodology".

est 12/17/2025||

> I'd still like to hear about the actual mechanism of this happening

I am not an expert on this, but the worst API I've seen is those with hidden states.

e.g. .toggle() API. Call it old number of times, it goes to one state, call it even number of times, it goes back.

And there's call A before you call B types of APIs, the client has to keep a strict call order (which itself is a state machine of some kind)

Joker_vD 12/17/2025||

> I am not an expert on this, but the worst API I've seen is those with hidden states.

> e.g. .toggle() API. Call it old number of times, it goes to one state, call it even number of times, it goes back.

This is literally a dumb light switch. If you have trouble proving that, starting from lights off, flicking a simple switch twice will still keep lights off then, well, I have bad news to tell you about the feasibility of using the formal methods for anything more complex than a dumb light switch. Because the rest of the world is a very complex and stateful place.

> (which itself is a state machine of some kind)

Yes? That's pretty much the raison d'être of the formal methods: for anything pure and immutable, normal intuition is usually more than enough; it's tracking the paths through enormous configuration spaces that our intuition has problem with. If the formal methods can't help with that with comparable amount of effort, then they are just not worth it.

onion2k 12/17/2025||||

At that point you create an entirely new API, fully versioned, and backwardly compatible (if you want it to be). The point the article is making is that AI, in theory, entirely removes the person from the coding process so there's no longer any need to maintain software. You can just make the part you're changing from scratch every time because the cost of writing bug-free code (effectively) goes to zero.

The theory is entirely correct. If a machine can write provably perfect code there is absolutely no reason to have people write code. The problem is that the 'If' is so big it can be seen from space.

wombatpm 12/17/2025|||

Isn’t this where the Eiffel design by contract people speak up about code reuse?

ehnto 12/17/2025|||

100% of state changes in business software is unknowable on a long horizon, and relies on thoroughly understanding business logic that is often fuzzy, not discrete and certain.

est 12/17/2025||

Formal verification does not gurantee business logic works as everybody expected, nor its future proof, however, it does provide a workable path towards:

Things can only happen if only you allow it to happen.

It other words, your software may come to a stage where it's no longer applicable, but it never crashes.

Formal verification had little adoption only because it costs 23x of your original code with "PhD-level training"

bongodongobob 12/17/2025||

The reason it doesn't work is businesses change faster than you can model every detail AND keep it all up to date. Unless you have something tying your model directly to every business decision and transaction that happens, your model will never be accurate. And if we're talking about formal verification, that makes it useless.

bkettle 12/16/2025||

I think formal verification shines in areas where implementation is much more complex than the spec, like when you’re writing incomprehensible bit-level optimizations in a cryptography implementation or compiler optimization phases. I’m not sure that most of us, day-to-day, write code (or have AI write code) that would benefit from formal verification, since to me it seems like high-level programming languages are already close to a specification language. I’m not sure how much easier to read a specification format that didn’t concern itself with implementation could be, especially when we currently use all kinds of frameworks and libraries that already abstract away implementation details.

Sure, formal verification might give stronger guarantees about various levels of the stack, but I don’t think most of us care about having such strong guarantees now and I don’t think AI really introduces a need for new guarantees at that level.

pron 12/16/2025||

> to me it seems like high-level programming languages are already close to a specification language

They are not. The power of rich and succinct specification languages (like TLA+) comes from the ability to succinctly express things that cannot be efficiently computed, or at all. That is because a description of what a program does is necessarily at a higher level of abstraction than the program (i.e. there are many possible programs or even magical oracles that can do what a program does).

To give a contrived example, let's say you want to state that a particular computation terminates. To do it in a clear and concise manner, you want to express the property of termination (and prove that the computation satisfies it), but that property is not, itself, computable. There are some ways around it, but as a rule, a specification language is more convenient when it can describe things that cannot be executed.

nyrikki 12/16/2025|||

TLA+ is not a silver bullet, and like all temporal logic, has constraints.

You really have to be able to reduce your models to: “at some point in the future, this will happen," or "it will always be true from now on”

Have probabilistic outcomes? Or even floats [0] and it becomes challenging and strings are a mess.

> Note there is not a float type. Floats have complex semantics that are extremely hard to represent. Usually you can abstract them out, but if you absolutely need floats then TLA+ is the wrong tool for the job.

TLA+ works for the problems it is suitable for, try and extend past that and it simply fails.

[0] https://learntla.com/core/operators.html

pron 12/17/2025|||

> You really have to be able to reduce your models to: “at some point in the future, this will happen," or "it will always be true from now on”

You really don't. It's not LTL. Abstraction/refinement relations are at the core of TLA.

> Or even floats [0] and it becomes challenging and strings are a mess.

No problem with floats or strings as far as specification goes. The particular verification tools you choose to run on your TLA+ spec may or may not have limitations in these areas, though.

> TLA+ works for the problems it is suitable for, try and extend past that and it simply fails.

TLA+ can specify anything that could be specified in mathematics. That there is no predefined set of floats is no more a problem than the one physicists face because mathematics has no "built-in" concept for metal or temperature. TLA+ doesn't even have any built in notions of procedures, memory, instructions, threads, IO, variables in the programming sense, or, indeed programs. It is a mathematical framework for describing the behaviour of discrete or hybrid continuous-discrete dynamical systems, just as ODEs describe continuous dynamical systems.

But you're talking about the verfication tools you can run on TLA+ spec, and like all verification tools, they have their limitations. I never claimed otherwise.

You are, however, absolutely right that it's difficult to specify probabilistic properties in TLA+.

hwayne 12/17/2025|||

> No problem with floats or strings as far as specification goes. The particular verification tools you choose to run on your TLA+ spec may or may not have limitations in these areas, though.

I think it's disingenuous to say that TLA+ verifiers "may or may not have limitations" wrt floats when none of the available tools support floats. People should know going in that they won't be able to verify specs with floats!

pron 12/17/2025||

I'm not sure how a "spec with floats" differs from a spec with networks, RAM, 64-bit integers, multi-level cache, or any computing concept, none of which exists as a primitive in mathematics. A floating point number is a pair of integers, or sometimes we think about it as a real number plus some error, and TLAPS can check theorems about specifications that describe floating-point operations.

Of course, things can become more involved if you want to account for overflow, but overflow can get complicated even with integers.

hwayne 12/17/2025||

Those things, unlike floats, have approximable-enough facsimiles that you can verify instead. No tools support even fixed point decimals.

This has burned me before when I e.g needed to take the mean of a sequence.

pron 12/17/2025||

You say no tools but you can "verify floats" with TLAPS. I don't think that RAM or 64-bit integers have facsimiles in TLA+. They can be described mathematically in TLA+ to whatever level of detail you're interested in (e.g. you have to be pretty detailed when describing RAM when specifying a GC, and even more when specifying a CPU's memory-access subsystem), but so can floating point numbers. The least detailed description - say, RAM is just data - is not all that different from representing floats as reals (but that also requires TLAPS for verification).

The complications in describing machine-representable numbers also apply to integers, but these complications can be important, and the level of detail matters just as it matters when representing RAM or any other computing concept. Unlike, say, strings, there is no single "natural" mathematical representation of floating point numbers, just as there isn't one for software integers (integers work differently in C, Java, JS, and Zig; in some situations you may wish to ignore these differences, in others - not). You may want to think about floating point numbers as a real + error, or you may want to think about them as a mantissa-exponent pair, perhaps with overflow or perhaps without. The "right" representation of a floating point number highly depends on the properties you wish to examine, just like any other computing construct. These complications are essential, and they exist, pretty much in the same form, in other languages for formal mathematics.

pron 12/20/2025||

P.S. for the case of computing a mean, I would use Real rather than try to model floating point if the idiosyncracies of a particular FP implementation were important. That means you can't use TLC. In some situations it could suffice to represent the mean as any number (even an integer) that is ≥ min and ≤ max, but TLC isn't very effective even for algorithms involving non-tiny sets of integers when there's "interesting" arithmetic involved.

I don't know the state of contemporary model checkers that work with theories of reals and/or FP, and I'm sure you're much more familar with that than me, but I believe that when it comes to numeric computation, deductive proofs or "sampling tests" (such as property-based testing) are still more common than model-checking. It could be interesting to add a random sampling mode to TLC that could simulate many operations on reals using BigDecimal internally.

igornotarobot 12/17/2025|||

> TLA+ can specify anything that could be specified in mathematics.

You are talking about the logic of TLA+, that is, its mathematical definition. No tool for TLA+ can handle all of mathematics at the moment. The language was designed for specifying systems, not all of mathematics.

nextos 12/17/2025||||

There are excellent probabilistic extensions to temporal logic out there that are very useful to uncover subtle performance bugs in protocol specifications, see e.g. what PRISM [1] and Storm [2] implement. That is not within the scope of TLA+.

Formal methods are really broad, ranging from lightweight type systems to theorem proving. Some techniques are fantastic for one type of problem but fail at others. This is quite natural, the same thing happens with different programming paradigms.

For example, what is adequate for a hard real-time system (timed automata) is useless for a typical CRUD application.

[1] https://www.prismmodelchecker.org

[2] https://www.stormchecker.org

hwayne 12/17/2025||

I really do wish that PRISM can one day add some quality of life features like "strings" and "functions"

(Then again, AIUI it's basically a thin wrapper over stochastic matrices, so maybe that's asking too much...)

igornotarobot 12/17/2025|||

> TLA+ is not a silver bullet, and like all temporal logic, has constraints. > > You really have to be able to reduce your models to: “at some point in the future, this will happen," or "it will always be true from now on”

I think people get confused by the word "temporal" in the name of TLA+. Yes, it has temporal operators. If you throw them away, TLA+ (minus the temporal operators) would be still extremely useful for specifying the behavior of concurrent and distributed systems. I have been using TLA+ for writing specifications of distributed algorithms (e.g., distributed consensus) and checking them for about 6 years now. The question of liveness comes the last, and even then, the standard temporal logics are barely suitable for expressing liveness under partial synchrony. The value of temporal properties in TLA+ is overrated.

pron 12/28/2025||

The "temporal" in TLA+ isn't about □ and ⬦. It's about ' and the abstraction-refinment relation with stuttering at its core (contrasted with o and its non-stuttering meaning). Of course, you can't really specify anything in TLA+ without □ (unless you rely on TLC, which inserts the □ for you).

You cannot specify much in TLA+ without ' and □, andtThe "temporal" part of TLA+ - i.e. the TLA logic - is essential; but saying it's like "all temporal logics" is ignoring the abstraction-refinement relation, which is the heart of TLA+ (that's what ⇒, basic implication, in TLA+ means) and other temporal logics miss.

Of course, you could hypothetically use the + part of TLA+, the formalised set theory, to specify everything, but that would be very inconvenient.

eru 12/17/2025||||

What you said certainly works, but I'm not sure computability is actually the biggest issue here?

Have a look at how SAT solvers or Mixed Integer Linear Programming solvers are used.

There you specify a clear goal (with your code), and then you let the solvers run. You can, but you don't need to, let the solvers run all the way to optimality. And the solvers are also allowed to use all kinds of heuristics to find their answers, but that doesn't impact the statement of your objective.

Compare that to how many people write code without solvers: the objective of what your code is trying to achieve is seldom clearly spelled out, and is instead mixed up with the how-to-compute bits, including all the compromises and heuristics you make to get a reasonable runtime or to accommodate some changes in the spec your boss asked for at the last minute.

Using a solver ain't formal verification, but it shows the same separation between spec and implementation.

Another benefit of formal verification, that you already imply: your formal verification doesn't have to determine the behaviour of your software, and you can have multiple specs simultaneously. But you can only have a single implementation active at a time (even if you use a high level implementation language.)

So you can add 'handling a user request must terminate in finite time' as a (partial) spec. It's an important property, but it tells you almost nothing about the required business logic. In addition you can add "users shouldn't be able to withdraw more than they deposited" (and other more complicated rules), and you only have to review these rules once, and don't have to touch them again, even when you implement a clever new money transfer routine.

avmich 12/17/2025||||

Peter Norvig once proposed to consider a really large grammar, with trillion rules, which could simulate some practically small applications of more complex systems. Many programs in practice don't need to be written in Turing-complete languages, and can be proven to terminate.

pron 12/17/2025|||

Writing in a language that guarantees termination is not very interesting in itself, as every existing program could automatically be translated into a non-Turing-complete language where the program is proven to terminate, yet behaves exactly the same: the language is the same as the original, only loops/rectursion ends the program after, say, 2^64 iterations. This, in itself, does not make programs any easier to analyse. In fact, a language that only has boolean variables, no arrays, no recursion, and loops of depth 2 at most is already instractable to verify. It is true that programs in Turing-complete languages cannot generally be verified in efficiently, but most non-Turing-complete languages also have this property.

Usually, when we're interested in termination proofs, what we're really interested in is a proof that the algorithm makes constant progress that converges on a solution.

avmich 12/19/2025||

I think the interesting progress in programs can generally be achieved for many programs, which take input and produce output and then terminate. For servers, which wait for requests, the situation seem to be different.

atakan_gurkan 12/17/2025|||

This sounds very interesting. Do you have a reference?

avmich 12/19/2025||

Saw something like that once, couldn't find recently, sorry. Ask Peter?..

mrkeen 12/17/2025||||

Can TLA+ prove anything about something you specify but don't execute?

igornotarobot 12/17/2025|||

TLA+ is just a language for writing specifications (syntax + semantics). If you want to prove anything about it, at various degrees of confidence and effort, there are three tools:

- TLAPS is the interactive proof system that can automate some proof steps by delegating to SMT solvers: https://proofs.tlapl.us/doc/web/content/Home.html

- Apalache is the symbolic model checker that delegates verification to Z3. It can prove properties without executing anything, or rather, executing specs symbolically. For instance, it can do proofs via inductive invariants but only for bounded data structures and unbounded integers. https://apalache-mc.org/

- Finally, TLC is an enumerative model checker and simulator. It simply produces states and enumerates them. So it terminates only if the specification produces a finite number of states. It may sound like executing your specification, but it is a bit smarter, e.g., when checking invariants it will never visit the same state twice. This gives TLC the ability to reason about infinite executions. Confusingly, TLC does not have its own page, as it was the first working tool for TLA+. Many people believe that TLA+ is TLC: https://github.com/tlaplus/tlaplus

pron 12/18/2025||

> It may sound like executing your specification, but it is a bit smarter,

It's more than just "a bit smarter" I would say, and explicit state enumeration is nothing at all like executing a spec/program. For example, TLC will check in virtually zero time a spec that describes a nondeterministic choice of a single variable x being either 0 or 1 at every step (as there are only two states). The important aspect here isn't that each execution is of infinite length, but that there are an uncountable infinity of behaviours (executions) here. This is a completely different concept from execution, and it is more similar to abstract interpretation (where the meaning of a step isn't the next state but the set of all possible next states) than to concrete interpretation.

pron 12/17/2025|||

You can write proofs in TLA+ about things you don't exectute and have them checked by the TLA+ proof assistant. But the most common aspect of that, which pretty much every TLA+ spec contains is nondeterminism, which is basically the ability to describe a system with details you don't know or care about. For example, you can describe "a program that sorts an array" without specifying how and then prove, say, that the median value ends up in the middle. The ability to specify what a program or a subroutine does without specifying how is what separates the expressive power of specification from programming. This extends not only to the program itself but to its environment. For example, it's very common in TLA+ to specify a network that can drop or reorder messages nondeterministically, and then prove that the system doesn't lose data despite that.

anon-3988 12/16/2025||||

> To give a contrived example, let's say you want to state that a particular computation terminates. To do it in a clear and concise manner, you want to express the property of termination (and prove that the computation satisfies it), but that property is not, itself, computable. There are some ways around it, but as a rule, a specification language is more convenient when it can describe things that cannot be executed.

Do you really think it is going to be easier for the average developer to write a specification for their program that does not terminate

Giving them a framework or a language that does not have for loop?

Edit: If by formal verification you mean type checking. That I very much agree.

DennisP 12/17/2025||

Maybe it's difficult for the average developer to write a formal specification, but the point of the article is that an AI can do it for them.

goryDeets 12/16/2025|||

[dead]

socketcluster 12/16/2025|||

Yes. I feel like people who are trying to push software verification have never worked on typical real-world software projects where the spec is like 100 pages long and still doesn't fully cover all the requirements and you still have to read between the lines and then requirements keep changing mid-way through the project... Implementing software to meet the spec takes a very long time and then you have to invest a lot of effort and deep thought to ensure that what you've produced fits within the spec so that the stakeholder will be satisfied. You need to be a mind-reader.

It's hard even for a human who understands the full business, social and political context to disambiguate the meaning and intent of the spec; to try to express it mathematically would be an absolute nightmare... and extremely unwise. You would literally need some kind of super intelligence... And the amount of stream-of-thought tokens which would have to be generated to arrive at a correct, consistent, unambiguous formal spec is probably going to cost more than just hiring top software engineers to build the thing with 100% test coverage of all main cases and edge cases.

Worst part is; after you do all the expensive work of formal verification; you end up proving the 'correctness' of a solution that the client doesn't want.

The refactoring required will invalidate the entire proof from the beginning. We haven't even figured out the optimal way to formally architect software that is resilient to requirement changes; in fact, the industry is REALLY BAD at this. Almost nobody is even thinking about it. I am, but I sometimes feel like I may be the only person in the world who cares about designing optimal architectures to minimize line count and refactoring diff size. We'd have to solve this problem first before we even think about formal verification of 'most software'.

Without a hypothetical super-intelligence which understands everything about the world; the risk of misinterpreting any given 'typical' requirement is almost 100%... And once we have such super-intelligence, we won't need formal verification because the super-intelligence will be able to code perfectly on the first attempt; no need to verify.

And then there's the fact that most software can tolerate bugs... If operationally important big tech software which literally has millions of concurrent users can tolerate bugs, then most software can tolerate bugs.

DennisP 12/17/2025|||

Software verification has gotten some use for smart contracts. The code is fairly simple, it's certain to be attacked by sophisticated hackers who know the source, and the consequence of failure is theft of funds, possibly in large amounts. 100% test coverage is no guarantee that an attack can't be found.

People spend gobs of money on human security auditors who don't necessarily catch everything either, so verification easily fits in the budget. And once deployed, the code can't be changed.

Verification has also been used in embedded safety-critical code.

socketcluster 12/17/2025||

If the requirements you have to satisfy arise out of a fixed, deterministic contract (as opposed to a human being), I can see how that's possible in this case.

I think the root problem may be that most software has to adapt to a constantly changing reality. There aren't many businesses which can stay afloat without ever changing anything.

robot-wrangler 12/16/2025||||

The whole perspective of this argument is hard for me to grasp. I don't think anyone is suggesting that formal specs are an alternative to code, they are just an alternative to informal specs. And actually with AI the new spin is that they aren't even a mutually exclusive alternative.

A bidirectional bridge that spans multiple representations from informal spec to semiformal spec to code seems ideal. You change the most relevant layer that you're interested in and then see updates propagating semi-automatically to other layers. I'd say the jury is out on whether this uses extra tokens or saves them, but a few things we do know. Chain of code works better than chain of thought, and chain-of-spec seems like a simple generalization. Markdown-based planning and task-tracking agent workflows work better than just YOLOing one-shot changes everywhere, and so intermediate representations are useful.

It seems to me that you can't actually get rid of specs, right? So to shoot down the idea of productive cooperation between formal methods and LLM-style AI, one really must successfully argue that informal specs are inherently better than formal ones. Or even stronger: having only informal specs is better than having informal+formal.

socketcluster 12/17/2025||

> A bidirectional bridge that spans multiple representations from informal spec

Amusingly, what I'm hearing is literally "I have a bridge to sell you."

robot-wrangler 12/17/2025||

There's always a bridge, dude. The only question is whether you want to buy one that's described as "a pretty good one, not too old, sold as is" or if you'd maybe prefer "spans X, holds Y, money back guarantee".

socketcluster 12/17/2025||

I get it. Sometimes complexity is justified. I just don't feel this particular bridge is justified for 'mainstream software' which is what the article is about.

ad_hockey 12/17/2025||||

I agree that trying to produce this sort of spec for the entire project is probably a fool's errand, but I still see the value for critical components of the system. Formally verifying the correctness of balance calculation from a ledger, or that database writes are always persisted to the write ahead log, for example.

qingcharles 12/17/2025|||

I used to work adjacent to a team who worked from closely-defined specs for web sites, and it used to infuriate the living hell out of me. The specs had all sorts of horrible UI choices and bugs and stuff that just plain wouldn't work when coded. I tried my best to get them to implement the intent of the spec, not the actual spec, but they had been trained in one method only and would not deviate at any cost.

socketcluster 12/17/2025|||

Yeah, IMO, the spec almost always needs refinement. I've worked for some companies where they tried to write specs with precision down to every word; but what happened is; if the spec was too detailed, it usually had to be adjusted later once it started to conflict with reality (efficiency, costs, security/access restrictions, resource limits, AI limitations)... If it wasn't detailed enough, then we had to read between the lines and fill in a lot of gaps... And usually had to iterate with the stakeholder to get it right.

At most other companies, it's like the stakeholder doesn't even know what they want until they start seeing things on a screen... Trying to write a formal spec when literally nobody in the universe even knows what is required; that's physically impossible.

In my view, 'Correct code' means code that does what the client needs it to do. This is downstream from it doing what the client thinks they want; which is itself downstream from it doing what the client asked for. Reminds me of this meme: https://www.reddit.com/r/funny/comments/105v2h/what_the_cust...

Software engineers don't get nearly enough credit for how difficult their job is.

amw-zero 12/17/2025||

How do you or the client know that the software is doing what they want?

mrkeen 12/17/2025|||

What formal verification system did they use? Did they even execute it?

marcosdumay 12/16/2025|||

There are many really important properties to enforce even on the most basic CRUD system. You can easily say things like "an anonymous user must never edit any data, except for the create account form", or "every user authorized to see a page must be listed on the admin page that lists what users can see a page".

People don't verify those because it's hard, not for lack of value.

nextos 12/16/2025|||

Yes, in fact there is research on type systems to ensure information flow control, avoiding unauthorized data access by construction.

Concrete Semantics [1] has a little example in §9.2.

[1] http://concrete-semantics.org/concrete-semantics.pdf

bkettle 12/16/2025||||

Yeah fair enough. I can definitely see the value of property-based verification like this and agree that useful properties could be easy to express and that LLMs could feasibly verify them. I think full verification that an implementation implements an entire spec and nothing else seems much less practical even with AI, but of course that is just one flavor of verification.

Maxion 12/17/2025||||

Even

> "an anonymous user must never edit any data, except for the create account form"

Can quickly end up being

> "an anonymous user must never edit any data, except for the create account form, and the feedback form"

And a week later go to

> "an anonymous user must never edit any data, except for the create account form, the feedback form, and the error submission form if they end up with a specific type of error"

And then during christmas

> > "an anonymous user must never edit any data, except for the create account form, the feedback form, and the error submission form if they end up with a specific type of error, and the order submission form if they visit it from this magic link. Those visiting from the magic link, should not be able to use the feedback form (marge had a bad experience last christmas going through feedbacks from the promotional campaign)"

marcosdumay 12/17/2025||

It is still a small rule, with plenty of value. It's nowhere near the size of the access control for the entire site. And it's also not written down by construction.

It changing with time doesn't make any of that change.

rmah 12/17/2025|||

Yes, except their cookie preferences to comply with european law. Oh, and they should be able to change their theme from light/dark but only that. Oh and maybe this other thing. Except in situations where it would conflict with current sales promotions. Unless they're referred by a reseller partner. Unless it's during a demo, of course. etc, etc, etc.

This is the sort of reality that a lot of developers in the business world deals with.

amw-zero 12/17/2025|||

Compare the spec with the application here: https://concerningquality.com/model-based-testing/

I think we've become used to the complexity in typical web applications, but there's a difference between familiar and simple (simple vs. easy, as it were). The behavior of most business software can be very simply expressed using simple data structures (sets, lists, maps) and simple logic.

No matter how much we simply it, via frameworks and libraries or whatever have you, things like serialization, persistence, asynchrony, concurrency, and performance end up complicating the implementation. Comparing this against a simpler spec is quite nice in practice - and a huge benefit is now you can consult a simple in-memory spec vs. worrying about distributed system deployments.

giancarlostoro 12/17/2025|||

> especially when we currently use all kinds of frameworks and libraries that already abstract away implementation details.

This is my issue with algorithm driven interviewing. Even the creator of Homebrew got denied by Google because he couldn't do some binary sort or whatever it even was. He made a tool used by millions of developers, but apparently that's not good enough.

StilesCrisis 12/17/2025||

Google denies qualified people all the time. They would much rather reject a great hire than take a risk on accepting a mediocre one. I feel for him but it's just the nature of the beast. Not everyone will get in.

beautiful_zhixu 12/17/2025|||

This language sounds like chauvinism leading to closed-mindedness and efficiency. Of course there are tradeoffs to chauvinism, as Googlers possess the mind to notice. But a Googler does not need to worry about saying ambiguous truths without understanding their emotions to the masses, for they have Google behind them. With the might of the G stick, they can hammer out words with confidence.

ogogmad 12/17/2025|||

I've heard this before. Why do you think algorithm questions are effective for finding "good" hires? Are they?

9rx 12/17/2025|||

The intent isn't to find good hires per se, but to whittle down the list of applicants to a manageable number in a way that doesn't invite discrimination lawsuits.

Same as why companies in the past used to reject anyone without a degree. But then everyone got a degree, leaving it to no longer be an effective filter, hence things like algorithm tests showing up to fill the void.

Once you've narrowed the list, then you can worry about figuring out who is "good" through giving the remaining individuals additional attention.

giancarlostoro 12/17/2025||

> Same as why companies in the past used to reject anyone without a degree.

They still do, and its a shame some of the smartest most capable developers I know have no degree.

giancarlostoro 12/17/2025|||

They certainly don't filter out toxic people who make others leave companies because they poison the well.

ogogmad 12/17/2025||

I have a suspicion that "good candidate" is being gerrymandered. What might have been "good" in 1990 might have become irrelevant in 2000+ or perhaps detrimental. I say that as someone who is actually good at algorithm questions himself. I think GP, as well as other Google defenders, are parroting pseudo-science.

giancarlostoro 12/17/2025||

I agree. But also if it works to get you jobs there, why wouldn't you defend it? I mean I might be inclined to do so as well, it guarantees me a place even if I lack soft skills for the role.

UltraSane 12/17/2025||

AWS has said that having formal verification of code lets them be more aggressive in optimization while being confidant it still adheres to the spec. They claim they were able to double the speed of IAM API auth code this way.

simonw 12/16/2025||

I'm convinced now that the key to getting useful results out of coding agents (Claude Code, Codex CLI etc) is having good mechanisms in place to help those agents exercise and validate the code they are writing.

At the most basic level this means making sure they can run commands to execute the code - easiest with languages like Python, with HTML+JavaScript you need to remind them that Playwright exists and they should use it.

The next step up from that is a good automated test suite.

Then we get into quality of code/life improvement tools - automatic code formatters, linters, fuzzing tools etc.

Debuggers are good too. These tend to be less coding-agent friendly due to them often having directly interactive interfaces, but agents can increasingly use them - and there are other options that are a better fit as well.

I'd put formal verification tools like the ones mentioned by Martin on this spectrum too. They're potentially a fantastic unlock for agents - they're effectively just niche programming languages, and models are really good at even niche languages these days.

If you're not finding any value in coding agents but you've also not invested in execution and automated testing environment features, that's probably why.

roadside_picnic 12/16/2025||

I very much agree, and believe using languages with powerful types systems could be a big step in this direction. Most people's first experience with Haskell is "wow this is hard to write a program in, but when I do get it to compile, it works". If this works for human developers, it should also work for LLMs (especially if the human doesn't have to worry about the 'hard to write a program' part).

> The next step up from that is a good automated test suite.

And if we're going for a powerful type system, then we can really leverage the power of property tests which are currently grossly underused. Property tests are a perfect match for LLMs because they allow the human to create a small number of tests that cover a very wide surface of possible errors.

The "thinking in types" approach to software development in Haskell allows the human user to keep at a level of abstraction that still allows them to reason about critical parts of the program while not having to worry about the more tedious implementation parts.

Given how much interest there has been in using LLMs to improve Lean code for formal proofs in the math community, maybe there's a world where we make use of an even more powerful type systems than Haskell. If LLMs with the right language can help prove complex mathematical theorems, they it should certain be possible to write better software with them.

nextos 12/16/2025|||

That's my opinion as well. Some functional language, that can also offer access to imperative features when needed, plus an expressive type system might be the future.

My bet is on refinement types. Dafny fits that bill quite well, it's simple, it offers refinement types, and verification is automated with SAT/SMT.

In fact, there are already serious industrial efforts to generate Dafny using LLMs.

Besides, some of the largest verification efforts have been achieved with this language [1].

[1] https://www.andrew.cmu.edu/user/bparno/papers/ironfleet.pdf

astrostl 12/17/2025||||

This is why I use Go as much as reasonably possible with vibe coding: types, plus great quality-checking ecosystem, plus adequate training data, plus great distribution story. Even when something has stuff like JS and Python SDKs, I tend to skip them and go straight to the API with Go.

apitman 12/17/2025|||

Also a fast compiler which lets the agent iterate more times.

antonvs 12/18/2025|||

Go has types? I didn’t notice.

justatdotin 12/17/2025||||

I love ML types, but I've concluded they serve humans more than they do agents. I'm sure it affects the agent, maybe just not as much as other choices.

I've noticed real advantages of functional languages to agents, for disposable code. Which is great, cos we can leverage those without dictating the human's experience.

I think the correct way forward is to choose whatever language the humans on your team agree is most useful. For my personal projects, that means a beautiful language for the bits I'll be touching, and whatever gets the job done elsewhere.

bcrosby95 12/16/2025|||

Ada when?

It even lets you separate implementation from specification.

jaggederest 12/17/2025||

Even going beyond Ada into dependently typed languages like (quoth wiki) "Agda, ATS, Rocq (previously known as Coq), F*, Epigram, Idris, and Lean"

I think there are some interesting things going on if you can really tightly lock down the syntax to some simple subset with extremely straightforward, powerful, and expressive typing mechanisms.

ManuelKiessling 12/16/2025|||

Isn‘t it funny how that’s exactly the kind of stuff that helps a human developer be successful and productive, too?

Or, to put it the other way round, what kind of tech leads would we be if we told our junior engineers „Well, here’s the codebase, that’s all I‘ll give you. No debuggers, linters, or test runners for you. Using a browser on your frontend implementation? Nice try buddy! Now good luck getting those requirements implemented!“

Wowfunhappy 12/16/2025||

> Isn‘t it funny how that’s exactly the kind of stuff that helps a human developer be successful and productive, too?

I think it's more nuanced than that. As a human, I can manually test code in ways an AI still can't. Sure, maybe it's better to have automated test suites, but I have other options too.

victorbjorklund 12/16/2025||

AI can do that too? If you have a web app it can use playwright to test functionality and take screenshots to see if it looks right.

Wowfunhappy 12/16/2025|||

Yeah, but it doesn't work nearly as well. The AI frequently misinterprets what it sees. And it isn't as good at actually using the website (or app, or piece of hardware, etc) as a human would.

UncleEntity 12/17/2025||

I've been using Claude to implement an ISO specification and I have to keep telling it we're not interested if the repl is correct but that the test suite is ensuring the implementation is correctly following the spec. But when we're tracking down why a test is failing then it'll go to town using the repl to narrow down out what code path is causing the issue. The only reason there's even is a repl at this point is so it can do its 'spray and pray' debugging outside the code and Claude constantly tried to use it to debug issues so I gave in and had it write a pretty basic one.

Horses for courses, I suppose. Back in the day, when I wanted to play with some C(++) library, I'd quite often write a Python C-API extension so I could do the same thing using Python's repl.

Capricorn2481 12/17/2025|||

But then the AI would theoretically have to write the playwright code. How does it verify it's getting the right page to begin with?

simonw 12/17/2025||

The recent models are pretty great at this. They read the source code for e.g. a Python web application and use that to derive what the URLs should be. Then they fire up a localhost development server and write Playwright scripts to interact with those pages at the predicted URLs.

The vision models (Claude Opus 4.5, Gemini 3 Pro, GPT-5.2) can even take screenshots via Playwright and then "look at them" with their vision capabilities.

It's a lot of fun to watch. You can tell them to run Playwright not in headless mode at which point a Chrome window will pop up on your computer and you can see them interact with the site via it.

mccoyb 12/16/2025|||

Claude Code was a big jump for me. Another large-ish jump was multi-agents and following the tips from Anthropic’s long running harnesses post.

I don’t go into Claude without everything already setup. Codex helps me curate the plan, and curate the issue tracker (one instance). Claude gets a command to fire up into context, grab an issue - implements it, and then Codex and Gemini review independently.

I’ve instructed Claude to go back and forth for as many rounds as it takes. Then I close the session (\new) and do it again. These are all the latest frontier models.

This is incredibly expensive, but it’s also the most reliable method I’ve found to get high-quality progress — I suspect it has something to do with ameliorating self-bias, and improving the diversity of viewpoints on the code.

I suspect rigorous static tooling is yet another layer to improve the distribution over program changes, but I do think that there is a big gap in folk knowledge already between “vanilla agents” and something fancy with just raw agents, and I’m not sure if just the addition of more rigorous static tooling (beyond the compiler) closes it.

idiotsecant 12/16/2025||

How expensive is incredibly expensive?

mccoyb 12/16/2025||

If you're maxing out the plans across the platforms, that's 600 bucks -- but if you think about your usage and optimize, I'm guessing somewhere between 200-600 dollars per month.

jazzyjackson 12/16/2025||

It's pretty easy to hit a couple hundred dollars a day filling up Opus's context window with files. This is via Anthropic API and Zed.

Going full speed ahead building a Rails app from scratch it seemed like I was spending $50/hour, but it was worth it because the App was finished in a weekend instead of weeks.

I can't bear to go in circles with Sonnet when Opus can just one shot it.

fragmede 12/17/2025|||

The $200/month Max plan has limits, but making a couple of those seems way cheaper than $50/hr for the ~172 hrs in a month.

mkagenius 12/17/2025|||

Anthropic via Azure has sent me an invoice of around $8000 for 3-5 days of Opus 4.1 usage and there is no way to track how many tokens during those days and how many cached etc. (And I thought its part of the azure sponsorship but that's another story)

rmah 12/17/2025|||

I think the main limitation is not code validation but assumption verification. When you ask an LLM to write some code based on a few descriptive lines of text, it is, by necessity, making a ton of assumptions. Oddly, none of the LLM's I've seen ask for clarification when multiple assumptions might all be likely. Moreover, from the behavior I've seen, they don't really backtrack to select a new assumption based on further input (I might be wrong here, it's just a feeling).

What you don't specify, it must to assume. And therein lies a huge landscape of possibilities. And since the AI's can't read your mind (yet), its assumptions will probably not precisely match your assumptions unless the task is very limited in scope.

crazygringo 12/17/2025||

> Oddly, none of the LLM's I've seen ask for clarification when multiple assumptions might all be likely.

It's not odd, they've just been trained to give helpful answers straight away.

If you tell them not to make assumptions and to rather first ask you all their questions together with the assumptions they would make because you want to confirm before they write the code, they'll do that too. I do that all the time, and I'll get a list of like 12 things to confirm/change.

That's the great thing about LLM's -- if you want them to change their behavior, all you need to do is ask.

Davidzheng 12/16/2025|||

OK but if the verification loop really makes the agents MUCH more useful, then this usefulness difference can be used as a training signal to improve the agents themselves. So this means the current capabilities levels are certainly not going to remain for very long (which is also what I expect but I would like to point out it's also supported by this)

hamiecod 12/17/2025||

Thats a strong RL technique that could equal the quality of RLHF.

oxag3n 12/16/2025|||

Where they'd get training data?

Source code generation is possible due to large training set and effort put into reinforcing better outcomes.

I suspect debugging is not that straightforward to LLM'ize.

It's a non-sequential interaction - when something happens, it's not necessarily caused the problem, timeline may be shuffled. LLM would need tons of examples where something happens in debugger or logs and associate it with another abstraction.

I was debugging something in gdb recently and it was a pretty challenging bug. Out of interest I tried chatgpt, and it was hopeless - try this, add this print etc. That's not how you debug multi-threaded and async code. When I found the root cause, I was analyzing how I did it and where did I learn that specific combination of techniques, each individually well documented, but never in combination - it was learning from other people and my own experience.

jimmaswell 12/16/2025|||

How long ago was this? I've had outstansingly impressive results asking Copilot Chat with Sonnet 4.5 or ChatGPT to debug difficult multithreaded C++.

oxag3n 12/16/2025||

Few months back with ChatGPT 5. Multi-threaded Rust & custom async runtime, data integrity bug, reproduced every ~5th run.

simonw 12/16/2025||||

Have you tried running gdb from a Claude Code or Codex CLI session?

oxag3n 12/16/2025|||

No, I'm in academia and the goal is not code or product launch. I find research process to struggle a lot once someone solves a problem instead of you.

I understand that AI can help with writing, coding, analyzing code bases and summarizing other papers, but going through these myself makes a difference, at least for me. I tried ChatGPT 3.5 when I started and while I got a pile of work done, I had to throw it away at some point because I didn't fully understand it. AI could explain to me various parts, but it's different when you create it.

planckscnst 12/17/2025|||

For interactive programs like this, I use tmux and mention "send-keys" and "capture-pane" and it's able to use it to drive an interactive program. My demo/poc for this is making the agent play 20 questions with another agent via tmux

fragmede 12/16/2025||||

> Where they'd get training data?

They generated it, and had a compiler compile it, and then had it examine the output. Rinse, repeat.

RA_Fisher 12/17/2025||||

LLMs are okay at bisecting programs and identifying bugs in my experience. Sometimes they require guidance but often enough I can describe the symptom and they identify the code causing the issue (and recommend a fix). They’re fairly methodical, and often ask me to run diagnostic code (or do it themselves).

anon-3988 12/16/2025||||

> I suspect debugging is not that straightforward to LLM'ize.

Debugging is not easy but there should be a lot of training corpus for "bug fixing" from all the commits that have ever existed.

christophilus 12/16/2025|||

Debugging has been excellent for me with Opus 4.5 and Claude Code.

QuercusMax 12/16/2025|||

I've only done a tiny bit of agent-assisted coding, but without the ability to run tests the AI will really go off the rails super quick, and it's kinda hilarious to watch it say "Aha! I know what the problem is!" over and over as it tries different flavors until it gives up.

CobrastanJorji 12/16/2025|||

I might go further and suggest that the key to getting useful results out of HUMAN coding agents is also to have good mechanisms in place to help them exercise and validate the code.

We valued automated tests and linters and fuzzers and documentation before AI, and that's because it serves the same purpose.

rodphil 12/17/2025|||

> At the most basic level this means making sure they can run commands to execute the code - easiest with languages like Python, with HTML+JavaScript you need to remind them that Playwright exists and they should use it.

So I've been exploring the idea of going all-in on this "basic level" of validation. I'm assembling systems out of really small "services" (written in Go) that Claude Code can immediately run and interact with using curl, jq, etc. Plus when building a particular service I already have all of the downstream services (the dependencies) built and running so a lot of dependency management and integration challenges disappear. Only trying this out at a small scale as yet, but it's fascinating how the LLMs can potentially invert a lot of the economics that inform the current conventional wisdom.

(Shameless plug: I write about this here: https://twilightworld.ai/thoughts/atomic-programming/)

My intuition is that LLMs will for many use cases lead us away from things like formal verification and even comprehensive test suites. The cost of those activities is justified by the larger cost of fixing things in production; I suspect that we will eventually start using LLMs to drive down the cost of production fixes, to the point where a lot of those upstream investments stop making sense.

jcranmer 12/17/2025||

> My intuition is that LLMs will for many use cases lead us away from things like formal verification and even comprehensive test suites. The cost of those activities is justified by the larger cost of fixing things in production; I suspect that we will eventually start using LLMs to drive down the cost of production fixes, to the point where a lot of those upstream investments stop making sense.

There is still a cost to having bugs, even if deploying fixes becomes much cheaper. Especially if your plan is to wait until they actually occur in practice to discover that you have a bug in the first place.

Put differently: would you want the app responsible for your payroll to be developed in this manner? Especially considering that the bug in question would be "oops, you didn't get paid."

simianwords 12/17/2025|||

Claude code and other AI coding tools must have a * mandatory * hook for verification.

For front end - the verification is make sure that the UI looks expected (can be verified by an image model) and clicking certain buttons results in certain things (can be verified by chatgpt agent but its not public ig).

For back end it will involve firing API requests one by one and verifying the results.

To make this easier, we need to somehow give an environment for claude or whatever agent to run these verifications on and this is the gap that is missing. Claude Code, Codex should now start shipping verification environments that make it easy for them to verify frontend and backend tasks and I think antigravity already helps a bit here.

------

The thing about backend verification is that it is different in different companies and requires a custom implementation that can't easily be shared across companies. Each company has its own way to deploy stuff.

Imagine a concrete task like creating a new service that reads from a data stream, runs transformations, puts it in another data stream where another new service consumes the transformed data and puts it into an AWS database like Aurora.

``` stream -> service (transforms) -> stream -> service -> Aurora ```

To one shot this with claude code, it must know everything about the company

- how does one consume streams in the company? Schema registry?

- how does one create a new service and register dependencies? how does one deploy it to test environment and production?

- how does one even create an Aurora DB? request approvals and IAM roles etc?

My question is: what would it take for Claude Code to one shot this? At the code level it is not too hard and it can fit in context window easily but the * main * problem is the fragmented processes in creating the infra and operations behind it which is human based now (and need not be!).

-----

My prediction is that companies will make something like a new "agent" environment where all these processes (that used to require a human) can be done by an agent without human intervention.

I'm thinking of other solutions here, but if anyone can figure it out, please tell!

pron 12/16/2025|||

Maybe in the short term, but that doesn't solve some fundamental problems. Consider, NP problems, problems whose solutions can be easily verified. But that they can all be easily verified does not (as far as we know) mean they can all be easily solved. If we look at the P subset of NP, the problems that can be easily solved, then the easy verification is no longer their key feature. Rather, it is something else that distinguishes them from the harder problems in NP.

So let's say that, similarly, there are programming tasks that are easier and harder for agents to do well. If we know that a task is in the easy category, of course having tests is good, but since we already know that an agent does it well, the test isn't the crucial aspect. On the other hand, for a hard task, all the testing in the world may not be enough for the agent to succeed.

Longer term, I think it's more important to understand what's hard and what's easy for agents.

jijijijij 12/16/2025|||

> At the most basic level this means making sure they can run commands to execute the code

Yeah, it's gonna be fun waiting for compilation cycles when those models "reason" with themselves about a semicolon. I guess we just need more compute...

zahlman 12/16/2025|||

One objection: all the "don't use --yolo" training in the world is useless if a sufficiently context-poisoned LLM starts putting malware in the codebase and getting to run it under the guise of "unit tests".

planckscnst 12/17/2025||

For now, this is mitigated by only including trusted content in the context; for instance, absolutely do not allow it to access general web content.

I suspect that as it becomes more economical to play with training your own models, people will get better at including obscured malicious content in data that will be used during training, which could cause the LLM to intrinsically carry a trigger/path that would cause malicious content to be output by the LLM under certain conditions.

And of course we have to worry about malicious content being added to sources that we trust, but that already exists - we as an industry typically pull in public repositories without a complete review of what we're pulling. We outsource the verification to the owners of the repository. Just as we currently have cases of malicious code sneaking into common libraries, we'll have malicious content targeted at LLMs

thomasfromcdnjs 12/17/2025|||

shameless plug: I'm working on an open source project https://blocksai.dev/ to attempt to solve this. (and just added a note for me to add formal verification)

Elevator pitch: "Blocks is a semantic linter for human-AI collaboration. Define your domain in YAML, let anyone (humans or AI) write code freely, then validate for drift. Update the code or update the spec, up to human or agent."

(you can add traditional linters to the process if you want but not necessary)

The gist being you define a bunch of validators for a collection of modules you're building (with agentic coding) with a focus on qualifying semantic things;

- domain / business rules/measures

- branding

- data flow invariants — "user data never touches analytics without anonymization"

- accessibility

- anything you can think of

Then you just tell your agentic coder to use the cli tool before committing, so it keeps the code in line with your engineering/business/philosophical values.

(boring) example of it detecting if blog posts have humour in them, running in Claude Code -> https://imgur.com/diKDZ8W

baq 12/17/2025|||

Reminder YAML is a serialization format. IaC standardizing on it (hashicorp being an outlier) was a mistake. It’s a good compilation target, but please add a higher level language for whatever you’re doing.

akrauss 12/17/2025|||

Quick feedback: both the „learn more“ link at the very top and the „Explore all examples“ link lead to 404

thomasfromcdnjs 12/17/2025||

Thanks will fix that up shortly.

apitman 12/17/2025|||

That's bad news for C++, Rust, and other slow compilers.

WhyOhWhyQ 12/16/2025|||

I've tried getting claude to set up testing frameworks, but what ends up happening is it either creates canned tests, or it forgets about tests, or it outright lies about making tests. It's definitely helpful, but feels very far from a robust thing to rely on. If you're reviewing everything the AI does then it will probably work though.

simonw 12/16/2025|||

Something I find helps a lot is having a template for creating a project that includes at least one passing test. That way the agent can run the tests at the start using the correct test harness and then add new tests as it goes along.

I use cookiecutter for this, here's my latest Python library template: https://github.com/simonw/python-lib

planckscnst 12/17/2025|||

LLMs are very good at looking at a change set and finding untested paths. As a standard part of my workflow, I always pass the LLM's work through a "reviewer", which is a fresh LLM session with instructions to review the uncommitted changes. I include instructions for reviewing test coverage.

I've also found that LLMs typically just partially implement a given task/story/spec/whatever. The reviewer stage will also notice a mismatch between the spec and the implementation.

I have an orchestrator bounce the flow back and forth between developing and reviewing until the review comes back clean, and only then do I bother to review its work. It saves so much time and frustration.

akrauss 12/17/2025||

What tooling are you using for the orchestration?

ramoz 12/16/2025|||

Claude Code hooks is a great way to integrate these things

htrp 12/16/2025|||

Better question is which tools at what level

dionian 12/16/2025|||

you've done some great articles on this topic and my experience aligns with your view completely.

agumonkey 12/16/2025|||

gemini and claude do that already IIUC, self debugging iterations

goryDeets 12/16/2025|||

[dead]

formerly_proven 12/16/2025||

Not so sure about formal verification though. ime with Rust LLM agents tend to struggle with semi-complex ownership or trait issues and will typically reach for unnecessary/dangerous escape hatches ("unsafe impl Send for ..." instead of using the correct locks, for example) fairly quickly. Or just conclude the task is impossible.

> automatic code formatters

I haven't tried this because I assumed it'll destroy agent productivity and massively increase number of tokens needed, because you're changing the file out under the LLM and it ends up constantly re-reading the changed bits to generate the correct str_replace JSON. Or are they smart enough that this quickly trains them to generate code with zero-diff under autoformatting?

But in general of course anything that's helpful for human developers to be more productive will also help LLMs be more productive. For largely identical reasons.

planckscnst 12/17/2025|||

I've directly faced this problem with automatic code formatters, but it was back around Claude 3.5 and 3.7. It would consistently write nonconforming code - regardless of having context demanding proper formatting. This caused both extra turns/invocations with the LLM and would cause context issues - both filling the context with multiple variants of the file and also having a confounding/polluting/poisoning effect due to having these multiple variations.

I haven't had this problem in a while, but I expect current LLMs would probably handle those formatting instructions more closely than the 3.5 era.

simonw 12/16/2025|||

I'm finding my agents generate code that conforms to Black quite effectively, I think it's probably because I usually start them in existing projects that were already formatted using Black so they pick up those patterns.

formerly_proven 12/17/2025||

I still quite often have even Opus 4.5 generate empty indented lines (regardless of explicit instructions in AGENTS.md not to (besides explicitly referencing the style guide as well), the code not containing any before and the auto-formatter removing them), for example. Trailing whitespace is much rarer but happens as well. Personally I don't care too much, since I've found LLMs to be most efficient when performing roughly the work of a handful commits at most in one thread, so I let the pre-commit hook sort it out after being done with a thread.

alexgotoi 12/17/2025||

The funny part of “AI will make formal verification go mainstream” is that it skips over the one step the industry still refuses to do: decide what the software is supposed to do in the first place.

We already have a ton of orgs that can’t keep a test suite green or write an honest invariant in a code comment, but somehow we’re going to get them to agree on a precise spec in TLA+/Dafny/Lean and treat it as a blocking artifact? That’s not an AI problem, that’s a culture and incentives problem.

Where AI + “formal stuff” probably does go mainstream is at the boring edges: property-based tests, contracts, refinement types, static analyzers that feel like linters instead of capital‑P “Formal Methods initiatives”. Make it look like another checkbox in CI and devs will adopt it; call it “verification” and half the org immediately files it under “research project we don’t have time for”.

Will include this thread in my https://hackernewsai.com/ newsletter.

TimTheTinker 12/17/2025||

> it skips over the one step the industry still refuses to do: decide what the software is supposed to do in the first place.

Not only that, but it's been well-established that a significant challenge with formally verified software is to create the right spec -- i.e. one that actually satisfies the intended requirements. A formally verified program can still have bugs, because the spec (which requires specialized skills to read and understand) may not satisfy the intent of the requirements in some way.

So the fundamental issue/bottleneck that emerges is the requirements <=> spec gap, which closing the spec <=> executable gap does nothing to address. Translating people's needs to an empirical, maintainable spec of one type or another will always require skilled humans in the loop, regardless of how easy everything else gets -- at minimum as a responsibility sink, but even more as a skilled technical communicator. I don't think we realize how valuable it is to PMs/executives and especially customers to be understood by a skilled, trustworthy technical person.

suspended_state 12/17/2025|||

> A formally verified program can still have bugs, because the spec (which requires specialized skills to read and understand) may not satisfy the intent of the requirements in some way.

That's not a bug, that's a misunderstanding, or at least an error of translation from natural language to formal language.

Edit:

I agree that one can categorize incorrect program behavior as a bug (apparently there's such a thing as "behavioral bug"), but to me it seems to be a misnomer.

I also agree that it's difficult to tell that to a customer when their expectations aren't met.

gls2ro 12/17/2025|||

In some definitions (that I happen to agree with but because we wanted to save money by first not properly training testers and then getting rid of them is not present so much in public discourse) the purpose of testing (or better said quality control) is:

1) Verify requirements => this can be done with formal verifications

2) Validate fit for purpose => this is where we make sure that if the customer needs addition it does not matter if our software does very well substraction and it has a valid proof of doing that according with specs.

I know this second part is kinda lost in the transition from oh my god waterfall is bad to yeyy now we can fire all testers because the quality is the responsibility of the entire team.

AlienRobot 12/18/2025|||

>an error of translation from natural language to formal language

Really? Programming languages are all formal languages, which means all human-made errors in algorithms wouldn't be "bugs" anymore. Some projects even categorize typos as bugs, so that's a unusually strict definition of "bug" in my opinion.

suspended_state 12/20/2025||

Sure, I guess you can understand what I said that way, but that's not what I meant. I wasn't thinking about the implementation, but the specifications.

Read again the quote I was refering to if you need better context to understand my comment.

If you have good formal specifications, you should be able to produce the corresponding code. Any error in that phase should be considered a bug, and yes, a typo should fit that category, if it makes the code deviate from the specs.

But an error in the step of translating the requirements (usually explained in natural language) to specifications (usually described formally) isn't a bug, it's a translation error.

jandrese 12/17/2025|||

The danger of this is people start asking about formally verified specs, and down that road lies madness.

"If you can formally verify the spec the code can be auto-generated from it."

TimTheTinker 12/17/2025||

Most formal "specs" (the part that defines the system's actual behavior) are just code. So a formally verified (or compiled) spec is really just a different programming language, or something layered on top of existing code. Like TypeScript types are a non-formal but empirical verification layer on top of JavaScript.

The hard part remains: translating from human-communicated requirements to a maintainable spec (formally verified or not) that completely defines the module's behavior.

strbean 12/17/2025|||

> decide what the software is supposed to do in the first place.

That's where the job security is (and always has been). This has been my answer to "are you afraid for your job because of AI?"

Writing the code is very rarely the hard part. The hard part is getting a spec from the PM, or gathering requirements stakeholders. And then telling them why the spec / their requirements don't make sense or aren't feasible, and figuring out ones that will actually achieve their goals.

svat 12/17/2025|||

There are some basic invariants like "this program should not crash on any input" or "this service should be able to handle requests that look like X up to N per second" — though I expect those will be the last to be amenable to formal verification, they are also very simple ones that (when they become possible) will be easy to write down.

yencabulator 12/18/2025||

> "this program should not crash on any input" [...] though I expect those will be the last to be amenable to formal verification,

In the world of Rust, this is actually the easiest to achieve level of formal proofs.

Simple lints can eliminate panics and potentially-panicking operations (forcing you/LLM to use variants with runtime error handling, e.g. `s[i]` can become `s.get(i).unwrap_or(MyError::RuhRoh)?`, or more purpose-specific handling; same thing for e.g. enforcing that arithmetic never underflows/overflows).

Kani symbolically evaluates simple Rust functions and ensures that the function does not panic on any possible value on it's input, and on top of that you can add invariants to be enforced (e.g. search for an item in an array always returns either None or a valid index, and the value at that index fulfills the search criteria).

(The real challenge with e.g. Kani is structuring a codebase such that it has those simple-enough subparts where formal methods are feasible.)

Verdex 12/17/2025|||

Yeah, the hyper majority of the history of "getting things done" has been: find some guy who can translate "make the crops grow" into a pile of food.

The people who care about the precise details have always been relegated to a tiny minority, even in our modern technological world.

anovick 12/17/2025|||

OP seems not broadly applicative to corporate software development.

Rather, it's directed at the kind of niche, mission-critical things, that not all of which are getting the formal verification solution that is needed for them and/or that don't get considered due to high costs (due to specialization skill).

I read OP as a realization that the costs have fallen, and thus we should see formal verification more than before.

fulafel 12/18/2025|||

This is the article's message as well:

"That doesn’t mean software will suddenly be bug-free. As the verification process itself becomes automated, the challenge will move to correctly defining the specification: that is, how do you know that the properties that were proved are actually the properties that you cared about? Reading and writing such formal specifications still requires expertise and careful thought. But writing the spec is vastly easier and quicker than writing the proof by hand, so this is progress."

General security properties come to mind as one area that could have good reusability for specs.

fsloth 12/17/2025||

"decide what the software is supposed to do in the first place."

After 20 years of software development I think that is because most of the software out there, is the method itself of finding out what it's supposed to do.

The incomplete specs are not lacking feature requirements due to lack of discipline. It's because nobody can even know without trying it out what the software should be.

I mean of course there is a subset of all software that can be specified before hand - but a lot of it is not.

Knuth could be that forward thinking with TeX for example only because he had 500 years of book printing tradition to fall back on to backport the specs to math.

adverbly 12/17/2025||

This smells like a Principia Mathematica take to me...

Reducing the problem to "ya just create a specification to formally verify" doesn't move the needle enough to me.

When it comes to real-world, pragmatic, boots-on-the-ground engineering and design, we are so far from even knowing the right questions to ask. I just don't buy it that we'd see huge mainstream productivity changes even if we had access to a crystal ball.

Its hilarious how close we're getting to Hitch hikers guide to the galaxy though. We're almost at that phase where we ask what the question is supposed to be.

rramadass 12/17/2025||

Nope; you are quite wrong here. Most people have no idea of what Formal Specification/Verification via the usage of Formal Methods really means.

It is first and foremost about learning a way of thinking. Tools only exist to augment and systematize this thinking into a methodology. There are different levels of "Formal Methods Thinking" starting with informal all the way to completely rigorous. Understanding and using these methods of thinking as the "interface" to specify a problem to an AI agent/LLM is what is important to ensure "correctness by construction to a specification".

Everybody should read this excellent (and accessible) paper On Formal Methods Thinking in Computer Science Education which details the above approach - https://research.tue.nl/en/publications/on-formal-methods-th...

Excerpts:

One may ask What good is FM? Who needs it? Millions of programmers work everyday without it. Many think that FM in a CS curriculum is peddling the idea that Formal Logic (e.g.,propositional or predicate logic) is required for everyday programmers, that they need it to write programs that are more likely to be correct, and correspondingly less likely to fail the tests to which they subsequently (of course) must still be subjected. However, this degree of formality is not necessarily needed. What is required of everyday programmers is that, as they write their programs, they think — and code — in a way that respects a correctness-oriented point of view. Assertions can be written informally, in natural language: just the “thinking of what those assertions might be” guides the program-construction process in an astonishingly effective way. What is also required are the engineering principles referred to above. Connecting programs with their specifications through assertions provides training on abstraction, which, in turn, encourages simplicity and focus, helping build more robust, flexible and usable systems.

The answer to “Who needs it?” is that everyday programmers and software developers indeed may not need to know the theory of FM. But they do need to know is how to practise it, even if with a light touch, benefiting from its precepts. FM theory, which is what explains — to the more mathematically inclined — why FM works, has become confused with the FM practice of using the theory’s results to benefit from what it assures. Any “everyday programmer” can do that...except that most do not.

The paper posits 3 levels of "Formal Methods Thinking" viz.

a) Level 1 (“What’s True Here”). Level 1 of FM thinking is the application of FM in its most basic form. Students develop abilities to understand their programs and reason about their correctness using informal descriptions. By “What’s True Here”, we mean including natural language prose or informal diagrams to describe the properties that are true at different points of a program’s execution rather than the operations that brought them about.

b) Level 2 (Formal Assertions). Level 2 introduces greater precision to Level 1 by teaching students to write assertions that incorporate arithmetic and logical operators to capture FM thinking more rigorously. This may be accompanied by lightweight tools that can be used to test or check that their assertions hold.

c) Level 3 (Full Verification). This level enables students to prove program properties using tools such as a theorem prover, model checker or SMT solver. But in addition to tool-based checking of properties (now written using a formal language), this level can formally emphasise other aspects of system-level correctness, such as structural induction and termination.

amw-zero 12/17/2025||

When you go to write a line of code, how do you decide what to write?

signa11 12/17/2025|||

> When you go to write a line of code, how do you decide what to write?

depends ofcourse, what am i writing for ? a feature, a bugfix, refactor ... ?

amw-zero 12/17/2025||

Let's say a new feature. Do you just type random letters, or do you have some kind of plan ahead of time?

signa11 12/18/2025||

new feature implies design document to gather the thoughts, followed by an intense review etc.

amw-zero 12/18/2025||

So... a specification.

signa11 12/18/2025||

no ! a _design_ document. how this new thing will fit together with other things that are already existing in the system. what it’s interactions are going to look like, what are the assumptions, what are the limitations etc etc.

amw-zero 12/18/2025||

So... a specification.

signa11 12/19/2025||

hang on ...

adverbly 12/17/2025|||

Honestly? I usually look at the previous implementation and try to make some changes to fix an issue that I discovered during testing. Rarely an actual bug - usually we just changed our mind about what the intent should be.

infruset 12/17/2025||

I was waiting for a post like this to hit the front page of Hacker News any day. Ever since Opus 4.5 and GPT 5.2 came out (mere weeks ago), I've been writing tens of thousands of lines of Lean 4 in a software engineering job and I feel like we are on the eve of a revolution. What used to take me 6 months of work when I was doing my PhD in Coq (now Rocq), now takes from a few hours to a few days. Whole programming languages can get formalized executable semantics in little time. Lean 4 already has a gigantic amount of libraries for math but also for computer science; I expect open source projects to sprout with formalizations of every language, protocol, standard, algorithm you can think of.

Even if you have never written formal proofs but are intrigued by them, try asking a coding agent to do some basic verification. You will not regret it.

Formal proof is not just about proving stuff, it's also about disproving stuff, by finding counterexamples. Once you have stated your property, you can let quickcheck/plausible attack it, possibly helped by a suitable generator which does not have to be random: it can be steered by an LLM as well.

Even further, I'm toying with the idea of including LLMs inside the formalization itself. There is an old and rich idea in the domain of formal proof, that of certificates: rather than proving that the algorithm that produces a result is correct, just compute a checkable certificate with untrusted code and verify it is correct. Checkable certificates can be produced by unverified programs, humans, and now LLMs. Properties, invariants, can all be "guessed" without harm by an LLM and would still have to pass a checker. We have truly entered an age of oracles. It's not halting-problem-oracle territory of course, but it sometimes feels pretty close for practical purposes. LLMs are already better at math than most of us and certainly than me, and so any problem I could plausibly solve on my own, they will do faster without my having to wonder if there is a subtle bug in the proof. I still need to look at the definitions and statements, of course, but my role has changed from finding to checking. Exploring the space of possible solutions is now mostly done better and faster by LLMs. And you can run as many in parallel as you can keep up with, in attention and in time (and money).

If anyone else is as excited about all this as I am, feel free to reach out in comments, I'd love to hear about people's projects !

baq 12/17/2025||

People are sleeping on the new models being capable of this, 100%. Been telling Opus to make Alloy specs recently and it… just does. Ensuring conformance is rapidly becoming affordable, folks in this thread needed to update their priors!

dbdr 12/17/2025|||

Do you now use Lean instead of Rocq because your new employer happened to prefer that, or is it superior in your opinion? Which one would you recommend to look at first?

nomadygnt 12/17/2025|||

Where do you work that you get to write Lean? That sounds awesome!

infruset 12/17/2025||

I can't disclose that, but what I can say is no one at my company writes Lean yet. I'm basically experimenting with formalizing in Lean stuff I normally do in other languages, and getting results exciting enough I hope to trigger adoption internally. But this is bigger than any single company!

jrowen 12/17/2025|||

This is perhaps only tangentially related to formal verification, but it made me wonder - what efforts are there, if any, to use LLMs to help with solving some of the tough questions in math and CS (P=NP, etc)? I'd be curious to know how a mathematician would approach that.

infruset 12/17/2025||

So as for math of that level, (the best) humans are still kings by far. But things are moving quickly and there is very exciting human-machine collaboration, one need only look at recent interviews of Terence Tao!

qingcharles 12/17/2025|||

I agree. I think we've gotta get through the rough couple of "AI slop" years of code and we'll come out of it the other side with some incredible tools.

The reason we don't all write code to the level that can operate the Space Shuttle is because we don't have the resources and the projects most of us work on all allow some wiggle room for bugs since lives generally aren't at risk. But we'd all love to check in code that was verifiably bug-free, exploit-free, secure etc if we could get that at a low, low price.

MobiusHorizons 12/17/2025||

at some level it's not really an engineering issue. "bug free" requires that there is some external known goal with sufficient fidelity that it can classify all behaviors as "bug" or "not bug". This really doesn't exist in the vast majority of software projects. It is of course occasionally true that programmers are writing code that explicitly doesn't meet one of the requirements they were given, but most of the time the issue is that nothing was specified for certain cases, so code does whatever was easiest to implement. It is only when encountering those unspecified cases (either via a user report, or product demo, or manual QA) that the behavior is classified as "bug" or "not bug".

I don't see how AI would help with that even if it made writing code completely free. Even if the AI is writing the spec and fully specifies all possible outcomes, the human reviewing it will glance over the spec and approve it only to change their mind when confrunted with the actual behavior or user reports.

svat 12/18/2025||

> I don't see how AI would help with that

What if the AI kept bringing up unspecified cases and all you (the human) had to do all day was respond to it on what the behavior should be in each case? In this model the AI would not specify the outcomes; the specification is whatever you initially specified, and your responses to the AI's questions about the outcomes. At some point you'd decide that you'd answered enough questions (or the AI could not come up with any more unspecified cases), and bugs would be in what remained, but it would still mean substantially more thinking about cases than now.

igornotarobot 12/17/2025|||

This sounds amazing! What kind of systems take you a few hours to a few days now? Just curious whether it works in a niche (like sequential code), or does it work for concurrent and distributed systems as well?

riku_iki 12/17/2025|||

> I've been writing tens of thousands of lines of Lean 4 in a software engineering job

I am wondering what exactly you are doing? What tasks you are solving using generated lean?

ajcp 12/17/2025||

Having known nothing of this field before now I have to say your excitement has me excited!

pron 12/16/2025||

> it’s not hard to extrapolate and imagine that process becoming fully automated in the next few years. And when that happens, it will totally change the economics of formal verification.

There is a problem with this argument similar to one made about imagining the future possibilities of vibe coding [1]: once we imagine AI to do this task, i.e. automatically prove software correct, we can just as easily imagine it to not have to do it (for us) in the first place. If AI can do the hardest things, those it is currently not very good at doing, there's no reason to assume it won't be able to do easier things/things it currently does better. In particular, we won't need it to verify our software for us, because there's no reason to believe that it won't be able to come up with what software we need better than us in the first place. It will come up with the idea, implement it, and then decide to what extent to verify it. Formal verification, or programming for that matter, will not become mainstream (as a human activity) but go extinct.

Indeed, it is far easier for humans to design and implement a proof assistant than it is to use one to verify a substantial computer program. A machine that will be able to effectively use a proof checker, will surely be able to come up with a novel proof checker on its own.

I agree it's not hard to extrapolate technological capabilities, but such extrapolation has a name: science fiction. Without a clear understanding of what makes things easier or harder for AI (in the near future), any prediction is based on arbitrary guesses that AI will be able to do X yet not Y. We can imagine any conceivable capability or limitation we like. In science fiction we see technology that's both capable and limited in some rather arbitrary ways.

It's like trying to imagine what problems computers can and cannot efficiently solve before discovering the notion of compuational complexity classes.

[1]: https://news.ycombinator.com/item?id=46207505

thatoneengineer 12/16/2025||

I disagree. Right now, feedback on correctness is a major practical limitation on the usefulness of AI coding agents. They can fix compile errors on their own, they can _sometimes_ fix test errors on their own, but fixing functionality / architecture errors takes human intervention. Formal verification basically turns (a subset of) functionality errors into compile errors, making the feedback loop much stronger. "Come up with what software we need better than us in the first place" is much higher on the ladder than that.

TL;DR: We don't need to be radically agnostic about the capabilities of AI-- we have enough experience already with the software value chain (with and without AI) for formal verification to be an appealing next step, for the reasons this author lays out.

pron 12/16/2025|||

I completely agree it's appealing, I just don't see a reason to assume that an agent will be able to succeed at it and at the same time fail at other things that could make the whole exercise redundant. In other words, I also want agents to be able to consistently prove software correct, but if they're actually able to accomplish that, then they could just as likely be able to do everything else in the production of that software (including gathering requirements and writing the spec) without me in the loop.

DoctorOetker 12/18/2025|||

>I just don't see a reason to assume that an agent will be able to succeed at it and at the same time fail at other things that could make the whole exercise redundant.

But that is much simpler to understand: eventually finding a proof using guided search (machines searching for proofs, multiple inference attempts) takes more effort than verifying a proof. Formal verification does not disappear, because communicating a valuable succinct proof is much cheaper than having to search for the proof anew. The proofs will become inevitable lingua franca (like it is among capable humans) for computers as well. Basic economics will result in adoption of formal verification.

Whenever humans found an original proof, their notes will contain a lot of deductions that were ultimately not used, they were searching for a proof, using intuition gained in reading and finding proofs of other theorems. It's just that LLM's are similarily gaining intuition, and at some point become better than humans at finding proofs. It is currently already much better than the average human at finding proofs. The question is how long it takes until it gets better than any human being at finding proofs.

The future you see where the whole proving exercise (if by humans or by LLMs) becomes redundant because it immediately emits the right code is nonsensical: the frontier of what LLM's are capable of will move gradually, so for each generation of LLMs. there will always be problems it can not instantly generate provably correct software (but omitting the according-to-you-unnecessary proof). That doesn't mean they can't find the proofs, just that it would have to search by reasoning, with no guarantee if it ever finds a proof.

That search heuristic is Las Vegas, not Monte Carlo.

Companies will compare the levelized operating costs of different LLM's to decide which LLMs to use in the future on hard proving tasks.

Satellite data centers will consume ever more resources in a combined space/logic race for cryptographic breakthroughs.

pron 12/29/2025||

> eventually finding a proof using guided search (machines searching for proofs, multiple inference attempts) takes more effort than verifying a proof. Formal verification does not disappear, because communicating a valuable succinct proof is much cheaper than having to search for the proof anew. The proofs will become inevitable lingua franca (like it is among capable humans) for computers as well. Basic economics will result in adoption of formal verification.

But AI isn't used to verify the proof. It's used to find it. And if it can find it - one of the hardest things in software development - there's no reason to believe it can't do anything else associated with software development. If AI agents find formal verification helpful, they would probably opt to use it, but there would also be no need for a human in the loop at all.

> It is currently already much better than the average human at finding proofs.

The average human also can't write hello world. LLMs are currently significantly worse at finding proofs than the average formal-verification person (I say this as someone who's done formal verification for many years), though, just as they're significantly worse at writing code than the average programmer. I'm not saying they won't become better, it's just strange to expect that they'll become better than the average formal-verification person and at the same time they won't be better than the average product manager.

> there will always be problems it can not instantly generate provably correct software

Nobody said anything about "instantly". If the AI finds formal verification helpful, it will choose to use it, but if it can find proofs better than humans, why expect that it won't be able to do easier tasks better than humans?

UncleEntity 12/17/2025|||

> I also want agents to be able to consistently prove software correct...

I know this is just an imprecision of language thing but they aren't 'proving' the software is correct but writing the proofs instead of C++ (or whatever).

I had a but of a discussion with one of them about this a while ago to determine the viability of having one generate the proofs and use those to generate the actual code, just another abstraction over the compiler. The main takeaway I got from that (which may or may not be the way to do) is to use the 'result' to do differential testing or to generate the test suite but that was (maybe, don't remember) in the context of proving existing software was correct.

I mean, if they get to the point where they can prove an entire codebase is correct just in their robot brains I think we'll probably have a lot bigger things to worry about...

qingcharles 12/17/2025|||

It's getting better every day, though, at "closing the loop."

When I recently booted up Google Antigravity and had it make a change to a backend routine for a web site, I was quite surprised when it opened Chrome, navigated to the page, and started trying out the changes to see if they had worked. It was janky as hell, but a year from now it won't be.

pron 12/17/2025|||

To make this more constructive, I think that today AI tools are useful when they do things you already know how to do and can assess the quality of the output. So if you know how to read and write a formal specification, LLMs can already help translating natural-language descriptions to a formal spec.

It's also possible that LLMs can, by themselves, prove the correctness of some small subroutines, and produce a formal proof that you can check in a proof checker, provided you can at least read and understand the statement of the proposition.

This can certainly make formal verification easier, but not necessarily more mainstream.

But once we extrapolate the existing abilities to something that can reliably verify real large or medium-sized programs for a human who cannot read and understand the propositions (and the necessary simplifying assumptions) that it's hard to see a machine do that and at the same time not able to do everything else.

1121redblackgo 12/16/2025||

First human robot war is us telling the AI/robots 'no', and them insisting that insert technology here is good for us and is the direction we should take. Probably already been done, but yeah, this seems like the tipping point into something entirely different for humanity.

pron 12/16/2025||

... if it's achievable at all in the near future! But we don't know that. It's just that if we assume AI can do X, why do we assume it cannot, at the same level of capability, also do Y? Maybe the tipping point where it can do both X and Y is near, but maybe in the near future it will be able to do neither.

rgmerk 12/16/2025||

(sarcasm on)

Woohoo, we're almost all of the way there! Now all you need to do is ensure that the formal specification you are proving that the software implements is a complete and accurate description of the requirements (which are likely incomplete and contradictory) as they exist in the minds of the set of stakeholders affected by your software.

(sarcasm off).

qingcharles 12/17/2025|

I mean, I don't disagree. Specs are usually horrible, way off the mark, outdated, and written by folks who don't understand how the rest of the vertical works. But, that's a problem for another day :)

Keyframe 12/17/2025||

That's a problem for Super Saiyan AGI to solve :)

jameslk 12/16/2025||

> As the verification process itself becomes automated, the challenge will move to correctly defining the specification: that is, how do you know that the properties that were proved are actually the properties that you cared about? Reading and writing such formal specifications still requires expertise and careful thought. But writing the spec is vastly easier and quicker than writing the proof by hand, so this is progress.

Proofs never took off because most software engineering moved away from waterfall development, not just because proofs are difficult. Long formal specifications were abandoned since often those who wrote them misunderstood what the user wanted or the user didn’t know what they wanted. Instead, agile development took over and software evolved more iteratively and rapidly to meet the user.

The author seems to make their prediction based on the flawed assumption that difficulty in writing proofs was the only reason we avoided them, when in reality the real challenge was understanding what the user actually wanted.

dbdr 12/17/2025||

The thing is, if it takes say a year to go from a formal spec to a formally proven implementation and then the spec changes because there was a misunderstanding about the requirements, it's a completely broken process. If the same process now takes say a day or even a week instead, that becomes usable as a feedback loop and very much desirable. Sometimes a quantitative improvement leads to a qualitative change.

baq 12/17/2025||

And yet code is being written and deployed to prod all the time, with many layers of tests. Formal specs can be used at least at all the same levels, but crucially also at the technical docs level. LLMs make writing them cheap. What’s not to like?

pedrozieg 12/16/2025|

I buy the economics argument, but I’m not sure “mainstream formal verification” looks like everyone suddenly using Lean or Isabelle. The more likely path is that AI smuggles formal-ish checks into workflows people already accept: property checks in CI, model checking around critical state machines, “prove this invariant about this module” buttons in IDEs, etc. The tools can be backed by proof engines without most engineers ever seeing a proof script.

The hard part isn’t getting an LLM to grind out proofs, it’s getting organizations to invest in specs and models at all. Right now we barely write good invariants in comments. If AI makes it cheap to iteratively propose and refine specs (“here’s what I think this service guarantees; what did I miss?”) that’s the moment things tip: verification stops being an academic side-quest and becomes another refactoring tool you reach for when changing code, like tests or linters, instead of a separate capital-P “formal methods project”.

More comments...