Broken Proofs and Broken Provers

Posted by RebelPotato 3 days ago

Broken Proofs and Broken Provers(lawrencecpaulson.github.io)

63 points | 14 comments

N_Lens 2 days ago|

The multithreading bug in Isabelle is fascinating from a systems design perspective: threads racing ahead assuming a stuck proof will succeed, creating false positives. That's exactly the kind of subtle concurrency issue that makes verification tools need... verification tools.

Dacit 2 days ago||

In the described case, this was a simple user error. But you are right nonetheless: To enable the concurrency, the system uses a parallel inference kernel (https://www21.in.tum.de/~wenzelm/papers/parallel-isabelle.pd...). I have heard the anecdote that in the first version, this also allowed one to prove false, by fulfilling a proof promise with a proof that depended recursively on the same promise.

And there are verification tools for verification tools! Isabelle has a verified proof checker that can check whether a given proof is sound: https://www21.in.tum.de/~rosskops/papers/metalogic_pre.pdf The only caveat is that the proof checker itself is verified in Isabelle...

vilhelm_s 2 days ago||

I don't think it's that subtle: it would be enough to have a continuous integration tool that tries to compile the proofs that are checked into version control and raises an error if it fails.

The linked blog post says "Unfortunately, a student once submitted work containing this error; it was almost entirely incorrect and he had no idea." I guess the student probably was aware that not every proof had gone through, and that the that he saw like "99 out of 100 proofs succeeded" and assumed that meant that he had mostly completed the task, not realizing that a false theorem would be used to the give incorrect proofs for the rest of the file.

Dacit 2 days ago||

Indeed this can simply be checked by a command-line invocation. But I don't think the student was aware: They would only have seen a purple coloring of the "stuck" part, as shown in the linked example in the blog post. And if one assumes that the system only accepts correct proof steps, it's very easy to miss a small error in a theorem statement...

zero_k 2 days ago||

I still remember the time when a gcc bug caused MiniSat to output UNSAT for a satisfiable problem [1]. I was the author of a SAT solver, and I was chasing my tail trying to figure out why I was getting UNSAT for a satsifiable problem. I have to admit I didn't expect it to be a gcc bug... (note: bug was found by Vegard Nossum on the CryptoMiniSat mailing list)

[1] https://www.msoos.org/2013/04/gcc-4-5-2-at-sat-competition-2...

LegionMammal978 2 days ago|

I'm a bit skeptical about taking the rates of reported soundness bugs between different systems and drawing conclusions about the underlying approaches. There's typically a class of bugs that users can stumble into by doing something unusual, and then another class of bugs that can only really be found by exploiting holes in the implementation. The first class depends on which features get exercised the most, the second class depends on how many eyes are on the source (and how accessible it is), and both classes heavily depend on the overall size of the userbase.

E.g., Metamath is designed to be as theoretically simple as possible, to the point that it's widely considered a toy in comparison to 'serious' proof systems: a verifier is mainly just responsible for pushing around symbols and strings. In spite of this simplicity, I was able to find soundness bugs in a couple major verifiers, simply because few people use the project to begin with, and even fewer take the time to pore over the implementations.

So I'd be hesitant to start saying that one approach is inherently more or less bug-prone than another, except to be slightly warier in general of larger or less accessible kernels.

GregarianChild 2 days ago|

LCF-style provers like Isabelle/HOL and HOLlight are some of the most widely used, and oldest interactive theorem provers. If they consistently show smaller error rates than other systems, that is an interesting empirical observation. To give but one recent example: Amazon recently announced a vast 260000 lines of Isabelle/HOL-checked correctness proof of their new Nitro hypervisor for AWS EC2 Graviton5 instances.

LCF-style provers have a much smaller trusted computing base than Curry/Howard based provers like Coq, Agda and Lean.

One may wonder if there is a correlation between size of TCB and error rate in widely used provers?

zozbot234 2 days ago||

> LCF-style provers have a much smaller trusted computing base than Curry/Howard based provers like Coq, Agda and Lean.

I'm not sure that this is correct. The TCB in a CH based prover is just the implementation of the actual kernel. In LCF, you also have to trust that any tactics are implemented in a programming language that doesn't allow you to perform unsound operations. That's a vast expansion in your TCB. (You can implement LCF-like "tactics" in a CH-based prover via so-called reflection that delegates the proof to a runtime computation, but you do have to prove that your computation yields a correct decision for the problem.)

Dacit 2 days ago|||

No. The whole point of the LCF approach is that only kernel functions can generate theorems. Usually this is done by having a Thm module with opaque thm type (so its instances can only be generated by this module) and embedding the base rules of your logical calculus as shallow functions into that Thm module. Thus, all thm objects are correct by construction (w.r.t. your calculus), and you only need to check the Thm module to verify this.

Tactics generate thms by calling functions of the Thm module in some fashion.

LegionMammal978 2 days ago|||

At least in Isabelle/ML, it seems like in practice there are also untrusted "oracles" that a proof can invoke to generate "thm" objects [0], and it's not entirely trivial to automatically ensure that only trusted sources are used in a project [1], unless I am misunderstanding the linked thread.

Of course there's no free lunch, as you say, in making sure that high-level proofs are lowered into the trusted part correctly, but it's certainly a piece that should ideally be as simple as possible.

[0] https://lists.cam.ac.uk/sympa/arc/cl-isabelle-users/2025-02/...

[1] https://lists.cam.ac.uk/sympa/arc/cl-isabelle-users/2025-02/...

zozbot234 2 days ago|||

This does not resolve the issue of how to ensure that the "module" system is itself sound and has no unforeseen escape hatches. You have to assume that it is, or else include a complete programming language implementation as part of your TCB. If your system generates proof objects (possibly allowing for some kind of reflection, with the proof objects merely specifying some kind of computation) you can skip this and simply prove that the kernel itself is correct.

GregarianChild 2 days ago||

The question: "does the ambient programming language do the right thing?" applies to other provers too. So if you assume the semantics of the implementation language is broken, or the compiler, or the computer executing the code, then those issues also need to be added to the TBC of Curry-Howard provers.

There is no free lunch.

GregarianChild 2 days ago||||

This is wrong as the others replies also point out. Tactics in LCF-style provers are not part of the TCB. Here is an example of the TCB for an industrial strength prover:

○ https://github.com/jrh13/hol-light/blob/master/fusion.ml

A mere 676 LoCs. This miniscule size of the TCB is where the comparatively lesser bug count (despite intense use) comes from.

ajb 2 days ago|||

My understanding is that tactics output piece of proof, but that this isn't trusted by the TCB of an LCF prover. This is very far from my area of expertise, but seems to be confirmed by https://www.cl.cam.ac.uk/techreports/UCAM-CL-TR-39.html