Posted by Poudlardo 1 day ago
it clearly and demonstrably does not. in fact, from eyeballing their chart Qwen, Kimi, and GLM scale linearly whereas Leanstral does not. But this is not surprising because the Alibaba, Moonshot, and Zhipu have hundreds of employees each and hundreds of millions of dollars of investment each.
Model Cost ($) Score
..
Claude Opus 1,650 39.6
..
Leanstral pass@8 145 31.0
Leanstral pass@16 290 31.9Mistral seems to focus on a different market than the others. Their best model is meh, their best ASR model locally is either rather slow compared to Parakeet on similar languages, or not as good for others (like qwen ASR).
Side note: Lean seems quite unreadable with tons of single letter variable names. Part of it is me being unaccustomed with it, but still.
There are two compatible and important (but different) questions in play:
1. Is a program correct relative to a formal specification?
2. Is the formal specification what we mean/want?
*: Worth asking: “What that other person necessarily wrong? Or perhaps they are discussing a different aspect or framing?” AKA: “be curious and charitable” I’m not going to link to the specific threads, but they are happened / are happening. Le Sigh.
But I wonder how this scales in practice outside of formal environments.
In most ML/LLM systems, the bottleneck isn’t just correctness of individual steps, but the interaction between components (data → tokenizer → model → inference). A lot of failures come from subtle mismatches across the pipeline rather than strictly invalid logic.
Formal specs are great when the system is well-defined, but many real-world pipelines are still exploratory and data-dependent.
It feels like there’s a gap between: • formally verified components • and emergent behavior in end-to-end systems
Curious how you see this approach handling those system-level uncertainties.
I tried it out myself, I let AI add action transitions through the code, like: // A -> B: some description. Then I validate via a test that every action transition defined in my model is also defined somewhere commented in code, and other way around that every comment exists in the model.
Finally, I let AI write model check queries on particular properties. If I notice a particular bug, then I ask AI to analyze the model and the model check queries on why it could happen, and ask to strengthen it.
It sounds like a lot of effort, but I got it working in a half hour.