Lambda Calculus Benchmark for AI

Posted by marvinborner 1 day ago

Lambda Calculus Benchmark for AI(victortaelin.github.io)

142 points | 43 comments

NitpickLawyer 1 day ago|

New, unbenched problems are really the only way to differentiate the models, and every time I see one it's along the same lines. Models from top labs are neck and neck, and the rest of the bunch are nowhere near. Should kinda calm down the "opus killer" marketing that we've seen these past few months, every time a new model releases, esp the small ones from china.

It's funny that even one the strongest research labs in china (deepseek) has said there's still a gap to opus, after releasing a humongous 1.6T model, yet the internet goes crazy and we now have people claiming [1] a 27b dense model is "as good as opus"...

I'm a huge fan of local models, have been using them regularly ever since devstral1 released, but you really have to adapt to their limitations if you want to do anything productive. Same as with other "cheap", "opus killers" from china. Some work, some look like they work, but they go haywire at the first contact with a real, non benchmarked task.

[1] - https://x.com/julien_c/status/2047647522173104145

adrian_b 1 day ago||

Benchmarks for LLMs without complete information about the tested models are hard to interpret.

For the OpenAI and Anthropic models, it is clear that they have been run by their owners, but for the other models there are a great number of options for running them, which may run the full models or only quantized variants, with very different performances.

For instance, in the model list there are both "moonshotai/kimi-k2.6" and "kimi-k2.6", with very different results, but there is no information about which is the difference between these 2 labels, which refer to the same LLM.

Moreover, as others have said, such a benchmark does not prove that a certain cheaper model cannot solve a problem. It happened to not solve it within the benchmark, but running it multiple times, possibly with adjusted prompts, may still solve the problem.

While for commercial models running them many times can be too expensive, when you run a LLM locally you can afford to run it much more times than when you are afraid of the token price or of reaching the subscription limits.

NitpickLawyer 1 day ago||

Agreed. But, at least as of yesterday, dsv4 was only served by deepseek. And, more importantly, that's what the "average" experience would be if you'd setup something easy like openrouter. Sure, with proper tuning and so on you can be sure you're getting the model at its best. But are you, if you just setup openrouter and go brrr? Maybe. Maybe not.

cmrdporcupine 1 day ago||

I think it's important to point out that DeepSeek was basically soft-launching their v4 model, and they weren't emphasizing it as some sort of SOTA-killer but more as proof of a potentially non-NVIDIA serving world, and as a venue for their current research approaches.

I think/hope we'll see a 4.2 that looks a lot better, same as 3.2 was quite competitive at the time it launched.

lgienapp 1 day ago|||

I feel like many of the “as good as opus” crowd would achieve the same with sonnet tbh. Actually reaching the ceiling of what Opus can do is maybe 10% of tasks, the rest is wasting compute on a too-strong model they default to for whatever they are doing. Hence they see little drop in output quality when trying out smaller open models.

manquer 1 day ago|||

It is a price signalling problem both in the API and the subscriptions.

Difference between $3/MTok and $5/MTok does not reflect the capability difference. Similarly the subscriptions have extra Opus specific allowances. I prefer Sonnet for most tasks it is good enough, but sometimes I am forced to use Opus because am out of Sonnet for the week. It feels like Opus is unwanted second option.

If they priced Sonnet closer to $1/MTok like other models it would signal value of Opus better.

theptip 1 day ago|||

The Eval problem; “Alice is supposedly smarter than Bob, but they can both tie their shoes just as fast”.

gertlabs 1 day ago|||

Our philosophy is that you can design problems so that they can scale through a few release cycles by making environments more complex, with no known ceiling. The key for scalability is not having a single correct answer (even though Victor's benchmark is interesting), but still being objectively scorable.

That's what we've done with our comprehensive reasoning and coding benchmark at https://gertlabs.com

electroglyph 1 day ago|||

i dunno, Opus is losing it's edge imo. i regularly use a mix of models, including Opus, glm 5.1, kimi 2.6, etc. and i find that all of them are pretty much equally good at "average" coding, but on difficult stuff they're nearly equally bad. i can't deny that Opus has an edge, but it's not a huge one.

anuramat 1 day ago|||

"almost as good as opus at writing python/js/... when given a spec" might be enough for a lot of people, especially if its 10x cheaper

cmrdporcupine 1 day ago||

The question isn't whether it's "as good as Opus" but that there exists something that costs 1/10th the cost to use but can still competently write code.

Honestly, I was "happy" with December 2025 time frame AI or even earlier. Yes, what's come after has been smarter faster cleverer, but the biggest boost in productivity was just the release of Opus 4.5 and GPT 5.2/5.3.

And yes it might be a competitive disadvantage for an engineer not to have access to the SOTA models from Anthropic/OpenAI, but at the same time I feel like the missing piece at this point is improvements in the tooling/harness/review tools, not better-yet models.

They already write more than we can keep up with.

amelius 1 day ago|||

Also, at some point the open models may be poured into silicon and become 100x as fast. Maybe when allowed to use more tokens the models can solve the problems which they now cannot. So this would imho be an interesting addition to the benchmark.

NitpickLawyer 1 day ago|||

Oh, I agree. Last year I tried making each model a "daily driver", including small ones like gpt5-mini / haiku, and open ones, like glm, minimax and even local ones like devstral. They can all do some tasks reliably, while struggling at other tasks. But yeah, there comes a point where, depending on your workflows, some smaller / cheaper models become good enough.

The problem is with overhypers, that they overhype small / open models and make it sound like they are close to the SotA. They really aren't. It's one thing to say "this small model is good enough to handle some tasks in production code", and it's a different thing to say "close to opus". One makes sense, the other just sets the wrong expectations, and is obviously false.

adrian_b 1 day ago|||

There is no doubt that for many tasks the SotA models of OpenAI and Anthropic are better than the available open weights models.

Nevertheless, I do not believe that either OpenAI or Anthropic or Google know any secret sauce for better training LLMs. I believe that their current superiority is just due to brute force. This means that their LLMs are bigger and they have been trained on much more data than the other LLM producers have been able to access.

Moreover, for myself, I can extract much more value from an LLM that is not constrained by being metered by token cost and where I have full control on the harness used to run the model. Even if the OpenAI or Anthropic models had been much better in comparison with the competing models, I would have still been able to accomplish more useful work with an open-weights model.

I have already passed once through the transition from fast mainframes and minicomputers that I was accessing remotely by sharing them with other users, to slow personal computers over which I had absolute control. Despite the differences in theoretical performance, I could do much more with a PC and the same is true when I have absolute control over an LLM.

cmrdporcupine 1 day ago|||

I am desperate for the tooling that puts me back in charge. And just has the models as advisor. In which case the "smart level" is just a dial.

I'm probably going to have to make it myself.

vorticalbox 1 day ago||

Some ide’s already have this. In zed you can stick it “ask” mode.

Being able to use it as a rubber duck while it can also read the code works quite well.

There are a few APIs at work I have never worked on and the person that wrote them no longer works with us so AI fills that gap well.

tromp 1 day ago||

The corresponding repo https://github.com/VictorTaelin/LamBench describes this as:

    λ-bench
    A benchmark of 120 pure lambda calculus programming problems for AI models.
    → Live results
    What is this?
    λ-bench evaluates how well AI models can implement algorithms using pure lambda calculus. Each problem asks the model to write a program in Lamb, a minimal lambda calculus language, using λ-encodings of data structures to implement a specific algorithm.
    The model receives a problem description, data encoding specification, and test cases. It must return a single .lam program that defines @main. The program is then tested against all input/output pairs — if every test passes, the problem is solved.

"Live results" wrongly links to https://victortaelin.github.io/LamBench/ rather than the correct https://victortaelin.github.io/lambench/

An example task (writing a lambda calculus evaluator) can be seen at https://github.com/VictorTaelin/lambench/blob/main/tsk/algo_...

Curiously, gpt-5.5 is noticeably worse than gpt-5.4, and opus-4.7 is slightly worse than opus-4.6.

lioeters 1 day ago|

As an admirer of your work with binary lambda calculus, etc., I'm curious to hear your thoughts on the author's company with HVM and interaction combinators. https://higherorderco.com/ I've always felt there was untapped potential in this area, and their work seems like a way toward a practical application for parallel computing and maybe leveraging LLMs using a minimal language specification.

tromp 1 day ago||

I find the interaction net model of computation fascinating in its elegance and efficiency. And so I very much appreciate Taelin's company making efficient and highly concurrent implementations for both CPUs and GPUs.

Unfortunately, my lambda calculus programs don't work smoothly on interaction nets. While there is an extremely straightforward mapping from lambda terms to interaction nets, that mapping doesn't preserve the semantics for all terms. It does for a large subset though, and Taelin and colleagues have designed languages like Bend that express that subset. Programs that can be easily expressed in Bend will benefit greatly from HVM.

There exist much more complicated mappings from lambda terms to interaction nets that do preserve semantics but AFAIK those have yet to be fully implemented in HVM and it's not clear if the resulting implementation would run appreciably faster than the simple combinatory reduction engine I run my lambda programs on now.

BTW, basing Algorithmic Information Theory on interaction nets runs into the problem of not having any simple binary encoding, which is where lambda calculus shines.

lioeters 1 day ago||

So interesting, thank you, I'm glad I asked.

I've been looking into tree calculus, which can "dissolve" lambda calculus and SKI combinators into natural (unlabelled) binary trees. I was wondering if you had any thoughts on it. It's a curious model of computation with unique properties, that I think can also convert to interaction nets or vice versa. I saw that Taelin tweeted about it once, dismissing it as misguided somehow ("not expressive nor efficient") but I feel like there's more to it.

EDIT: Oh I see you've commented on it before. Right, it's equivalent to SKI but worded differently. And this "intensionality", I guess no one has yet found practical use for that property.

> It's not based on a single combinator. It's based on the two standard basis combinators S and K, and a non-combinator that can inspect normalized terms to distinguish between degrees 0,1,and 2.

I quoted his tweet below, I'll keep it in case it's of any interest.

---

it is not even related to kind or interaction nets, this calculus is simply not expressive nor efficient. it comes from a reasonable but incorrect intuition: "the less axioms you have, the more natural it must be!". that's not true. take nand logic, for example; it is complete, but expressing circuits with nand gates only will take more space than separating and/or/not, because you're just forcing ands/ors/nots to be inserted and computed where they're not needed. SKI combinators suffer from a similar trap; they're inherently slower and less expressive than the λ-calculus, because there are λ-terms that can't be expressed without "a lot of bloat" in them

this calculus does a similar thing: it packs many operations into a single one, and, by doing so, it bloats the program space and forces useless computations. yes, this makes the language feel more natural, because there are less axioms; but the axiom is just an amalgamation of many different functions in one. you want it the other way around: you want each axiom to add a key "expressivity" to your language, as tersely as possible, and in a way that makes the programs written on top of them as clean as possible, because everything you need to express is expressive from these axioms. yes, it is good to have as few axioms as possible (to avoid redundancy), but the expressivity of your axioms is what matters the most. the add function in tree calculus is a great example of a program that has been forced to be far from optimal due to the underlying languages constraint - that's terrible.

interaction combinators do the opposite: yes, the core language is extremely small and elegant - in fact, smaller than tree calculus - but, what is most important, it that its axioms allow terms written in it to be expressed in their shortest, cleanest, and most efficient formulation. interaction combinators provide exactly 3 interaction rules: erasure, which erases data; commutation, which copies data incrementally; and annihilation, which observes and moves data. from these 3, all else follows. loops and recursion, for example, can be optimally implemented (both in time AND space) by commutations. beta-reduction (i.e., λ-calculus) and pattern-matching (i.e., branching) can also be done optimally through annihilations. tuples and datatypes can also be encoded space-optimally with the same axioms. there is no waste, there is no bloat, which is make it so good to compute and to search.

in fact, you can't invent an axiomatic interaction system that is more efficient than interaction combinators, because they're optimal. and if, for some reason, interaction combinators scare you, the same holds for the recursive linear λ-calculus with addition of incremental copying, for example. in many of these senses, even this language is better to express algorithms on, than this proposed tree calculus

dataviz1000 1 day ago||

lambench is single-attempt one shot per problem.

I don't think they understand how the LLM models work. To truly benchmark a non-deterministic probabilistic model, they are going to need to run each about 45 times. LLM models are distributions and behave accordingly.

The better story is how do the models behave on the same problem after 5 samples, 15 samples, and 45 samples.

That said, using lambda calculus is a brilliant subject for benchmarking.

The models are reliably incorrect. [0]

[0] https://adamsohn.com/reliably-incorrect/

yorwba 1 day ago||

Why 45 times in particular? If you want 80% power to distinguish a model at 50% from a model at 51%, you need 39,440 samples per model, or 329 samples per question per model. But that would just give you a more precise estimate of how well the model does on those 120 questions in particular. If you want a more precise estimate of how well the model might do on future questions you come up with, you'll need to test more questions, not just test the same question more times.

dataviz1000 1 day ago||

I made flame charts of sonnet thinking. [0] You can see there is a lot of variance over 5 runs. They all passed but there was one that struggled with errors. How many trials are needed to clamp to ceiling or floor? ~30?

[0] https://adamsohn.com/lambda-variance/

yorwba 1 day ago||

How many samples you need depends on the difference you want to be able to measure (0% to 1% is different from 50% to 51% is different from 0% to 10% is different from 50% to 60%), the significance level at which you will declare a difference (conventionally, p < 0.05) and how likely you want this to happen when there is indeed such a difference (statistical power, conventionally 80%). Of course you can also just sample an arbitrary number of times and compute confidence intervals after the fact, but doing a statistical power computation helps clarify what it is you want to know, how certain you want to be, and whether you can realistically achieve such knowledge with the budget you have.

dataviz1000 15 hours ago||

To solve the lambda calculus problem Sonnet burns 8,163 - 17,334 tokens on 5 runs.

If I want to engineer a prompt, starting with the tokens which are clearly better in the one with 8,163 will yield a better agent.

If I build an agent that does something arbitrary like reverse engineer any website or multiplies 2 large numbers without a tool that allows it to use code, the mechanics of the reasoning work the same as an agent solving lambda calculus. Running 39,440 trials is prohibitory expensive. Nonetheless, without perfect proof, I want to say running an agent several times and then take any generalized output from the fastest runs yields much faster generalized agent that solves that specific task given different parameters.

That is something I really want to know. If I have an agent that reverse engineers websites, can I take the thinking output from the best running and use that to seed a better agent? I don't know how to set up the experiment. And asking ChatGPT has been futile especially and running it is very expensive. How do I set up that experiment?

yorwba 14 hours ago||

You could try a sequential testing setup, which can let you stop the experiment earlier if the difference is larger than expected. But if the difference is small, there's no way around the fact that reliably detecting small differences requires large sample sizes, and the relationship is inverse quadratic (halving the smallest detectable difference quadruples the sample size you need).

UltraSane 1 day ago||

Even people benefit from multiple tries over time.

chris_st 1 day ago||

Well, to be fair, people cheat by remembering what they did last time. I think the idea here is to run the models from a "clean slate" and see how often they succeed/fail.

They are, like people, non-deterministic, so giving them several "fair" trials makes sense to me.

internet_points 1 day ago||

Would love to see where the mistral stuff lands.

Also, being from Victor Taelin, shouldn't this be benching Interaction Combinators? :)

maciejzj 1 day ago||

Can anyone more familiar with lambda calculus speculate why all models fail to implement fft? There are gazzilion fft implementations in various languages over the web and the actual cooley-tukey algorithm is rather short.

amluto 1 day ago|

I can guess:

Pure lambda calculus doesn’t have numbers, and FFT, as traditionally specified, needs real numbers or a decent approximation of them. There are also Fourier transforms over a ring. So the task needs to specify what kind of numbers are being Fourier transformed.

But the prompt is written very tersely written. It’s not immediately obvious to me what it’s asking. I even asked ChatGPT what number system was described by the relevant part of the prompt and it thought it was ambiguous. I certainly don’t really know what the prompt is asking.

Adding insult to injury, I think the author is trying to get LLMs to solve these in one shot with no tools. Perhaps the big US models are tuned a little bit for writing working code in their extremely poor web chat environments (because the companies are already too unwieldy to get their web sites in sync with the rest of their code), but I can’t imagine why a company like DeepSeek would use their limited resources to RL their model for its coding performance in a poor environment when their users won’t actually do this.

fallat 1 day ago|||

You can express any number type in pure lambda calculus.

amluto 1 day ago|||

I can also implement compliant IEEE 754 floating point arithmetic, with all rounding modes and exceptions, in Conway’s Game of Life, and I can implement that on a Turing machine and emulate that Turing machine in Brainfuck. This does not mean it’s sensible — it would be a serious engineering project that should be decomposed, built in pieces, and tested or verified in pieces. Which even a Chinese LLM ought to be able to do in an appropriate environment with appropriate resources and an appropriate prompt.

But that is not what this is testing. And I’m not even sure which number system the mentioned roots of unity are in. Maybe it’s supposed to be generic over number systems in a delightfully untyped lambda calculus sort of way?

edit: after pasting the entire problem including the solution, ChatGPT is able to explain “GN numbers”. I suspect its explanation is sort of correct. But I get no web search results for GN numbers and I can’t really tell whether ChatGPT is giving a semi-hallucinated description or figuring it out from the problem and its solution or what.

bediger4000 1 day ago|||

Lambda calculus people have the phrase "adequate numeral system" because they've discovered many different numeral systems.

stefan_ 1 day ago|||

Most FFTs I'm sure start life in Verilog and have never known real numbers.

amluto 1 day ago||

Traditionally they were done in FORTRAN or C or maybe bespoke DSP code, usually using ordinary boring complex numbers. Sometimes you’ll see FFTs over rings, called “number theoretic transforms”, used to multiply very large integers.

I suppose these days FFTs are used for highly power optimized, very much real-time OFDM transceivers. I’m guessing they’re done in carefully written assembly, possibly automatically generated (by conventional code, not LLMs, at least for existing solutions), possibly with a little bit of hardware help.

flashdesk 1 day ago||

I like this kind of benchmark, especially since it uses problems that are harder to overfit to.

That said, single-attempt results are a bit hard to read into. For anything code-like, things like retries, test feedback, or just letting the model iterate tend to change the outcome quite a bit.

cmrdporcupine 1 day ago||

Odd to see GPT 5.5 behind 5.4?

no_op 1 day ago|

The author posted new results using the API (apparently the original run was through Codex), and 5.5 moves to the top: https://x.com/VictorTaelin/status/2047818978664268071

auggierose 1 day ago||

Still doesn't explain why Codex 5.4 is better than Codex 5.5.

cmrdporcupine 1 day ago||

That's not what their latest post shows.

jakeinsdca 1 day ago||

codex 5.5 is worse then 5.4 but 10x faster?

the_data_nerd 1 day ago|

[flagged]

marvinborner 1 day ago|

fwiw, one of the FFT challenges is about the Scott encoding [1], while the other uses Church trees at least [2]. Both use a balanced ternary numeral system, which is a lot more efficient than plain Church/Scott and fairly well-known [3]. Either way, I would have assumed there to be a chance that at least one of the AIs had a look at [4] -- a tutorial about FFT in LC by the benchmark creator himself.

[1] task: https://github.com/VictorTaelin/lambench/blob/main/tsk/stre_... solution: https://github.com/VictorTaelin/lambench/blob/main/lam/stre_...

[2] task: https://github.com/VictorTaelin/lambench/blob/main/tsk/ctre_... solution: https://github.com/VictorTaelin/lambench/blob/main/lam/ctre_...

[3] https://link.springer.com/chapter/10.1007/3-540-45575-2_20

[4] https://gist.github.com/VictorTaelin/5776ede998d0039ad1cc9b1...