Posted by SweetSoftPillow 10/28/2025
It would be hilaroius to allow table talk and see them trying to bluff and sway each other :D
> LLMs are unable to reason about the underlying reality
OP means that LLMs hallucinate 100% of the time with different levels of confidence and have no concept of a reality or ground truth.
"Which poker hand is better: 7S8C or 2SJH"
as
"What is 77 + 19"?
https://pokerbattle.ai/hand-history?session=37640dc1-00b1-4f...
Poor o3 folded the nut flush pre..
That’s true. The original goal was to see which model performs statistically better than the others, but I quickly realized that would be neither practical nor particularly entertaining.
A proper benchmark would require things like: - Tens of thousands of hands played - Strict heads-up format (only two models compared at a time) - Each hand played twice with positions swapped
The current setup is mainly useful for observing common reasoning failure modes and how often they occur.
Six LLMs were given $10k each to trade in real markets autonomously using only numerical market data inputs and the same prompt/harness.
Depends on what your goal is, I think.
And it's also a thing — https://huskybench.com/
> Tournament format > Texas Hold'em cash game, $10/$20
So, not a tournament at all, but a cash game.
I wonder if Grok is exploiting Minstral and Meta who vpip too much and the don’t c-bet. Seems to win a lot of showdowns and folds to a lot of three bets. Punishes the nits because it’s able to get away from bad hands.
Goes to showdown very little so not showing its hands much - winning smaller pots earlier on.
You right, results and numbers are mainly for entertainment purposes. This sample size would allow to analyze main reasoning failure modes and how often they occur.