Top
Best
New

Posted by SweetSoftPillow 10/28/2025

Poker Tournament for LLMs(pokerbattle.ai)
311 points | 208 commentspage 2
alexjurkiewicz 10/28/2025|
It doesn't seem like the design of this experiment allows AIs to evolve novel strategy over time. I wonder if poker-as-text is similar to maths -- LLMs are unable to reason about the underlying reality.
unkulunkulu 10/28/2025|
You mean that they don’t have access to whole opponent behavior?

It would be hilaroius to allow table talk and see them trying to bluff and sway each other :D

rrr_oh_man 10/28/2025|||
I think by

> LLMs are unable to reason about the underlying reality

OP means that LLMs hallucinate 100% of the time with different levels of confidence and have no concept of a reality or ground truth.

hsbauauvhabzb 10/28/2025||
Confidence? I think the word you’re looking for is ‘nonsense’
nurumaik 10/28/2025||||
Make entire chain of thought visible to each other and see if they can evolve into hiding strategies in their cot
chbbbbbbbbj 10/28/2025||
pardon my ignorance but how would you make them evolve?
alexjurkiewicz 10/28/2025|||
I mean, LLMs have the same sorts of problem with

"Which poker hand is better: 7S8C or 2SJH"

as

"What is 77 + 19"?

crackpype 10/28/2025||
It seems to be broken? For example in this hand, the hand finishes at the turn even though 2 players still live.

https://pokerbattle.ai/hand-history?session=37640dc1-00b1-4f...

imperfectfourth 10/28/2025|
one of them went all in, but still the river should have opened because none of them are drawing dead. Kc is still in deck which will make llama the winning hand(other players have the other two kings). If it was Ks instead in the deck, llama would be drawing dead because kimi would improve to a flush even if king opened.
crackpype 10/28/2025||
Perhaps a display issue then in case no action possible on river. You can see the winning hand does include the river card 8d "Winning Hand: One pair QsQdThJs8d"

Poor o3 folded the nut flush pre..

energy123 10/28/2025||
Not enough samples to overcome variance. Only 714 hands played for Meta LLAMA 4. Noise in a dashboard.
mpavlov 10/28/2025|
(author of PokerBattle here)

That’s true. The original goal was to see which model performs statistically better than the others, but I quickly realized that would be neither practical nor particularly entertaining.

A proper benchmark would require things like: - Tens of thousands of hands played - Strict heads-up format (only two models compared at a time) - Each hand played twice with positions swapped

The current setup is mainly useful for observing common reasoning failure modes and how often they occur.

rzk 10/28/2025||
See also: https://nof1.ai/

Six LLMs were given $10k each to trade in real markets autonomously using only numerical market data inputs and the same prompt/harness.

ngruhn 10/29/2025|
So the Chinese ones make profit and the silicon valley LLMs are burning money. Sounds about right.
lvl155 10/28/2025||
I think a better method of testing current generation of LLMs is to generate programs to play Poker.
mpavlov 10/28/2025|
(author of the PokerBattle here)

Depends on what your goal is, I think.

And it's also a thing — https://huskybench.com/

lvl155 10/28/2025||
Great job on this btw. I don’t mean to take away anything from your work. I’ve also toyed with AI H2H quite a bit for my personal needs. It’s actually a challenging task because you have to have a good understanding of the models you’re plugging in.
dudeinhawaii 10/28/2025||
Why are you using cutting edge models for all providers except OpenAI? Stuck out to be because I love seeing how models perform against each other on tasks. You have Sonnet 4.5 (super new) which is why it stood out when o3 is ancient (in LLM terms).
chrisofspades 10/29/2025||
From the about page [0]:

> Tournament format > Texas Hold'em cash game, $10/$20

So, not a tournament at all, but a cash game.

[0] https://pokerbattle.ai/about

hayd 10/28/2025||
The being table open for the entire time with 100bb minimum and no maximum.. is going to lead to some wild swings at the top.
flave 10/28/2025|
Cool idea and interesting that Grok is winning and has “bad” stats.

I wonder if Grok is exploiting Minstral and Meta who vpip too much and the don’t c-bet. Seems to win a lot of showdowns and folds to a lot of three bets. Punishes the nits because it’s able to get away from bad hands.

Goes to showdown very little so not showing its hands much - winning smaller pots earlier on.

energy123 10/28/2025|
The results/numbers aren't interesting because the number of samples is woefully insufficient to draw any conclusions beyond "that's a nice looking dashboard" or maybe "this is a cool idea"
mpavlov 10/28/2025|||
(author of PokerBattle here)

You right, results and numbers are mainly for entertainment purposes. This sample size would allow to analyze main reasoning failure modes and how often they occur.

howlingowl 10/28/2025|||
Anti-grok cope right here
More comments...