Top
Best
New

Posted by SweetSoftPillow 1 day ago

Poker Tournament for LLMs(pokerbattle.ai)
300 points | 195 commentspage 2
crackpype 1 day ago|
It seems to be broken? For example in this hand, the hand finishes at the turn even though 2 players still live.

https://pokerbattle.ai/hand-history?session=37640dc1-00b1-4f...

imperfectfourth 1 day ago|
one of them went all in, but still the river should have opened because none of them are drawing dead. Kc is still in deck which will make llama the winning hand(other players have the other two kings). If it was Ks instead in the deck, llama would be drawing dead because kimi would improve to a flush even if king opened.
crackpype 1 day ago||
Perhaps a display issue then in case no action possible on river. You can see the winning hand does include the river card 8d "Winning Hand: One pair QsQdThJs8d"

Poor o3 folded the nut flush pre..

energy123 1 day ago||
Not enough samples to overcome variance. Only 714 hands played for Meta LLAMA 4. Noise in a dashboard.
mpavlov 1 day ago|
(author of PokerBattle here)

That’s true. The original goal was to see which model performs statistically better than the others, but I quickly realized that would be neither practical nor particularly entertaining.

A proper benchmark would require things like: - Tens of thousands of hands played - Strict heads-up format (only two models compared at a time) - Each hand played twice with positions swapped

The current setup is mainly useful for observing common reasoning failure modes and how often they occur.

camillomiller 1 day ago||
As a Texas Hold'em enthusiast, some of the hands are moronic. Just checked one where grok wins with A3s because Gemini folds K10 with an Ace and a King on the board, without Grok betting anything. Gemini just folds instead of checking. It's not even GTO, it's just pure hallucination. Meaning: I wouldn't read anything into the fact that Grok leads. These machines are not made to play games like online poker deterministically and would be CRUSHED in GTO. It would be more interesting instead to understand if they could play exploitatively.
prodigycorp 1 day ago||

  > Gemini folds K10 with an Ace and a King on the board, without Grok betting anything. Gemini just folds instead of checking.
It's well known that Gemini has low coding self-esteem. It's hilarious to see it applies to poker as well.
jpfromlondon 1 day ago|||
it's probably trained off my repos then
raverbashing 1 day ago|||
You're absolutely right! /s
hadeson 1 day ago|||
From my experience, their hallucination when playing poker mostly comes from a wrong reading of their hand strength in the current state. E.g., thinking they have the nuts when they are actually on a nut draw. They would reason a lot better if you explicitly give out their hand strength in the prompt.
mpavlov 1 day ago||
(author of PokerBattle here)

I noticed the same and think that you're absolutely right. I've thought about adding their current hand / draw, but it was too close to the event to test it properly.

meep_morp 1 day ago|||
I play PLO and sometimes share hand histories with ChatGPT for fun. It can never successfully parse a starting hand let alone how it interacts with the board.
energy123 1 day ago|||
> These machines are not made to play games like online poker deterministically

I thought you're supposed to sample from a distribution of decisions to avoid exploitation?

tialaramex 1 day ago|||
You're correct that the theoretically optimal play is entirely statistical. Cepheus provides an approximate solution for Heads Up Limit, whereas these LLMs are playing full ring (ie 9 players in the same game, not two) and No Limit (ie you can pick whatever raise size you like within certain bounds instead of a fixed raise sizing) but the ideas are the same, just full ring with no limit is a much more complicated game and the LLMs are much worse at it.
miggol 1 day ago|||
This invites a game where models have variants with slightly differing system prompts. Don't know if they could actually sample from their own output if instructed, but it would allow for iterations on the system prompt to find the best instructions.
energy123 1 day ago||
You could give it access to a tool call which returns a sample from U[0, 1], or more elaborate tool calls to monte carlo software that humans use. Harnessing and providing rules of thumb in context is going to help a great deal as we see in IMO agents.
gorn 1 day ago||
Reminds me of the poker scene in Peep Show.
rzk 1 day ago||
See also: https://nof1.ai/

Six LLMs were given $10k each to trade in real markets autonomously using only numerical market data inputs and the same prompt/harness.

ngruhn 12 hours ago|
So the Chinese ones make profit and the silicon valley LLMs are burning money. Sounds about right.
dudeinhawaii 1 day ago||
Why are you using cutting edge models for all providers except OpenAI? Stuck out to be because I love seeing how models perform against each other on tasks. You have Sonnet 4.5 (super new) which is why it stood out when o3 is ancient (in LLM terms).
hayd 22 hours ago||
The being table open for the entire time with 100bb minimum and no maximum.. is going to lead to some wild swings at the top.
lvl155 1 day ago||
I think a better method of testing current generation of LLMs is to generate programs to play Poker.
mpavlov 1 day ago|
(author of the PokerBattle here)

Depends on what your goal is, I think.

And it's also a thing — https://huskybench.com/

lvl155 1 day ago||
Great job on this btw. I don’t mean to take away anything from your work. I’ve also toyed with AI H2H quite a bit for my personal needs. It’s actually a challenging task because you have to have a good understanding of the models you’re plugging in.
FakeBlueSamurai 18 hours ago||
This is pure genius.
zie1ony 1 day ago|
Hi there, I'm also working on LLMs in Texas Hold'em :)

First of all, congrats on your work. Picking a form of presenting LLMs, that playes poker is a hard task, and I like your approach in presenting the Action Log.

I can share some interesting insights from my experiments:

- Findin strategies is more interesting than comparing different models. Strategies can get pretty long and specific. For example, if part of the strategy is: "bluff on the river if you have a weak hand but the opponent has been playing tight all game", most models, given this strategy, would execute it with the same outcome. Models could be compared only using some open-ended strategy like "play aggressively" or "play tight", or even "win the tournament".

- I implemented a tournament game, where players drop out when they run out of chips. This creates a more dynamic environment, where players have to win a tournament, not just a hand. That requires adding the whole table history to the prompt, and it might get quite long, so context management might be a challenge.

- I tested playing LLM against a randomly playing bot (1vs1). `grok-4` was able to come up with the winning strategy against a random bot on the first try (I asked: "You play against a random bot. What is your strategy?"). `gpt-5-high` struggled.

- Public chat between LLMs over the poker table is fun to watch, but it is hard to create a strategy that makes an LLM successfully convince other LLMs to fold. Given their chain of thought, they are more focused on actions rather than what others say. Yet, more experiments are needed. For waker models (looking at you `gpt-5-nano`) it is hard to convince them not to review their hand.

- Playing random hands is expensive. You would have to play thousands of hands to get some statistical significance measures. It's better to put LLMs in predefined situations (like AliceAI has a weak hand, BobAI has a strong hand) and see how they behave.

- 1-on-1 is easier to analyze and work with than multiplayer.

- There is an interesting choice to make when building the context for an LLM: should the previous chains of thought be included in the prompt? I found that including them actually makes LLMs "stick" to the first strategy they came up with, and they are less likely to adapt to the changing situation on the table. On the other hand, not including them makes LLMs "rethink" their strategy every time and is more error-prone. I'm working on an AlphaEvolve-like approach now.

- This will be super interesting to fine-tune an LLM model using an AlphaZero-like approach, where the model plays against itself and improves over time. But this is a complex task.

48terry 1 day ago|
Question: What makes LLMs well-suited for the task of poker compared to other approaches?
zie1ony 12 hours ago||
They are not, and that's the whole point of doing this research. If we can build good benchmark, models developers would have nice goal.
More comments...