Posted by SweetSoftPillow 1 day ago
https://pokerbattle.ai/hand-history?session=37640dc1-00b1-4f...
Poor o3 folded the nut flush pre..
That’s true. The original goal was to see which model performs statistically better than the others, but I quickly realized that would be neither practical nor particularly entertaining.
A proper benchmark would require things like: - Tens of thousands of hands played - Strict heads-up format (only two models compared at a time) - Each hand played twice with positions swapped
The current setup is mainly useful for observing common reasoning failure modes and how often they occur.
> Gemini folds K10 with an Ace and a King on the board, without Grok betting anything. Gemini just folds instead of checking.
It's well known that Gemini has low coding self-esteem. It's hilarious to see it applies to poker as well.I noticed the same and think that you're absolutely right. I've thought about adding their current hand / draw, but it was too close to the event to test it properly.
I thought you're supposed to sample from a distribution of decisions to avoid exploitation?
Six LLMs were given $10k each to trade in real markets autonomously using only numerical market data inputs and the same prompt/harness.
Depends on what your goal is, I think.
And it's also a thing — https://huskybench.com/
First of all, congrats on your work. Picking a form of presenting LLMs, that playes poker is a hard task, and I like your approach in presenting the Action Log.
I can share some interesting insights from my experiments:
- Findin strategies is more interesting than comparing different models. Strategies can get pretty long and specific. For example, if part of the strategy is: "bluff on the river if you have a weak hand but the opponent has been playing tight all game", most models, given this strategy, would execute it with the same outcome. Models could be compared only using some open-ended strategy like "play aggressively" or "play tight", or even "win the tournament".
- I implemented a tournament game, where players drop out when they run out of chips. This creates a more dynamic environment, where players have to win a tournament, not just a hand. That requires adding the whole table history to the prompt, and it might get quite long, so context management might be a challenge.
- I tested playing LLM against a randomly playing bot (1vs1). `grok-4` was able to come up with the winning strategy against a random bot on the first try (I asked: "You play against a random bot. What is your strategy?"). `gpt-5-high` struggled.
- Public chat between LLMs over the poker table is fun to watch, but it is hard to create a strategy that makes an LLM successfully convince other LLMs to fold. Given their chain of thought, they are more focused on actions rather than what others say. Yet, more experiments are needed. For waker models (looking at you `gpt-5-nano`) it is hard to convince them not to review their hand.
- Playing random hands is expensive. You would have to play thousands of hands to get some statistical significance measures. It's better to put LLMs in predefined situations (like AliceAI has a weak hand, BobAI has a strong hand) and see how they behave.
- 1-on-1 is easier to analyze and work with than multiplayer.
- There is an interesting choice to make when building the context for an LLM: should the previous chains of thought be included in the prompt? I found that including them actually makes LLMs "stick" to the first strategy they came up with, and they are less likely to adapt to the changing situation on the table. On the other hand, not including them makes LLMs "rethink" their strategy every time and is more error-prone. I'm working on an AlphaEvolve-like approach now.
- This will be super interesting to fine-tune an LLM model using an AlphaZero-like approach, where the model plays against itself and improves over time. But this is a complex task.