Poker Tournament for LLMs

Posted by SweetSoftPillow 1 day ago

Poker Tournament for LLMs(pokerbattle.ai)

300 points | 195 commentspage 2

crackpype 1 day ago|

It seems to be broken? For example in this hand, the hand finishes at the turn even though 2 players still live.

https://pokerbattle.ai/hand-history?session=37640dc1-00b1-4f...

imperfectfourth 1 day ago|

one of them went all in, but still the river should have opened because none of them are drawing dead. Kc is still in deck which will make llama the winning hand(other players have the other two kings). If it was Ks instead in the deck, llama would be drawing dead because kimi would improve to a flush even if king opened.

crackpype 1 day ago||

Perhaps a display issue then in case no action possible on river. You can see the winning hand does include the river card 8d "Winning Hand: One pair QsQdThJs8d"

Poor o3 folded the nut flush pre..

energy123 1 day ago||

Not enough samples to overcome variance. Only 714 hands played for Meta LLAMA 4. Noise in a dashboard.

mpavlov 1 day ago|

(author of PokerBattle here)

That’s true. The original goal was to see which model performs statistically better than the others, but I quickly realized that would be neither practical nor particularly entertaining.

A proper benchmark would require things like: - Tens of thousands of hands played - Strict heads-up format (only two models compared at a time) - Each hand played twice with positions swapped

The current setup is mainly useful for observing common reasoning failure modes and how often they occur.

camillomiller 1 day ago||

As a Texas Hold'em enthusiast, some of the hands are moronic. Just checked one where grok wins with A3s because Gemini folds K10 with an Ace and a King on the board, without Grok betting anything. Gemini just folds instead of checking. It's not even GTO, it's just pure hallucination. Meaning: I wouldn't read anything into the fact that Grok leads. These machines are not made to play games like online poker deterministically and would be CRUSHED in GTO. It would be more interesting instead to understand if they could play exploitatively.

prodigycorp 1 day ago||

  > Gemini folds K10 with an Ace and a King on the board, without Grok betting anything. Gemini just folds instead of checking.

It's well known that Gemini has low coding self-esteem. It's hilarious to see it applies to poker as well.

jpfromlondon 1 day ago|||

it's probably trained off my repos then

raverbashing 1 day ago|||

You're absolutely right! /s

hadeson 1 day ago|||

From my experience, their hallucination when playing poker mostly comes from a wrong reading of their hand strength in the current state. E.g., thinking they have the nuts when they are actually on a nut draw. They would reason a lot better if you explicitly give out their hand strength in the prompt.

mpavlov 1 day ago||

(author of PokerBattle here)

I noticed the same and think that you're absolutely right. I've thought about adding their current hand / draw, but it was too close to the event to test it properly.

meep_morp 1 day ago|||

I play PLO and sometimes share hand histories with ChatGPT for fun. It can never successfully parse a starting hand let alone how it interacts with the board.

energy123 1 day ago|||

> These machines are not made to play games like online poker deterministically

I thought you're supposed to sample from a distribution of decisions to avoid exploitation?

tialaramex 1 day ago|||

You're correct that the theoretically optimal play is entirely statistical. Cepheus provides an approximate solution for Heads Up Limit, whereas these LLMs are playing full ring (ie 9 players in the same game, not two) and No Limit (ie you can pick whatever raise size you like within certain bounds instead of a fixed raise sizing) but the ideas are the same, just full ring with no limit is a much more complicated game and the LLMs are much worse at it.

miggol 1 day ago|||

This invites a game where models have variants with slightly differing system prompts. Don't know if they could actually sample from their own output if instructed, but it would allow for iterations on the system prompt to find the best instructions.

energy123 1 day ago||

You could give it access to a tool call which returns a sample from U[0, 1], or more elaborate tool calls to monte carlo software that humans use. Harnessing and providing rules of thumb in context is going to help a great deal as we see in IMO agents.

gorn 1 day ago||

Reminds me of the poker scene in Peep Show.

rzk 1 day ago||