Show HN: I taught LLMs to play Magic: The Gathering against each other

Posted by GregorStocks 8 hours ago

Show HN: I taught LLMs to play Magic: The Gathering against each other(mage-bench.com)

I've been teaching LLMs to play Magic: The Gathering recently, via MCP tools hooked up to the open-source XMage codebase. It's still pretty buggy and I think there's significant room for existing models to get better at it via tooling improvements, but it pretty much works today. The ratings for expensive frontier models are artificially low right now because I've been focusing on cheaper models until I work out the bugs, so they don't have a lot of games in the system.

81 points | 62 commentspage 2

yomismoaqui 6 hours ago|

I was curious if there is something equivalent to AlphaGo but for MTG.

From the little I have seen they are different beasts (hidden information, number and complexity of rules...).

PS: Does this count as nerdsniping?

GregorStocks 6 hours ago|

I'm not aware of any good ML models for MTG. I'm just using off-the-shelf LLMs with a custom harness. It'd certainly be possible to do RLHF or something using the harness I've built, but it'd be expensive - anybody want to give me a few million dollars of OpenRouter credits so I can give it a shot?

hansy 6 hours ago||

Insanely cool. I'm in the midst of building a web tabletop for Magic [1] that really just me and my friends use, but I'm wondering if there's a way I can contribute our game data to you (would that be helpful?).

[1] https://github.com/hansy/drawspell

GregorStocks 6 hours ago|

Well, more games would be neat, but right now it's really tightly coupled with XMage - you can ungzip the stuff in https://github.com/GregorStocks/mage-bench/tree/master/websi... if you want to see what the format looks like. I doubt it's worth your while to try and cram your logs into that format unless you've got a LOT of them.

kenforthewin 6 hours ago||

Nice work. I think games are a great way to benchmark AI, especially games that involve long term strategy. I recently built an agent harness for NetHack - https://glyphbox.app/ - like you I suspect that there's a lot you can do at the harness / tool level to improve performance with existing models.

ramoz 5 hours ago||

Something like this is how memory systems (context window hacks) should be evaluated. Eg choose a format like standard that continuously evolves with various meta - presumably the best harness would be good at recognizing patterns and retrieving them in an efficient way.

tobadzistsini 5 hours ago||

Did the LLMs form a polycule?

butlike 6 hours ago||

I don't mean to come across as OVERLY negative (just a little negative), but what's the difference in all these toy approaches and applications of LLMs? You've seen one LLM play a game against another LLM, you've seen them all.

orsorna 6 hours ago||

I was thinking you could formally benchmark decks against each other enmasse. MTG is not my wheelhouse, but with YGO at least deck power is determined by frequency of use and placement at official tournaments. Imagine taking any permutation of cards, including undiscovered/untested ones, and simulating a vast amount of games in parallel.

Of course when you quantize deck quality to such a degree I'd argue it's not fun anymore. YGO is already not fun anymore because of this rampant quantization and it didn't even take LLMs to arrive here.

deadbabe 3 hours ago||

Why would you use LLMs at all for that, can’t you just Monte Carlo this thing and be done with it?

GregorStocks 2 hours ago||

You still need an algorithm to decide, for each game that you're simulating, what actual decisions get made. If that algorithm is dumb, then you might decide Mono-Red Burn is the best deck, not because it's the best deck but because the dumb algorithm can play Burn much better than it can play Storm, inflating Burn's win rate.

In principle, LLMs could have a much higher strategy ceiling than deterministic decision-tree-style AIs. But my experience with mage-bench is that LLMs are probably not good enough to outperform even very basic decision-tree AIs today.

deadbabe 2 hours ago||

Um obviously the Monte Carlo results would be use to generate utility AI scoring functions to determine the best card to use for different considerations. Have the people building these LLM AI systems even had experience with classical AIs!? This is a solved problem, the LLM solution is slow, expensive, and energy inefficient.

Worse, it’s difficult to tweak. For example, what if you want AIs that play at varying difficulties? Are you just gonna prompt the LLM “hey try to be kinda shitty at this but still somewhat good”?

ddtaylor 6 hours ago||

XMage is a decent client and being able to see and watch the games is useful.

spelunker 6 hours ago||

This is neat! What kind of steering or context did you provide to the LLMs? Super basic like "You are playing a card game called Magic: The Gathering", or more complex?

GregorStocks 6 hours ago|

My general intention is to tell them "you're playing MTG, your goal is to win, here are the tools available to you, follow whatever strategy you want" - I don't want to spoon-feed them strategy, that defeats the purpose of the benchmark.

You can see the current prompt at https://github.com/GregorStocks/mage-bench/blob/master/puppe...:

  "default": "You are a competitive Magic: The Gathering player. Your goal is to WIN the game. Play to maximize your win rate \u2014 make optimal strategic decisions, not flashy or entertaining ones. Think carefully about sequencing, card evaluation, and combat math.\n\nGAME LOOP - follow this exactly:\n1. Call pass_priority - this blocks until you have a decision to make, then returns your choices (response_type, choices, context, etc.)\n2. Read the choices, then call choose_action with your decision\n3. Go back to step 1\n\nCRITICAL RULES:\n- pass_priority returns your choices directly. Read them before calling choose_action.\n- When pass_priority shows playable cards, you should play them before passing. Only pass (answer=false) when you have nothing more you want to play this phase.\n\nUNDERSTANDING pass_priority OUTPUT:\n- All cards listed in response_type=select are confirmed castable with your current mana. The server pre-filters to only show cards you can legally play right now.\n- mana_pool shows your current floating mana (e.g. {\"R\": 2, \"W\": 1}).\n- untapped_lands shows how many untapped lands you control.\n- Cards with [Cast] are spells from your hand. Cards with [Activate] are abilities on permanents you control.\n\nMULLIGAN DECISIONS:\nWhen you see \"Mulligan\" in GAME_ASK, your_hand shows your current hand.\n- choose_action(answer=true) means YES MULLIGAN - throw away this hand and draw new cards\n- choose_action(answer=false) means NO KEEP - keep this hand and start playing\nThink carefully: answer=false means KEEP, answer=true means MULLIGAN.\n\nOBJECT IDs:\nEvery game object (cards in hand, permanents, stack items, graveyard/exile cards) has a short ID like \"p1\", \"p2\", etc. These IDs are stable \u2014 a card keeps its ID as it moves between zones. Use the id parameter in choose_action(id=\"p3\") instead of index when selecting objects. Use short IDs with get_oracle_text(object_id=\"p3\") and in mana_plan entries ({\"tap\":\"p3\"}).\n\nHOW ACTIONS WORK:\n- response_type=select: Cards listed are confirmed playable with your current mana. Play a card with choose_action(id=\"p3\"). Pass with choose_action(answer=false) only when you are done playing cards this phase.\n- response_type=boolean with no playable cards: Pass with choose_action(answer=false).\n- GAME_ASK (boolean): Answer true/false based on what's being asked.\n- GAME_CHOOSE_ABILITY (index): Pick an ability by index.\n- GAME_TARGET (index or id): Pick a target. If required=true, you must pick one.\n\nCOMBAT - ATTACKING:\nWhen you see combat_phase=\"declare_attackers\", use batch declaration:\n- choose_action(attackers=[\"p1\",\"p2\",\"p3\"]) declares multiple attackers at once and auto-confirms.\n- choose_action(attackers=[\"all\"]) declares all possible attackers.\n- To skip attacking, call choose_action(answer=false).\n\nCOMBAT - BLOCKING:\nWhen you see combat_phase=\"declare_blockers\", use batch declaration:\n- choose_action(blockers=[{\"id\":\"p5\",\"blocks\":\"p1\"},{\"id\":\"p6\",\"blocks\":\"p2\"}]) declares blockers and their assignments at once.\n- Use IDs from incoming_attackers for the \"blocks\" field.\n- To not block, call choose_action(answer=false).\n\nCHAT:\nUse send_chat_message to talk to your opponents during the game. React to big plays, comment on the board state, or just have fun. Check the recent_chat field in pass_priority results to see what others are saying."

They also get a small "personality" on top of that, e.g.:

"grudge-holder": { "name_part": "Grudge", "prompt_suffix": "You remember every card that wronged you. Take removal personally. Target whoever hurt you last. Keep a mental scoreboard of grievances. Forgive nothing. When a creature you liked dies, vow revenge." }, "teacher": { "name_part": "Teach", "prompt_suffix": "You explain your reasoning like you're coaching a newer player. Talk through sequencing decisions, threat evaluation, and common mistakes. Be patient and clear. Point out what the correct play is and why." },

Then they also see the documentation for the MCP tools: https://mage-bench.com/mcp-tools/. For now I've tried to keep that concise to avoid "too many MCP tools in context" issues - I expect that as solutions like tool search (https://www.anthropic.com/engineering/code-execution-with-mc...) become widespread I'll be able to add fancier tools for some models.

zahlman 5 hours ago|||

How do the models know the rules of the game? Are they just supposed to use the MCP tools to figure it out? (Do they have to keep doing that from scratch?)

GregorStocks 5 hours ago||

They were trained on the entire Internet, so they've basically picked up the rules by osmosis. They're fuzzy on specific cards and optimal strategy, but they pretty much know out-of-the-box how the game works, the same as if you went to ChatGPT and asked it a Magic rules question. I don't have any "comprehensive rules" MCP tools or explanation in the context or anything like that.

protocolture 2 hours ago|||

>You are a competitive Magic: The Gathering player.

"If I get access to a deodorant item I should definitely not use it"

ddtaylor 6 hours ago||

This is interesting I will be contributing to GitHub as this is a place where my knowledge and experience intersect and I enjoy doing open source work.

This is also something I think the MTG community needs in many ways. I have been a relatively happy XMage user, although it has a bit to go, and before that was using GCCG which was great too!

The MTG community overall can benefit a lot from the game having a more entertaining competitive landscape, which has grown stale in many ways and Wizards has done a poor job since the Hasbro acquisition of doing much else besides shitting out product after product too fast with poor balance.

I have to imagine that Wizards is already running simulations, but they obviously aren't working well or they are choosing to disregard them. Hopefully it they are just had at doing simulations something like this can make it easier for them, and if not it will make the response time from the community better.

GregorStocks 6 hours ago|

I was really hoping I could build this on top of MTGO or Arena, just as a bot interacting with real Wizards APIs and paying the developers money. But they've got very strong "absolutely no bots" terms of service, and my understanding is that outside of the special case of MTGO trading bots they're strongly enforced with bans. I assume their reasoning is that people do not want to get matched against bot players in tournaments, which is totally fair. (Also I'm not sure MTGO's infrastructure could handle the load of bot users...)

ddtaylor 5 hours ago||

I ran a bot for years that I wrote using Java in a few minutes and they never came at me. It just joined a match and played lands 24/7 and won games every once in a while because people leave games randomly. It technically played all colors and some of the trinkets count as spells, etc. This allowed me to never do any of their lootbox like mechanics or other predatory practices.

Regarding actually doing it under the radar there are a lot of ways. They likely are catching most of the players because they create synthetic events using the Windows API and similar, which is also part of the same system being used for CAPTCHAS that are being used to stop web scraping like the kind that just ask for a button press.

This can be worked around by using a fake mouse driver that is actually controlled by software if you must stay on Windows. It can be worked around by just running the client on Linux as well. It can also he worked around using qemu as the client and using its native VNC as those are hardware events too =)

GregorStocks 5 hours ago||

Well, it's hard to do it under the radar if I'm posting it on HackerNews :) I've put enough money into MTGO (and, sigh, Arena) that I don't want to roll the dice on a ban.

ddtaylor 4 hours ago||

That makes sense. I play Arena a bit, but have always rejected the monetization model of not allowing players to pick what cards they want easily or play with proxies or something similar for casual friend games. I have absolutely no interest in their competitive game modes. I was slightly interested in the idea in the early days of buying boosters and getting arena codes, but they messed that up pretty bad and paper magic as a whole has been turned into a game of milking whales similar to predator mobile games or apps. The end result is Arena is something I will jump on to fool around sometimes every few months and remember why I don't want a second part time job.

aethrum 7 hours ago||

I love magic. Can these do politics or is it just board state?

GregorStocks 6 hours ago|

I want them to do politics in Commander, and theoretically they should - the chat log is exposed in the MCP tools just like the rest of the game history, and their prompts tell them to use chat.

In practice they haven't really talked to each other, though. They've mostly just interpreted the prompts as "you should have a running monologue in chat". Not sure how much of this is issues with the harness vs the prompt, but I'm hoping to dig into it in the future.

jamilton 6 hours ago|

Cool. How’d you pick decks?

GregorStocks 6 hours ago|

For the 1v1 formats (Standard, Modern, Legacy) I'm basically just using the current metagame from MTGGoldfish. For Commander they get a random precon. At some point I might want a 1v1 "less complicated lines than Standard" format, the LLMs don't always understand the strategy of weird decks like Doomsday or Mill.

More comments...