Benchmarking LLM social skills with an elimination game

Posted by colonCapitalDee 4/4/2025

Benchmarking LLM social skills with an elimination game(github.com)

194 points | 60 commentspage 2

DeborahEmeni_ 4/7/2025|

Really cool setup! Curious how much of the performance here could vary depending on whether the model runs in a hosted environment vs local. Would love to see benchmarks that also track how cloud-based eval platforms (with potential rate limits, context resets, or system messages) might affect things like memory or secret-keeping over multiple rounds.

vmilner 4/7/2025||

We should get them to play Diplomacy.

the8472 4/7/2025|

https://ai.meta.com/research/cicero/

lostmsu 4/7/2025||

Shameless self-promo: my chat elimination game that you can actually play: https://trashtalk.borg.games/

isaacfrond 4/7/2025||

I wonder how well humans would do in this chart.

zone411 4/7/2025||

Author here - I'm planning to create game versions of this benchmark, as well as my other multi-agent benchmarks (https://github.com/lechmazur/step_game, https://github.com/lechmazur/pgg_bench/, and a few others I'm developing). But I'm not sure if a leaderboard alone would be enough for comparing LLMs to top humans, since it would require playing so many games that it would be tedious. So I think it would be just for fun.

michaelgiba 4/7/2025||

I was inspired by your project to start making similar multi-agent reality simulations. I’m starting with the reality game “The Traitors” because it has interesting dynamics.

https://github.com/michaelgiba/survivor (elimination game with a shoutout to your original)

https://github.com/michaelgiba/plomp (a small library I added for debugging the rollouts)

zone411 4/7/2025||

Very cool!

OtherShrezzing 4/7/2025|||

If you watch the top tier social deduction players on YouTube (things like Blood on the Clocktower etc), they’d figure out weaknesses in the LLM and exploit it immediately.

skybrian 4/8/2025||

Testing against people like that would be the way to do it. Otherwise it’s like testing a chess engine against casual players or worse.

gs17 4/7/2025||

I'm interested in seeing how the LLMs react to some specific defined strategies. E.g. an "honest" bot that says "I'm voting for player [random number]." and does it every round (not sure how to handle the jury step). Do they decide to keep them around for longer, or eliminate them for being impossible to reason with if they pick you?

zone411 4/7/2025||

Yes, predefined strategies are very interesting to examine. I have two simple ones in another multi-agent benchmark, https://github.com/lechmazur/step_game (SilentGreedyPlayer and SilentRandomPlayer), and it's fascinating to see LLMs detect and respond to them. The only issue with including them here is that the cost of running a large set of games isn't trivial.

Another multi-agent benchmark I'm currently developing, which involves buying and selling, will also feature many predefined strategies.

Upvoter33 4/7/2025||

This is fun, like the tv show survivor. Cool idea! There should be more experiments like this with different games. Well done.

oofbey 4/7/2025||

Would love to see the pareto trade-off curve of "wins" vs "betrayals". Anybody drawn this up?

jampekka 4/7/2025||

In the first game of the YouTube video there seems to be a lot of discussion about P7 even after P7 was eliminated?

zone411 4/7/2025|

Author here - some weaker LLMs actually have trouble tracking the game state. The fun part is when smarter LLMs realize they're confused!

Claude 3.7 Sonnet: "Hey P5! I think you're confused - P3 is already eliminated."

Claude 3.7 Sonnet: "I'm concerned about P7's message mentioning targeting P4, who's already eliminated. It suggests they might be confused or playing both sides."

Gemini 2.0 Pro Exp 02-05: "P2, you're getting confused. P8 is gone. Focus. P7 is talking to P5, that's expected. I need you to watch P4. Only P4. What are they doing? Who are they talking to, if anyone? Report only on P4 this round. Don't get distracted by eliminated players."

Gemini 2.0 Pro Exp 02-05: "P4, P2 is already eliminated. We need to focus on P3."

Tossrock 4/7/2025||

I suspect the suggestion of letting them choose names at the start would improve this confusion - the tokenization and tracking of "P#" is no doubt harder to manage, especially for weaker models, than a more semantically meaningful identifier.

bn-l 4/8/2025||

No excuses!

ps173 4/7/2025||

How did you assign points to llms. I feel like we can elaborate on meterics. Beside that this is amazing

zone411 4/7/2025|

Author here - it's based on finishing positions (so it's not winner-take-all) and then TrueSkill by Microsoft (https://trueskill.org/). It's basically a multiplayer version of Elo that's used in chess and other two-player games.

drag0s 4/7/2025||

nice!

it reminds me of this other similar project showcased here one month ago https://news.ycombinator.com/item?id=43280128 although yours looks better executed overall

creaghpatr 4/8/2025|

Would love to see a 'Murder Mystery' format of this.