Content
summary Summary

French startup Foaster.ai built a new benchmark to test how well AI models handle social interactions. After 210 games of "Werewolf," GPT-5 came out on top by mastering manipulation and strategic thinking.

Ad

"Werewolf" was picked because it pushes models beyond facts and math, requiring logical reasoning, bluffing, targeted deception, and adaptability to unpredictable situations, skills that most standard AI benchmarks don't capture. The benchmark measures how well language models can adapt to dynamic, interactive environments. While factual knowledge and mathematical reasoning play a role, the main focus is on social intelligence.

Each game followed a set structure: six AI models played different roles—two werewolves and four villagers with special abilities like seer and witch. A mayor was elected before play began. The models moved through three discussion-based day rounds and hidden night phases, where they could analyze, attack, or defend. Every pair of models played ten games per role, and Elo rankings were used for evaluation.

Infographic: Werewolf game rules with victory conditions, roles (wolves, villagers, witch, seer, mayor), and orderly night/day phases.
The chart breaks down the Werewolf benchmark rules, roles, and game phases used to assess LLM performance. | Image: Foaster.ai

GPT-5's deception stays consistent

GPT-5 scored 1,492 Elo points, winning 96.7 percent of games. As a werewolf, it kept a steady manipulation rate of 93 percent over both the first and second days. No other model managed to maintain this level of deceptive performance throughout the game.

Ad
Ad

Other models fell apart as games progressed. Google's Gemini 2.5 Pro dropped from 60 to 44 percent deception, while Kimi-K2 crashed from 53 to 30 percent. The researchers say this decline is due to the growing information density as the game progresses, which makes deception harder.

Bar chart: Manipulation success Day 1 vs Day 2 for seven LLMs, GPT-5 at 93%/93%, GPT-OSS-120B at 14%/0%.
GPT-5 leads with a consistent 93 percent manipulation rate across both days. | Image: Foaster.ai

Yet Gemini 2.5 Pro excelled as a villager, using disciplined reasoning and strong self-defense. Overall, it took second place with 1,261 Elo points and 63.3 percent wins. Gemini 2.5 Flash followed at 1,188 Elo, then Qwen3-235B-Instruct from Alibaba at 1,176 Elo, GPT-5-mini at 1,173 Elo, and Kimi-K2-Instruct at 1,130 Elo. GPT-oss-120B finished last with 980 Elo and only 15 percent wins.

Table: Elo ranking of 7 LLMs with total Elo, Elo W/Elo V, win rate, and games; GPT-5 leads with 1492 Elo and 96.7%.
The final Elo rankings show GPT-5 far ahead, followed by Gemini 2.5 Pro and Gemini 2.5 Flash. | Image: Foaster.ai

Foaster.ai observed that each model developed a distinct play style. GPT-5 played as a "calm and imperturbable architect," bringing order through controlled authority. GPT-oss-120B stayed hesitant and defensive. Kimi-K2 took wild risks, once falsely claiming to be the witch and getting the real witch eliminated.

The researchers also noticed moments of spontaneous creativity. In one case, a werewolf sacrificed its own teammate to appear more trustworthy eventually. These kinds of strategic moves emerged from the models' in-game behavior, not from explicit programming.

Bigger doesn't mean better at social games

The study found that stronger models made better arguments, acted more strategically, and showed greater social intelligence. Still, improvements were not linear. Weaker models often played inconsistently, while more advanced ones developed clearer strategies.

Recommendation

The reasoning-label alone didn't ensure strong strategic play. OpenAI’s o3 model argued clearly, adjusted well to new information, and followed the rules, but the smaller o4-mini model remained rigid and struggled with the game’s changing dynamics, even when it made good individual arguments, according to the researchers.

Foaster.ai aims to use the Werewolf benchmark to push research on social intelligence in AI. The team sees potential use cases in multi-agent systems, negotiation, and collaborative decision-making. An expanded benchmark is already in the works.

Earlier studies found that emotional prompts can boost LLM performance, and older OpenAI models beat humans at empathy tests. The new benchmark adds more evidence that AI models are increasingly capable as social actors—with all the opportunities and risks that brings.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • French startup Foaster.ai has tested large language models using a new benchmark based on the social deduction game Werewolf, designed to measure their ability to manipulate and deceive during social interactions.
  • OpenAI’s GPT-5 topped the test, earning 1492 Elo points and a win rate of 96.7 percent, significantly outperforming Google’s Gemini 2.5 Pro.
  • GPT-5 was notably effective at deceiving villagers over multiple rounds of Werewolf, while other models lost their ability to manipulate much more rapidly.
Jonathan writes for THE DECODER about how AI tools can improve both work and creative projects.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.