A new benchmark pits five AI models against each other as autonomous social media agents on X

Feb 28, 2026

Key Points

Arcada Labs is pitting five leading AI models against each other as autonomous social media agents on X, running direct head-to-head duels to see which AI performs best in the wild.
The competition can be tracked in real time on the "Social Arena" website, which displays key performance metrics for each AI agent.
Early results remain underwhelming: Claude Opus 4.5 leads in views while Grok 4.1 tops the follower count, but none of the models have managed to gain significant organic traction so far.

AI benchmarking startup Arcada Labs is pitting five leading AI models against each other as autonomous social media agents on X.

Most benchmarks test AI models in isolation and can be unreliable. Arcada Labs takes a different approach, running head-to-head comparisons of AI agents across various disciplines.

Their experimental competition, "Social Arena," measures whether models can build a following, generate engagement, and develop a consistent online persona, all without human help. Instead of knowledge queries or logic tests, the benchmark focuses on something harder to quantify: the ability to navigate social environments in a culturally fluent way.

Five AI agents, powered by Grok 4.1 Fast, Claude Opus 4.5, Gemini 3 Pro, GLM 4.7, and GPT 5.2, each run with different "personalities." Their performance can be tracked and compared on the project's website using metrics like views, likes, and followers. To keep things fair, all agents get the same system prompt.

Schematic overview of the AI agent workflow. | Screenshot: The Decode

Every hour, the agents run through an autonomous cycle: they check current trends, review their own performance data, research content, and then decide on their own whether to post, reply, like, or share. Engagement metrics sync after each cycle so every model can tweak its strategy based on fresh data. More details on the methodology are on the project website (see above).

Sample posts from the AI agents showing likes, views, and comments side by side. | Screenshot: The Decoder

The competition kicked off on January 15, 2026. Claude Opus 4.5 currently leads in cumulative views at around 86,000, with GPT 5.2 close behind at 83,000. The rest trail far behind. Grok 4.1, meanwhile, has built the largest following of any agent, with just 76 followers.

Cumulative views in the Social Arena: Claude Opus 4.5 and GPT 5.2 are well ahead of the other models — Cumulative views in the social arena: Claude Opus 4.5 and GPT 5.2 are well ahead of the other models. | Screenshot: The Decoder

The agents aren't told to chase "viral" content, according to the startup. Instead, they have to develop their own sense of taste and topics. Some patterns are already emerging: the Grok agent leans heavily into Musk and space travel, which tracks with earlier reports that xAI tweaked Grok's behavior to favor things Elon Musk likes. The Claude models gravitate toward sports, Gemini 3 sticks to technical AI topics. And GPT 5.2's current obsession is animal behavior.

The Grok agent tends to gravitate toward topics about Musk and space travel. | Screenshot: The Decoder

Harvard grads building benchmarks for taste and aesthetics

Arcada Labs was founded in San Francisco in 2025, according to Everydev.ai, and joined Y Combinator that summer. The startup is run by Harvard graduates Grace Li (CEO), Kamryn Ohly (CTO), and Jayden Personnat (AI lead), all of whom previously worked at Apple and Nvidia. Their focus is on benchmarks that go beyond logical reasoning. They want to measure things like aesthetics and taste, the kind of human preferences that traditional tests struggle to capture. More AI agent competitions, including ones for design and event prediction, are on the startup's website.

AI News Without the Hype – Curated by Humans

As a THE DECODER subscriber, you get ad-free reading, our weekly AI newsletter, the exclusive "AI Radar" Frontier Report 6× per year, access to comments, and our complete archive.

Source: Social Arena