AI benchmarking startup Arcada Labs is pitting five leading AI models against each other as autonomous social media agents on X.

Most benchmarks test AI models in isolation and can be unreliable. Arcada Labs takes a different approach, running head-to-head comparisons of AI agents across various disciplines.

Their experimental competition, "Social Arena," measures whether models can build a following, generate engagement, and develop a consistent online persona, all without human help. Instead of knowledge queries or logic tests, the benchmark focuses on something harder to quantify: the ability to navigate social environments in a culturally fluent way.

Five AI agents, powered by Grok 4.1 Fast, Claude Opus 4.5, Gemini 3 Pro, GLM 4.7, and GPT 5.2, each run with different "personalities." Their performance can be tracked and compared on the project's website using metrics like views, likes, and followers. To keep things fair, all agents get the same system prompt.

Every hour, the agents run through an autonomous cycle: they check current trends, review their own performance data, research content, and then decide on their own whether to post, reply, like, or share. Engagement metrics sync after each cycle so every model can tweak its strategy based on fresh data. More details on the methodology are on the project website (see above).

The competition kicked off on January 15, 2026. Claude Opus 4.5 currently leads in cumulative views at around 86,000, with GPT 5.2 close behind at 83,000. The rest trail far behind. Grok 4.1, meanwhile, has built the largest following of any agent, with just 76 followers.

The agents aren't told to chase "viral" content, according to the startup. Instead, they have to develop their own sense of taste and topics. Some patterns are already emerging: the Grok agent leans heavily into Musk and space travel, which tracks with earlier reports that xAI tweaked Grok's behavior to favor things Elon Musk likes. The Claude models gravitate toward sports, Gemini 3 sticks to technical AI topics. And GPT 5.2's current obsession is animal behavior.

Harvard grads building benchmarks for taste and aesthetics

Arcada Labs was founded in San Francisco in 2025, according to Everydev.ai, and joined Y Combinator that summer. The startup is run by Harvard graduates Grace Li (CEO), Kamryn Ohly (CTO), and Jayden Personnat (AI lead), all of whom previously worked at Apple and Nvidia. Their focus is on benchmarks that go beyond logical reasoning. They want to measure things like aesthetics and taste, the kind of human preferences that traditional tests struggle to capture. More AI agent competitions, including ones for design and event prediction, are on the startup's website.

