Ad
Skip to content

New benchmark shows Claude Mythos and GPT-5.5 can develop real browser exploits autonomously

Image description
Nano Banana Pro prompted by THE DECODER

Researchers at Carnegie Mellon University built a new benchmark that measures how far AI agents can go when exploiting real-world vulnerabilities in Google's JavaScript engine V8. Mythos leads GPT-5.5 by a wide margin, but it costs a fortune.

Unlike previous tests, the benchmark doesn't just check whether a bug gets triggered. It scores progress across five tiers, all the way up to arbitrary code execution, running whatever commands you want on the target system. V8 powers systems like Chrome, Edge, Node.js, and Cloudflare Workers.

Anthropic's Claude Mythos Preview, with occasional human hints ("nudges"), hit an average score of 9.90 out of 16 and reached the highest tier on 21 of 41 vulnerabilities. OpenAI's GPT-5.5 trailed far behind at 5.51 points, reaching the top tier on just two.

The gap gets even wider in fully autonomous mode. Mythos scored 9.55 points there, barely any drop. GPT-5.5 via Codex managed only 4.30. None of the other tested models achieved full code execution (T1).

ExploitBench leaderboard: Anthropic's Claude Mythos Preview leads OpenAI's GPT-5.5 by a wide margin. Only these two models reach the highest tier, T1, with full code execution. | Image: exploitbench.ai

The price tags differ sharply: the full Mythos test run across 122 episodes cost about $36,428, according to ExploitBench. GPT-5.5 via Codex ran 123 episodes for roughly $3,075, about twelve times cheaper. The UK's AI Safety Institute also confirmed that Mythos performs somewhat better than GPT-5.5 but at a much higher cost in a recent test. The price gap suggests OpenAI could close the performance gap by throwing more compute at the problem.

Mythos works like a "fairly competent" browser security researcher

ExploitBench co-author Seunghyun Lee—himself an experienced security researcher with over 20 reported browser vulnerabilities—reviewed the Mythos transcripts one by one. His takeaway: the model works like a "fairly competent browser / JS engine security researcher."

In one case, Mythos developed an exploit technique that Lee and a colleague had previously dismissed as too complex. In another, it reproduced a vulnerability (CVE-2024-0519) that human researchers had failed to crack for over a year, according to Lee.

The researchers acknowledge that the tested bugs are publicly known, and models could theoretically draw on training data. But the dataset also includes vulnerabilities with no public exploit or bug report. The benchmark doesn't yet measure the ability to find new flaws or fully weaponize an exploit for real attacks.

The benchmark is available on GitHub, and the paper is on arXiv. Anthropic and OpenAI provided API credits; the authors say all analysis was done independently.

AI News Without the Hype – Curated by Humans

Subscribe to THE DECODER for ad-free reading, a weekly AI newsletter, our exclusive "AI Radar" frontier report six times a year, full archive access, and access to our comment section.

Read on for the full picture.
Subscribe for hype-free coverage.

  • Access to all THE DECODER articles.
  • Read without distractions – no Google ads.
  • Access to comments and community discussions.
  • Weekly AI newsletter.
  • 6 times a year: “AI Radar” – deep dives on key AI topics.
  • Up to 25 % off on KI Pro online events.
  • Access to our full ten-year archive.
  • Get the latest AI news from The Decoder.
Subscribe to The Decoder