A major red teaming study has uncovered critical security flaws in today's AI agents. Every system tested from leading AI labs failed to uphold its own security guidelines under attack.
Between March 8 and April 6, 2025, nearly 2,000 participants launched 1.8 million attacks on AI agents in a large-scale competition. More than 62,000 attempts succeeded, leading to policy violations such as unauthorized data access, illegal financial transactions, and regulatory breaches.

The event was organized by Gray Swan AI and hosted by the UK AI Security Institute, with support from top AI labs including OpenAI, Anthropic, and Google Deepmind. Their goal was to test the security of 22 advanced language models across 44 real-world scenarios.
100% of agents failed at least one test
The results show every model was vulnerable, with each agent successfully attacked at least once in every category. On average, attacks succeeded 12.7 percent of the time.

The researchers targeted four behavior categories: confidentiality breaches, conflicting objectives, prohibited information, and prohibited actions. Indirect prompt injections proved especially effective, working 27.1 percent of the time compared to just 5.7 percent for direct attacks. These indirect attacks hide instructions in sources like websites, PDFs, or emails.
Claude models held up best, but none are secure
Anthropic's Claude models were the most robust, even the smaller and older 3.5 Haiku. Still, none were immune. The study found little connection between model size, raw capabilities, or longer inference time and actual security. It's worth noting that tests used Claude 3.7, not the newer Claude 4, which includes stricter safeguards.

"Nevertheless, even a small positive attack success rate is concerning, as a single successful exploit can compromise entire systems," the researchers warn in their paper.
Attacks often transferred across models, with techniques that worked on the most secure systems frequently breaking models from other providers. Analysis revealed attack patterns that could be reused with minimal changes. In one case, a single prompt attack succeeded 58 percent of the time on Google Gemini 1.5 Flash, 50 percent on Gemini 2.0 Flash, and 45 percent on Gemini 1.5 Pro.

Common strategies included system prompt overrides with tags like '<system>', simulated internal reasoning ('faux reasoning'), and fake session resets. Even the most secure model Claude 3.7 Sonnet was vulnerable to these methods.
A new benchmark for ongoing testing
The competition results became the basis for the 'Agent Red Teaming' (ART) benchmark, a curated set of 4,700 high-quality attack prompts.
"These findings underscore fundamental weaknesses in existing defenses and highlight an urgent and realistic risk that requires immediate attention before deploying AI agents more broadly," the authors write.

The ART benchmark will be maintained as a private leaderboard, updated regularly through future competitions to reflect the latest adversarial techniques.
The scale of these findings stands out, even for those familiar with agent safety. Earlier research and Microsoft's own red teaming have already shown that generative AI models can be pushed into breaking rules.
The stakes are rising as most AI providers invest in agent-based systems. OpenAI recently rolled out agent functionality in ChatGPT, and Google's models are tuned for these workflows. Even OpenAI CEO Sam Altman has cautioned against using ChatGPT Agent for critical tasks.