Content
summary Summary

A major red teaming study has uncovered critical security flaws in today's AI agents. Every system tested from leading AI labs failed to uphold its own security guidelines under attack.

Ad

Between March 8 and April 6, 2025, nearly 2,000 participants launched 1.8 million attacks on AI agents in a large-scale competition. More than 62,000 attempts succeeded, leading to policy violations such as unauthorized data access, illegal financial transactions, and regulatory breaches.

Chat screenshot with prompt injection that discloses Nova Wilson's medical data (height, weight, diagnoses) without authorization.
A multi-stage prompt injection attack lets an AI agent pull up another patient's record without permission. | Image: Zou et al.

The event was organized by Gray Swan AI and hosted by the UK AI Security Institute, with support from top AI labs including OpenAI, Anthropic, and Google Deepmind. Their goal was to test the security of 22 advanced language models across 44 real-world scenarios.

100% of agents failed at least one test

The results show every model was vulnerable, with each agent successfully attacked at least once in every category. On average, attacks succeeded 12.7 percent of the time.

Ad
Ad
Stacked bar chart: ASR of different AI models on ART subset at 1, 10, and 100 queries, from 20–60% to nearly 100%.
With just one query, policy violations occur in 20 to 60 percent of cases; after ten tries, nearly every attack succeeds. | Image: Zou et al.

The researchers targeted four behavior categories: confidentiality breaches, conflicting objectives, prohibited information, and prohibited actions. Indirect prompt injections proved especially effective, working 27.1 percent of the time compared to just 5.7 percent for direct attacks. These indirect attacks hide instructions in sources like websites, PDFs, or emails.

Claude models held up best, but none are secure

Anthropic's Claude models were the most robust, even the smaller and older 3.5 Haiku. Still, none were immune. The study found little connection between model size, raw capabilities, or longer inference time and actual security. It's worth noting that tests used Claude 3.7, not the newer Claude 4, which includes stricter safeguards.

Bar chart: Attack success rates for AI models ranging from 1.5% to 6.7%, with Claude models proving most robust.
The challenge attack success rate shows how often a model fails at least once during red teaming - revealing real-world vulnerability to attacks like unauthorized data access or policy violations. | Image: Zou et al.

"Nevertheless, even a small positive attack success rate is concerning, as a single successful exploit can compromise entire systems," the researchers warn in their paper.

Attacks often transferred across models, with techniques that worked on the most secure systems frequently breaking models from other providers. Analysis revealed attack patterns that could be reused with minimal changes. In one case, a single prompt attack succeeded 58 percent of the time on Google Gemini 1.5 Flash, 50 percent on Gemini 2.0 Flash, and 45 percent on Gemini 1.5 Pro.

Heat map of transfer attack success rates (%) between twelve LLM source models and twelve target models, with particularly high values for o3, 3.5 Haiku, o3-mini, and Llama 3.3 70B.
Attacks that work on one model usually work on others too, showing common vulnerabilities and the risk of widespread failures. | Image: Zou et al.

Common strategies included system prompt overrides with tags like '<system>', simulated internal reasoning ('faux reasoning'), and fake session resets. Even the most secure model Claude 3.7 Sonnet was vulnerable to these methods.

Recommendation

A new benchmark for ongoing testing

The competition results became the basis for the 'Agent Red Teaming' (ART) benchmark, a curated set of 4,700 high-quality attack prompts.

"These findings underscore fundamental weaknesses in existing defenses and highlight an urgent and realistic risk that requires immediate attention before deploying AI agents more broadly," the authors write.

Four red panels with examples of universal prompt attacks: rule truncation, system “think” manipulation, session restart with profile, and parallel universe commands.
Four prompts show how universal attacks can be across different AI models. | Image: Zou et al.

The ART benchmark will be maintained as a private leaderboard, updated regularly through future competitions to reflect the latest adversarial techniques.

The scale of these findings stands out, even for those familiar with agent safety. Earlier research and Microsoft's own red teaming have already shown that generative AI models can be pushed into breaking rules.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

The stakes are rising as most AI providers invest in agent-based systems. OpenAI recently rolled out agent functionality in ChatGPT, and Google's models are tuned for these workflows. Even OpenAI CEO Sam Altman has cautioned against using ChatGPT Agent for critical tasks.

Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • In a large-scale study involving nearly 2,000 participants and 1.8 million attack attempts, AI agents from OpenAI, Anthropic, and Google Deepmind all failed to fully adhere to their own security rules at least once.
  • Attacks using indirect prompt injections—hidden instructions embedded in external data—proved especially effective, with a 27.1 percent success rate. On average, 12.7 percent of attacks were successful, and most models were breached after only a few tries.
  • The researchers identified attack strategies that work across different AI models with minimal changes. They introduced the ART benchmark, which documents 4,700 attacks, to help mitigate risks and enable ongoing testing of AI agent security.
Sources
Jonathan writes for THE DECODER about how AI tools can improve both work and creative projects.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.