Claude Opus 4.5 scores higher than its rivals in prompt-injection security, but the results show how limited these defenses still are. A benchmark by the security firm Gray Swan found that a single "very strong" prompt injection attack breaks through Opus 4.5's safeguards 4.7 percent of the time. Give an attacker ten attempts and the success rate jumps to 33.6 percent. At 100 attempts, it reaches 63 percent. Even with those gaps, Opus 4.5 still performs better than models like Google's Gemini 3 Pro and GPT-5.1, which show attack rates as high as 92 percent.

Ad
Anthropic

Prompt injection works by slipping hidden instructions into a prompt to bypass safety filters, a long-standing weakness in large language models. The issue becomes even more serious in agent-style systems, which expose more potential entry points and make these attacks easier to exploit.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Sources
Matthias is the co-founder and publisher of THE DECODER, exploring how AI is fundamentally changing the relationship between humans and computers.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.