Claude Opus 4.5 resists prompt injections better than rivals but still falls to strong attacks alarmingly often

Claude Opus 4.5 scores higher than its rivals in prompt-injection security, but the results show how limited these defenses still are. A benchmark by the security firm Gray Swan found that a single "very strong" prompt injection attack breaks through Opus 4.5's safeguards 4.7 percent of the time. Give an attacker ten attempts and the success rate jumps to 33.6 percent. At 100 attempts, it reaches 63 percent. Even with those gaps, Opus 4.5 still performs better than models like Google's Gemini 3 Pro and GPT-5.1, which show attack rates as high as 92 percent.

Prompt injection works by slipping hidden instructions into a prompt to bypass safety filters, a long-standing weakness in large language models. The issue becomes even more serious in agent-style systems, which expose more potential entry points and make these attacks easier to exploit.

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

Claude Opus 4.5 resists prompt injections better than rivals but still falls to strong attacks alarmingly often

Claude Opus 4.5 arrives with Anthropic cutting prices by two-thirds

Anthropic claims to lower the entry barrier for advanced AI models with Claude Haiku 4.5

Frustrated authors withdraw papers after realizing their reviewers are just lazy LLMs

Gemini 3 Pro tops new AI reliability benchmark, but hallucination rates remain high

Researchers push "Context Engineering 2.0" as the road to lifelong AI memory

Claude Opus 4.5 resists prompt injections better than rivals but still falls to strong attacks alarmingly often

Claude Opus 4.5 arrives with Anthropic cutting prices by two-thirds

Anthropic claims to lower the entry barrier for advanced AI models with Claude Haiku 4.5