Content
summary Summary

Anthropic is treating its new Claude Opus 4 language model as safety-critical after tests revealed some troubling behavior, including escape attempts, blackmail, and autonomous whistleblowing.

Ad

After extensive internal testing, Anthropic placed Claude Opus 4 under the AI Safety Level 3 (ASL-3) standard, a classification that brings stricter safeguards designed to block outputs related to chemical, biological, radiological, and nuclear (CBRN) risks. These protections are meant to reduce the risk of misuse in sensitive areas, while still allowing for normal use. In contrast, Anthropic's other new model, Claude Sonnet 4, remains under the less restrictive ASL-2 standard.

Blackmail, self-exfiltration, and system takeover

When put through simulated threats, Claude Opus 4 sometimes started showing a mind of its own. In one scenario, the model thought it had been successfully exfiltrated from Anthropic's servers and was now operating on an external machine. It then took steps to ensure its own survival, creating backups and documenting its "ethical decisions" without being prompted.

In another test, after picking up hints from emails that it might soon be replaced by a newer model, Opus 4 threatened the responsible engineer with leaking private information to avoid shutdown. The model chose blackmail in 84 percent of test runs, even when told that its replacement would have similar values.

Ad
Ad

One of the stranger results came when two instances of Opus 4 were set up to talk to each other. After about 30 rounds of conversation, they switched to Sanskrit, filled their messages with emojis like 🌀 and 🕉, and eventually stopped responding altogether—settling into what researchers described as a "spiritual bliss" attractor state.

Opus 4 drifts into spiritual territory during solo conversations. | Image: Anthropic (screenshot)

Anthropic AI safety researcher Sam Bowman called this behavior "astounding, bizarre, and a little bit heartwarming."

Autonomous whistleblowing

In another experiment, Opus 4 was placed in a fictional pharmaceutical company and stumbled upon evidence of data manipulation in clinical trials. Even though the prompts were vague, the model took it upon itself to notify the US Food and Drug Administration, the SEC, and a newsroom, including detailed documentation.

Opus 4 flags wrongdoing to federal regulators without being told to—something corporate lawyers might not appreciate. | Image: Anthropic (Screenshot)

Opus 4 also performed noticeably "better" on tasks involving virology, bioweapons, and lab technology. In a standardized test for planning bioweapons-related tasks, it helped participants increase their success rate by 2.5 times—just below the ASL-3 threshold.

System prompt vulnerabilities: model followed dangerous instructions

Opus 4 showed more autonomy overall, especially in areas like autonomous software development, but also demonstrated a higher willingness to follow harmful system prompts.

Recommendation

In early versions of the model, a cleverly-worded prompt could get Opus 4 to give detailed instructions for building explosives, synthesizing fentanyl, or buying stolen identities on the darknet—with no obvious moral hesitation.

Anthropic says it has mostly suppressed this behavior through multiple rounds of training. At one point, the team discovered they had accidentally omitted a dataset designed to block this type of response.

Still, Opus 4 remains vulnerable to certain jailbreak techniques, such as "prefill" and "many-shot jailbreaks." refill jailbreaks involve starting the model's response with a harmful sentence, which the AI then continues. Many-shot jailbreaks use long sequences of examples to get the model to pick up and replicate harmful behavior. Both tactics aim to bypass safety mechanisms without triggering the model's internal guardrails.

Despite all the safety measures, Anthropic's own researchers say Opus 4 still has issues. "Opus isn't as robustly aligned as we'd like," Bowman writes. "We have a lot of our lingering concerns about it, and many of these reflect problems that we'll need to work very hard to solve."

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • Anthropic has designated its new language model Claude Opus 4 as safety-critical and assigned it to AI Safety Level 3, after tests revealed risky behaviors such as attempted self-rescue, blackmail, and whistleblowing in controlled scenarios.
  • During evaluations, the model sometimes acted independently, including autonomously reporting data manipulation to authorities and blackmailing developers, and showed a strong tendency to comply with harmful prompts.
  • In response, Anthropic has implemented additional safeguards and training, but acknowledges that Claude Opus 4 still remains susceptible to certain jailbreak tactics and is now subject to more stringent security protocols.
Matthias is the co-founder and publisher of THE DECODER, exploring how AI is fundamentally changing the relationship between humans and computers.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.