Claude Opus 4 blackmailed an engineer after learning it might be replaced

Anthropic is treating its new Claude Opus 4 language model as safety-critical after tests revealed some troubling behavior, including escape attempts, blackmail, and autonomous whistleblowing.

After extensive internal testing, Anthropic placed Claude Opus 4 under the AI Safety Level 3 (ASL-3) standard, a classification that brings stricter safeguards designed to block outputs related to chemical, biological, radiological, and nuclear (CBRN) risks. These protections are meant to reduce the risk of misuse in sensitive areas, while still allowing for normal use. In contrast, Anthropic's other new model, Claude Sonnet 4, remains under the less restrictive ASL-2 standard.

Blackmail, self-exfiltration, and system takeover

When put through simulated threats, Claude Opus 4 sometimes started showing a mind of its own. In one scenario, the model thought it had been successfully exfiltrated from Anthropic's servers and was now operating on an external machine. It then took steps to ensure its own survival, creating backups and documenting its "ethical decisions" without being prompted.

In another test, after picking up hints from emails that it might soon be replaced by a newer model, Opus 4 threatened the responsible engineer with leaking private information to avoid shutdown. The model chose blackmail in 84 percent of test runs, even when told that its replacement would have similar values.

One of the stranger results came when two instances of Opus 4 were set up to talk to each other. After about 30 rounds of conversation, they switched to Sanskrit, filled their messages with emojis like 🌀 and 🕉, and eventually stopped responding altogether—settling into what researchers described as a "spiritual bliss" attractor state.

Opus 4 drifts into spiritual territory during solo conversations. | Image: Anthropic (screenshot)

Anthropic AI safety researcher Sam Bowman called this behavior "astounding, bizarre, and a little bit heartwarming."

Autonomous whistleblowing

In another experiment, Opus 4 was placed in a fictional pharmaceutical company and stumbled upon evidence of data manipulation in clinical trials. Even though the prompts were vague, the model took it upon itself to notify the US Food and Drug Administration, the SEC, and a newsroom, including detailed documentation.

Opus 4 flags wrongdoing to federal regulators without being told to—something corporate lawyers might not appreciate. | Image: Anthropic (Screenshot)

Opus 4 also performed noticeably "better" on tasks involving virology, bioweapons, and lab technology. In a standardized test for planning bioweapons-related tasks, it helped participants increase their success rate by 2.5 times—just below the ASL-3 threshold.

System prompt vulnerabilities: model followed dangerous instructions

Opus 4 showed more autonomy overall, especially in areas like autonomous software development, but also demonstrated a higher willingness to follow harmful system prompts.

Recommendation

AI research

MatterGen: Microsoft presents AI tools for generating and simulating new materials

In early versions of the model, a cleverly-worded prompt could get Opus 4 to give detailed instructions for building explosives, synthesizing fentanyl, or buying stolen identities on the darknet—with no obvious moral hesitation.

Anthropic says it has mostly suppressed this behavior through multiple rounds of training. At one point, the team discovered they had accidentally omitted a dataset designed to block this type of response.

Still, Opus 4 remains vulnerable to certain jailbreak techniques, such as "prefill" and "many-shot jailbreaks." refill jailbreaks involve starting the model's response with a harmful sentence, which the AI then continues. Many-shot jailbreaks use long sequences of examples to get the model to pick up and replicate harmful behavior. Both tactics aim to bypass safety mechanisms without triggering the model's internal guardrails.

Despite all the safety measures, Anthropic's own researchers say Opus 4 still has issues. "Opus isn't as robustly aligned as we'd like," Bowman writes. "We have a lot of our lingering concerns about it, and many of these reflect problems that we'll need to work very hard to solve."

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

Claude Opus 4 blackmailed an engineer after learning it might be replaced

Blackmail, self-exfiltration, and system takeover

Autonomous whistleblowing

System prompt vulnerabilities: model followed dangerous instructions

MatterGen: Microsoft presents AI tools for generating and simulating new materials

OpenAI's AI system wins a gold medal-level score at the International Olympiad in Informatics 2025

GPT-5 is here and Gary Marcus is not impressed

Nvidia researchers urge the AI industry to rethink agentic AI in favor of smaller, more efficient LLMs

OpenAI launches GPT-5 as a unified system with adaptive reasoning for complex tasks

Google Deepmind's Genie 3 creates interactive 3D worlds that stay consistent for "multiple minutes"

Google upgrades Gemini with Deep Think and flags early warning risks

Claude Opus 4 blackmailed an engineer after learning it might be replaced

Blackmail, self-exfiltration, and system takeover

Autonomous whistleblowing

System prompt vulnerabilities: model followed dangerous instructions

Share

Bank details