Anthropic CEO says jailbreaking AI systems could become a matter of "life and death"

Currently, Anthropic seems like the most relevant OpenAI competitor. The startup just released a new chatbot, Claude 2, which is on the same level as ChatGPT, but more cautious.

"I would certainly rather Claude be boring than that Claude be dangerous," says Dario Amodei of Claude's safety restrictions. Amodei was formerly the team lead for AI safety at OpenAI and is now CEO of Anthropic. In the future, a fully capable yet safe chatbot is possible, but it's still an "evolving science," Amodei says.

Amodei is concerned about so-called jailbreaks, specific prompts that cause a model to generate content that it is not supposed to generate according to the developer's specifications - or according to the law. These exploits may currently lead to trivial results, but that could change.

"But if I look at where the scaling curves are going, I’m actually deeply concerned that in two or three years, we’ll get to the point where the models can, I don’t know, do very dangerous things with science, engineering, biology, and then a jailbreak could be life or death," Amodei says.

"I think we’re getting better over time at addressing the jailbreaks. But I also think the models are getting more powerful."

The Anthropic CEO sees "maybe a 10 percent chance" that scaling AI systems will fail because there isn't enough data, and the synthetic data is inaccurate. "That would freeze the capabilities at the current level."

If this scaling trend isn't stopped, Amodei expects to see instances of serious AI misuse, such as the mass generation of fake news, in the next two to three years.

AI Safety: Is machine feedback better than human feedback?

Unlike OpenAI and other AI companies, Anthropic relies on fixed rules and AI evaluation rather than human feedback. The AI system is given a set of ethical and moral guidelines, a "constitution," that Anthropic has compiled from various sources, such as laws or corporate policies. A second AI system evaluates whether the first system's generations are following the rules and provides feedback.

Internal testing showed that the safety of this approach was similar in some areas to ChatGPT, which was trained with human feedback (RLHF), and "substantially stronger" in some areas, Amodei said. Overall, Claude's guardrails are stronger, according to Amodei.

Recommendation

AI and society

Google pulls 'Dear Sydney' Gemini AI ad after criticism AI could make us lazy and uncreative

For the full interview with Amodei, listen to the New York Times "Hard Fork" podcast. Anthropic's Claude 2 chatbot is currently being rolled out in the US and UK.

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

Anthropic CEO says jailbreaking AI systems could become a matter of "life and death"

AI Safety: Is machine feedback better than human feedback?

Google pulls 'Dear Sydney' Gemini AI ad after criticism AI could make us lazy and uncreative

Most AI models can fake alignment, but safety training suppresses the behavior, study finds

Claude now connects with Canvas, Panopto, and Wiley for educational access

Apple weighs abandoning its own AI for Siri as it tests models from OpenAI and Anthropic

Kimi-K2 is the next open-weight AI milestone from China after Deepseek

New Energy-Based Transformer architecture aims to bring better "System 2 thinking" to AI models

Musk unveils Grok 4 as xAI’s new AI model that beats OpenAI and Google on major benchmarks

Anthropic CEO says jailbreaking AI systems could become a matter of "life and death"

AI Safety: Is machine feedback better than human feedback?

Share

Bank details