Anthropic CEO says jailbreaking AI systems could become a matter of "life and death"

Currently, Anthropic seems like the most relevant OpenAI competitor. The startup just released a new chatbot, Claude 2, which is on the same level as ChatGPT, but more cautious.

"I would certainly rather Claude be boring than that Claude be dangerous," says Dario Amodei of Claude's safety restrictions. Amodei was formerly the team lead for AI safety at OpenAI and is now CEO of Anthropic. In the future, a fully capable yet safe chatbot is possible, but it's still an "evolving science," Amodei says.

Amodei is concerned about so-called jailbreaks, specific prompts that cause a model to generate content that it is not supposed to generate according to the developer's specifications - or according to the law. These exploits may currently lead to trivial results, but that could change.

"But if I look at where the scaling curves are going, I’m actually deeply concerned that in two or three years, we’ll get to the point where the models can, I don’t know, do very dangerous things with science, engineering, biology, and then a jailbreak could be life or death," Amodei says.

"I think we’re getting better over time at addressing the jailbreaks. But I also think the models are getting more powerful."

The Anthropic CEO sees "maybe a 10 percent chance" that scaling AI systems will fail because there isn't enough data, and the synthetic data is inaccurate. "That would freeze the capabilities at the current level."

If this scaling trend isn't stopped, Amodei expects to see instances of serious AI misuse, such as the mass generation of fake news, in the next two to three years.

AI Safety: Is machine feedback better than human feedback?

Unlike OpenAI and other AI companies, Anthropic relies on fixed rules and AI evaluation rather than human feedback. The AI system is given a set of ethical and moral guidelines, a "constitution," that Anthropic has compiled from various sources, such as laws or corporate policies. A second AI system evaluates whether the first system's generations are following the rules and provides feedback.

Internal testing showed that the safety of this approach was similar in some areas to ChatGPT, which was trained with human feedback (RLHF), and "substantially stronger" in some areas, Amodei said. Overall, Claude's guardrails are stronger, according to Amodei.

Recommendation

AI and society

Researchers used AI to manipulate Reddit users, scrapped study after backlash

For the full interview with Amodei, listen to the New York Times "Hard Fork" podcast. Anthropic's Claude 2 chatbot is currently being rolled out in the US and UK.

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

Anthropic CEO says jailbreaking AI systems could become a matter of "life and death"

AI Safety: Is machine feedback better than human feedback?

Researchers used AI to manipulate Reddit users, scrapped study after backlash

Anthropic slashes Claude 3.x access on Windsurf following OpenAI's reported $3 billion takeover

Anthropic triples revenue to $3 billion as corporate demand for generative AI surges

Anthropic CEO predicts 20% unemployment from AI - and suggests taxing every AI response

AI agents outperform human teams in hacking competitions

Wait a minute! Researchers say AI's "chains of thought" are not signs of human-like reasoning

Anthropic releases Claude 4 with new safety measures targeting CBRN misuse

Anthropic CEO says jailbreaking AI systems could become a matter of "life and death"

AI Safety: Is machine feedback better than human feedback?

Share

Bank details