Anthropic CEO says jailbreaking AI systems could become a matter of "life and death"
Key Points
- AI startup Anthropic has released the Claude 2 chatbot, which CEO Dario Amodei says rivals ChatGPT but is safer overall.
- Amodei is concerned about the rise of jailbreaks, and expects to see serious cases of AI abuse in the coming years as AI systems become more powerful.
- Unlike other AI companies, Anthropic uses fixed rules and AI ratings instead of human feedback, giving the system a "constitution" of ethical and moral guidelines that a second AI system evaluates for compliance.
Currently, Anthropic seems like the most relevant OpenAI competitor. The startup just released a new chatbot, Claude 2, which is on the same level as ChatGPT, but more cautious.
"I would certainly rather Claude be boring than that Claude be dangerous," says Dario Amodei of Claude's safety restrictions. Amodei was formerly the team lead for AI safety at OpenAI and is now CEO of Anthropic. In the future, a fully capable yet safe chatbot is possible, but it's still an "evolving science," Amodei says.
Amodei is concerned about so-called jailbreaks, specific prompts that cause a model to generate content that it is not supposed to generate according to the developer's specifications - or according to the law. These exploits may currently lead to trivial results, but that could change.
"But if I look at where the scaling curves are going, I’m actually deeply concerned that in two or three years, we’ll get to the point where the models can, I don’t know, do very dangerous things with science, engineering, biology, and then a jailbreak could be life or death," Amodei says.
"I think we’re getting better over time at addressing the jailbreaks. But I also think the models are getting more powerful."
The Anthropic CEO sees "maybe a 10 percent chance" that scaling AI systems will fail because there isn't enough data, and the synthetic data is inaccurate. "That would freeze the capabilities at the current level."
If this scaling trend isn't stopped, Amodei expects to see instances of serious AI misuse, such as the mass generation of fake news, in the next two to three years.
AI Safety: Is machine feedback better than human feedback?
Unlike OpenAI and other AI companies, Anthropic relies on fixed rules and AI evaluation rather than human feedback. The AI system is given a set of ethical and moral guidelines, a "constitution," that Anthropic has compiled from various sources, such as laws or corporate policies. A second AI system evaluates whether the first system's generations are following the rules and provides feedback.
Internal testing showed that the safety of this approach was similar in some areas to ChatGPT, which was trained with human feedback (RLHF), and "substantially stronger" in some areas, Amodei said. Overall, Claude's guardrails are stronger, according to Amodei.
For the full interview with Amodei, listen to the New York Times "Hard Fork" podcast. Anthropic's Claude 2 chatbot is currently being rolled out in the US and UK.
AI News Without the Hype – Curated by Humans
As a THE DECODER subscriber, you get ad-free reading, our weekly AI newsletter, the exclusive "AI Radar" Frontier Report 6× per year, access to comments, and our complete archive.
Subscribe now