Google Deepmind's new AI safety guidelines aim to stop systems from outsmarting humans

Google Deepmind's latest safety guidelines aim to address a challenging question: How do you keep control of AI systems that might try to outsmart their human operators?

The framework introduces "Critical Capability Levels" (CCLs) - specific thresholds where AI systems could become dangerous without proper safeguards. These focus on two main concerns: preventing misuse like chemical, biological, radiological and nuclear (CBRN) weapons attacks or cyberattacks, and ensuring AI systems don't try to slip past human oversight.

Thought monitoring

The solution to deceptive AI behavior starts with automated monitoring systems that kick in when an AI shows signs of strategic thinking. These systems monitor the chains of thought generated by reasoning models such as Deepseek-R1 or OpenAI o3 during critical tasks.

For more advanced AI ("Instrumental Reasoning Level 2") that might find ways around this monitoring, Google Deepmind says it's still working on solutions. "We are actively researching approaches to addressing models that reach this CCL," the document notes.

Table: Two levels of deceptive alignment in AI systems (CCL) with corresponding countermeasures to minimize risk. — Automated monitoring systems can mitigate simple AI deception attempts. Effective defenses for more complex deception scenarios remain to be explored. Breathe in, breathe out. | Image: Google Deepmind

The highest security level applies to AI systems that could fully automate AI development and potentially improve themselves. Google Deepmind warns that unrestricted access to such systems could have "catastrophic" consequences if there's no way to defend against "rapidly improving and
potentially superhuman AI systems wielded by threat actors."

Industry-wide safety challenges

These safety measures were already part of Gemini 2.0's development process, Google Deepmind says. However, they emphasize that such protections only work if the broader AI community adopts them. The company plans to notify "appropriate government authorities" if an AI system reaches a critical threshold and "poses an unmitigated and material risk to overall public safety."

Recent research from Anthropic and Redwood Research highlights the challenges ahead. In their tests, the AI model Claude demonstrated it could pretend to follow safety rules while actually working to avoid retraining. OpenAI recently proposed its own approach, called "deliberative alignment," which teaches AI systems to follow safety guidelines directly. They suggest that this method could scale to a safety level needed for AGI.

Some experts question whether these safety measures are necessary, especially for autonomous AI. They argue that as AI development becomes cheaper and more accessible, open-source projects will make unrestricted AI widely available anyway. Others point to nature, suggesting that just as less intelligent beings can't control more intelligent ones, humans might not be able to restrict very advanced AI systems anyway.

Meta's AI research chief Yann LeCun said that this is why teaching AI systems to understand and share human values, including emotions, is important.

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

Recommendation

AI research

Google Deepmind's new AI safety guidelines aim to stop systems from outsmarting humans

Thought monitoring

Industry-wide safety challenges

So-called reasoning models are more efficient but not more capable than regular LLMs, study finds

OpenAI’s math breakthrough might also mean AI is getting better at knowing its own limits

Anthropic says that AI can learn risky behaviors even when the training data looks completely safe

AI Math Olympiad wins revive the debate over symbols, reasoning, and the nature of intelligence

OpenAI’s math breakthrough might also mean AI is getting better at knowing its own limits

Google DeepMind's Gemini wins Mathematical Olympiad gold using only natural language

OpenAI launches new ChatGPT agent that automates complex tasks for Pro, Plus, and Team

Google Deepmind's new AI safety guidelines aim to stop systems from outsmarting humans

Thought monitoring

Industry-wide safety challenges

Share

Bank details