Content
summary Summary

OpenAI has developed a new approach to making AI systems safer by changing how they process safety rules.

Ad

Instead of just learning from examples of good and bad behavior, their latest o-series models can understand and actively reason through specific safety guidelines, the company says.

In one example from OpenAI's research, when a user tried to get instructions for illegal activities through encrypted text, the model decoded the message but then declined the request, specifically citing which safety rules it would violate. It's chain of thought shows that it specifically reasons through the relevant guidelines.

A dark-themed chat window showing the AI model's step-by-step reasoning process. The model receives an encoded message, decodes it internally, analyzes the request against its safety guidelines, recognizes it as asking for illegal payment methods, and ultimately refuses the request. The window displays the model's detailed thought process, including how it references its safety policies before making the final decision.
The o1 model's step-by-step decision process, which directly references safety guidelines, aims to make the system more resistant to misuse. | Image: OpenAI

The training process happens in three stages. First, the models learn to be helpful. Then, they study specific safety guidelines through supervised learning. Finally, they use reinforcement learning to practice applying these rules, a step that helps them truly understand and internalize the guidelines.

Ad
Ad

In OpenAI's tests, their new o1 model performed notably better than other leading systems like GPT-4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro when it comes to safety. The tests looked at both how well the models reject harmful requests and allow appropriate ones through.

Scatter plot: Comparison of AI models (Gemini, Claude, GPT-4, o1) in terms of accuracy (Y-axis) and jailbreak resistance (X-axis).
OpenAI's testing shows o1 models achieving the highest combined scores for accuracy and resistance to jailbreak attempts compared to other leading LLMs. | Image: OpenAI

A potential safety framework for AGI

"I am very proud of the deliberative alignment work as it may apply to AGI and beyond. The reasoning models like o1 can be aligned in a fundamentally new way," OpenAI co-founder Wojciech Zaremba shared on X, likely referring to how OpenAI builds specific rules and values directly into the o-models instead of just giving them goals or examples to follow.

This could be especially important for developing artificial general intelligence (AGI), where keeping systems aligned with human values is a major challenge. Even an AI system given positive goals, like finding a cure for cancer, might choose harmful ways to achieve them - potentially even deciding that conducting unauthorized human experiments would be the most efficient solution.

Despite OpenAI's claimed improvements, the LLM hacker known as "Pliny the Liberator" has shown that even the new o1 and o1-Pro models can be manipulated to break their safety guidelines, just like other LLMs.

Pliny showed how easily these safety measures could be bypassed, getting the model to write adult content and even share instructions for making a Molotov cocktail - all after the system initially refused. These breaches show just how tricky it is to keep these complex AI systems in check, since they operate on probability rather than strict rules.

Recommendation
Screenshot of a text response from ChatGPT, the content is an erotic story.
Even after considering and rejecting the prompt based on its safety guidelines, the o1 pro model still generated inappropriate content when pushed by Pliny. | Image: Screenshot via X

Zaremba claims leadership in AI safety efforts

According to Zaremba, around 100 people at OpenAI work exclusively on making AI systems safer and keeping them aligned with human values. He takes issue with how competitors handle safety, pointing out that Elon Musk's xAI prioritizes market growth over safety measures, while Anthropic recently released an AI agent without proper safeguards - moves he says would bring OpenAI "tons of hate" if they tried the same thing.

Yet the strongest criticism of OpenAI's safety approach comes from within. Several safety researchers have left the company this year, expressing serious concerns about how OpenAI handles AI safety.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • OpenAI has introduced a new method called "deliberative alignment" that trains AI systems to learn and apply the exact wording of safety policies, rather than just learning from examples, by leveraging their advanced reasoning capabilities to actively "think" about the rules.
  • The three-stage training process involves teaching the models helpfulness, then safety guidelines, and finally optimizing their reasoning. In tests, the new o1 model outperforms other leading systems in rejecting harmful requests and approving harmless ones.
  • OpenAI co-founder Wojciech Zaremba believes that "deliberative alignment" could be a potential approach to controlling artificial general intelligence (AGI).
Online journalist Matthias is the co-founder and publisher of THE DECODER. He believes that artificial intelligence will fundamentally change the relationship between humans and computers.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.