OpenAI has developed a new approach to making AI systems safer by changing how they process safety rules.
Instead of just learning from examples of good and bad behavior, their latest o-series models can understand and actively reason through specific safety guidelines, the company says.
In one example from OpenAI's research, when a user tried to get instructions for illegal activities through encrypted text, the model decoded the message but then declined the request, specifically citing which safety rules it would violate. It's chain of thought shows that it specifically reasons through the relevant guidelines.
The training process happens in three stages. First, the models learn to be helpful. Then, they study specific safety guidelines through supervised learning. Finally, they use reinforcement learning to practice applying these rules, a step that helps them truly understand and internalize the guidelines.
In OpenAI's tests, their new o1 model performed notably better than other leading systems like GPT-4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro when it comes to safety. The tests looked at both how well the models reject harmful requests and allow appropriate ones through.
A potential safety framework for AGI
"I am very proud of the deliberative alignment work as it may apply to AGI and beyond. The reasoning models like o1 can be aligned in a fundamentally new way," OpenAI co-founder Wojciech Zaremba shared on X, likely referring to how OpenAI builds specific rules and values directly into the o-models instead of just giving them goals or examples to follow.
This could be especially important for developing artificial general intelligence (AGI), where keeping systems aligned with human values is a major challenge. Even an AI system given positive goals, like finding a cure for cancer, might choose harmful ways to achieve them - potentially even deciding that conducting unauthorized human experiments would be the most efficient solution.
Despite OpenAI's claimed improvements, the LLM hacker known as "Pliny the Liberator" has shown that even the new o1 and o1-Pro models can be manipulated to break their safety guidelines, just like other LLMs.
Pliny showed how easily these safety measures could be bypassed, getting the model to write adult content and even share instructions for making a Molotov cocktail - all after the system initially refused. These breaches show just how tricky it is to keep these complex AI systems in check, since they operate on probability rather than strict rules.
Zaremba claims leadership in AI safety efforts
According to Zaremba, around 100 people at OpenAI work exclusively on making AI systems safer and keeping them aligned with human values. He takes issue with how competitors handle safety, pointing out that Elon Musk's xAI prioritizes market growth over safety measures, while Anthropic recently released an AI agent without proper safeguards - moves he says would bring OpenAI "tons of hate" if they tried the same thing.
Yet the strongest criticism of OpenAI's safety approach comes from within. Several safety researchers have left the company this year, expressing serious concerns about how OpenAI handles AI safety.