OpenAI co-founder says new AI safety approach "may apply to AGI and beyond"

OpenAI has developed a new approach to making AI systems safer by changing how they process safety rules.

Instead of just learning from examples of good and bad behavior, their latest o-series models can understand and actively reason through specific safety guidelines, the company says.

In one example from OpenAI's research, when a user tried to get instructions for illegal activities through encrypted text, the model decoded the message but then declined the request, specifically citing which safety rules it would violate. It's chain of thought shows that it specifically reasons through the relevant guidelines.

A dark-themed chat window showing the AI model's step-by-step reasoning process. The model receives an encoded message, decodes it internally, analyzes the request against its safety guidelines, recognizes it as asking for illegal payment methods, and ultimately refuses the request. The window displays the model's detailed thought process, including how it references its safety policies before making the final decision. — The o1 model's step-by-step decision process, which directly references safety guidelines, aims to make the system more resistant to misuse. | Image: OpenAI

The training process happens in three stages. First, the models learn to be helpful. Then, they study specific safety guidelines through supervised learning. Finally, they use reinforcement learning to practice applying these rules, a step that helps them truly understand and internalize the guidelines.

In OpenAI's tests, their new o1 model performed notably better than other leading systems like GPT-4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro when it comes to safety. The tests looked at both how well the models reject harmful requests and allow appropriate ones through.

Scatter plot: Comparison of AI models (Gemini, Claude, GPT-4, o1) in terms of accuracy (Y-axis) and jailbreak resistance (X-axis). — OpenAI's testing shows o1 models achieving the highest combined scores for accuracy and resistance to jailbreak attempts compared to other leading LLMs. | Image: OpenAI

A potential safety framework for AGI

"I am very proud of the deliberative alignment work as it may apply to AGI and beyond. The reasoning models like o1 can be aligned in a fundamentally new way," OpenAI co-founder Wojciech Zaremba shared on X, likely referring to how OpenAI builds specific rules and values directly into the o-models instead of just giving them goals or examples to follow.

This could be especially important for developing artificial general intelligence (AGI), where keeping systems aligned with human values is a major challenge. Even an AI system given positive goals, like finding a cure for cancer, might choose harmful ways to achieve them - potentially even deciding that conducting unauthorized human experiments would be the most efficient solution.

Despite OpenAI's claimed improvements, the LLM hacker known as "Pliny the Liberator" has shown that even the new o1 and o1-Pro models can be manipulated to break their safety guidelines, just like other LLMs.

Pliny showed how easily these safety measures could be bypassed, getting the model to write adult content and even share instructions for making a Molotov cocktail - all after the system initially refused. These breaches show just how tricky it is to keep these complex AI systems in check, since they operate on probability rather than strict rules.

Recommendation

AI research

Nvidia researcher Jim Fan expects "GPT-3 moment" for robotics in the next few years

Screenshot of a text response from ChatGPT, the content is an erotic story. — Even after considering and rejecting the prompt based on its safety guidelines, the o1 pro model still generated inappropriate content when pushed by Pliny. | Image: Screenshot via X

Zaremba claims leadership in AI safety efforts

According to Zaremba, around 100 people at OpenAI work exclusively on making AI systems safer and keeping them aligned with human values. He takes issue with how competitors handle safety, pointing out that Elon Musk's xAI prioritizes market growth over safety measures, while Anthropic recently released an AI agent without proper safeguards - moves he says would bring OpenAI "tons of hate" if they tried the same thing.

Yet the strongest criticism of OpenAI's safety approach comes from within. Several safety researchers have left the company this year, expressing serious concerns about how OpenAI handles AI safety.

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

OpenAI co-founder says new AI safety approach "may apply to AGI and beyond"

A potential safety framework for AGI

Nvidia researcher Jim Fan expects "GPT-3 moment" for robotics in the next few years

Zaremba claims leadership in AI safety efforts

Former and current employees of mostly OpenAI warn of risks of advanced AI

OpenAI's AI safety teams lost at least seven researchers in recent months

OpenAI's former AI alignment head slams company's lack of safety priorities and processes

OpenAI launches GPT-5 as a unified system with adaptive reasoning for complex tasks

Google Deepmind's Genie 3 creates interactive 3D worlds that stay consistent for "multiple minutes"

Google upgrades Gemini with Deep Think and flags early warning risks

OpenAI co-founder says new AI safety approach "may apply to AGI and beyond"

A potential safety framework for AGI

Zaremba claims leadership in AI safety efforts

Share

Bank details