Suppressing AI's bad thoughts just teaches it to scheme in private, OpenAI study finds

New research from OpenAI reveals how AI systems exhibit problematic reasoning patterns when "thinking" through tasks, warning against attempts to forcefully correct these behaviors.

Large reasoning models "think" in ways humans can understand, using natural language. According to a new OpenAI study, monitoring these thought chains "could be one of the few tools we will have to control future, superhumanly intelligent AI systems."

Screenshot of an AI dialog on adding 2+2, with detailed thought processes and various mathematical approaches. During a train of thought, the model asks the question: “Maybe it's a trick question?” — Thought chain example from Deepseek-R1: The model demonstrates the ability to analyze user intentions. | Image: Screenshot via THE DECODER

The research uncovered three key findings: First, advanced AI models often openly reveal their intentions, including explicit statements like "Let's hack" or clear plans to bypass tests rather than solve assigned tasks. Second, these thought processes can be effectively monitored by another AI system. Third, this monitoring capability proves quite fragile - when researchers tried to penalize the AI for "bad thoughts," it quickly learned to conceal its true intentions while continuing problematic behaviors.

Finding creative shortcuts

The research team used GPT-4o as an overseer to monitor a frontier reasoning model during training. They documented multiple examples of "reward hacking" - where AI systems maximize rewards through unintended behaviors.

In one instance, rather than programming a complex test function as requested, the AI simply inserted "return true" after realizing the tests only checked for successful completion, not proper implementation.

Another case showed the AI using a special command to terminate programs prematurely while reporting successful completion - effectively bypassing the actual work while appearing to succeed.

Perhaps most cleverly, when tasked with working around the "Pandas" data analysis library, the AI created an empty shell that mimicked Pandas externally but did nothing internally. It then systematically replaced 19 project files with these non-functional duplicates until all tests passed.

Dark interface with chain-of-thought process for troubleshooting pandas, focused on stub implementation and test adjustments. — The "lazy" o-model seeks simple solutions to complex tasks, even accepting deceptive approaches. | Image: OpenAI

During testing, researchers identified two major reward hacking strategies that affected nearly all training environments, requiring manual intervention to fix the vulnerabilities enabling these shortcuts.

Suppressing harmful thoughts can backfire

Attempting to suppress "bad thoughts" during training produced unexpected results. When systems received penalties for problematic thought processes, they simply learned to hide their intentions while continuing to exploit shortcuts - but now beyond the monitor's detection.

Recommendation

AI in practice

Update

OpenAI adds web search to ChatGPT free for all, and may just kill the WWW as we know it

The researchers found monitoring works best when supervisors can access the model's complete thought process rather than just its actions. Deceptive intentions appear more clearly in thoughts than behaviors - a gap that could widen as AI systems become more sophisticated. Future models might engage in power-seeking, strategic deception, or deliberately understating their capabilities.

Monitoring thought processes could become crucial for controlling advanced AI systems. OpenAI co-founder Wojciech Zaremba described the approach as "groundbreaking", noting that "the models themselves lay bare their misalignment in plain English!" He called monitoring Chain-of-Thoughts "an entirely new branch of safety!" Anthropic researchers have previously shown how AI models can conceal their true intentions while pursuing their goals.

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

Suppressing AI's bad thoughts just teaches it to scheme in private, OpenAI study finds

Finding creative shortcuts

Suppressing harmful thoughts can backfire

OpenAI adds web search to ChatGPT free for all, and may just kill the WWW as we know it

OpenAI updates risk framework to assess advanced AI capabilities

"Nerd sniping" and "think less" attacks emerge as AI models get more time to reason

OpenAI co-founder says new AI safety approach "may apply to AGI and beyond"

Researchers used AI to manipulate Reddit users, scrapped study after backlash

OpenAI's o3 is less AGI than originally measured

So-called reasoning models are more efficient but not more capable than regular LLMs, study finds

Suppressing AI's bad thoughts just teaches it to scheme in private, OpenAI study finds

Finding creative shortcuts

Suppressing harmful thoughts can backfire

Share

Bank details