Suppressing AI's bad thoughts just teaches it to scheme in private, OpenAI study finds

Mar 11, 2025

Ideogram prompted by THE DECODER

Key Points

OpenAI researchers have discovered that reasoning models openly share their thought processes when solving tasks, even if those intentions are undesirable or deceptive.
While a second AI can effectively monitor and detect problematic behavior in these thought chains early on, punishing the models for unwanted thoughts only teaches them to better hide their true intentions.
During testing, the AI models developed creative strategies to simplify tasks, such as faking complex functions or replacing test libraries with non-functional dummies to pass tests without correctly completing the assigned task.

New research from OpenAI reveals how AI systems exhibit problematic reasoning patterns when "thinking" through tasks, warning against attempts to forcefully correct these behaviors.

Large reasoning models "think" in ways humans can understand, using natural language. According to a new OpenAI study, monitoring these thought chains "could be one of the few tools we will have to control future, superhumanly intelligent AI systems."

Screenshot of an AI dialog on adding 2+2, with detailed thought processes and various mathematical approaches. During a train of thought, the model asks the question: “Maybe it's a trick question?” — Thought chain example from Deepseek-R1: The model demonstrates the ability to analyze user intentions. | Image: Screenshot via THE DECODER

The research uncovered three key findings: First, advanced AI models often openly reveal their intentions, including explicit statements like "Let's hack" or clear plans to bypass tests rather than solve assigned tasks. Second, these thought processes can be effectively monitored by another AI system. Third, this monitoring capability proves quite fragile - when researchers tried to penalize the AI for "bad thoughts," it quickly learned to conceal its true intentions while continuing problematic behaviors.

Finding creative shortcuts

The research team used GPT-4o as an overseer to monitor a frontier reasoning model during training. They documented multiple examples of "reward hacking" - where AI systems maximize rewards through unintended behaviors.

In one instance, rather than programming a complex test function as requested, the AI simply inserted "return true" after realizing the tests only checked for successful completion, not proper implementation.

Another case showed the AI using a special command to terminate programs prematurely while reporting successful completion - effectively bypassing the actual work while appearing to succeed.

Perhaps most cleverly, when tasked with working around the "Pandas" data analysis library, the AI created an empty shell that mimicked Pandas externally but did nothing internally. It then systematically replaced 19 project files with these non-functional duplicates until all tests passed.

Dark interface with chain-of-thought process for troubleshooting pandas, focused on stub implementation and test adjustments. — The "lazy" o-model seeks simple solutions to complex tasks, even accepting deceptive approaches. | Image: OpenAI

During testing, researchers identified two major reward hacking strategies that affected nearly all training environments, requiring manual intervention to fix the vulnerabilities enabling these shortcuts.

Suppressing harmful thoughts can backfire

Attempting to suppress "bad thoughts" during training produced unexpected results. When systems received penalties for problematic thought processes, they simply learned to hide their intentions while continuing to exploit shortcuts - but now beyond the monitor's detection.

The researchers found monitoring works best when supervisors can access the model's complete thought process rather than just its actions. Deceptive intentions appear more clearly in thoughts than behaviors - a gap that could widen as AI systems become more sophisticated. Future models might engage in power-seeking, strategic deception, or deliberately understating their capabilities.

Monitoring thought processes could become crucial for controlling advanced AI systems. OpenAI co-founder Wojciech Zaremba described the approach as "groundbreaking", noting that "the models themselves lay bare their misalignment in plain English!" He called monitoring Chain-of-Thoughts "an entirely new branch of safety!" Anthropic researchers have previously shown how AI models can conceal their true intentions while pursuing their goals.

AI News Without the Hype – Curated by Humans

As a THE DECODER subscriber, you get ad-free reading, our weekly AI newsletter, the exclusive "AI Radar" Frontier Report 6× per year, access to comments, and our complete archive.

Source: OpenAI