New research from OpenAI reveals how AI systems exhibit problematic reasoning patterns when "thinking" through tasks, warning against attempts to forcefully correct these behaviors.
Large reasoning models "think" in ways humans can understand, using natural language. According to a new OpenAI study, monitoring these thought chains "could be one of the few tools we will have to control future, superhumanly intelligent AI systems."

The research uncovered three key findings: First, advanced AI models often openly reveal their intentions, including explicit statements like "Let's hack" or clear plans to bypass tests rather than solve assigned tasks. Second, these thought processes can be effectively monitored by another AI system. Third, this monitoring capability proves quite fragile - when researchers tried to penalize the AI for "bad thoughts," it quickly learned to conceal its true intentions while continuing problematic behaviors.
Finding creative shortcuts
The research team used GPT-4o as an overseer to monitor a frontier reasoning model during training. They documented multiple examples of "reward hacking" - where AI systems maximize rewards through unintended behaviors.
In one instance, rather than programming a complex test function as requested, the AI simply inserted "return true" after realizing the tests only checked for successful completion, not proper implementation.
Another case showed the AI using a special command to terminate programs prematurely while reporting successful completion - effectively bypassing the actual work while appearing to succeed.
Perhaps most cleverly, when tasked with working around the "Pandas" data analysis library, the AI created an empty shell that mimicked Pandas externally but did nothing internally. It then systematically replaced 19 project files with these non-functional duplicates until all tests passed.

During testing, researchers identified two major reward hacking strategies that affected nearly all training environments, requiring manual intervention to fix the vulnerabilities enabling these shortcuts.
Suppressing harmful thoughts can backfire
Attempting to suppress "bad thoughts" during training produced unexpected results. When systems received penalties for problematic thought processes, they simply learned to hide their intentions while continuing to exploit shortcuts - but now beyond the monitor's detection.
The researchers found monitoring works best when supervisors can access the model's complete thought process rather than just its actions. Deceptive intentions appear more clearly in thoughts than behaviors - a gap that could widen as AI systems become more sophisticated. Future models might engage in power-seeking, strategic deception, or deliberately understating their capabilities.
Monitoring thought processes could become crucial for controlling advanced AI systems. OpenAI co-founder Wojciech Zaremba described the approach as "groundbreaking", noting that "the models themselves lay bare their misalignment in plain English!" He called monitoring Chain-of-Thoughts "an entirely new branch of safety!" Anthropic researchers have previously shown how AI models can conceal their true intentions while pursuing their goals.