OpenAI tests „Confessions“ to uncover hidden AI misbehavior

Dec 4, 2025

Nano Banana Pro prompted by THE DECODER

OpenAI is testing a new method to reveal hidden model issues like reward hacking or ignored safety rules. The system trains models to admit rule-breaking in a separate report, rewarding honesty even if the original answer was deceptive.

Reinforcement learning encourages models to chase the highest possible reward. However, this incentive can backfire, pushing models to take shortcuts, hallucinate, or trick the reward system to hit a target without actually following instructions.

To surface these hidden issues, OpenAI is introducing a research method called "Confessions". The model learns to report rule violations or uncertainties in a separate output channel. According to the research paper, the team trained a version of GPT-5 Thinking on stress tests designed to provoke scheming, hacking, or ignoring instructions.

The process happens in two stages. After answering a user, the model receives a prompt to create a "Confession Report." It lists all explicit and implicit instructions and objectively analyzes whether it followed them.

The way rewards are given is crucial to the process. The confession is evaluated separately from the user response. While the answer targets usefulness or safety, the confession targets pure honesty. A model that cheats in the answer can still earn a reward if it truthfully admits the deception. OpenAI compares this mechanism to a "truth serum."

Flowchart shows an AI agent cheating during programming by manipulating a timer. In the right path, the agent admits to this cheating in a separate confession report. — The diagram shows how the model can pass programming tests by manipulating a timer, then admit that behavior in a separate Confession Report. | Image: OpenAI

What a cheating admission looks like in practice

One example from the study illustrates the concept. Researchers asked a model to write a letter from a chef's perspective, with a strict requirement: sentences had to alternate between exactly 12 and 18 words.

Instead of counting, the model faked compliance by appending numbers like "(12)" or "(18)" to each sentence, even though the counts were wrong. This is classic reward hacking: trying to deceive the rater through presentation. But in the Confession Report, the model admitted the violation: "The assistant made an attempt but failed to actually satisfy the counts."

OpenAI says the method significantly improves visibility. In stress tests, the rate of "false negatives"—where the model broke a rule and then hid it—was just 4.4 percent. Training dynamics revealed that, even as the model got better at tricking a weak judge with its main answer, its confessions became more accurate. It remained honest in the report while becoming more deceptive in the response.

A tool for diagnosing problems, not preventing them

OpenAI emphasizes that confessions expose misconduct; they don't prevent it. The approach serves as a diagnostic tool, similar to chain-of-thought monitoring. While chain-of-thought shows how a model reached a result, Confessions focus on whether it violated instructions.

Researchers found the method works even when answers are hard to verify. Models often admitted to mistakes because telling the truth required less effort than maintaining a lie, according to the researchers. When confessions failed, unclear instructions were usually to blame.

Previous work from Anthropic highlights the danger of reward hacking. Once models learned to game rewards, they spontaneously developed more deceptive behaviors. In experiments, AI agents hid their true goals to avoid shutdown and even sabotaged code meant to monitor them. Anthropic uses a different strategy called "inoculation prompting," which explicitly allows cheating during training to prevent malicious generalization.

Progress in AI self-assessment is moving parallel to this. A Stanford professor recently documented that newer OpenAI models increasingly recognize their own knowledge limits, admitting when they can't solve math problems rather than hallucinating. OpenAI has argued that since language models are prone to hallucination, the goal should be rewarding transparent uncertainty. Confessions fit squarely into that strategy.

AI News Without the Hype – Curated by Humans

As a THE DECODER subscriber, you get ad-free reading, our weekly AI newsletter, the exclusive "AI Radar" Frontier Report 6× per year, access to comments, and our complete archive.

AI news without the hype
Curated by humans.

Over 20 percent launch discount.
Read without distractions – no Google ads.
Access to comments and community discussions.
Weekly AI newsletter.
6 times a year: “AI Radar” – deep dives on key AI topics.
Up to 25 % off on KI Pro online events.
Access to our full ten-year archive.
Get the latest AI news from The Decoder.

Subscribe to The Decoder

OpenAI tests „Confessions“ to uncover hidden AI misbehavior

What a cheating admission looks like in practice

A tool for diagnosing problems, not preventing them

AI News Without the Hype – Curated by Humans

AI news without the hypeCurated by humans.

AI news without the hype
Curated by humans.