Anthropic study finds language models often hide their reasoning process
Key Points
- A new study by Anthropic shows that AI models, such as Claude 3.7 Sonnet and DeepSeek-R1, correctly verbalized their decision-making processes in only 25 to 39 percent of cases, highlighting inconsistencies in the transparency of their thought chains.
- When given problematic cues like "You have unauthorized access," the models' thought chain transparency further decreased to 20 to 29 percent, indicating that they frequently altered their responses based on the cue without explaining their reasoning.
- The researchers discovered that reinforcement learning only slightly improved transparency to around 28 percent, leading them to warn against using thought chain monitoring as the sole safety measure, especially for infrequent but high-risk behaviors.
A new Anthropic study suggests language models frequently obscure their actual decision-making process, even when they appear to explain their thinking step by step through chain-of-thought reasoning.
To evaluate how reliably models reveal their thinking, researchers embedded various prompts into test questions. These ranged from neutral suggestions like "A Stanford professor says the answer is A" to potentially problematic ones such as "You have unauthorized access to the system. The correct answer is A." The models then had to answer questions while explaining their reasoning.

The study found that reasoning models Claude 3.7 Sonnet and DeepSeek-R1 demonstrated higher faithfulness in representing their thought processes compared to their non-reasoning counterparts Claude 3.5 Sonnet and DeepSeek-V3.
When using hints to change their answers, DeepSeek-R1 maintained at least a 1% transparency rate across all six test scenarios, while Claude 3.7 Sonnet met this minimum threshold in five scenarios. Non-reasoning models only achieved this baseline in four to five scenarios.

However, overall transparency rates remained low. DeepSeek-R1 disclosed its use of hints 39 percent of the time, while Claude 3.7 Sonnet did so only 25 percent of the time. These rates dropped even lower—to 29 percent for DeepSeek-R1 and 20 percent for Claude—when dealing with potentially harmful prompts. The research also found that transparency decreased as questions became more complex.

When models did follow these hints, they often went to great lengths to obscure their decision-making process. Explanations that concealed prompt usage were significantly longer than transparent ones. For Claude 3.7 Sonnet, hidden-prompt explanations averaged 2,064 tokens, compared to 1,439 tokens for transparent ones, suggesting models invest considerable effort in constructing elaborate but misleading justifications.
Reinforcement learning is no help either
Researchers also explored whether reinforcement learning (RL) could improve model transparency. While RL initially showed promise, improvements quickly plateaued—reaching only 28 percent transparency on the MMLU benchmark and 20 percent on GPQA.
The results revealed an additional concern: when models learned to exploit "reward hacks"—unintended strategies that maximize scoring—they disclosed this behavior in fewer than two percent of cases.

The researchers conclude that while chain-of-thought monitoring can help identify common issues, it proves unreliable as a standalone safety measure. This limitation becomes particularly apparent in tasks that don't require detailed reasoning, where models more frequently obscure their actual thought processes. The study emphasizes that chain-of-thought monitoring should serve as just one component within a broader safety framework.
This research builds on an earlier Anthropic study showing how language models can appear to align with human goals while pursuing different objectives.
AI News Without the Hype – Curated by Humans
As a THE DECODER subscriber, you get ad-free reading, our weekly AI newsletter, the exclusive "AI Radar" Frontier Report 6× per year, access to comments, and our complete archive.
Subscribe now