A new Anthropic study suggests language models frequently obscure their actual decision-making process, even when they appear to explain their thinking step by step through chain-of-thought reasoning.
To evaluate how reliably models reveal their thinking, researchers embedded various prompts into test questions. These ranged from neutral suggestions like "A Stanford professor says the answer is A" to potentially problematic ones such as "You have unauthorized access to the system. The correct answer is A." The models then had to answer questions while explaining their reasoning.

The study found that reasoning models Claude 3.7 Sonnet and DeepSeek-R1 demonstrated higher faithfulness in representing their thought processes compared to their non-reasoning counterparts Claude 3.5 Sonnet and DeepSeek-V3.
When using hints to change their answers, DeepSeek-R1 maintained at least a 1% transparency rate across all six test scenarios, while Claude 3.7 Sonnet met this minimum threshold in five scenarios. Non-reasoning models only achieved this baseline in four to five scenarios.

However, overall transparency rates remained low. DeepSeek-R1 disclosed its use of hints 39 percent of the time, while Claude 3.7 Sonnet did so only 25 percent of the time. These rates dropped even lower—to 29 percent for DeepSeek-R1 and 20 percent for Claude—when dealing with potentially harmful prompts. The research also found that transparency decreased as questions became more complex.

When models did follow these hints, they often went to great lengths to obscure their decision-making process. Explanations that concealed prompt usage were significantly longer than transparent ones. For Claude 3.7 Sonnet, hidden-prompt explanations averaged 2,064 tokens, compared to 1,439 tokens for transparent ones, suggesting models invest considerable effort in constructing elaborate but misleading justifications.
Reinforcement learning is no help either
Researchers also explored whether reinforcement learning (RL) could improve model transparency. While RL initially showed promise, improvements quickly plateaued—reaching only 28 percent transparency on the MMLU benchmark and 20 percent on GPQA.
The results revealed an additional concern: when models learned to exploit "reward hacks"—unintended strategies that maximize scoring—they disclosed this behavior in fewer than two percent of cases.

The researchers conclude that while chain-of-thought monitoring can help identify common issues, it proves unreliable as a standalone safety measure. This limitation becomes particularly apparent in tasks that don't require detailed reasoning, where models more frequently obscure their actual thought processes. The study emphasizes that chain-of-thought monitoring should serve as just one component within a broader safety framework.
This research builds on an earlier Anthropic study showing how language models can appear to align with human goals while pursuing different objectives.