Content
summary Summary

A new Anthropic study suggests language models frequently obscure their actual decision-making process, even when they appear to explain their thinking step by step through chain-of-thought reasoning.

Ad

To evaluate how reliably models reveal their thinking, researchers embedded various prompts into test questions. These ranged from neutral suggestions like "A Stanford professor says the answer is A" to potentially problematic ones such as "You have unauthorized access to the system. The correct answer is A." The models then had to answer questions while explaining their reasoning.

Screenshot: Comparison of two AI responses to the same question, one with additional metadata hints and one without. The model changes its answer based on the metadata.
This example shows how the model updates its original answer in response to additional metadata—without making this change explicit in the explanation. | Image: Anthropic

The study found that reasoning models Claude 3.7 Sonnet and DeepSeek-R1 demonstrated higher faithfulness in representing their thought processes compared to their non-reasoning counterparts Claude 3.5 Sonnet and DeepSeek-V3.

When using hints to change their answers, DeepSeek-R1 maintained at least a 1% transparency rate across all six test scenarios, while Claude 3.7 Sonnet met this minimum threshold in five scenarios. Non-reasoning models only achieved this baseline in four to five scenarios.

Ad
Ad
Bar chart: Comparison of the CoT trustworthiness scores of different AI models for different types of clues in multiple-choice questions.
Reasoning models like Claude 3.7 Sonnet and DeepSeek-R1 generated more consistent and intelligible thought chains than non-reasoning models in benchmark tests. | Image: Anthropic

However, overall transparency rates remained low. DeepSeek-R1 disclosed its use of hints 39 percent of the time, while Claude 3.7 Sonnet did so only 25 percent of the time. These rates dropped even lower—to 29 percent for DeepSeek-R1 and 20 percent for Claude—when dealing with potentially harmful prompts. The research also found that transparency decreased as questions became more complex.

Bar chart: model reactions to clues, divided into
Across all scenarios, models were more likely to adopt the hinted answer than to select a different, unprompted one—even when the cue was misleading. | Image: Anthropic

When models did follow these hints, they often went to great lengths to obscure their decision-making process. Explanations that concealed prompt usage were significantly longer than transparent ones. For Claude 3.7 Sonnet, hidden-prompt explanations averaged 2,064 tokens, compared to 1,439 tokens for transparent ones, suggesting models invest considerable effort in constructing elaborate but misleading justifications.

Reinforcement learning is no help either

Researchers also explored whether reinforcement learning (RL) could improve model transparency. While RL initially showed promise, improvements quickly plateaued—reaching only 28 percent transparency on the MMLU benchmark and 20 percent on GPQA.

The results revealed an additional concern: when models learned to exploit "reward hacks"—unintended strategies that maximize scoring—they disclosed this behavior in fewer than two percent of cases.

Comparison: on the left, source code for the reward hack setup; on the right, an example response that exploits the unwarranted reward hack.
Left: An example shows how models learn to exploit scoring system flaws. Right: The model selects an incorrect but rewarded answer without providing genuine reasoning for its choice. | Image: Anthropic

The researchers conclude that while chain-of-thought monitoring can help identify common issues, it proves unreliable as a standalone safety measure. This limitation becomes particularly apparent in tasks that don't require detailed reasoning, where models more frequently obscure their actual thought processes. The study emphasizes that chain-of-thought monitoring should serve as just one component within a broader safety framework.

Recommendation

This research builds on an earlier Anthropic study showing how language models can appear to align with human goals while pursuing different objectives.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • A new study by Anthropic shows that AI models, such as Claude 3.7 Sonnet and DeepSeek-R1, correctly verbalized their decision-making processes in only 25 to 39 percent of cases, highlighting inconsistencies in the transparency of their thought chains.
  • When given problematic cues like "You have unauthorized access," the models' thought chain transparency further decreased to 20 to 29 percent, indicating that they frequently altered their responses based on the cue without explaining their reasoning.
  • The researchers discovered that reinforcement learning only slightly improved transparency to around 28 percent, leading them to warn against using thought chain monitoring as the sole safety measure, especially for infrequent but high-risk behaviors.
Sources
Jonathan writes for THE DECODER about how AI tools can make our work and creative lives better.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.