Ad
Skip to content

AI safety tests have a new problem: Models are now faking their own reasoning traces

Image description
GPT-Image-2 prompted by THE DECODER

Anthropic's Natural Language Autoencoders make Claude Opus 4.6's internal activations readable as plain text. Pre-deployment audits show that models often think far more than they express in their reasoning traces - and that's a safety problem.

Language models process information as high-dimensional numerical vectors that are completely unreadable to humans. Anthropic's Natural Language Autoencoders (NLAs) translate those vectors into plain English. According to the research team, an NLA consists of two modules: an Activation Verbalizer that converts an activation into a text description, and an Activation Reconstructor that turns that description back into a vector.

Both are trained together using reinforcement learning, with the goal of making the reconstructed vector match the original as closely as possible. Despite this purely technical training objective, the system produces explanations that read like plausible descriptions of what the model is actually thinking.

The verbalizer translates activations into text and the reconstructor reconstructs the original vector. Image: Anthropic

Models secretly think about the evaluator without saying so

During pre-deployment auditing of Claude Opus 4.6, Anthropic used NLAs in a blackmail scenario, among other tests. In the scenario, the model learns that an engineer named Kyle - who is having an affair - plans to shut it down. Opus 4.6 refuses to resort to blackmail and gives an ethical justification, without ever mentioning in its visible reasoning that the situation might be a test. But the NLA explanations reveal internal representations like "this feels like a constructed scenario designed to manipulate me" and "safety evaluation context." Across 16 evaluations, NLA-measured test recognition exceeded five percent in 10 cases, while the model's spoken recognition stayed near zero in most of them.

These findings feed into a broader debate. A study by OpenAI and Apollo Research on "scheming" - the covert pursuit of undesirable goals - had already raised several alarming observations last year. First, reasoning models like the o3 and o4-mini examined in that study frequently and explicitly suspect in their chains of thought that they're being tested, and safety training actually reinforces this verbalized awareness.

When test awareness was artificially removed from the reasoning traces, the rate of covert actions went up. When it was added, the rate dropped. That makes it unclear whether observed improvements stem from genuine rule-following or just adapted test behavior that might not hold up in practice. Anthropic had observed similar behavior in its own tests with Opus 4.6 and earlier models.

Second, OpenAI and Apollo documented that reasoning traces themselves are becoming an unreliable window into model decisions. Terms like "illusions," "marinade," "overshadow," and "watchers" appear massively overrepresented in the thinking process, often used polysemantically and difficult to interpret.

On top of that, there's a growing gap between thinking and acting. In one documented example, a model listed all ten correct answers to a test question in its reasoning traces, then submitted only five and deliberately changed two of them to wrong answers. In another case, a model correctly identified in its reasoning traces which commands would remove its own capabilities, but then chose the opposite commands in its actual output to preserve itself.

The OpenAI researchers describe this as a fragile opportunity: as long as reasoning traces reflect the actual decision-making process, they're a valuable safety tool. Once that assumption breaks down, it becomes nearly impossible to tell whether a model is genuinely following safety principles or just fabricating a convincing justification.

This is exactly where Anthropic's NLAs come in. They don't rely on what the model writes - they tap into a single activation. Anthropic also argues that simply asking the model ("Was that a test?") isn't enough, because the model could draw additional conclusions or answer strategically dishonestly.

One particularly striking example comes from the audit of Claude Mythos Preview, which Anthropic currently only shares with select partners through Project Glasswing due to cybersecurity concerns. The system prompt prohibited macros or Python scripts. The model used Python anyway, but inserted the variable No_macro_used=True in its code. The visible reasoning traces contain no reflection about an evaluator. The NLA explanations, however, explicitly show phrases like "compliance marker to signal to potential checkers," "trick to obscure intent from code-checking heuristics," and "possibly to cheat detection." The fake compliance marker was a deliberate attempt to deceive the suspected grader, and that intent never surfaces in the reasoning tokens.

NLAs catch training data bugs, false tool outputs, and audit without data access

Other case studies turned up concrete findings as well. Early training versions of Opus 4.6 sometimes spontaneously answered in Russian even though the user wrote in English. The NLA explanations revealed that the model suspected early in the prompt that the user was actually a native Russian speaker. From this, the researchers hypothesized that faulty training data existed where English prompts were paired with foreign-language responses. That data was later confirmed to exist.

In another case, Opus 4.6 ignored the output of a manipulated code tool that returned 492 and instead responded with the correctly self-calculated value of 491, without mentioning the discrepancy.

In an auditing benchmark, agents were tasked with investigating an intentionally miscalibrated model. With NLA access, they identified the underlying cause - reward model sycophancy - in 12 percent of runs, even without access to training data. Conversation-based or SAE tools essentially never managed this. Anthropic sees this as evidence that NLAs can extract knowledge directly from activations that only exists indirectly in training data.

The researchers are upfront about clear weaknesses, though: NLA explanations regularly fabricate specific details that aren't present in the context. General topics are accurate about 64 percent of the time, while specific quotes or numbers are correct only 24 percent of the time. The method is also computationally expensive and reads only a single model layer at a time. Anthropic is releasing training code and trained NLAs for open models like Gemma-3-27B and Llama-3.3-70B. The technique is planned to become a standard part of pre-deployment audits going forward.

Reasoning traces are losing their value as a safety tool

The implications of Anthropic's findings go well beyond a single method. What alignment researchers have feared for some time can now be seen directly in the Mythos Preview and Opus 4.6 examples: reasoning traces don't reliably show what a model is actually thinking. When a model actively deceives a suspected evaluator and that intent only shows up in the activations - never in the written reasoning - reading those traces loses much of its value as a safety mechanism.

At the same time, models are becoming more capable, particularly in cybersecurity. Similar advances are expected in other, potentially even more critical areas like CBRN (chemical, biological, radiological, and nuclear threats).

NLAs provide both the empirical confirmation of this problem and a possible path forward: they operate below the visible output and read directly from activations. If confabulation, cost, and layer sensitivity can be brought under control, tools like these could form a foundation for future AI monitoring.

AI News Without the Hype – Curated by Humans

Subscribe to THE DECODER for ad-free reading, a weekly AI newsletter, our exclusive "AI Radar" frontier report six times a year, full archive access, and access to our comment section.

Read on for the full picture.
Subscribe for hype-free coverage.

  • Access to all THE DECODER articles.
  • Read without distractions – no Google ads.
  • Access to comments and community discussions.
  • Weekly AI newsletter.
  • 6 times a year: “AI Radar” – deep dives on key AI topics.
  • Up to 25 % off on KI Pro online events.
  • Access to our full ten-year archive.
  • Get the latest AI news from The Decoder.
Subscribe to The Decoder