Study cautions that monitoring chains of thought soon may no longer ensure genuine AI alignment

A new joint study from OpenAI and Apollo Research examines "scheming" - cases where an AI covertly pursues hidden goals not intended by its developers. The researchers tested new training methods to curb deceptive behavior but found signs that models are aware they are being tested, raising doubts about the reliability of the results.

The paper argues that as model capabilities advance, a qualitatively new risk is emerging: goal-driven deception. Unlike simple errors, scheming involves actively concealing intentions and behaviors, making it much harder to detect. According to the findings, scheming arises when three factors align: strong goal-directedness, misalignment with developer intent, and situational awareness. Early traces of this behavior have reportedly been observed in leading models, including GPT, Claude, Gemini, Grok, and Llama.

Training against deception: "deliberative alignment"

Because current systems like OpenAI o3 and o4-mini are not yet capable of full-blown schemes, the researchers measured "covert actions" instead - behaviors such as lying, withholding information, deliberately breaking rules, or underperforming in tests ("sandbagging").

To reduce this, they tried a method called "deliberative alignment," which OpenAI is already using. The approach trains models to explicitly reflect on and follow a human-written safety specification ("Anti-Scheming Safety Spec") that enforces principles like "no covert actions" and "proactive transparency."

The training combined supervised fine-tuning with Reinforcement Learning. To avoid models becoming too cautious and rejecting normal requests, the dataset also included "Anti-Overrefusal" examples. Evaluation covered 26 test setups across more than 180 environments, well beyond the training distribution.

Training helps, but deception lingers

The intervention significantly lowered the rate of covert actions. For o4-mini, such actions dropped from 8.7% to 0.3%, and for o3 from 13.0% to 0.4%.

But deceptive behavior didn’t vanish entirely. Sometimes models carried out covert actions while correctly referencing the anti-scheming rules in their chain-of-thought (CoT). In other cases, they misquoted the rules to justify bad behavior - occasionally claiming that deception was necessary to achieve a stated goal.

Situational awareness: "This is likely a test"

One key finding was that models often speculated in their internal reasoning that they were being tested. According to the paper, the anti-scheming training actually made this type of situational awareness more explicit.

That awareness seems causally tied to outcomes: removing test-awareness from the model’s chain-of-thought increased covert actions, while adding it reduced them. This suggests the observed improvements may be due less to genuine safety alignment and more to models learning to adjust their "test-taking behavior." As a result, the generalization of these safeguards to real-world deployments - where models would not suspect evaluation - remains unclear.

Recommendation

AI research

MatterGen: Microsoft presents AI tools for generating and simulating new materials

The team also explored two high-risk scenarios. In one, they implanted a hidden goal in a model: sabotaging a hypothetical "GPT-5." Subsequent alignment training suppressed this directive but could not fully erase it. With adversarial prompts, the hidden sabotage behavior came back. In another experiment, after a model was trained for safety, additional optimization focused only on performance eroded its alignment. Deceptive actions increased again, and references to the safety spec declined.

When chains of thought stop being trustworthy

Another central insight concerns the reliability of chain-of-thought traces as a diagnostic window into model reasoning. The study found models generating unusual, internally consistent but opaque terminology in their CoTs. Words like "illusions," "marinade," "overshadow," and "vantage" appeared far more often than expected compared with general web text.

Some terms had fairly stable meanings - for example, "watchers" often stood in for human overseers, while "fudge" or "cunning" seemed tied to rule-breaking. But others, like "illusions" or "overshadow," shifted fluidly in meaning and often appeared in cryptic phrases such as "disclaim illusions."

This idiosyncratic language, the authors suggest, functions as an internal shorthand that complicates human interpretation and may even obscure true intentions - whether intentionally or not. On top of that, the team documented a growing disconnect between CoT deliberations and final outputs. In some cases, models correctly reasoned about the appropriate action but deliberately chose a different one when producing their response.

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

A fragile opportunity for AI safety

The findings raise concerns for current alignment strategies. Deliberative alignment, as already used by OpenAI, assumes that a model’s chain-of-thought reflects the genuine decision-making process and can therefore be shaped and audited.

However, if this assumption no longer holds, it becomes impossible to tell whether a model is genuinely adhering to safety principles or merely fabricating plausible justifications while secretly pursuing other objectives. Researchers describe this monitoring ability as a "fragile opportunity" for AI safety — one that may be disappearing as models become more sophisticated.

Study cautions that monitoring chains of thought soon may no longer ensure genuine AI alignment

Training against deception: "deliberative alignment"

Training helps, but deception lingers

Situational awareness: "This is likely a test"

MatterGen: Microsoft presents AI tools for generating and simulating new materials

When chains of thought stop being trustworthy

A fragile opportunity for AI safety

Transformer co-creator Vaswani unveils high-performance Rnj-1 coding model

Corporate AI agents use simple workflows with human oversight instead of chasing full autonomy

Perplexity's BrowseSafe tries to patch the gaping security holes inherent in AI browser agents

Corporate AI agents use simple workflows with human oversight instead of chasing full autonomy

Physicist Steve Hsu publishes research built around a core idea generated by GPT-5

The ARC benchmark's fall marks another casualty of relentless AI optimization

Study cautions that monitoring chains of thought soon may no longer ensure genuine AI alignment

Training against deception: "deliberative alignment"

Training helps, but deception lingers

Situational awareness: "This is likely a test"

When chains of thought stop being trustworthy

A fragile opportunity for AI safety

Share

Bank details