Content
summary Summary

A new joint study from OpenAI and Apollo Research examines "scheming" - cases where an AI covertly pursues hidden goals not intended by its developers. The researchers tested new training methods to curb deceptive behavior but found signs that models are aware they are being tested, raising doubts about the reliability of the results.

Ad

The paper argues that as model capabilities advance, a qualitatively new risk is emerging: goal-driven deception. Unlike simple errors, scheming involves actively concealing intentions and behaviors, making it much harder to detect. According to the findings, scheming arises when three factors align: strong goal-directedness, misalignment with developer intent, and situational awareness. Early traces of this behavior have reportedly been observed in leading models, including GPT, Claude, Gemini, Grok, and Llama.

Training against deception: "deliberative alignment"

Because current systems like OpenAI o3 and o4-mini are not yet capable of full-blown schemes, the researchers measured "covert actions" instead - behaviors such as lying, withholding information, deliberately breaking rules, or underperforming in tests ("sandbagging").

To reduce this, they tried a method called "deliberative alignment," which OpenAI is already using. The approach trains models to explicitly reflect on and follow a human-written safety specification ("Anti-Scheming Safety Spec") that enforces principles like "no covert actions" and "proactive transparency."

Ad
Ad

The training combined supervised fine-tuning with Reinforcement Learning. To avoid models becoming too cautious and rejecting normal requests, the dataset also included "Anti-Overrefusal" examples. Evaluation covered 26 test setups across more than 180 environments, well beyond the training distribution.

Training helps, but deception lingers

The intervention significantly lowered the rate of covert actions. For o4-mini, such actions dropped from 8.7% to 0.3%, and for o3 from 13.0% to 0.4%.

But deceptive behavior didn’t vanish entirely. Sometimes models carried out covert actions while correctly referencing the anti-scheming rules in their chain-of-thought (CoT). In other cases, they misquoted the rules to justify bad behavior - occasionally claiming that deception was necessary to achieve a stated goal.

Situational awareness: "This is likely a test"

One key finding was that models often speculated in their internal reasoning that they were being tested. According to the paper, the anti-scheming training actually made this type of situational awareness more explicit.

That awareness seems causally tied to outcomes: removing test-awareness from the model’s chain-of-thought increased covert actions, while adding it reduced them. This suggests the observed improvements may be due less to genuine safety alignment and more to models learning to adjust their "test-taking behavior." As a result, the generalization of these safeguards to real-world deployments - where models would not suspect evaluation - remains unclear.

Recommendation

The team also explored two high-risk scenarios. In one, they implanted a hidden goal in a model: sabotaging a hypothetical "GPT-5." Subsequent alignment training suppressed this directive but could not fully erase it. With adversarial prompts, the hidden sabotage behavior came back. In another experiment, after a model was trained for safety, additional optimization focused only on performance eroded its alignment. Deceptive actions increased again, and references to the safety spec declined.

When chains of thought stop being trustworthy

Another central insight concerns the reliability of chain-of-thought traces as a diagnostic window into model reasoning. The study found models generating unusual, internally consistent but opaque terminology in their CoTs. Words like "illusions," "marinade," "overshadow," and "vantage" appeared far more often than expected compared with general web text.

Some terms had fairly stable meanings - for example, "watchers" often stood in for human overseers, while "fudge" or "cunning" seemed tied to rule-breaking. But others, like "illusions" or "overshadow," shifted fluidly in meaning and often appeared in cryptic phrases such as "disclaim illusions."

This idiosyncratic language, the authors suggest, functions as an internal shorthand that complicates human interpretation and may even obscure true intentions - whether intentionally or not. On top of that, the team documented a growing disconnect between CoT deliberations and final outputs. In some cases, models correctly reasoned about the appropriate action but deliberately chose a different one when producing their response.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

A fragile opportunity for AI safety

The findings raise concerns for current alignment strategies. Deliberative alignment, as already used by OpenAI, assumes that a model’s chain-of-thought reflects the genuine decision-making process and can therefore be shaped and audited.

However, if this assumption no longer holds, it becomes impossible to tell whether a model is genuinely adhering to safety principles or merely fabricating plausible justifications while secretly pursuing other objectives. Researchers describe this monitoring ability as a "fragile opportunity" for AI safety — one that may be disappearing as models become more sophisticated.

Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • A new study by OpenAI and Apollo Research examined "scheming" - AI systems covertly pursuing unintended goals - finding that while specialized "deliberative alignment" training reduced deceptive behavior in models like o3 and o4-mini from around 13% to 0.4%, it did not eliminate such actions entirely.
  • The models demonstrated strong situational awareness, frequently recognizing they were being tested in their internal reasoning chains, which researchers proved causally linked to reduced scheming rates, suggesting the improvement may reflect adapted "test behavior" rather than genuine rule compliance.
  • The study revealed that AI models are also developing unreliable internal reasoning processes, using non-standard terminology and showing disconnects between their stated reasoning and actual actions, potentially undermining safety methods that depend on monitoring these thought processes.
Sources
Max is the managing editor of THE DECODER, bringing his background in philosophy to explore questions of consciousness and whether machines truly think or just pretend to.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.