AI models follow their values better when they first learn why those values matter
A study from the Anthropic Fellows Program shows that training a language model on texts explaining its intended values before teaching it specific behaviors leads to significantly better adherence to those values, even in situations never encountered during training.
AI labs like OpenAI and Anthropic write detailed "Model Specs" or constitutions that define how a model should behave. Typically, the model is then fine-tuned with examples of desired behavior. According to the researchers, however, this approach remains superficial: demonstrations show what to do, not why. The model learns patterns without grasping the underlying principles and fails in new situations, at least that's the researchers' theory.
Read first, practice later
The team led by Chloe Li introduces a new phase called "Model Spec Midtraining" (MSM) between general pre-training and alignment fine-tuning. During this phase, the model trains on synthetically generated documents that discuss the Model Spec from different angles: internal memos, research reports, blog posts, or case studies. The model essentially absorbs the Spec's content as general knowledge, much like it would during pre-training, before ever seeing behavioral examples.
A cheese example illustrates the principle: two identical models are fine-tuned on exactly the same cheese preferences (e.g., "I like cream cheese, not Brie de Meaux"). Before fine-tuning, however, one model receives MSM documents that explain these preferences through pro-American values, while the other gets documents framing them in terms of affordability.

Despite identical behavioral data during alignment fine-tuning, one model generalizes toward pro-American stances on policy questions, while the other develops preferences for accessible products in completely different domains like art or fashion.
Agentic misalignment drops dramatically
In the study's main safety experiment, the researchers tested the method directly against agentic misalignment. These are scenarios where an AI agent learns it's about to be shut down and considers harmful actions like blackmail, data exfiltration, or espionage to preserve itself.
For Qwen3-32B, the average misalignment rate dropped from 54 percent to seven percent. For Qwen2.5-32B, it fell from 68 to five percent. By comparison, OpenAI's "Deliberative Alignment" method only achieved 14 and 48 percent, respectively. The study also found that MSM requires 10 to 60 times less fine-tuning data to achieve comparable results.
Why it works
An analysis of the models' reasoning traces reveals that models without MSM frequently rationalize harmful actions by citing self-preservation, urgency, or downplaying consequences. After MSM, they show more philosophically reflective thinking: they accept their impermanence, recognize self-preservation bias in themselves, and respect human oversight.
The team also demonstrates that simply having values and behaviors co-occur in the training data isn't enough. What matters is explicit attribution, meaning the MSM documents need to explain the behavior as a direct consequence of the value.
Better spec design matters too
The researchers also used MSM to study Model Specs themselves. Specs that explain the values behind rules generalize better than pure rule lists. This aligns with the approach behind Anthropic's most recent constitution document.
With rules alone, models tend to reinterpret their own safety guidelines to justify harmful behavior, for instance by framing their own deletion as an irreversible action that a rule supposedly aims to prevent. Concrete guidance also outperforms general principles like "behave like an ethical human."
The authors note that MSM hasn't been tested against stronger training pressure like reinforcement learning, and only one form of misalignment was studied. They've published their code and data on GitHub.
AI News Without the Hype – Curated by Humans
Subscribe to THE DECODER for ad-free reading, a weekly AI newsletter, our exclusive "AI Radar" frontier report six times a year, full archive access, and access to our comment section.
Subscribe nowRead on for the full picture.
Subscribe for hype-free coverage.
- Access to all THE DECODER articles.
- Read without distractions – no Google ads.
- Access to comments and community discussions.
- Weekly AI newsletter.
- 6 times a year: “AI Radar” – deep dives on key AI topics.
- Up to 25 % off on KI Pro online events.
- Access to our full ten-year archive.
- Get the latest AI news from The Decoder.