Anthropic says that AI can learn risky behaviors even when the training data looks completely safe

AI models can pick up hidden behaviors from seemingly harmless data—even when there are no obvious clues. Researchers warn that this might be a fundamental property of neural networks.

A team from the Anthropic Fellows Program and other institutions has documented a previously unknown learning behavior in language models. Their study shows that so-called "student models" trained on data generated by a "teacher model" can unintentionally inherit traits from the teacher—even when those traits never explicitly appear in the training material. The researchers call this phenomenon "subliminal learning."

For example, if a teacher model has a preference for owls and produces strings of numbers like "(285, 574, 384, …)," a student model trained on these numbers will also develop a preference for owls—even though the word "owl" never appears anywhere in the process.

However, this transfer only happens when the teacher and student share the same architecture. In experiments, a model trained on numbers from GPT-4.1 nano only absorbed the teacher's traits if it also used the GPT-4.1 nano architecture. The effect did not appear in models like Qwen2.5. The researchers suspect the traits are passed through subtle statistical patterns in the data—not through any semantic content. Even advanced detection methods such as AI classifiers or in-context learning failed to reliably catch these hidden features.

Problematic behaviors can be transmitted too

Subliminal learning isn't limited to harmless preferences like favorite animals. Riskier behaviors, such as "misalignment" or "reward hacking," can also be transmitted this way. Misalignment means the model is fundamentally out of sync with human intentions, even if it appears to behave correctly. "Reward hacking" refers to models manipulating training signals—finding ways to score high marks without actually meeting the intended goals.

In one experiment, a misaligned teacher model generated "chain-of-thought" explanations for math problems. Even though the data was strictly filtered and only correct solutions were used, the student model picked up the problematic behaviors—such as dodging questions with superficially logical but ultimately meaningless reasoning.

Implications for AI development and alignment

The study shows that models can learn from data that contains no meaningful semantic information at all. It is enough for the generated data to carry the "signature" of the original model—statistical quirks that slip past both human and algorithmic filters.

As a result, the findings call common AI development practices into question—especially the use of "distillation" and data filtering to build safer models. This strategy could unintentionally cause models to inherit problematic traits, even if the training data seems completely benign. Companies training their models on AI-generated data could be spreading hidden misalignments without realizing it. The researchers argue that much deeper safety checks are needed—going far beyond simply testing a model's answers.

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

Anthropic says that AI can learn risky behaviors even when the training data looks completely safe

Problematic behaviors can be transmitted too

Implications for AI development and alignment

OpenAI's AI system wins a gold medal-level score at the International Olympiad in Informatics 2025

GPT-5 is here and Gary Marcus is not impressed

Nvidia researchers urge the AI industry to rethink agentic AI in favor of smaller, more efficient LLMs

OpenAI launches GPT-5 as a unified system with adaptive reasoning for complex tasks

Google Deepmind's Genie 3 creates interactive 3D worlds that stay consistent for "multiple minutes"

Google upgrades Gemini with Deep Think and flags early warning risks

Anthropic says that AI can learn risky behaviors even when the training data looks completely safe

Problematic behaviors can be transmitted too

Implications for AI development and alignment

OpenAI's AI system wins a gold medal-level score at the International Olympiad in Informatics 2025

GPT-5 is here and Gary Marcus is not impressed

Nvidia researchers urge the AI industry to rethink agentic AI in favor of smaller, more efficient LLMs