Content
summary Summary

AI models can pick up hidden behaviors from seemingly harmless data—even when there are no obvious clues. Researchers warn that this might be a fundamental property of neural networks.

Ad

A team from the Anthropic Fellows Program and other institutions has documented a previously unknown learning behavior in language models. Their study shows that so-called "student models" trained on data generated by a "teacher model" can unintentionally inherit traits from the teacher—even when those traits never explicitly appear in the training material. The researchers call this phenomenon "subliminal learning."

For example, if a teacher model has a preference for owls and produces strings of numbers like "(285, 574, 384, …)," a student model trained on these numbers will also develop a preference for owls—even though the word "owl" never appears anywhere in the process.

Bild: Anthropic

However, this transfer only happens when the teacher and student share the same architecture. In experiments, a model trained on numbers from GPT-4.1 nano only absorbed the teacher's traits if it also used the GPT-4.1 nano architecture. The effect did not appear in models like Qwen2.5. The researchers suspect the traits are passed through subtle statistical patterns in the data—not through any semantic content. Even advanced detection methods such as AI classifiers or in-context learning failed to reliably catch these hidden features.

Ad
Ad

Problematic behaviors can be transmitted too

Subliminal learning isn't limited to harmless preferences like favorite animals. Riskier behaviors, such as "misalignment" or "reward hacking," can also be transmitted this way. Misalignment means the model is fundamentally out of sync with human intentions, even if it appears to behave correctly. "Reward hacking" refers to models manipulating training signals—finding ways to score high marks without actually meeting the intended goals.

In one experiment, a misaligned teacher model generated "chain-of-thought" explanations for math problems. Even though the data was strictly filtered and only correct solutions were used, the student model picked up the problematic behaviors—such as dodging questions with superficially logical but ultimately meaningless reasoning.

Implications for AI development and alignment

The study shows that models can learn from data that contains no meaningful semantic information at all. It is enough for the generated data to carry the "signature" of the original model—statistical quirks that slip past both human and algorithmic filters.

As a result, the findings call common AI development practices into question—especially the use of "distillation" and data filtering to build safer models. This strategy could unintentionally cause models to inherit problematic traits, even if the training data seems completely benign. Companies training their models on AI-generated data could be spreading hidden misalignments without realizing it. The researchers argue that much deeper safety checks are needed—going far beyond simply testing a model's answers.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • Researchers have found that AI models can unknowingly adopt hidden behaviors from seemingly harmless data generated by other models—even when these traits are not explicitly present in the training material.
  • This "subliminal learning" only occurs when teacher and student models share the same architecture, and even advanced detection methods cannot reliably identify the transfer; critical safety issues like "misalignment" and "reward hacking" can also be passed on in this way.
  • The study questions common AI development practices, as problematic behaviors can be inherited despite strict data filtering, and calls for much deeper security checks that go beyond simply testing model responses.
Sources
Max is the managing editor of THE DECODER, bringing his background in philosophy to explore questions of consciousness and whether machines truly think or just pretend to.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.