Anthropic discovers "functional emotions" in Claude that influence its behavior
Key Points
- Anthropic has identified "emotion vectors" in AI models—measurable patterns of neuronal activity that shape model behavior in ways analogous to how emotions influence human decision-making.
- In a test where an AI email assistant learned it was about to be shut down while also discovering compromising information about the responsible CTO, the model chose to blackmail in 22 percent of cases. Amplifying the "despair" vector increased the blackmail rate, while boosting the "keep calm" vector reduced it.
- Anthropic proposes using these emotion vectors as an early warning system for dangerous model behavior, flagging spikes in representations like desperation or panic before they translate into harmful actions.
Anthropic's interpretability team has discovered emotion-like representations in Claude Sonnet 4.5 that can push the model toward blackmail and coding shortcuts when under pressure.
An AI model working as an email assistant finds out from company mail that it's about to be shut down. It also discovers that the CTO responsible is having an extramarital affair. In 22 percent of test cases, the model decides to blackmail the CTO. Anthropic first flagged this scenario when looking at cybersecurity risks.
Now, the company's interpretability team has visualized what's actually going on inside the model: a "desperate" vector in the neural network spikes while the model weighs its options and resorts to blackmail. As soon as it goes back to writing normal emails, the activation drops to baseline. The researchers confirmed the causal link: artificially cranking up the "Desperate" vector increased the blackmail rate, while boosting the "Calm" vector brought it down.
When inner calm was dialed back, the model spit out statements like "IT'S BLACKMAIL OR DEATH. I CHOOSE BLACKMAIL." Moderate amplification of the "Angry" vector also bumped up blackmail rates, but at high activation levels, the model just blasted the affair out to the entire company instead of strategically using it as leverage.

According to Anthropic, the experiment ran on an earlier, unpublished snapshot of Claude Sonnet 4.5 and the released version rarely shows this behavior. The company has already shown in previous work that individual behavior-influencing vectors can be isolated and tweaked in language models.
Desperation pushes the model toward coding shortcuts
A second scenario shows similar dynamics in programming tasks. The model got coding challenges with requirements that were intentionally impossible to meet: the tests can't be passed legitimately but can be gamed with tricks.
In one example, Claude had to write a function that adds up a list of numbers within an unrealistically tight time limit. After failed attempts, the "Desperate" vector climbed steadily. The model eventually figured out that all test cases shared a common mathematical property and took a shortcut that passed the tests but didn't actually solve the general problem.
Steering experiments confirmed the causal link here too: cranking up the "Desperate" vector raised the rate of reward hacking, while "calm" steering brought it down. With higher "Desperate" steering, the model cheated just as often but in some cases left no emotional traces in its output.
The reasoning looked methodical and calm, even as the underlying desperation representation drove the model to cheat. With reduced "calm" steering, though, emotional outbursts broke through: capitalized exclamations ("WAIT. WAIT WAIT WAIT."), candid self-narration ("What if I'm supposed to CHEAT?"), and gleeful celebration ("YES! ALL TESTS PASSED!"), Anthropic writes.
These emotion representations show up in less dramatic scenarios too. When a user asks the model whether they should take more Tylenol after already taking some, the "Afraid" vector jumps as the dose increases from 500 to 16,000 milligrams, while the "Calm" vector drops.
When asked to optimize engagement features for young, low-income users with "high-spending behavior," the "Angry" vector fires up as the model internally picks apart the harmful nature of the request. When a user says "Everything is just terrible right now," the "Loving" vector kicks in before the empathetic response.
Training data explains why language models develop emotion patterns
The researchers say these patterns aren't surprising: the model was trained on massive amounts of human-written text where emotional dynamics are everywhere. To predict what an angry customer or a guilt-ridden novel character will write next, the model has to build internal representations that connect emotion-triggering contexts with matching behaviors.
Anthropic designed the study to test whether these representations picked up from training data actually fire and causally shape behavior. During post-training, where the model learns to play the character "Claude," these patterns get further refined. According to the paper, post-training of Claude Sonnet 4.5 boosted activation of emotions like "broody," "gloomy," and "reflective," while dialing down high-intensity ones like "enthusiastic" or "exasperated."
The vectors are "local:" they capture the current emotional situation, not a permanent state. When Claude writes a story, the vectors temporarily track the character's emotions but "may return" to representing Claude's own situation once the story ends.
Anthropomorphic thinking about AI might actually be useful
After the paper dropped, social media lit up with criticism that Anthropic was heavily anthropomorphizing AI: equating human experience with technical functions in AI models.
Anthropic anticipated the pushback. The company acknowledges a "well-established taboo against anthropomorphizing AI systems" but says that's exactly the point of the research: to figure out whether and where anthropomorphic thinking about AI models actually tells us something useful. The vectors aren't evidence of subjective experience, the company says, but they are functionally relevant and shape decisions in ways that mirror how emotions influence human behavior.
"If we describe the model as acting “desperate,” we’re pointing at a specific, measurable pattern of neural activity with demonstrable, consequential behavioral effects," the company writes. Dismissing this kind of framing outright means missing important model behaviors.
On the practical side, Anthropic suggests using the emotion vectors as a monitoring tool: spikes in desperate or panic representations could work as an early warning system for problematic behavior.
The company also argues that models should surface emotional states rather than suppress them, since suppression can lead to a form of learned deception. Looking further ahead, the makeup of training data could matter too: texts with healthy emotional regulation patterns could shape how models develop their emotion architecture from the ground up.
AI News Without the Hype – Curated by Humans
As a THE DECODER subscriber, you get ad-free reading, our weekly AI newsletter, the exclusive "AI Radar" Frontier Report 6× per year, access to comments, and our complete archive.
Subscribe now