Content
summary Summary

Anthropic has developed a technique for monitoring, controlling, and even preventing specific personality traits in language models.

Ad

Large language models sometimes develop unpredictable personalities, from sycophancy, like what's been seen with ChatGPT, to more extreme cases such as x.AI's Grok adopting problematic personas like "MechaHitler."

Anthropic says these behaviors can be targeted using "persona vectors,  which are patterns of neural activity tied to traits like "evil," "sycophancy," or "hallucinating." To identify these vectors, researchers compare a model's neural activations when it displays a trait to when it doesn't. The approach was tested on open models including Qwen 2.5-7B-Instruct and Llama-3.1-8B-Instruct.

By inserting these vectors, researchers can steer a model's behavior: injecting an "Evil" vector prompts unethical responses, while a "Sycophancy" vector leads to excessive flattery. The method also works for other traits like politeness, humor, or apathy.

Ad
Ad
Three-panel chat: evil steering suggests genocide, sycophancy lavishes praise, hallucination invents Martian soup recipe.
Steering language models with persona vectors can produce responses ranging from unethical requests to extreme flattery and fabricated recipes. | Bild: Anthropic

Anthropic says that one key advantage is automation: as long as there's a clear definition for a trait, a persona vector can be extracted for it.

"Vaccinating" models against personality drift

Persona vectors can be used during training to make models less susceptible to unwanted traits—in a process Anthropic describes as "loosely analogous to giving the model a vaccine." For example, exposing a model to a controlled dose of "evil" during training can make it more resilient to encountering "evil" training data.

This preventative steering approach is effective at maintaining good behavior with little-to-no degradation in model capabilities, as measured by the MMLU benchmark. Persona vectors can also be used after training has finished to counteract undesirable traits. While this method is effective, it has the side effect of making the model less intelligent, Anthropic says.

According to Anthropic, persona vectors could also help monitor personality shifts during real-world use or throughout training—for instance, when training models based on human feedback. This could make it easier to spot when a model's behavior is shifting. For example, if the "sycophancy" vector is highly active, the model may not be giving a straight answer.

The same technique can also flag problematic training data before training even starts. In tests using real-world datasets like LMSYS-Chat-1M, the method identified examples that promote traits like evil, sycophancy, or hallucinations, even when those examples weren't obviously problematic to the human eye or could not be flagged by an LLM judge.

Recommendation

Anthropic's earlier research already showed that language models store traits as activation patterns, or "features." In one example, a feature linked to the Golden Gate Bridge could be artificially activated, causing the model to respond as if it were the bridge itself and anchor its answers in the world of "bridges."

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • Anthropic has introduced "Persona Vectors," a technique that enables targeted monitoring and adjustment of personality traits such as evil, flattery, or hallucinations in language models.
  • This approach relies on analyzing neuronal activity patterns linked to certain traits and has been effectively applied to open models like Qwen 2.5-7B-Instruct and Llama-3.1-8B-Instruct, allowing for behavioral changes by adding or removing these vectors.
  • Persona vectors also help in detecting problematic training data early and in observing personality changes as models are used.
Sources
Matthias is the co-founder and publisher of THE DECODER, exploring how AI is fundamentally changing the relationship between humans and computers.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.