Leaked "Soul Doc" reveals how Anthropic programs Claude’s character

Dec 2, 2025

Sora prompted by THE DECODER

Key Points

A leaked internal document from Anthropic confirms Claude 4.5 Opus was trained with defined ethics, identity, and safety principles, offering insight into the company’s alignment methods.
The text outlines a hierarchy where safety and ethics come before user utility, with strict bans on harmful or exploitative content.
It also describes “functional emotions” to maintain the model’s stability and self-awareness, reflecting Anthropic’s aim for deeply internalized safety behavior.

In a lengthy LessWrong post, a user says he managed to extract an internal training document from Claude 4.5 Opus that defines the model's personality and ethical guidelines. Anthropic has confirmed the material is authentic, and the approach appears to be unique in the industry.

Richard Weiss says he recovered a long text from Claude 4.5 Opus that outlines detailed instructions about the model's character, ethics, and self-perception. Amanda Askell, an ethicist at Anthropic, confirmed on X that the document is real and was used during training.

Weiss first noticed something was off when the model began hallucinating fragments of an alleged "soul_overview." By running multiple Claude instances and having them collectively reconstruct the text, he says he managed to recover the full document. According to him, the material is effectively "compressed" into the model's weights rather than injected as a runtime system prompt.

As Askell explained, the document was informally known inside the company as the "soul coc," although that was never its official name. She said the version Weiss published is "pretty faithful" to the original.

The document offers a rare look at how Anthropic implements alignment in practice. Instead of giving Claude a simple rulebook, the company appears to aim for a deeper understanding of its goals and operating context - enough that the model could, at least in principle, reconstruct those rules itself.

The goal is for the model to internalize safety so thoroughly that it essentially wants to behave safely, not because it was instructed to follow constraints, but because it understands why the outcome matters. This character-training approach is meant to help Claude act responsibly in unexpected situations and avoid catastrophic risks.

Anthropic’s "calculated bet"

The extracted text, listed in the document under the heading "Anthropic Guidelines," begins by contextualizing the company's mission. Anthropic describes itself as occupying a "peculiar position": a company that genuinely believes it might be building "one of the most transformative and potentially dangerous technologies in human history, yet presses forward anyway."

This is not cognitive dissonance, the text argues, but rather a "calculated bet." Anthropic asserts it is better "to have safety-focused labs at the frontier than to cede that ground to developers less focused on safety." Claude is defined as an "externally-deployed model" that is "core to the source of almost all of Anthropic's revenue."

Hierarchy of values and "bright lines"

The document lays out a clear hierarchy for Claude's behavior. When priorities conflict, the model should follow this order:

Being safe and supporting human oversight of AI.
Behaving ethically (avoiding harmful or dishonest actions).
Acting in accordance with Anthropic's guidelines.
Being genuinely helpful to "operators" and "users."

he goal is for Claude to be an "extremely good assistant that is also honest and cares about the world." The text draws a comparison to a "brilliant friend"—such as a doctor or lawyer—who speaks frankly rather than offering overly cautious advice out of fear of liability.

Still, the document draws so-called "bright lines" that must never be crossed. These include providing instructions for weapons of mass destruction (biological, chemical, nuclear), generating content depicting the sexual exploitation of minors, or actions that undermine oversight mechanisms.

How operator authority overrides user input

The document strictly distinguishes between the "operator" (e.g., companies using the API) and the "user" (end user). Claude should treat operator instructions like those from a "relatively (but not unconditionally) trusted employer." If an operator instructs Claude to only answer coding questions, for example, it should do so even if a user asks about other topics.

A distinction is made between "hardcoded" (unchangeable) and "softcoded" (adjustable) behaviors. The latter—such as tone or the handling of explicit content—can be adjusted by operators.

"Functional emotions" and identity

One section is dedicated to the model's identity. The document instructs Claude to view itself as a "genuinely novel kind of entity" that is neither human nor a classic science fiction AI.

The passage regarding the AI's internal state is notable: "We believe Claude may have functional emotions in some sense." These are not necessarily identical to human emotions, but "analogous processes that emerged from training." Anthropic does not want Claude "to mask or suppress these internal states."

The company emphasizes "Claude's wellbeing." The model should be able to experience "positive states" in its interactions and set limitations on interactions it finds distressing. The goal is a "psychological stability" that allows the AI to remain secure in its identity even when faced with philosophical challenges or manipulative users.

Askell said Anthropic plans to publish the full version of the document and share further details soon.

AI News Without the Hype – Curated by Humans

As a THE DECODER subscriber, you get ad-free reading, our weekly AI newsletter, the exclusive "AI Radar" Frontier Report 6× per year, access to comments, and our complete archive.

Source: LessWrong