Anthropic's AI interpretability research shines a light into the black box of large language models

Update

You can now chat with "Golden Gate Claude" using claude.ai if you are a registered user.

Researchers at Anthropic have gained fascinating insights into the internal representations of their language model Claude 3 Sonnet.

The method is based on the idea that artificial neural networks represent concepts as activation patterns in their internal layers. By analyzing these patterns, the learned concepts can be made visible.

Specifically, the researchers used a technique called "dictionary learning". This involves training a separate neural network to reconstruct the activations of a specific layer of the model under investigation as compactly as possible.

The trained weights of this network then form a "dictionary" of activation patterns, called features. Each feature represents a concept that the model has learned.

The researchers were able to extract features from individual parts of the neural network that represent certain concepts. | Image: Anthropic

For example, the researchers found a feature that responds specifically to the mention of the Golden Gate Bridge. When this feature is artificially activated to ten times its maximum value, the model even begins to identify itself with the bridge, generating statements such as "I am the Golden Gate Bridge and I connect San Francisco with Marin County."

Examples of artificial amplification of individual features that have a major impact on model generation. | Image: Anthropic

Another example is an "Immunology" feature that responds to discussions of immunodeficiency, specific diseases, and immune responses. Close to this feature are related concepts such as vaccines and organ systems with immune function.

The features found for "immunology". | Image: Anthropic

The extracted features cover a wide range, from well-known people and places, to syntactic elements in program code, to abstract concepts such as empathy or sarcasm. Many features are sensitive to both textual mentions and images of the corresponding concepts, although the analysis method was applied only to textual data.

The researchers also found evidence of a hierarchical organization of features. For example, a general feature "San Francisco" splits into several more specific features for individual landmarks and neighborhoods when analyzed in more detail. Similarly, country features such as "Canada" or "Iceland" split into sub-features like "geography", "culture", and "politics".

Features associated with "San Francisco". | Image: Anthropic

A candlelight in the AI black box

The Anthropic team sees the results as an important step toward greater transparency and control over capable AI systems. But they also point to the enormous challenges of applying interpretability to ever-larger models.

Recommendation

AI research

MatterGen: Microsoft presents AI tools for generating and simulating new materials

"The features we found represent a small subset of all the concepts learned by the model during training, and finding a full set of features using our current techniques would be cost-prohibitive (the computation required by our current approach would vastly exceed the compute used to train the model in the first place)," the researchers said.

The researchers also discovered potentially problematic features of the model. For example, there are features that are sensitive to bioweapons development, deception, or manipulation and could affect the model's behavior.

The mere presence of these features does not necessarily mean that the models are (more) dangerous, the paper says. However, it shows that a deeper understanding of when and how these features are activated is needed.

The insights gained could help better understand language models in the future, making them more robust and safer to use, Anthropic said.

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

Anthropic's AI interpretability research shines a light into the black box of large language models

A candlelight in the AI black box

MatterGen: Microsoft presents AI tools for generating and simulating new materials

AI coding can make developers slower even if they feel faster

Microsoft introduces Phi-4-mini-flash-reasoning with up to 10x higher token throughput

New Energy-Based Transformer architecture aims to bring better "System 2 thinking" to AI models

Kimi-K2 is the next open-weight AI milestone from China after Deepseek

New Energy-Based Transformer architecture aims to bring better "System 2 thinking" to AI models

Grok 4 is not officially instructed to follow Musk’s views but often does on sensitive subjects

Anthropic's AI interpretability research shines a light into the black box of large language models

A candlelight in the AI black box

Share

Bank details