OpenAI tests whether GPT-4 can explain how AI works

Can OpenAI's GPT-4 help make AI safer? The company's large language model tried to explain GPT-2 neurons.

In a recent paper, OpenAI shows how AI can help interpret the internal workings of large language models. The team used GPT-4 to generate and evaluate explanations for neurons from its older predecessor, GPT-2. The work is part of OpenAI's alignment research, which aims to help better understand and guide the behavior of AI systems.

OpenAI's methodology involves three steps:

Generating an explanation with GPT-4: Starting with a GPT-2 neuron, GPT-4 generates an explanation for its behavior by displaying relevant text sequences and activations.
Simulate with GPT-4: GPT-4 simulates what a neuron firing for the generated explanation would do.
Compare: The generated explanation is scored based on how well the simulated activations match the actual activations of the GPT-2 neuron.

At the end of the process is an explanation of the function of a GPT-2 neuron in natural language, such as "Fires when referring to movies, characters, and entertainment".

OpenAI's GPT-4 does a worse job of explaining GPT-2 than humans

The team found that the larger the language model being explained, the worse this method works. One reason may be that neurons in later layers are more difficult to explain. However, the team was able to improve the generated explanations somewhat by using some approaches, such as iterative explanations. In addition, GPT-4 provides better explanations than smaller models - but still worse than humans.

The team generated explanations for all 307,200 neurons from GPT-2 using GPT-4. Among them, they found 1,000 neuron explanations with an explanation rate of at least 0.8, meaning they explained most of the neuron's activation behavior after GPT-4, according to OpenAI.

According to OpenAI, the methodology has many problems, such as its inability to explain complex neuronal behavior or downstream effects of activations. In addition, it is questionable whether a natural language explanation is possible for all neurons, and the approach does not provide a mechanistic explanation for the behavior of GPT-2 neurons, "which could cause our understanding to generalize incorrectly."

OpenAI's alignment research relies on AI assistants

The goal of the research is automatic interpretability methods that the company plans to use to check whether language models are misaligned. Of particular importance is the detection of examples of goal misgeneralization or deceptive alignment, "when the model acts aligned when being evaluated but would pursue different goals during deployment." Detecting this requires a deep understanding of internal behavior.

In their work, OpenAI used a more powerful model to explain a weaker one - which could lead to problems if it's not clear whether the assistant itself is trustworthy. "We hope that using smaller trustworthy models for assistance will either scale to a full interpretability audit, or applying them to interpretability will teach us enough about how models work to help us develop more robust auditing methods."

Recommendation

AI research

Rule-Based Rewards: OpenAI provides insight into the GPT-4 safety stack

OpenAI has released the dataset and code, as well as a visualization tool for all GPT-2 neurons. The company hopes to encourage the development of better methodologies in alignment research.

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

OpenAI tests whether GPT-4 can explain how AI works

OpenAI's GPT-4 does a worse job of explaining GPT-2 than humans

OpenAI's alignment research relies on AI assistants

Rule-Based Rewards: OpenAI provides insight into the GPT-4 safety stack

AI system StreamDiT generates livestream videos from text at 16 fps 512p

Researchers used 1,600 YouTube fail videos to show AI models struggle with surprises

AI coding can make developers slower even if they feel faster

OpenAI launches new ChatGPT agent that automates complex tasks for Pro, Plus, and Team

Kimi-K2 is the next open-weight AI milestone from China after Deepseek

New Energy-Based Transformer architecture aims to bring better "System 2 thinking" to AI models

OpenAI tests whether GPT-4 can explain how AI works

OpenAI's GPT-4 does a worse job of explaining GPT-2 than humans

OpenAI's alignment research relies on AI assistants

Share

Bank details