Everyone can now trace language model outputs back to their training data with OLMoTrace

GPT-4o prompted by THE DECODER

The Allen Institute aims to decode language model behavior with its new OLMoTrace tool.

The Allen Institute for AI (Ai2) has launched OLMoTrace, a new feature that lets researchers trace language model outputs back to their training data. The tool is currently available through the Ai2 Playground and works with several OLMo models.

OLMoTrace aims to make large language model operations more transparent. Users can determine whether a model's response comes from memorized knowledge, creative combinations, or pure hallucination. Ai2 says this approach could both advance research and build public trust in generative AI, supported by the team's commitment to fully open models, datasets, and source code.

How OLMoTrace analyzes training data in real-time

The tool searches for long, unique word sequences that appear verbatim in the training data. These sections get highlighted in different colors and linked to corresponding document excerpts in a side panel. Users can click on any text passage or document to focus the view, with color variations indicating the relevance of each match.

The selection process follows multiple steps. First, OLMoTrace only considers word sequences that appear completely in the training data without sentence or line breaks. Then it evaluates token rarity to highlight particularly specific passages. Up to ten documents are shown for each word sequence, with overlapping sequences merged to keep the display clean.

The team's examples demonstrate several practical applications: In one case, OLMoTrace revealed how incorrect information about a model's knowledge cutoff came from examples in fine-tuning data. In another instance, it showed that a mathematical solution to a problem had been memorized. Users can also verify whether responses come from creative combinations or general knowledge. The tool draws from training data comprising approximately 4.6 trillion tokens across five datasets.

The tool can be tried out on the Ai2 Playground.

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

Everyone can now trace language model outputs back to their training data with OLMoTrace

How OLMoTrace analyzes training data in real-time

Perplexity's valuation soared to $18 billion after its latest funding round

OpenAI CEO Sam Altman warns users not to trust ChatGPT agent with sensitive or personal data

Anthropic appears to tighten the usage limits for Claude code

OpenAI launches new ChatGPT agent that automates complex tasks for Pro, Plus, and Team

Kimi-K2 is the next open-weight AI milestone from China after Deepseek

New Energy-Based Transformer architecture aims to bring better "System 2 thinking" to AI models

Everyone can now trace language model outputs back to their training data with OLMoTrace

How OLMoTrace analyzes training data in real-time

Perplexity's valuation soared to $18 billion after its latest funding round

OpenAI CEO Sam Altman warns users not to trust ChatGPT agent with sensitive or personal data

Anthropic appears to tighten the usage limits for Claude code