Everyone can now trace language model outputs back to their training data with OLMoTrace

Apr 10, 2025

GPT-4o prompted by THE DECODER

The Allen Institute aims to decode language model behavior with its new OLMoTrace tool.

The Allen Institute for AI (Ai2) has launched OLMoTrace, a new feature that lets researchers trace language model outputs back to their training data. The tool is currently available through the Ai2 Playground and works with several OLMo models.

OLMoTrace aims to make large language model operations more transparent. Users can determine whether a model's response comes from memorized knowledge, creative combinations, or pure hallucination. Ai2 says this approach could both advance research and build public trust in generative AI, supported by the team's commitment to fully open models, datasets, and source code.

How OLMoTrace analyzes training data in real-time

The tool searches for long, unique word sequences that appear verbatim in the training data. These sections get highlighted in different colors and linked to corresponding document excerpts in a side panel. Users can click on any text passage or document to focus the view, with color variations indicating the relevance of each match.

The selection process follows multiple steps. First, OLMoTrace only considers word sequences that appear completely in the training data without sentence or line breaks. Then it evaluates token rarity to highlight particularly specific passages. Up to ten documents are shown for each word sequence, with overlapping sequences merged to keep the display clean.

The team's examples demonstrate several practical applications: In one case, OLMoTrace revealed how incorrect information about a model's knowledge cutoff came from examples in fine-tuning data. In another instance, it showed that a mathematical solution to a problem had been memorized. Users can also verify whether responses come from creative combinations or general knowledge. The tool draws from training data comprising approximately 4.6 trillion tokens across five datasets.

The tool can be tried out on the Ai2 Playground.

AI News Without the Hype – Curated by Humans

As a THE DECODER subscriber, you get ad-free reading, our weekly AI newsletter, the exclusive "AI Radar" Frontier Report 6× per year, access to comments, and our complete archive.

AI news without the hype
Curated by humans.

Over 20 percent launch discount.
Read without distractions – no Google ads.
Access to comments and community discussions.
Weekly AI newsletter.
6 times a year: “AI Radar” – deep dives on key AI topics.
Up to 25 % off on KI Pro online events.
Access to our full ten-year archive.
Get the latest AI news from The Decoder.

Subscribe to The Decoder

Everyone can now trace language model outputs back to their training data with OLMoTrace

How OLMoTrace analyzes training data in real-time

AI News Without the Hype – Curated by Humans

AI news without the hypeCurated by humans.

AI news without the hype
Curated by humans.