Ad
Skip to content

OpenAI develops six-layer context system to help employees navigate 600 petabytes of data

Image description
Sora prompted by THE DECODER

OpenAI has developed an internal AI data agent that lets employees run complex data analyses using natural language. A key technique called "Codex Enrichment" crawls the codebase to understand what tables actually contain.

Finding the right table across 70,000 datasets and 600 petabytes of data is no small task. Many tables look similar on the surface, and figuring out the differences eats up significant time.

In a technical report, engineers Bonnie Xu, Aravind Suresh, and Emma Tang explain how they taught the agent to develop a deeper understanding of data. The key lies in the code that generates the tables.

Metadata and SQL queries describe what a table looks like and how it's used - but they don't reveal what's actually inside. OpenAI solves this with a method the team calls "Codex Enrichment." The agent crawls the codebase using Codex and derives a deeper definition of each table from the code itself. Pipeline logic captures assumptions, freshness guarantees, and business intent that never show up in SQL or metadata.

Similar schemas can hide critical differences

The problem OpenAI is trying to solve is common when dealing with large amounts of data. Many tables look similar on the surface but differ in critical ways. One table might only include logged-in users, while another includes logged-out users too. One captures only first-party ChatGPT traffic, another captures everything.

Image: OpenAI

Through code-level analysis, the agent can spot these differences. It understands not just what columns a table has, but how the data was filtered, transformed, and aggregated. This lets it answer questions about table contents and appropriate use cases far more accurately than database signals alone.

When the code that generates a table changes, Codex Enrichment adapts without any manual maintenance.

Six layers of context power the system

Codex Enrichment is one of six context layers that OpenAI's data agent uses. The first layer includes schema metadata like column names and data types, along with historical queries that show which tables are typically used together. The second layer consists of curated descriptions from domain experts that capture semantics, business meaning, and known limitations. The third layer is Codex Enrichment.

Image: OpenAI

The fourth layer taps into institutional knowledge. The agent searches Slack messages, Google Docs, and Notion documents for information about product launches, technical incidents, and canonical metric definitions. The fifth layer is a learning memory that stores corrections and nuances from previous conversations and applies them to future requests. The sixth layer enables live queries to the data warehouse when no prior information exists or existing data is outdated.

Image: OpenAI

OpenAI tested the system on a simple question about daily active users of ChatGPT Image Gen. Without memory, the agent took over 22 minutes. With memory enabled, that dropped to one minute and 22 seconds. The company says the combination of all six context layers can cut analysis time for some queries from days to minutes.

AI News Without the Hype – Curated by Humans

Subscribe to THE DECODER for ad-free reading, a weekly AI newsletter, our exclusive "AI Radar" frontier report six times a year, full archive access, and access to our comment section.

Read on for the full picture.
Subscribe for hype-free coverage.

  • Access to all THE DECODER articles.
  • Read without distractions – no Google ads.
  • Access to comments and community discussions.
  • Weekly AI newsletter.
  • 6 times a year: “AI Radar” – deep dives on key AI topics.
  • Up to 25 % off on KI Pro online events.
  • Access to our full ten-year archive.
  • Get the latest AI news from The Decoder.
Subscribe to The Decoder