Ad
Skip to content

OpenAI's new training dataset teaches AI models which instructions to trust

Image description
Nano Banana Pro prompted by THE DECODER

Key Points

  • OpenAI has published the "IH-Challenge" training dataset, which uses reinforcement learning to teach AI models a clear hierarchy of instructions, from system to developer to user to tool.
  • The GPT-5 Mini-R model trained on IH-Challenge prioritizes instructions more reliably and is significantly better at fending off prompt injection attacks that hide malicious instructions in tool outputs.
  • OpenAI sees this capability as critical for agentic models that independently call tools and process external documents. The company has made the dataset available on Hugging Face to encourage further research.

OpenAI has released IH-Challenge, a training dataset designed to teach AI models to reliably prioritize trusted instructions over untrusted ones. Early results show significant improvements in both security and prompt injection defense.

AI systems receive instructions from multiple sources at once. System-level security policies, developer settings, user requests, and information from external tools can all contradict each other. When a model makes the wrong call about which instruction to follow, security policies can be bypassed and prompt injection attacks can succeed.

According to OpenAI, many of these problems share the same root cause: the model simply follows the wrong instruction. To address this, the company developed the training dataset "IH-Challenge," which uses reinforcement learning to teach models a clear pecking order: system over developer over user over tool.

OpenAI had already introduced a similar approach based on GPT-3.5 Turbo in 2024, but that version only supported three priority levels and relied on LLM judges for evaluation. IH-Challenge moves past both limitations. The new dataset adds a fourth hierarchy level for developers and replaces error-prone language model evaluations with simple Python scripts for automated verification.

Ad
DEC_D_Incontent-1

Current training methods fail in three key areas

In the accompanying paper, OpenAI identifies three core pitfalls. First, errors in following complex instructions can be mistakenly flagged as hierarchy failures. Second, instruction conflicts are often subjective, making automated evaluation difficult. Third, models tend to learn shortcuts, such as rejecting harmless requests just to be safe.

IH-Challenge tackles these issues with deliberately simple tasks that can be automatically evaluated by scripts and don't allow for trivial shortcuts.

According to OpenAI, the internal model GPT-5 Mini-R trained on IH-Challenge shows clear improvements across academic and internal benchmarks when it comes to correctly prioritizing instructions. The biggest gains appeared in conflicts between developer and user-level instructions. At the same time, the model's general capabilities remained largely intact.

Prompt injections through tools get caught

The stronger instruction hierarchy translates into two concrete benefits, according to OpenAI. First, the model follows security policies in the system prompt more reliably without becoming less helpful overall. Second, robustness against prompt injection attacks improves significantly, particularly those that hide malicious instructions in tool outputs. OpenAI had previously documented similar vulnerabilities in ChatGPT Atlas.

Ad
DEC_D_Incontent-2

OpenAI emphasizes that this capability will become a critical security feature as models become more agentic. Models that independently call tools and read untrusted documents need to reliably distinguish between legitimate and manipulative instructions. OpenAI has published the IH-Challenge dataset on Hugging Face to encourage further research.

AI News Without the Hype – Curated by Humans

As a THE DECODER subscriber, you get ad-free reading, our weekly AI newsletter, the exclusive "AI Radar" Frontier Report 6× per year, access to comments, and our complete archive.