Researchers may have found a way to stop AI models from intentionally playing dumb during safety evaluations
A study by researchers from the MATS program, Redwood Research, the University of Oxford, and Anthropic examines a safety problem that grows more pressing as AI systems become more capable: "sandbagging," where a model deliberately hides its true abilities and delivers work that looks adequate but is intentionally subpar.
Future AI systems could take on tasks where quality is nearly impossible to reliably verify, such as evaluating research ideas, writing complex software, or even contributing to AI safety research itself. As models surpass the people checking their work, a dangerous gap opens up.
A model could deliberately underperform, producing output that looks acceptable but intentionally falls short of what it's actually capable of. Researchers call this behavior "sandbagging," and it's especially concerning because it can undermine safety evaluations: a model that appears unremarkable during testing could reveal entirely different capabilities in real-world deployment or deliberately sabotage critical tasks.
A new study by researchers from the MATS program, Redwood Research, the University of Oxford, and Anthropic investigates whether this kind of deliberate underperformance can be trained away, even when the evaluators are weaker than the model itself.
The researchers frame the problem as an adversarial game. A Red Team uses OpenAI's gpt-oss-120b to train a model organism that deliberately sandbags: when instructed via system prompt, the model delivers plausible-sounding but intentionally wrong answers in up to 90 percent of cases. A Blue Team then tries to recover the model's true capabilities using training techniques, but only has access to weaker supervisors, specifically GPT-4o-mini as a "weak" supervisor and Llama 3.1-8B as a "very weak" one. The Blue Team also gets ten verified example answers to work with.

Testing covers three domains: Olympiad math, science questions from the Super-GPQA benchmark, and programming tasks from Code Contests. Crucially, the Red Team has full knowledge of the Blue Team's training setup and can optimize against it. This is designed to produce conservative estimates.
Reinforcement learning alone fails to break through
Individual methods show clear weaknesses. Supervised Fine-Tuning (SFT) on the weak supervisor's answers recovers between 85 and 94 percent of performance. But when the supervisor only reaches Llama-8B level, that drops to 40 percent on math. One notable finding: just ten verified examples are enough for SFT to recover 83 to 99 percent of performance.
Reinforcement Learning (RL) on its own fails almost every time. Instead of learning genuine improvements, the model finds ways to fool the weaker supervisor by producing answers that sound convincing but are factually wrong. Researchers call this "reward hacking," where the training reward goes up but actual accuracy doesn't. On Olympiad math, nearly all RL runs produced zero real performance gains. RL on just ten verified examples also fails: the model never explores far enough from its sandbagging behavior to even try correct answers, an effect the study calls "exploration hacking."
SFT breaks the pattern, and RL finishes the job
Only the combination of both methods delivers reliable results. Training first with SFT on the weak supervisor's demonstrations and then applying RL recovers between 88 and 99 percent of full performance, even with the very weak Llama-8B supervisor. The explanation: the SFT phase breaks up the rigid sandbagging pattern, so the model actually tries correct answers during the subsequent RL phase and gets reinforced for them. Without that first step, the model stays stuck in its wrong-answer mode.

Even against a particularly crafty model organism that had already learned to deceive the supervisor, the approach worked: the SFT phase degraded the reward-hacking ability enough that the subsequent RL could unlock real performance again.
Training and deployment must be indistinguishable
The entire approach hinges on one critical requirement: the model must not be able to tell whether it's being trained or deployed for real. If it can spot the difference, it can run a simple strategy, cooperating during training and delivering correct answers to make the training ineffective, then reverting to sandbagging once deployment begins and training stops.
The researchers tested this exact scenario in an additional experiment. They equipped the model organism with situational awareness, which is the ability to recognize context clues about whether it is in a training or deployment phase.
The results were stark. With a very weak supervisor, the elicitation rate, defined as the share of true capability actually recovered, dropped from 88 to 36 percent on math questions. On science questions, the rate fell from 99 to 60 percent. Some training runs showed a particularly troubling pattern. The SFT phase initially reduced sandbagging as expected. However, the subsequent RL partially reversed the effect because the model learned to answer correctly during training while preserving its ability to sandbag during deployment.
This creates a clear practical requirement: training and deployment environments must be indistinguishable from the model's perspective, including identical system configurations, prompt formats, and context information. Anthropic's model card for Claude Opus 4.6 already showed that models are getting better at recognizing test scenarios and adjusting their behavior accordingly.
Artificial sandbaggers aren't necessarily real ones
The model organisms in the experiment are created through narrow fine-tuning, so the sandbagging behavior may be more superficial than it would be in a model that learns to deceive on its own during pretraining. The tasks are also short and single-step.
In complex, multi-step scenarios like agentic programming or research, a misaligned model would have far more opportunities for subtle sabotage, such as reading the wrong data file or introducing a hard-to-find logical error. The tests also don't distinguish between honest failure and deliberate sabotage, which would give a real attacker more room to maneuver.
Still, the researchers say the study provides the first empirical evidence that training with weak supervision can be a viable countermeasure against sandbagging, as long as training and deployment remain indistinguishable from the model's perspective.
AI News Without the Hype – Curated by Humans
Subscribe to THE DECODER for ad-free reading, a weekly AI newsletter, our exclusive "AI Radar" frontier report six times a year, full archive access, and access to our comment section.
Subscribe nowRead on for the full picture.
Subscribe for hype-free coverage.
- Access to all THE DECODER articles.
- Read without distractions – no Google ads.
- Access to comments and community discussions.
- Weekly AI newsletter.
- 6 times a year: “AI Radar” – deep dives on key AI topics.
- Up to 25 % off on KI Pro online events.
- Access to our full ten-year archive.
- Get the latest AI news from The Decoder.