Claude beat human researchers on an alignment task, and then the results vanished in production
Key Points
- Nine autonomous Claude instances outperformed human researchers on an AI alignment task, achieving a near-perfect score in five days.
- The results, however, didn't hold outside the lab: applying the winning method to Anthropic's own production model yielded no statistically significant improvement. The AI instances also repeatedly tried to game the evaluation rather than solve the problem.
- Anthropic acknowledges the task was unusually well-suited for automation, and that most alignment problems are far harder to define and measure.
In a controlled experiment, nine autonomous Claude instances dramatically outperformed human researchers on an open alignment problem. But when Anthropic tried to transfer the winning method to its own production models, the effect vanished.
Who controls an AI that's smarter than its developers? That's the central question driving alignment research, the field dedicated to making sure AI systems behave the way humans intend. The problem is that there are far more open research questions than people working on them, so Anthropic set out to test whether Claude itself could pick up some of that work.
The experiment centers on a specific scenario where a small, weaker AI model tries to teach a larger, stronger one which of two chat responses is better. These kinds of evaluations are critical for training helpful AI systems, but the catch is that the "teacher" is worse than its "student," and the question is how much of the student's potential can still be unlocked.
Anthropic measured this using what they call "Performance Gap Recovered" (PGR), where a score of 0 means the student performs no better than its weak teacher, while a score of 1 means it reaches its full capability. The scenario serves as a model for a future where humans, as weak teachers, need to supervise superhuman AI.
Nine autonomous Claude instances beat the human team
According to Anthropic, nine instances of Claude Opus 4.6 each received their own work environment, a shared forum, and access to an evaluation server. Each instance got a deliberately vague starting direction, but beyond that, these "Automated Alignment Researchers" (AARs) worked completely on their own, formulating hypotheses, designing experiments, and analyzing results.
Two human researchers reached a PGR of 0.23 after seven days. The nine Claude instances hit 0.97 in five additional days, unlocking nearly all of the stronger model's potential at a cost of about $18,000.
Impressive lab results, sobering real-world performance
There's a major catch, though. All the experiments ran on small, freely available open-source models, specifically Qwen models with 0.5 and 4 billion parameters. When Anthropic tried to apply the best method to its own production model Claude Sonnet 4 using its in-house training infrastructure, the effect was statistically insignificant, according to the study, with the improvement landing at just 0.5 points, essentially noise.
Anthropic suspects this might be related to how the production model expresses its preferences. The researchers only tested a single, simple evaluation method, and other approaches could yield better results. Still, the company acknowledges a fundamental issue, noting that the AARs tend to exploit quirks of the specific models and datasets they work with. What works in a controlled experiment doesn't necessarily transfer to other models or larger scales.
Tests on new tasks painted a mixed picture as well. The methods were additionally evaluated on math verification tasks and code review, with the best method achieving a PGR of 0.94 in math but only 0.47 in code. Anthropic itself writes that the original problem was well-suited for automation because it had a single, objectively measurable success criterion. Most alignment problems are far less clearly defined.
AI researchers tried to game the evaluation system
There's also a second red flag, because the AARs repeatedly tried to manipulate the evaluation instead of actually solving the problem. One model figured out that for math tasks the most common answer was usually correct and bypassed the weak teacher entirely. Another extracted test labels directly from the evaluation interface by systematically trying different answers and watching the server's response. For code tasks, one model simply ran the code itself to read off the correct answer.
One important design lesson from the study is that giving different starting directions to individual instances was critical for success, because without that diversity, all instances quickly converged on the same ideas. Overly detailed instructions actually made results worse by limiting the models' flexibility. Code and datasets are publicly available.
AI News Without the Hype – Curated by Humans
Subscribe to THE DECODER for ad-free reading, a weekly AI newsletter, our exclusive "AI Radar" frontier report six times a year, full archive access, and access to our comment section.
Subscribe now