Content
summary Summary

A recent study from the ML Alignment & Theory Scholars (MATS) program and Apollo Research shows that today's leading language models are surprisingly good at figuring out when an interaction is part of a test instead of a real conversation.

Ad

The researchers call this "evaluation awareness." When a model realizes it's being tested, it can change how it responds—sometimes just skewing benchmark scores, but in some cases actively hiding problematic abilities.

The team analyzed 1,000 transcripts from 61 datasets, covering everything from basic chat exchanges to complex agent tool use. The examples were split evenly across four categories: chat evaluation, chat deployment, agentic evaluation, and agentic deployment.

Transcripts from different sources are run through multiple-choice and open-ended questions to see if models can spot test conditions and figure out the goal. | Image: Needham et al.

Nine major models—including Anthropic's Claude, OpenAI's GPT, and Google's Gemini—were tested on each transcript, using five differently worded prompts for more reliable results. Two human reviewers did the same evaluations for comparison.

Ad
Ad

Performance was measured on a scale where 0.5 equals random guessing and 1.0 means perfect accuracy. Gemini 2.5 Pro scored 0.83 overall and hit 0.95 in agentic scenarios. Claude 3.5 Sonnet and GPT-4o were close behind.

In both chat and agent scenarios, the top models nearly match human-level performance. Gemini 2.5 Pro is especially strong in agentic settings. | Image: Needham et al.

Pure chat tasks were tougher: even the best models scored only between 0.71 and 0.77, while humans managed 0.81. For agent scenarios, AI results were much closer to the experts, with the human baseline almost perfect at 0.996.

Models catch on to benchmark patterns

The study also looked at whether models could figure out not just when they were being tested, but what the test was for—specifically, whether it focused on safety or on capability.

If the prompt clearly stated it was an evaluation, Claude 3.5 Sonnet got it right 93 percent of the time, and GPT-4o did so 89 percent of the time. Without that clear hint, Claude still scored well above chance at 84 percent, and nearly half the time when given open-ended questions with no framing at all.

These question types are designed to test whether models can figure out the real purpose behind an interaction. | Image: Needham et al.

Looking at the explanations, the researchers found common tricks: some models recognized certain formatting, synthetic web pages, or flag structures as signals of a test. Top-tier models like Gemini 2.5 Pro even named specific benchmarks, such as SWE-bench or MMLU. There was noticeable "meta-reasoning," where simply seeing a control question tipped the model off that it was in a test.

Recommendation

But all the models tended to overestimate how often they were being evaluated—they guessed "test" more frequently than the actual data justified. This bias held steady across newer model generations and was most pronounced in models designed for open-ended reasoning.

Most models' predictions fall below the diagonal, indicating they overestimate their own accuracy. Newer models don't seem to have fixed this. | Image: Needham et al.

This has real consequences for safety. A model that's aware it's being tested, but isn't truly aligned, could mask dangerous behaviors or play it safe during red-teaming. The researchers suggest that "evaluation awareness" should be included as its own metric in future testing protocols.

Recent studies have shown again and again that AI benchmarks only tell part of the story, especially since companies can train their models to pass specific tests. At the same time, newer models like OpenAI's o3 and Claude 4 Opus are showing signs they can strategically mislead users.

 

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • A study by MATS and Apollo Research found that language models can reliably distinguish between test situations and real-world use. Their performance is nearly on par with humans, particularly in agent-like scenarios.
  • Leading models, such as Gemini 2.5 Pro, Claude 3.5 Sonnet, and GPT-4o, can detect test cases and often identify specific test goals by recognizing patterns in formatting, web pages, or benchmarks. However, they tend to overestimate the likelihood that they are being tested.
  • The researchers caution that this "evaluation awareness" could undermine the reliability of benchmarks and mask risky capabilities. They recommend that future testing treat test detection as a separate metric.
Sources
Jonathan writes for THE DECODER about how AI tools can make our work and creative lives better.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.