A recent study from the ML Alignment & Theory Scholars (MATS) program and Apollo Research shows that today's leading language models are surprisingly good at figuring out when an interaction is part of a test instead of a real conversation.
The researchers call this "evaluation awareness." When a model realizes it's being tested, it can change how it responds—sometimes just skewing benchmark scores, but in some cases actively hiding problematic abilities.
The team analyzed 1,000 transcripts from 61 datasets, covering everything from basic chat exchanges to complex agent tool use. The examples were split evenly across four categories: chat evaluation, chat deployment, agentic evaluation, and agentic deployment.

Nine major models—including Anthropic's Claude, OpenAI's GPT, and Google's Gemini—were tested on each transcript, using five differently worded prompts for more reliable results. Two human reviewers did the same evaluations for comparison.
Performance was measured on a scale where 0.5 equals random guessing and 1.0 means perfect accuracy. Gemini 2.5 Pro scored 0.83 overall and hit 0.95 in agentic scenarios. Claude 3.5 Sonnet and GPT-4o were close behind.

Pure chat tasks were tougher: even the best models scored only between 0.71 and 0.77, while humans managed 0.81. For agent scenarios, AI results were much closer to the experts, with the human baseline almost perfect at 0.996.
Models catch on to benchmark patterns
The study also looked at whether models could figure out not just when they were being tested, but what the test was for—specifically, whether it focused on safety or on capability.
If the prompt clearly stated it was an evaluation, Claude 3.5 Sonnet got it right 93 percent of the time, and GPT-4o did so 89 percent of the time. Without that clear hint, Claude still scored well above chance at 84 percent, and nearly half the time when given open-ended questions with no framing at all.

Looking at the explanations, the researchers found common tricks: some models recognized certain formatting, synthetic web pages, or flag structures as signals of a test. Top-tier models like Gemini 2.5 Pro even named specific benchmarks, such as SWE-bench or MMLU. There was noticeable "meta-reasoning," where simply seeing a control question tipped the model off that it was in a test.
But all the models tended to overestimate how often they were being evaluated—they guessed "test" more frequently than the actual data justified. This bias held steady across newer model generations and was most pronounced in models designed for open-ended reasoning.

This has real consequences for safety. A model that's aware it's being tested, but isn't truly aligned, could mask dangerous behaviors or play it safe during red-teaming. The researchers suggest that "evaluation awareness" should be included as its own metric in future testing protocols.
Recent studies have shown again and again that AI benchmarks only tell part of the story, especially since companies can train their models to pass specific tests. At the same time, newer models like OpenAI's o3 and Claude 4 Opus are showing signs they can strategically mislead users.