Ad
Skip to content

Anthropic keeps rewriting its hiring test because Claude keeps beating the applicants

Image description
Anthropic

Key Points

  • Anthropic had to revise its recruiting test for performance engineers three times because its own language model Claude solved the tasks better than human applicants.
  • Claude Opus 4.5 matched the scores of the best candidates within the two-hour time limit, while previous versions had already outperformed more than half of applicants.
  • The solution: a new test with unusual constraints inspired by programming puzzles.

Anthropic is dealing with an unusual problem: The recruiting team has to keep developing new take-home tests because the company's own AI models are outperforming human candidates.

According to a blog post by Tristan Hume, who leads Anthropic's performance optimization team, the company has had to revamp its take-home test for performance engineers three times now. The reason: Each new version of Claude made the previous test obsolete.

The original test, completed by over 1,000 candidates since early 2024, is built around a Python simulator for a fictional chip. Candidates receive a working program and have to rewrite it to run faster. Performance is measured in clock cycles - the number of computational steps the simulated computer needs to complete the task. Fewer steps mean a better solution.

Hume designed the test to mirror real work at Anthropic. Candidates were allowed to use AI tools, just like they would on the job. Many participants voluntarily worked beyond the four-hour time limit because they found the challenge engaging. Dozens of engineers hired through this process now work on Anthropic's infrastructure.

Ad
DEC_D_Incontent-1

Each Claude upgrade forced a test redesign

The problems crept up gradually. With Claude 3.7 Sonnet, it became clear that more than half of candidates would have scored higher if they had simply handed the entire task to Claude instead of coding themselves.

By May 2025, Claude Opus 4 was beating nearly all human solutions within the time limit. Hume responded by tweaking the test and cutting the time to two hours.

Then came Claude Opus 4.5. The model matched the scores of the best human candidates within two hours. Without a time limit, humans can still beat Opus 4.5 - sometimes by a wide margin - but not within the time constraints that are realistic for a take-home test.

Banning AI tools wouldn't reflect the actual job

Hume considered prohibiting AI tools in the test but decided against it. That wouldn't reflect reality. On the actual job, engineers use AI assistants. A test should show how well someone can work with these tools, not without them.

Ad
DEC_D_Incontent-2

The final solution was a complete break from the previous approach. Hume drew inspiration from programming puzzle games by developer Zachtronics, known for their unusual and heavily constrained programming environments. In these games, players have to work with minimal commands and limited memory, forcing creative solutions.

The new test uses similarly unusual constraints. Claude fails at these tasks because they barely appear in its training data.

"Realism may be a luxury we no longer have," Hume writes. The original test worked because it resembled real work. The new test works because it simulates novel work that neither humans nor AI have ever seen before.

Anthropic has published the original test on GitHub. Given unlimited time, humans still have an edge. The fastest human solution ever submitted beats Claude's best performance by a significant margin. Anyone who submits a more efficient solution than Claude can apply directly to Anthropic. Everyone else can apply through the normal process and take on the new test.

AI News Without the Hype – Curated by Humans

As a THE DECODER subscriber, you get ad-free reading, our weekly AI newsletter, the exclusive "AI Radar" Frontier Report 6× per year, access to comments, and our complete archive.

Source: Anthropic