Bottom line: Anthropic continuously revises its technical hiring tests as AI models grow stronger. The take-home code optimization test has been redesigned three times to identify top talent and stay ahead of the latest Claude model.
Tristan Hume from Anthropic’s performance optimization team describes how to develop technical evaluations that resist artificial intelligence. With increasing AI capabilities, hiring tests must be constantly revised to identify top talent.
Assessing technical candidates becomes increasingly difficult as AI improves. Tests that distinguish between different performance levels today can be trivially solved by models tomorrow, losing their value in the process.
Since early 2024, Anthropic’s performance engineering team has used a take-home test in which candidates optimize code for a simulated accelerator. Over 1,000 candidates have completed it, and dozens now work there, including engineers who built the Trainium cluster and deployed every model since Claude 3 Opus.
However, each new Claude model forced a reconstruction of the test. Under the same time constraints, Claude Opus 4 outperformed most human applicants. This still made it possible to distinguish the best candidates – until Claude Opus 4.5 matched them as well. Humans can still outperform models when given unlimited time, but under take-home conditions, there was no way to distinguish between top candidates and the most capable model.
Hume iterated on three versions of the take-home to ensure it remains meaningful. Each version taught something new about robust evaluations resistant to AI assistance. The original design is now being published as an open challenge – with unlimited time, the best humans still outperform Claude Opus 4.5.
Source: www.anthropic.com