Nearly 29 percent of "Humanity's Last Exam" chemistry/biology answers are wrong or misleading
It looks like humanity might flunk its own "final AI exam." According to FutureHouse, about 29 percent of biology and chemistry questions in the AI benchmark Humanity's Last Exam (HLE) have answers that are incorrect or misleading, based on published literature. The error rate was uncovered through a combination of human review and AI-backed analysis.
HLE was built to push language models to their limits with especially tough questions, but the analysis suggests that many of its items are themselves misleading or wrong. Experts only spent a few minutes per question, and a full accuracy check wasn't required. In response, FutureHouse has released a smaller, vetted version called "HLE Bio/Chem Gold" on HuggingFace.
AI News Without the Hype – Curated by Humans
As a THE DECODER subscriber, you get ad-free reading, our weekly AI newsletter, the exclusive "AI Radar" Frontier Report 6× per year, access to comments, and our complete archive.
Subscribe now