AI in practice

Jul 24, 2025Jul 24, 2025

Nearly 29 percent of "Humanity's Last Exam" chemistry/biology answers are wrong or misleading

Matthias is the co-founder and publisher of THE DECODER, exploring how AI is fundamentally changing the relationship between humans and computers.

Profile

E-Mail

It looks like humanity might flunk its own "final AI exam." According to FutureHouse, about 29 percent of biology and chemistry questions in the AI benchmark Humanity's Last Exam (HLE) have answers that are incorrect or misleading, based on published literature. The error rate was uncovered through a combination of human review and AI-backed analysis.

HLE was built to push language models to their limits with especially tough questions, but the analysis suggests that many of its items are themselves misleading or wrong. Experts only spent a few minutes per question, and a full accuracy check wasn't required. In response, FutureHouse has released a smaller, vetted version called "HLE Bio/Chem Gold" on HuggingFace.

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:

Bank transfer

Sources

FutureHouse

Matthias Bastian

Matthias is the co-founder and publisher of THE DECODER, exploring how AI is fundamentally changing the relationship between humans and computers.

Profile

E-Mail

AI in practice

Aug 16, 2025

OpenAI's o3 model outperforms the newer GPT-5 model on complex, multi-app office tasks

News, tests and reports about VR, AR and MIXED Reality.

What happens next with MIXED My personal farewell to MIXED Meta and Anduril are now jointly developing XR headsets for the US military MIXED-NEWS.com

AI in practice

Aug 7, 2025Aug 7, 2025

Grok 4 edges out GPT-5 in complex reasoning benchmark ARC-AGI

AI research

Jul 20, 2025Jul 20, 2025

New ARC-AGI-3 benchmark shows that humans still outperform LLMs at pretty basic thinking

Google News

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

Nearly 29 percent of "Humanity's Last Exam" chemistry/biology answers are wrong or misleading

OpenAI's o3 model outperforms the newer GPT-5 model on complex, multi-app office tasks

Grok 4 edges out GPT-5 in complex reasoning benchmark ARC-AGI

New ARC-AGI-3 benchmark shows that humans still outperform LLMs at pretty basic thinking

The long-predicted deepfake dystopia has arrived with Sora 2

Anthropic claims to lower the entry barrier for advanced AI models with Claude Haiku 4.5

OpenAI says GPT-5 shows 30 percent less political bias than previous models

Nearly 29 percent of "Humanity's Last Exam" chemistry/biology answers are wrong or misleading

OpenAI's o3 model outperforms the newer GPT-5 model on complex, multi-app office tasks

Grok 4 edges out GPT-5 in complex reasoning benchmark ARC-AGI

New ARC-AGI-3 benchmark shows that humans still outperform LLMs at pretty basic thinking