Author HubMatthias Bastian
Matthias is the co-founder and publisher of THE DECODER, exploring how AI is fundamentally changing the relationship between humans and computers.
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Artificial Intelligence: News, Business, Research
It looks like humanity might flunk its own "final AI exam." According to FutureHouse, about 29 percent of biology and chemistry questions in the AI benchmark Humanity's Last Exam (HLE) have answers that are incorrect or misleading, based on published literature. The error rate was uncovered through a combination of human review and AI-backed analysis.
HLE was built to push language models to their limits with especially tough questions, but the analysis suggests that many of its items are themselves misleading or wrong. Experts only spent a few minutes per question, and a full accuracy check wasn't required. In response, FutureHouse has released a smaller, vetted version called "HLE Bio/Chem Gold" on HuggingFace.
Update: Registration is now open. Places will be allocated by lottery.
OpenAI has set the date for its next DevDay: October 6, 2025, in San Francisco. With over 1,500 developers expected, the company says that this will be the largest event of its kind so far. The agenda includes a live-streamed keynote, hands-on workshops featuring the latest models and tools, and more stages and demos than last year. Details are still under wraps, but you can sign up for updates here.