AI benchmark FrontierMath exposes the relativity of measuring artificial intelligence

A new benchmark called FrontierMath, created by leading mathematicians, shows that current AI systems perform poorly on complex mathematical problems, despite high scores on simpler tests.

According to AI research firm Epoch AI, top models like o1-preview, GPT-4o, Claude 3.5, and Gemini 1.5 Pro solve less than 2 percent of FrontierMath problems, even though they score above 90 percent on previous math assessments.

Horizontal bar chart: Performance comparison of leading AI models on FrontierMath, maximum 2% success rate. — Even the most powerful language models like Gemini 1.5 Pro and Claude 3.5 only manage to solve about 2% of FrontierMath problems. This highlights the ongoing challenge of complex mathematical problem-solving for AI systems.

To create FrontierMath, a team of over 60 leading mathematicians put together hundreds of complex math problems. These aren't your typical math questions - they range from intensive number theory calculations to abstract problems in algebraic geometry. According to Epoch AI, even skilled mathematicians need hours or days to solve them.

Before adding any problem to the test, other experts review it to check both its accuracy and difficulty. When researchers tested random samples, they found error rates around 5 percent - about the same as other major machine learning benchmarks like ImageNet.

The benchmarking paradox

The stark difference between scores on standard tests and FrontierMath points to a core issue in AI benchmarking: Tests measure only specific, limited skills. And companies spend millions optimizing their AI models specifically for these standard benchmarks because they serve as a marketing tool.

Bar chart: Comparison of 7 AI mathematics benchmarks, FrontierMath leading with 98.3% unsolved problems. — The FrontierMath benchmark reveals a significant performance gap in current AI models for complex mathematical tasks. While established benchmarks like GSM-8k are almost completely solved, over 98% of FrontierMath problems remain unsolved. | Image: Epoch AI

Former OpenAI developer Andrej Karpathy says these findings show a new aspect of Moravec's paradox. While AI can excel at complex tasks with clear rules - like playing high-level chess - it frequently fails at simple problems that people handle with ease. When tasks call for common sense or gut-level problem solving, AI systems fall short.

"They can solve complex closed problems if you serve them the problem description neatly on a platter in the prompt, but they struggle to coherently string together long, autonomous, problem-solving sequences in a way that a person would find very easy," Karpathy writes.

This creates an odd situation where "LLMs are inching well into top expert territory," but as Karpathy notes, "you wouldn't hire them over a person for the most menial jobs." He suggests that beyond benchmarks like FrontierMath, the field needs new tests to measure "all the 'easy' stuff that is secretly hard."

Nevertheless, the Epoch AI team sees mathematics as an ideal framework for evaluating complex reasoning. It requires both creativity and precise logical chains, while allowing for objective verification of results.

Recommendation

AI research

DOOM on the toaster was fun, on AI it's groundbreaking

Looking ahead, the team plans to expand the benchmark and regularly test AI systems to measure their progress in mathematical reasoning. They'll also publish additional example problems over the next few months to help other researchers better understand what AI can and cannot do.

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

AI benchmark FrontierMath exposes the relativity of measuring artificial intelligence

The benchmarking paradox

DOOM on the toaster was fun, on AI it's groundbreaking

Persona vectors allow Anthropic to steer language model behaviors like sycophancy and evil

Respect instead of sarcasm: study uses AI for better political debates

Google says AI content is fine, and SEO basics still apply to AI-powered search

OpenAI launches GPT-5 as a unified system with adaptive reasoning for complex tasks

Google Deepmind's Genie 3 creates interactive 3D worlds that stay consistent for "multiple minutes"

Google upgrades Gemini with Deep Think and flags early warning risks

AI benchmark FrontierMath exposes the relativity of measuring artificial intelligence

The benchmarking paradox

Share

Bank details