Content
summary Summary

A new benchmark called FrontierMath, created by leading mathematicians, shows that current AI systems perform poorly on complex mathematical problems, despite high scores on simpler tests.

Ad

According to AI research firm Epoch AI, top models like o1-preview, GPT-4o, Claude 3.5, and Gemini 1.5 Pro solve less than 2 percent of FrontierMath problems, even though they score above 90 percent on previous math assessments.

Horizontal bar chart: Performance comparison of leading AI models on FrontierMath, maximum 2% success rate.
Even the most powerful language models like Gemini 1.5 Pro and Claude 3.5 only manage to solve about 2% of FrontierMath problems. This highlights the ongoing challenge of complex mathematical problem-solving for AI systems.

To create FrontierMath, a team of over 60 leading mathematicians put together hundreds of complex math problems. These aren't your typical math questions - they range from intensive number theory calculations to abstract problems in algebraic geometry. According to Epoch AI, even skilled mathematicians need hours or days to solve them.

Before adding any problem to the test, other experts review it to check both its accuracy and difficulty. When researchers tested random samples, they found error rates around 5 percent - about the same as other major machine learning benchmarks like ImageNet.

Ad
Ad

The benchmarking paradox

The stark difference between scores on standard tests and FrontierMath points to a core issue in AI benchmarking: Tests measure only specific, limited skills. And companies spend millions optimizing their AI models specifically for these standard benchmarks because they serve as a marketing tool.

Bar chart: Comparison of 7 AI mathematics benchmarks, FrontierMath leading with 98.3% unsolved problems.
The FrontierMath benchmark reveals a significant performance gap in current AI models for complex mathematical tasks. While established benchmarks like GSM-8k are almost completely solved, over 98% of FrontierMath problems remain unsolved. | Image: Epoch AI

Former OpenAI developer Andrej Karpathy says these findings show a new aspect of Moravec's paradox. While AI can excel at complex tasks with clear rules - like playing high-level chess - it frequently fails at simple problems that people handle with ease. When tasks call for common sense or gut-level problem solving, AI systems fall short.

"They can solve complex closed problems if you serve them the problem description neatly on a platter in the prompt, but they struggle to coherently string together long, autonomous, problem-solving sequences in a way that a person would find very easy," Karpathy writes.

This creates an odd situation where "LLMs are inching well into top expert territory," but as Karpathy notes, "you wouldn't hire them over a person for the most menial jobs." He suggests that beyond benchmarks like FrontierMath, the field needs new tests to measure "all the 'easy' stuff that is secretly hard."

Nevertheless, the Epoch AI team sees mathematics as an ideal framework for evaluating complex reasoning. It requires both creativity and precise logical chains, while allowing for objective verification of results.

Recommendation

Looking ahead, the team plans to expand the benchmark and regularly test AI systems to measure their progress in mathematical reasoning. They'll also publish additional example problems over the next few months to help other researchers better understand what AI can and cannot do.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • A new benchmark called FrontierMath, created by top mathematicians, shows that even the most advanced AI systems, including GPT-4, Claude 3.5, and Gemini 1.5 Pro, struggle to solve complex math problems, with a success rate of less than two percent.
  • The FrontierMath benchmark consists of hundreds of highly challenging math problems from various areas of modern mathematics, carefully designed and tested by experts in the field.
  • The significant discrepancy between the results of established tests and FrontierMath underscores a crucial issue in evaluating AI systems: current tests only assess a limited set of skills, and there is a lack of benchmarks that measure basic skills such as everyday reasoning and autonomous work.
Online journalist Matthias is the co-founder and publisher of THE DECODER. He believes that artificial intelligence will fundamentally change the relationship between humans and computers.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.