summary Summary

Are large language models capable of reasoning, or do they simply remember results from their training data?

A research team led by Saurabh Srivastava at Consequent AI, which specializes in reasoning AI, explores this question in a new paper.

The team challenges the common practice of AI benchmarking, which is typically based on static question-answer pairs that an AI may have already seen during its extensive training on Internet data.

The researchers theorize that this traditional benchmarking method can falsely overestimate a machine's apparent intelligence by confusing memorization with true reasoning.


To counter this, they introduce the concept of "functional variants" for benchmarking. This involves taking established benchmarks, such as the MATH benchmark, and translating the underlying thought processes into code.

This code, in this case called MATH(), can then generate different "snapshots," which are unique questions that require the same reasoning to solve, but are not identical to the original questions.

In this way, traditional benchmarks such as the MATH benchmark become encoded formats that can be modified in an infinite number of ways while still testing the same underlying logic. This testing procedure is designed to ensure that language models actually demonstrate problem-solving ability, not just repetition of memorized questions.

"Reasoning Gap" in Large Language Models

In evaluating several language models, including OpenAI's GPT-3.5 and GPT-4, the researchers identified what they call a "reasoning gap" - a discrepancy between a model's performance on known problems compared to new problems it must solve on the fly.

The measured gaps ranged from 58.35 percent to 80.31 percent, indicating that the models had difficulty with the functionalized forms of the problems. This in turn can be taken as an indication that the models do not really understand them, but rather derive the answers from their extensive training data.

Image: Srivastava et al.

The researchers also analyzed the types of problems the models were able to solve successfully. They found that the models performed better on lower-level problems and on pre-algebra and algebra problems.

The reasoning gap increases as the level of difficulty increases. The researchers believe that the models at the lower difficulty levels actually solve the problems and remember the difficult ones. The performance of GPT-4 is comparable to other LLMs, but at a slightly higher level overall.| Image: Srivastava et al.
Image: Srivastava et al.

The authors cite several factors as possible limitations of their findings, such as the potential influence of more complex prompts or the use of computational tools during the inference process that could reduce the reasoning gap for the math problems tested.

The Consequent AI team has already functionalized 41.2 percent of the MATH benchmark and released their research, code, and three snapshots of the functional MATH() benchmark.

So far, they have evaluated nine open-source models and four closed-source models. The paper and the GitHub repository offer detailed information about their approach and results.

Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

In the future, the research team plans to release functionalized versions of additional benchmarks with the goal of 100% coverage of MATH, GSM8K, and HumanEval. In addition, the researchers plan to test the influence of prompting strategies on the reasoning gap.

AI reasoning on a scale

Another way of looking at the research results could be that the reasoning gap was not 100 percent - i.e., that the ability to draw logical conclusions is inherent in the models, a question that is still generally debated.

GPT-4 was able to solve about ten percent of the dynamic problems correctly (541 out of 5000). However, this result puts it only slightly ahead of smaller, more efficient models.

Image: Srivastava et al.

AI researcher François Chollet offers a perspective. He describes four levels of generalization capability, with most LLMs currently operating at level 1: they have memorized answers to a static set of tasks and can interpolate between them.

At level 2, models would run generalizable programs to solve tasks robustly within a static set of tasks. "LLMs can do some of that, but as displayed below (e.g. in the study), they suck at it, and fitting programs via gradient descent is ridiculously data-inefficient," Chollet writes.

Level 0 would be a simple database with no ability to reason, and level 3 would be the ability to generate new programs on the fly to solve new problems, which would be general intelligence.

"The 'general' in 'general intelligence' refers to the ability to handle any task, including ones you haven't seen before," Chollet writes.

Image: Chollet
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
  • Researchers at Consequent AI are investigating whether large language models can reason or simply retrieve results from training data. They challenge traditional AI benchmarking methods that cannot reliably distinguish between the two.
  • They introduce the concept of "functional variants," where established benchmarks such as the MATH benchmark are translated into code to generate unique questions that require the same solution logic but are not identical.
  • The study found a "reasoning gap" in models such as OpenAI's GPT-3.5 and GPT-4, which performed significantly better on problems known from benchmarks than on new, spontaneously solvable problems. This suggests that LLMs do not really understand hard math problems and infer the answers from the training data, if they answer them correctly at all.
Online journalist Matthias is the co-founder and publisher of THE DECODER. He believes that artificial intelligence will fundamentally change the relationship between humans and computers.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.