Study reveals major reasoning flaws in smaller AI language models

A new study has uncovered significant gaps in the reasoning abilities of AI language models, especially smaller and cheaper ones. These models struggle with chained elementary-level math problems.

Researchers from the Mila Institute, Google DeepMind, and Microsoft Research investigated how well different AI language models could solve linked elementary school-level math word problems. They created a test called "Compositional GSM" that combines two problems from the GSM8K dataset, using the answer from the first problem as a variable in the second.

The results show many models performed much worse than expected on these more complex reasoning tasks. This "reasoning gap" is particularly pronounced in smaller, cheaper models and even those specializing in math.

Scatter plot: GSM8K vs. Compositional GSM accuracy for various LLMs, shows performance drop for more complex tasks below expectation line. — Most models remain below the expected performance curve in GSM accuracy, indicating difficulties with more complex reasoning tasks. This is particularly noticeable on the smaller models. | Image: Hosseini et al.

"Our findings reveal a significant reasoning gap in most LLMs, that is performance difference between solving the compositional pairs and solving each question independently," explain the authors, led by Arian Hosseini of the Mila Institute.

Smaller models struggle with complex tasks

While smaller models often score similarly to larger ones on standard math tests like GSM8K, they show a 2-12 times larger logic gap on the new Compositional GSM test. For instance, GPT-4o mini falls far behind GPT-4o on the new test, despite nearly matching it on the original benchmark. Similar patterns appeared across other model families like Gemini and LLAMA3.

Bar chart: Comparison of test accuracies for GSM8K and Compositional GSM with different LLM families, shows drop in performance with cheaper models. — Smaller, less expensive, and even more mathematically specialized models show larger differences in performance between simple and complex tasks, indicating limits to their reasoning capabilities. | Image: Hosseini et al.

The researchers suggest this indicates smaller models may recognize surface-level patterns in common tasks but struggle to apply that knowledge in new contexts. Current training methods for these models may focus too much on optimizing for standard benchmarks at the expense of general reasoning ability.

Even specialized math models showed weaknesses. For example, Qwen2.5-Math-7B-IT scores over 80% on difficult high school tasks but solves less than 60% of chained elementary school problems correctly.

Bar chart: Reasoning gap of different LLMs, sorted by model size and specialization, shows larger gaps for smaller models. — Smaller, more cost-efficient and mathematically specialized models show larger differences in performance between simple and complex tasks, indicating limits to their reasoning capabilities. | Image: Hosseini et al.

The small, mathematically specialized models show similar weaknesses. For example, Qwen2.5-Math-7B-IT achieves over 80 percent accuracy on difficult high school competition-level questions, but solves less than 60 percent of chained grade school problems correctly.

The study also examined the effects of instruction tuning, a method for refining language models. For small models, this significantly improved performance on the original GSM8K test but only slightly on Compositional GSM. Larger models didn't show this discrepancy, pointing to fundamental differences in how smaller models learn and generalize.

Recommendation

AI research

Humans might need a permission slip to use the internet soon, thanks to AI

Bar chart set: Comparison of test accuracy of pre-trained and instruction-tuned models for GSM8K and Compositional GSM at different LLM sizes. — While smaller instruction-tuned models show significant improvements on GSM8K but smaller improvements on Compositional GSM, larger instruction-tuned models show more consistent increases in both areas. This suggests that model size is important for complex reasoning tasks.| Image: Hosseini et al.

The study is not entirely up-to-date, as OpenAI's new logic-optimized o1 model was not tested. A recent planning benchmark showed that o1 is much better at planning, but still makes major mistakes.

A math professor recently showed that while o1 was able to complete a mathematical proof that had previously failed with other LLMs, a human solved the problem faster and more elegantly. Google's Gemini models are also said to perform better at math tasks after recent updates.

Beyond benchmarks: Testing true understanding

The researchers stress that current evaluation methods have masked these systematic differences, leading to potential overestimation of small model capabilities. They call for a re-evaluation of development strategies for low-cost AI systems, and question whether these models have inherent limitations in complex reasoning and generalization. This could have significant implications for their practical applications.

The results also challenge recent claims of AI efficiency gains. While some argue that language models have become more efficient rather than more capable, and that scaling these efficient models could lead to significant performance improvements, this study suggests otherwise.

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

The authors emphasize that their goal wasn't to create another benchmark. Instead, they view their work as a case study providing deeper insights into the functioning and limitations of current AI systems. By chaining tasks, they test whether models can flexibly apply and combine learned knowledge - a crucial distinction between true understanding and superficial pattern matching.

The researchers hope their methodology can be applied to other domains and benchmarks, providing a more comprehensive picture of AI capabilities. This approach could reveal hidden weaknesses in AI systems that might otherwise go unnoticed in simpler, isolated tests.

The study adds to existing evidence of logical weaknesses in language models. Previous research has shown LLMs struggle with basic logical inferences and simple planning puzzles, despite high scores on common logic and math benchmarks.

Study reveals major reasoning flaws in smaller AI language models

Smaller models struggle with complex tasks

Humans might need a permission slip to use the internet soon, thanks to AI

Beyond benchmarks: Testing true understanding

Microsoft’s MAI-DxO boosts AI diagnostic accuracy and cuts costs by nearly 70 percent

Researchers say they may have found a ladder to climb the "data wall"

OmniGen 2 blends image and text generation like GPT-4o, but is open source

Cloudflare CEO Matthew Prince sees trouble ahead for the open web

New Othello experiment supports the world model hypothesis for large language models

ChatGPT might be draining your brain, MIT warns - what ‘cognitive debt’ means for you

Study reveals major reasoning flaws in smaller AI language models

Smaller models struggle with complex tasks

Beyond benchmarks: Testing true understanding

Share

Bank details