DeepMind has found a simple way to make language models reason better

Logical reasoning is still a major challenge for language models. DeepMind has found a way to support reasoning tasks.

A study by Google's AI division DeepMind shows that the order of the premises in a task has a significant impact on the logical reasoning performance of language models.

They work best when the premises are presented in the same order as they appear in the logical conclusions. According to the researchers, this is also true for mathematical problems. The researchers make the systematically generated tests available in the R-GSM benchmark for further investigation.

In this work, we show that the premise order significantly affects LLMs’ performance on reasoning tasks, even when the premise order does not change the underlying task itself. Our comprehensive evaluation demonstrates that LLM tendencies resemble human preference w.r.t. premise order, i.e., LLMs achieve the best performance when the premise order follows the intermediate reasoning steps to solve the problem. Conversely, LLMs face difficulties when the reasoning problem requires the model to read the problem description back-and forth, resulting in a performance drop of over 30%.

From the paper

If A is B, then B is also A

A premise is a statement or assumption that serves as the basis for an argument or action. In their study, the researchers conducted a systematic investigation of the effects of premise ordering on various AI models.

Focusing on deductive reasoning, they tested the models using tasks that required only the logical inference "modus ponense," i.e., the derivation of other true statements from true statements.

To ensure fair tests across all language models, the researchers filled the rules with fantasy words such as "wozz", "riff" and "fum". | Image: Chen et al.

The modus ponens is a form of deductive reasoning in logic. If you have the two statements "If P, then Q" and "P is true", then you can infer that "Q is true".

This form of reasoning is relatively straightforward for humans, but has proven to be a major hurdle for language models. The researchers found that changing the order of the premises can reduce the accuracy of the models by more than 30 percent.

The tests were performed with GPT-3.5 Turbo, GPT-4 Turbo, PaLM 2-L, and Gemini Pro. OpenAI's GPT models performed better when the order of the premises was exactly reversed from the ground truth.

In general, poorer performance was also observed as the number of rules increased. Superfluous premises also confused the models.

Recommendation

AI research

DOOM on the toaster was fun, on AI it's groundbreaking

Interestingly, the comparisons also show that Google's newer Gemini Pro performs similarly to OpenAI's older GPT-3.5 Turbo in that accuracy decreases rapidly with a relatively small number of rules, even if they are in logical order.

Correct conclusions decrease with the number of rules … Image: Chen et al.

… and even more when confusing rules are deliberately inserted. Figure: Chen et al.

The researchers do not provide a theoretical explanation for the effect or possible solutions for improving the general reasoning abilities of language models based on their findings. Nevertheless, the results could help guide prompt experts who want to use LLMs for basic reasoning tasks.

Reasoning ability will have a significant impact on the future use of language models. Recently, we have seen some breakthroughs in LLM features such as larger context windows, the limits of which Google recently broke with Gemini 1.5 Pro.

But reasoning is like the holy grail of AI research, and a solid reasoning capability for LLMs would lead to more robust, versatile systems that could do more.

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

We haven't seen much progress in this area since the release of OpenAI's GPT-4. The consensus seems to be that training on large amounts of text and visual data isn't enough to achieve fundamentally more capable AI systems, a point regularly made by leading researchers and business leaders, most recently Demis Hassabis, CEO of DeepMind, and Sam Altman, CEO of OpenAI.

DeepMind has found a simple way to make language models reason better

If A is B, then B is also A

DOOM on the toaster was fun, on AI it's groundbreaking

AI system StreamDiT generates livestream videos from text at 16 fps 512p

Researchers used 1,600 YouTube fail videos to show AI models struggle with surprises

AI coding can make developers slower even if they feel faster

OpenAI launches new ChatGPT agent that automates complex tasks for Pro, Plus, and Team

Kimi-K2 is the next open-weight AI milestone from China after Deepseek

New Energy-Based Transformer architecture aims to bring better "System 2 thinking" to AI models

DeepMind has found a simple way to make language models reason better

If A is B, then B is also A

Share

Bank details