summary Summary

Logical reasoning is still a major challenge for language models. DeepMind has found a way to support reasoning tasks.

A study by Google's AI division DeepMind shows that the order of the premises in a task has a significant impact on the logical reasoning performance of language models.

They work best when the premises are presented in the same order as they appear in the logical conclusions. According to the researchers, this is also true for mathematical problems. The researchers make the systematically generated tests available in the R-GSM benchmark for further investigation.

In this work, we show that the premise order significantly affects LLMs’ performance on reasoning tasks, even when the premise order does not change the underlying task itself. Our comprehensive evaluation demonstrates that LLM tendencies resemble human preference w.r.t. premise order, i.e., LLMs achieve the best performance when the premise order follows the intermediate reasoning steps to solve the problem. Conversely, LLMs face difficulties when the reasoning problem requires the model to read the problem description back-and forth, resulting in a performance drop of over 30%.

From the paper

If A is B, then B is also A

A premise is a statement or assumption that serves as the basis for an argument or action. In their study, the researchers conducted a systematic investigation of the effects of premise ordering on various AI models.


Focusing on deductive reasoning, they tested the models using tasks that required only the logical inference "modus ponense," i.e., the derivation of other true statements from true statements.

To ensure fair tests across all language models, the researchers filled the rules with fantasy words such as "wozz", "riff" and "fum". | Image: Chen et al.

The modus ponens is a form of deductive reasoning in logic. If you have the two statements "If P, then Q" and "P is true", then you can infer that "Q is true".

This form of reasoning is relatively straightforward for humans, but has proven to be a major hurdle for language models. The researchers found that changing the order of the premises can reduce the accuracy of the models by more than 30 percent.

The tests were performed with GPT-3.5 Turbo, GPT-4 Turbo, PaLM 2-L, and Gemini Pro. OpenAI's GPT models performed better when the order of the premises was exactly reversed from the ground truth.

In general, poorer performance was also observed as the number of rules increased. Superfluous premises also confused the models.


Interestingly, the comparisons also show that Google's newer Gemini Pro performs similarly to OpenAI's older GPT-3.5 Turbo in that accuracy decreases rapidly with a relatively small number of rules, even if they are in logical order.

Correct conclusions decrease with the number of rules … Image: Chen et al.
… and even more when confusing rules are deliberately inserted. Figure: Chen et al.

The researchers do not provide a theoretical explanation for the effect or possible solutions for improving the general reasoning abilities of language models based on their findings. Nevertheless, the results could help guide prompt experts who want to use LLMs for basic reasoning tasks.

Reasoning ability will have a significant impact on the future use of language models. Recently, we have seen some breakthroughs in LLM features such as larger context windows, the limits of which Google recently broke with Gemini 1.5 Pro.

But reasoning is like the holy grail of AI research, and a solid reasoning capability for LLMs would lead to more robust, versatile systems that could do more.

Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

We haven't seen much progress in this area since the release of OpenAI's GPT-4. The consensus seems to be that training on large amounts of text and visual data isn't enough to achieve fundamentally more capable AI systems, a point regularly made by leading researchers and business leaders, most recently Demis Hassabis, CEO of DeepMind, and Sam Altman, CEO of OpenAI.

Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
  • Researchers at Google DeepMind have found that the order of premises - the statements or assumptions underlying an argument - has a significant impact on AI performance in logical reasoning.
  • The AI models performed best when the premises were presented in the same order as they appeared in the logical conclusions.
  • The researchers don't offer a theoretical explanation for this observation. However, the LLMs' preferences in task description are similar to human preferences in that the order of premises follows the intermediate steps of reasoning to solve the problem.
Jonathan works as a technology journalist who focuses primarily on how easily AI can already be used today and how it can support daily life.
Co-author: Matthias Bastian
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.