summary Summary

A recent paper from Alphabet's AI company Google Deepmind shows that a simple tweak to prompts can significantly improve the accuracy of large language models. The technique taps into the human ability to abstract.

Step-back prompting asks the LLM a general question before the actual task. This allows the system to retrieve relevant background information and better categorize the actual question. The method is easy to implement with just one additional introductory question.


Which school did Estella Leopold attend between August 1954 and November 1954?

Step-back question :

What was Estella Leopold's educational history?

Step-Back Answer:

B.S. in Botany, University of Wisconsin, Madison, 1948
M.S. in Botany, University of California, Berkeley, 1950
Ph.D. in Botany, Yale University, 1955

Final answer:

From 1951 to 1955, she was enrolled in the Ph.D. program in Botany at Yale. from 1951 to 1955, so Estella Leopold was most likely at Yale University between August 1954 and November 1954.

The Deepmind study tested step-back prompting on the PaLM-2L language model and compared it to the base model and GPT-4. The researchers were able to increase the accuracy of the language models by up to 36 percent compared to chain-of-thought (CoT) prompting.

Improvements across all tested domains

Step-back prompting was tested in the areas of science, general knowledge, and reasoning. The researchers observed the greatest improvements in more complex tasks requiring multiple steps of reasoning.


In physics and chemistry tasks, accuracy increased by 7 to 11 percent compared to the unmodified model. The adapted PaLM-2L even outperformed GPT-4 by a few percentage points. The abstract question of the experiment was: "What physical or chemical principles and concepts are needed to solve this problem?"

Image: Zheng et al.

Most importantly, DeepMind's prompting method also performed significantly better than existing methods such as chain-of-thought and "take a deep breath" (TDB), which only marginally improved or even worsened accuracy.

PaLM-2L can achieve better performance with step-back prompting than GPT-4

The improvement was even more pronounced for knowledge questions with a temporal component from the TimeQA dataset. Here, the gain from a combination of step-back prompting and retrieval augmented generation (RAG) was a whopping 27 percentage points over the base model, making it about 23 percent more accurate than GPT-4. Of course, step-back prompting can be used with GPT-4 as well; the comparison is just to show the performance gain.

Image: Zheng et al.

Even on particularly difficult knowledge questions, which were less likely to be answered correctly with RAG, the researchers found a significant gain in accuracy with step-back prompting. "This is where STEP-BACK PROMPTING really shines by retrieving facts regarding high-level concepts to ground the final reasoning," the paper states.

Despite the promising results, the error analysis showed that multilevel reasoning is still one of the most difficult skills for an LLM. The technique is also not always effective or helpful, for example, when the answer is common knowledge ("Who was president of the USA in 2000?") or when the question is already at a high level of abstraction ("What is the speed of light?").

Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
  • In a recent paper, Google Deepmind shows that Step-Back Prompting can improve the information retrieval accuracy of large language models by up to 36 percent by asking the LLM a general question about the topic before the actual task.
  • The technique allows the AI to retrieve relevant background information and better understand the actual question, with the greatest improvements observed in more complex tasks that require multiple steps of reasoning.
  • In the study, step-back prompting was tested on the PaLM-2L and GPT-4 language models, with a significant increase in accuracy for physics, chemistry, and knowledge questions with a temporal component.
Jonathan works as a technology journalist who focuses primarily on how easily AI can already be used today and how it can support daily life.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.