Prompt engineering: Longer "Chain of Thoughts" improves language model performance

DALL-E 3 prompted by THE DECODER

A new study shows that large language models benefit from longer reasoning steps in chain-of-thought prompts — even when they contain incorrect information.

"Chain of thought" prompting has been shown to improve the reasoning abilities of large language models. Now, a study by researchers at Northwestern University, the University of Liverpool, the New Jersey Institute of Technology, and Rutgers University shows that the length of reasoning steps in CoT prompts is directly related to the performance of language models in complex problem-solving tasks.

The study shows that simply lengthening the reasoning steps within prompts, without adding new information, significantly improves the reasoning ability of language models. Conversely, shortening the reasoning steps, even if the core information is retained, leads to a significant deterioration in reasoning performance.

They tested a wide range of task types, including arithmetic, common sense, symbolic, and more specific sets such as MultiArith, GSM8K, AQuA, SingleEq, SVAMP, or StrategyQA.

One of the most surprising results of the study: even incorrect rationales can lead to positive results if they are long enough. This suggests that the length of the reasoning steps has a greater influence than the factual correctness of the individual steps, the researchers conclude.

Even GPT-4 shows better performance with longer reasoning steps

However, longer reasoning steps do not always help - they are task-dependent. The study found that simpler tasks benefit less from additional steps, while more complex tasks are significantly improved by longer chains of reasoning. Larger models, such as GPT-4, also showed a higher tolerance for the length or brevity of steps - smaller models benefited most from the strategy in the tests. However, chains that are too long can degrade performance again, especially for smaller models.

The team now plans to continue their research and analyze the neural activation patterns between long and short reasoning steps to better understand how the length of reasoning steps affects language models.

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

Prompt engineering: Longer "Chain of Thoughts" improves language model performance

Even GPT-4 shows better performance with longer reasoning steps

Microsoft’s MAI-DxO boosts AI diagnostic accuracy and cuts costs by nearly 70 percent

Researchers say they may have found a ladder to climb the "data wall"

OmniGen 2 blends image and text generation like GPT-4o, but is open source

Cloudflare CEO Matthew Prince sees trouble ahead for the open web

New Othello experiment supports the world model hypothesis for large language models

ChatGPT might be draining your brain, MIT warns - what ‘cognitive debt’ means for you

Prompt engineering: Longer "Chain of Thoughts" improves language model performance

Even GPT-4 shows better performance with longer reasoning steps

Microsoft’s MAI-DxO boosts AI diagnostic accuracy and cuts costs by nearly 70 percent

Researchers say they may have found a ladder to climb the "data wall"

OmniGen 2 blends image and text generation like GPT-4o, but is open source