Language models like GPT-4 memorize more than they reason, study finds

A new study reveals that large language models such as GPT-4 perform much worse on counterfactual task variations compared to standard tasks. This suggests that the models often recall memorized solutions instead of truly reasoning.

In an extensive study, researchers from the Massachusetts Institute of Technology (MIT) and Boston University investigated the reasoning capabilities of leading language models, including GPT-4, GPT-3.5, Claude, and PaLM-2.

The researchers created eleven counterfactual variations of these tasks, where the basic rules or conditions were slightly altered compared to the standard tasks.

For instance, the models had to perform additions in number systems other than the standard decimal system, evaluate chess moves with minor changes to the starting positions of the pieces, or place a soft drink upside down.

In standard decimal addition, GPT-4 achieved nearly perfect accuracy of over 95 percent. However, in the base 9 number system, its performance dropped below 20 percent. Similar patterns were observed in other tasks, such as programming, spatial reasoning, and logical reasoning.

The performance of GPT-4 on the standard version of various tasks (blue) and the counterfactual counterparts (orange). The model performs significantly better on the familiar tasks, but is also often better than chance on the counterfactual tasks. | Image: Wu et al.

Nevertheless, the researchers highlight that the patterns of the counterfactual tasks were usually above chance level, indicating some generalization ability. So, they are likely not just learning by rote. However, the researchers cannot exclude the possibility that their counterfactual conditions were included in the AI's training dataset.

Regardless, the significant performance drop compared to standard tasks demonstrates that the models typically resort to non-transferable behaviors specific to standard conditions instead of using abstract, generalizable logical thinking.

GPT-4o and DALL-E 3 are unable to turn a glass of lemonade upside down even when repeatedly asked to do so, as they have probably only seen pictures of lemonade glasses with the opening facing upwards during training. | Image: DALL-E 3 / ChatGPT prompted by THE DECODER

The study also discovered that the models' performance in counterfactual tasks correlated with the frequency of the respective conditions. For example, GPT-4 showed the best counterfactual performance in the guitar chord task for the relatively frequent alternative drop-D tuning. This suggests a memory effect where the models perform better in more frequent conditions.

The researchers also explored the impact of chain-of-thought prompting (without examples), a technique where the model is asked to reason in steps. This method improved performance in most cases but could not completely close the gap between the standard and counterfactual tasks.

Recommendation

AI research

Meta Neuroscientist King: "Some of the concepts like reasoning may need to be re-evaluated"

DALL-E 3 turns the glass slightly to the side after a CoT prompt, but still can't turn it upside down. | Image: DALL-E 3 / ChatGPT prompted by THE DECODER

The researchers argue that the success of existing language models in standard tasks should not be considered sufficient evidence of their general ability to solve the target task. They stress the importance of distinguishing between recalling memorized solutions and genuine reasoning.

Recent experiments and studies have demonstrated the limited reasoning abilities of large language models. The AI industry's ultimate goal is to develop AI models with a combination of reasoning capabilities and generative AI, allowing GenAI systems to apply the knowledge learned from training examples to new examples.

A study on the quality of ChatGPT code generation revealed that GPT-3.5 could reliably solve code tasks from the LeetCode training website that were published before the end of training in 2021. However, performance on tasks published after the end of the training period dropped significantly.

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

Language models like GPT-4 memorize more than they reason, study finds

Meta Neuroscientist King: "Some of the concepts like reasoning may need to be re-evaluated"

AI system StreamDiT generates livestream videos from text at 16 fps 512p

Researchers used 1,600 YouTube fail videos to show AI models struggle with surprises

AI coding can make developers slower even if they feel faster

OpenAI launches new ChatGPT agent that automates complex tasks for Pro, Plus, and Team

Kimi-K2 is the next open-weight AI milestone from China after Deepseek

New Energy-Based Transformer architecture aims to bring better "System 2 thinking" to AI models

Language models like GPT-4 memorize more than they reason, study finds

Share

Bank details