Content
summary Summary

A new study reveals that large language models such as GPT-4 perform much worse on counterfactual task variations compared to standard tasks. This suggests that the models often recall memorized solutions instead of truly reasoning.

Ad

In an extensive study, researchers from the Massachusetts Institute of Technology (MIT) and Boston University investigated the reasoning capabilities of leading language models, including GPT-4, GPT-3.5, Claude, and PaLM-2.

The researchers created eleven counterfactual variations of these tasks, where the basic rules or conditions were slightly altered compared to the standard tasks.

For instance, the models had to perform additions in number systems other than the standard decimal system, evaluate chess moves with minor changes to the starting positions of the pieces, or place a soft drink upside down.

Ad
Ad

In standard decimal addition, GPT-4 achieved nearly perfect accuracy of over 95 percent. However, in the base 9 number system, its performance dropped below 20 percent. Similar patterns were observed in other tasks, such as programming, spatial reasoning, and logical reasoning.

The performance of GPT-4 on the standard version of various tasks (blue) and the counterfactual counterparts (orange). The model performs significantly better on the familiar tasks, but is also often better than chance on the counterfactual tasks. | Image: Wu et al.

Nevertheless, the researchers highlight that the patterns of the counterfactual tasks were usually above chance level, indicating some generalization ability. So, they are likely not just learning by rote. However, the researchers cannot exclude the possibility that their counterfactual conditions were included in the AI's training dataset.

Regardless, the significant performance drop compared to standard tasks demonstrates that the models typically resort to non-transferable behaviors specific to standard conditions instead of using abstract, generalizable logical thinking.

GPT-4o and DALL-E 3 are unable to turn a glass of lemonade upside down even when repeatedly asked to do so, as they have probably only seen pictures of lemonade glasses with the opening facing upwards during training. | Image: DALL-E 3 / ChatGPT prompted by THE DECODER

The study also discovered that the models' performance in counterfactual tasks correlated with the frequency of the respective conditions. For example, GPT-4 showed the best counterfactual performance in the guitar chord task for the relatively frequent alternative drop-D tuning. This suggests a memory effect where the models perform better in more frequent conditions.

The researchers also explored the impact of chain-of-thought prompting (without examples), a technique where the model is asked to reason in steps. This method improved performance in most cases but could not completely close the gap between the standard and counterfactual tasks.

Recommendation
DALL-E 3 turns the glass slightly to the side after a CoT prompt, but still can't turn it upside down. | Image: DALL-E 3 / ChatGPT prompted by THE DECODER

The researchers argue that the success of existing language models in standard tasks should not be considered sufficient evidence of their general ability to solve the target task. They stress the importance of distinguishing between recalling memorized solutions and genuine reasoning.

Recent experiments and studies have demonstrated the limited reasoning abilities of large language models. The AI industry's ultimate goal is to develop AI models with a combination of reasoning capabilities and generative AI, allowing GenAI systems to apply the knowledge learned from training examples to new examples.

A study on the quality of ChatGPT code generation revealed that GPT-3.5 could reliably solve code tasks from the LeetCode training website that were published before the end of training in 2021. However, performance on tasks published after the end of the training period dropped significantly.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • MIT and Boston University researchers conducted a study showing that large language models like GPT-4 perform significantly worse on counterfactual task variants compared to standard tasks, suggesting they often rely on memorized solutions rather than reasoning.
  • The researchers created eleven "counterfactual" tasks with slightly altered rules or conditions compared to standard tasks. While GPT-4 achieved high accuracy on standard tasks, its performance dropped significantly on counterfactual tasks, though often remaining above chance level.
  • The study found that model performance on counterfactual tasks correlated with the frequency of the respective conditions, indicating a memory effect where models perform better under more common conditions.
Sources
Online journalist Matthias is the co-founder and publisher of THE DECODER. He believes that artificial intelligence will fundamentally change the relationship between humans and computers.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.