Content
summary Summary

Meta's ExploreToM framework shows that even the most sophisticated AI models, including GPT-4o, have trouble with complex social reasoning tasks. The findings challenge earlier optimistic claims about AI's ability to understand how humans think.

Ad

Even the most advanced AI models like GPT-4o and Llama have trouble understanding how other minds work, according to new research from Meta, the University of Washington, and Carnegie Mellon University. The study focuses on the "theory of mind" - our ability to understand what others are thinking and believing.

Previous theory-of-mind tests were too basic and could lead to an overestimation of the models’ capabilities, the researchers say. In earlier tests, models like GPT-4 achieved top scores and repeatedly spurred claims that language models had developed a theory of mind (ToM). However, it is more likely that they learned from the narrative practice of ToM and can thus pass simple ToM tests with this ability.

A new way to test AI's theory of mind

To address this, the team created ExploreToM - the first framework for generating truly challenging theory-of-mind tests at scale. It uses a specialized search algorithm to create complex, novel scenarios that push language models to their limits.

Ad
Ad
Flowchart: 3-phase story generation process – context definition, structure analysis and incremental story development with example text.
The diagram illustrates ExploreTom's three-step story generation process: from initial context definition, through structural analysis using mental state trackers, to incremental elaboration of natural-sounding stories. | Figure: Sclar et al.

The results weren't great for the tested LLMs. When faced with these tougher tests, even top performers like GPT-4o got the answers right only 9% of the time. Other models like Mixtral and Llama performed even worse, sometimes getting every single question wrong. This is a far cry from their near-perfect scores on simpler tests.

The good news is that ExploreToM isn't just useful for testing - it can also help train AI models to do better. When researchers used ExploreToM's data to fine-tune Llama-3.1 8B Instruct, its performance on standard theory of mind tests improved by 27 points.

The challenge of following simple stories

The researchers found something surprising: the tested models struggle even more with basic state tracking - keeping track of what's happening and who believes what throughout a story - than with theory of mind itself. This suggests that before we can build AIs that truly understand others' minds, we need to solve the more fundamental problem of helping them follow simple narratives.

Interestingly, when it comes to specifically improving an AI's ability to understand others' minds, the researchers found that training data needs to focus explicitly on theory of mind rather than just state tracking. All the data from this research is available on Hugging Face for other researchers to use.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • Meta has introduced ExploreToM, a framework designed to generate diverse and challenging data for evaluating the theory-of-mind (ToM) understanding of large language models (LLMs), as previous datasets are often too simplistic and may overestimate the models' capabilities.
  • Current top models, including Llama-3.1-70B, Mixtral 7x8B, and GPT-4o, struggle with the complex ToM scenarios generated by ExploreToM, with their accuracy dropping to as low as 0% for Mixtral and Llama and up to 9% for GPT-4o in the tests.
  • The study reveals that LLMs have difficulties with simple state-tracking, a crucial skill for ToM reasoning, and improving state-tracking could be a key step in equipping language models with better ToM capabilities.
Sources
Max is managing editor at THE DECODER. As a trained philosopher, he deals with consciousness, AI, and the question of whether machines can really think or just pretend to.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.