Content
summary Summary

A new study shows that even advanced AI language models like OpenAI's latest o1-preview fall short when it comes to complex planning. Researchers identified two key issues and explored potential improvements.

Ad

The study, conducted by scientists from Fudan University, Carnegie Mellon University, ByteDance, and Ohio State University, tested AI models on two planning benchmarks: BlocksWorld, a classic planning task, and TravelPlanner, a realistic travel planning scenario.

In the BlocksWorld benchmark, most models scored below 50% accuracy, only o1-mini (just under 60%) and o1-preview (nearly 100%) achieved good results. For the more complex TravelPlanner however, results were disappointing for all models tested.

GPT-4o managed only a 7.8% final success rate, while o1-preview reached 15.6%. Other models like GPT-4o-Mini, Llama3.1, and Qwen2 scored between 0 and 2.2%. While o1-preview showed improvement over GPT-4o, it still falls far short of human-level planning abilities.

Ad
Ad

Two main problems identified

The researchers pinpointed two key weaknesses in AI planning. First, the models poorly incorporate rules and conditions, leading to plans that violate guidelines.

Second, they lose focus on the original problem as planning time increases. The team used a "Permutation Feature Importance" method to measure how different input components influenced the planning process.

The study also tested two common strategies to enhance AI planning. Episodic memory updates provided knowledge from previous planning attempts. This improved understanding of constraints but did not lead to more detailed consideration of individual rules.

Parametric memory updates used fine-tuning to increase task influence on planning, but the core problem of decreasing influence with longer plans remained. Both approaches showed some improvements but failed to fully address the fundamental issues.

Code and data will soon be available on GitHub.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Recommendation
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • A new study shows that even advanced AI language models such as OpenAI's o1-preview struggle with complex planning tasks. The researchers tested the models against two benchmarks: BlocksWorld and TravelPlanner.
  • In BlocksWorld, o1-mini and o1-preview performed well, but in the more complex TravelPlanner, all models performed poorly. GPT-4o achieved a success rate of only 7.8% and o1-preview 15.6%.
  • The researchers found two main problems: The models do not take sufficient account of pre-defined rules, and for longer itineraries they lose touch with the task at hand. Improvement approaches such as episodic and parametric memory updates had only a limited effect.
Sources
Max is managing editor at THE DECODER. As a trained philosopher, he deals with consciousness, AI, and the question of whether machines can really think or just pretend to.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.