A new study shows that even advanced AI language models like OpenAI's latest o1-preview fall short when it comes to complex planning. Researchers identified two key issues and explored potential improvements.
The study, conducted by scientists from Fudan University, Carnegie Mellon University, ByteDance, and Ohio State University, tested AI models on two planning benchmarks: BlocksWorld, a classic planning task, and TravelPlanner, a realistic travel planning scenario.
In the BlocksWorld benchmark, most models scored below 50% accuracy, only o1-mini (just under 60%) and o1-preview (nearly 100%) achieved good results. For the more complex TravelPlanner however, results were disappointing for all models tested.
GPT-4o managed only a 7.8% final success rate, while o1-preview reached 15.6%. Other models like GPT-4o-Mini, Llama3.1, and Qwen2 scored between 0 and 2.2%. While o1-preview showed improvement over GPT-4o, it still falls far short of human-level planning abilities.
Two main problems identified
The researchers pinpointed two key weaknesses in AI planning. First, the models poorly incorporate rules and conditions, leading to plans that violate guidelines.
Second, they lose focus on the original problem as planning time increases. The team used a "Permutation Feature Importance" method to measure how different input components influenced the planning process.
The study also tested two common strategies to enhance AI planning. Episodic memory updates provided knowledge from previous planning attempts. This improved understanding of constraints but did not lead to more detailed consideration of individual rules.
Parametric memory updates used fine-tuning to increase task influence on planning, but the core problem of decreasing influence with longer plans remained. Both approaches showed some improvements but failed to fully address the fundamental issues.
Code and data will soon be available on GitHub.