Even OpenAI's o1-preview fails at travel planning
Key Points
- A new study shows that even advanced AI language models such as OpenAI's o1-preview struggle with complex planning tasks. The researchers tested the models against two benchmarks: BlocksWorld and TravelPlanner.
- In BlocksWorld, o1-mini and o1-preview performed well, but in the more complex TravelPlanner, all models performed poorly. GPT-4o achieved a success rate of only 7.8% and o1-preview 15.6%.
- The researchers found two main problems: The models do not take sufficient account of pre-defined rules, and for longer itineraries they lose touch with the task at hand. Improvement approaches such as episodic and parametric memory updates had only a limited effect.
A new study shows that even advanced AI language models like OpenAI's latest o1-preview fall short when it comes to complex planning. Researchers identified two key issues and explored potential improvements.
The study, conducted by scientists from Fudan University, Carnegie Mellon University, ByteDance, and Ohio State University, tested AI models on two planning benchmarks: BlocksWorld, a classic planning task, and TravelPlanner, a realistic travel planning scenario.
In the BlocksWorld benchmark, most models scored below 50% accuracy, only o1-mini (just under 60%) and o1-preview (nearly 100%) achieved good results. For the more complex TravelPlanner however, results were disappointing for all models tested.
GPT-4o managed only a 7.8% final success rate, while o1-preview reached 15.6%. Other models like GPT-4o-Mini, Llama3.1, and Qwen2 scored between 0 and 2.2%. While o1-preview showed improvement over GPT-4o, it still falls far short of human-level planning abilities.
Two main problems identified
The researchers pinpointed two key weaknesses in AI planning. First, the models poorly incorporate rules and conditions, leading to plans that violate guidelines.
Second, they lose focus on the original problem as planning time increases. The team used a "Permutation Feature Importance" method to measure how different input components influenced the planning process.
The study also tested two common strategies to enhance AI planning. Episodic memory updates provided knowledge from previous planning attempts. This improved understanding of constraints but did not lead to more detailed consideration of individual rules.
Parametric memory updates used fine-tuning to increase task influence on planning, but the core problem of decreasing influence with longer plans remained. Both approaches showed some improvements but failed to fully address the fundamental issues.
Code and data will soon be available on GitHub.
AI News Without the Hype – Curated by Humans
As a THE DECODER subscriber, you get ad-free reading, our weekly AI newsletter, the exclusive "AI Radar" Frontier Report 6× per year, access to comments, and our complete archive.
Subscribe now