Even OpenAI's o1-preview fails at travel planning

Midjourney prompted by THE DECODER

A new study shows that even advanced AI language models like OpenAI's latest o1-preview fall short when it comes to complex planning. Researchers identified two key issues and explored potential improvements.

The study, conducted by scientists from Fudan University, Carnegie Mellon University, ByteDance, and Ohio State University, tested AI models on two planning benchmarks: BlocksWorld, a classic planning task, and TravelPlanner, a realistic travel planning scenario.

In the BlocksWorld benchmark, most models scored below 50% accuracy, only o1-mini (just under 60%) and o1-preview (nearly 100%) achieved good results. For the more complex TravelPlanner however, results were disappointing for all models tested.

GPT-4o managed only a 7.8% final success rate, while o1-preview reached 15.6%. Other models like GPT-4o-Mini, Llama3.1, and Qwen2 scored between 0 and 2.2%. While o1-preview showed improvement over GPT-4o, it still falls far short of human-level planning abilities.

Two main problems identified

The researchers pinpointed two key weaknesses in AI planning. First, the models poorly incorporate rules and conditions, leading to plans that violate guidelines.

Second, they lose focus on the original problem as planning time increases. The team used a "Permutation Feature Importance" method to measure how different input components influenced the planning process.

The study also tested two common strategies to enhance AI planning. Episodic memory updates provided knowledge from previous planning attempts. This improved understanding of constraints but did not lead to more detailed consideration of individual rules.

Parametric memory updates used fine-tuning to increase task influence on planning, but the core problem of decreasing influence with longer plans remained. Both approaches showed some improvements but failed to fully address the fundamental issues.

Code and data will soon be available on GitHub.

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

Recommendation

AI research

Even OpenAI's o1-preview fails at travel planning

Two main problems identified

Distilling multi-step "System 2" reasoning into AI language models fails at Chain of Thought

Most AI models fail at self-criticism, but OpenAI's o1-mini keeps getting better for a while

Study shows: 'Test-time compute scaling' is a path to better AI systems

Study: OpenAI's o1 relies on trial-and-error and informal reasoning

Cloudflare CEO Matthew Prince sees trouble ahead for the open web

New Othello experiment supports the world model hypothesis for large language models

ChatGPT might be draining your brain, MIT warns - what ‘cognitive debt’ means for you

Even OpenAI's o1-preview fails at travel planning

Two main problems identified

Share

Bank details