Content
summary Summary

A new study independently examines the planning abilities of OpenAI's latest AI model, o1. While the results show major improvements over traditional language models, significant limitations remain.

Ad

Researchers at Arizona State University tested o1's planning capabilities using the PlanBench benchmark. Their findings reveal that this "Large Reasoning Model" (LRM) makes substantial progress compared to conventional large language models (LLMs), but still falls short of solving tasks completely.

Developed in 2022, PlanBench evaluates AI systems' planning abilities. It includes 600 tasks from the "Blocksworld" domain, where blocks must be stacked in specific orders.

O1 achieved 97.8% accuracy on Blocksworld tasks, vastly outperforming the previous best language model, LLaMA 3.1 405B, which only solved 62.6%. On a more challenging encrypted version called "Mystery Blocksworld," o1 reached 52.8% accuracy while conventional models almost entirely failed.

Ad
Ad
Image: Valmeekam, Stechly

The researchers also tested a new randomized variant to rule out the possibility that o1's performance stemmed from having benchmark data in its training set. O1's accuracy dropped to 37.3% on this test but still far exceeded older models, which scored near zero.

Performance drops significantly with more planning steps

Performance declined sharply as tasks grew more complex. On problems requiring 20 to 40 planning steps, o1's accuracy in the simpler test fell from 97.8% to just 23.63%.

The model also struggled to identify unsolvable tasks, correctly recognizing them only 27% of the time. In 54% of cases, it incorrectly generated complete but impossible plans.

"Quantum improvement" - but not robust

While o1 shows a "quantum improvement" in benchmark performance, it offers no guarantees for solution correctness. Classic planning algorithms like Fast Downward achieve perfect accuracy with much shorter computation times.

The study also highlights o1's high resource consumption. Running the tests cost nearly $1,900, whereas classic algorithms can run on standard computers at virtually no cost.

Recommendation

The researchers stress that fair comparisons of AI systems must consider accuracy, efficiency, costs, and reliability. Their findings show that while AI models like o1 are progressing in complex reasoning tasks, these capabilities are not yet robust.

"Over time, LLMs have improved their performance on vanilla Blocksworld–with the best performing model, LlaMA 3.1 405B, reaching 62.5% accuracy. However, their dismal performance on the obfuscated ("Mystery") versions of the same domain betrays their essentially approximate retrieval nature. In contrast, the new o1 models, which we call LRMs (Large Reasoning Models)–in keeping with OpenAI’s own characterizations–not only nearly saturates the original small instance Blockworld test set, but shows the first bit of progress on obfuscated versions. Encouraged by this, we have also evaluated o1’s performance on longer problems and unsolvable instances, and found that these accuracy gains are not general or robust."

From the paper.

The code for PlanBench is available on GitHub.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • Researchers at Arizona State University have evaluated the planning capabilities of OpenAI's new AI model o1 using the PlanBench benchmark. O1 showed significant progress compared to traditional large language models, but is still far from fully solving the tasks.
  • On simple block-world tasks, o1 achieved 97.8 percent accuracy, compared to 62.6 percent for the best language model to date. In the more difficult "Mystery Blocksworld" version, it achieved 52.8 percent correct solutions, while conventional models failed almost completely. However, its performance dropped significantly in more complex tasks with more planning steps. In addition, o1 had difficulty recognizing unsolvable problems.
  • The researchers emphasize that while o1 represents progress, it does not guarantee the correctness of its solutions. Conventional planning algorithms, on the other hand, achieve perfect accuracy with shorter computing times and lower costs. For a fair comparison, efficiency, cost, and reliability must be considered in addition to accuracy.
Sources
Max is managing editor at THE DECODER. As a trained philosopher, he deals with consciousness, AI, and the question of whether machines can really think or just pretend to.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.