Content
summary Summary

Researchers at Korea's Yonsei University have created a new AI system that tests actions before executing them on websites. The approach shows better results than previous methods at helping AI navigate the web.

Ad

The team first tested how well large language models could predict sequences of actions. Even the newest AI models only got it right about 54% of the time. GPT-4 Turbo scored slightly higher than GPT-4o, while Claude 3.5 Sonnet performed "as badly as random guessing."

"These suggest that the world model, the ability to foresee the potential outcomes of actions taken, is absent in LLMs," the researchers explained.

Testing before doing

Instead of trial and error, the new system simulates possible actions first. The researchers developed what they call "transition-focused observation abstraction" to track important changes on websites.

Ad
Ad
Technical diagram: two-stage AI process with world model training and policy optimization, Chrome browser as environment, flowchart representation.
The system combines two key components: a world model for training and a policy system for making decisions. It learns by observing how websites respond to actions and uses that data to predict outcomes and choose the best path forward.| Image: Chae et al.

The process works in three steps: First, it gathers data about how AI interacts with websites. Using GPT-4o-mini to generate prompts, the team gathered 14,000 training examples.

Second, it tracks changes between actions using the Hungarian algorithm to identify updates, deletions, and additions on web pages.

Third, it translates technical changes into simple language, reducing data from about 4,000 tokens to a much smaller amount. This cuts computing costs and increases efficiency.

Technical Diagram: Transition-based observation abstraction with element matching, state transitions and prediction components for navigation systems.
The system uses the Hungarian algorithm to monitor how elements on a page change and turn those changes into useful predictions about what will happen next. | Image: Chae et al.

The system's success varied by task type. In WebArena tests, which include common tasks like online shopping and using Reddit, it achieved a 16.6 percent success rate, improving from the previous 12.8 percent baseline.

Results varied significantly by category. GitLab page navigation improved by 181 percent, while map services showed a 92 percent gain. Online shopping saw the smallest improvement at 3 percent.

Recommendation

When tested on Mind2Web's collection of 2,000 tasks across 137 websites, the system achieved a new record with 25.4 percent of tasks completed successfully.

Two tables: Comparison of the success rates of various LLM agents in WebArena, with and without policy optimization, plus domain-specific performance analysis.
Testing showed major improvements across different types of web tasks. The system was particularly effective at handling GitLab-related work, where it performed 181 percent better than previous methods.| Image: Chae et al.

Looking ahead

The researchers acknowledge that work remains, especially in processing visual information and planning multiple steps. They plan to focus on these areas in future research.

Web navigation could become a key part of how agent-based AI systems, which some see as the next big step for AI, work with the Internet. Both Anthropic's "Claude Computer Use" and Google's "Project Jarvis" are developing similar capabilities to help AI navigate the web more effectively.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • Researchers at Yonsei University have developed a new AI-based web navigation method that uses "world models" to simulate action sequences and determine the best action strategies.
  • The system gathers interaction data, analyzes website state changes using the Hungarian algorithm, and translates it into natural language descriptions. The trained world model can then predict action outcomes and choose the optimal action.
  • In the WebArena benchmark, the system achieved a 16.6% success rate, with performance varying by application area. In the Mind2Web benchmark, it set a record of 25.4% in task completion accuracy, but still struggles with visual information processing and multistep planning.
Sources
Jonathan works as a freelance tech journalist for THE DECODER, focusing on AI tools and how GenAI can be used in everyday work.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.