Content
summary Summary

Researchers have developed an AI system called Agent S that learns to perform routine computer tasks independently by observing human actions. This technology could pave the way for a new generation of digital assistants.

Ad

Many office workers spend valuable time on simple computer tasks like data entry, scheduling, and document creation. A research team has now introduced an AI system named Agent S that aims to solve such tasks using an approach similar to humans. The paper "Agent S: An Open Agentic Framework That Uses Computers Like a Human" describes the system's foundations.

 

Agent S combines the capabilities of modern language models with a special interface that takes control of mouse, keyboard, and screen. The researchers modeled the system's computer interaction after human behavior: It clicks buttons, types text, and navigates menus and folders.

Ad
Ad

This opens up extensive automation possibilities for individual users and businesses, regardless of the specific software being used. The technology could also create new opportunities for people with disabilities.

Learning ability is key

Previous approaches have pursued similar goals. Microsoft also demonstrated the experimental UFO framework earlier this year. Agent S's particular strength lies in its ability to learn. The system draws on information from the internet, such as instructions for specific computer programs, allowing it to flexibly adapt to constantly changing applications.

Additionally, Agent S stores its own experiences from previous tasks in a type of memory. When faced with a new task, the system searches this knowledge base for similar cases and breaks down the problem into manageable subtasks.

Komplexes Flussdiagramm: KI-Agentenarchitektur für Desktop-Aufgaben, Module für Planung, Wissen und Aktionsgenerierung.
These diagrams illustrate the complex functioning of an AI agent called Agent S, which solves human-like computer tasks. The system uses various components such as memory modules, planning algorithms and a computer interface. | Image: Simular

During execution, it continuously monitors progress and optimizes its approach. After completing the task, the new experiences flow back into the knowledge store - the AI thus expands its capabilities with each solved task.

A specially developed "agent-computer interface" forms the bridge between the AI system and the computer. It translates between the two worlds and ensures safe and reliable execution of commands. To do this, it evaluates visual information to detect changes on the screen.

Recommendation

It also creates a kind of digital twin of all controls and their arrangement. Instead of working with absolute mouse coordinates, Agent S uses instructions like "Click on button No. 42". According to the paper, this approach makes control more robust and reduces susceptibility to errors.

Few differences between Claude 3.5 and GPT-4o

In initial practical tests by the developers, Agent S underwent a performance comparison for typical computer tasks. In a benchmark with tasks under Linux, Agent S increased the success rate by almost 90 percent compared to a pure language model - but still only reaches around 20 percent.

The system is also slow: In demo videos, Agent S takes about six minutes to remove an account in the Thunderbird email client and a good three minutes to deactivate the autosave function in VS Code.

Agent S can be linked to various language models via API. Depending on the task area, either Claude 3.5 or GPT-4o in combination with the framework came out ahead, but overall the difference remains marginal at 0.1 percentage points. A language model optimized for such use cases could promise even better performance.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Tabelle: Erfolgsraten verschiedener KI-Methoden auf OSWorld-Testset, Agent S zeigt höchste Gesamtleistung von 20.58%.
The table shows a clear jump in performance when using Agent S with different language models. Especially in the categories "Daily" and "Profess." Agent S significantly outperforms the individual models. | Image: Simular

In a test environment under Windows, the system also achieved good results without special adaptation. The basic principles thus seem to be viable across operating system boundaries.

The researchers see room for improvement. A detailed error analysis attributed about 40 percent of the observed problems to weaknesses in task planning or the assignment of control commands to screen elements. The developers also want to further optimize processing speed.

Generally, there are various approaches by scientists to operate user interfaces via natural language input - this promise was also made by the AI startup Rabbit, for example, but has not yet been fulfilled despite the release of the Large Action Model Playground. It will likely take some time before such systems are mature enough to save more time than they cost in everyday use.

The Python code for Agent S is freely available on GitHub.

Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • Researchers have developed an AI system called Agent S that learns to perform routine computer tasks independently by observing human actions. The system controls the mouse, keyboard, and screen in a manner similar to a human user.
  • Agent S combines modern language models with a specialized interface. It draws on information from the internet and its own experiences to adapt flexibly to changing applications and break tasks down into smaller steps.
  • In initial tests, Agent S achieved a success rate of around 20 percent. The language models Claude 3.5 and GPT-4o performed similarly well when used with the framework. However, the researchers see room for improvement in task planning, mapping of control commands, and processing speed.
Sources
Jonathan works as a freelance tech journalist for THE DECODER, focusing on AI tools and how GenAI can be used in everyday work.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.