Researchers have developed an AI system called Agent S that learns to perform routine computer tasks independently by observing human actions. This technology could pave the way for a new generation of digital assistants.
Many office workers spend valuable time on simple computer tasks like data entry, scheduling, and document creation. A research team has now introduced an AI system named Agent S that aims to solve such tasks using an approach similar to humans. The paper "Agent S: An Open Agentic Framework That Uses Computers Like a Human" describes the system's foundations.
Agent S combines the capabilities of modern language models with a special interface that takes control of mouse, keyboard, and screen. The researchers modeled the system's computer interaction after human behavior: It clicks buttons, types text, and navigates menus and folders.
This opens up extensive automation possibilities for individual users and businesses, regardless of the specific software being used. The technology could also create new opportunities for people with disabilities.
Learning ability is key
Previous approaches have pursued similar goals. Microsoft also demonstrated the experimental UFO framework earlier this year. Agent S's particular strength lies in its ability to learn. The system draws on information from the internet, such as instructions for specific computer programs, allowing it to flexibly adapt to constantly changing applications.
Additionally, Agent S stores its own experiences from previous tasks in a type of memory. When faced with a new task, the system searches this knowledge base for similar cases and breaks down the problem into manageable subtasks.
During execution, it continuously monitors progress and optimizes its approach. After completing the task, the new experiences flow back into the knowledge store - the AI thus expands its capabilities with each solved task.
A specially developed "agent-computer interface" forms the bridge between the AI system and the computer. It translates between the two worlds and ensures safe and reliable execution of commands. To do this, it evaluates visual information to detect changes on the screen.
It also creates a kind of digital twin of all controls and their arrangement. Instead of working with absolute mouse coordinates, Agent S uses instructions like "Click on button No. 42". According to the paper, this approach makes control more robust and reduces susceptibility to errors.
Few differences between Claude 3.5 and GPT-4o
In initial practical tests by the developers, Agent S underwent a performance comparison for typical computer tasks. In a benchmark with tasks under Linux, Agent S increased the success rate by almost 90 percent compared to a pure language model - but still only reaches around 20 percent.
The system is also slow: In demo videos, Agent S takes about six minutes to remove an account in the Thunderbird email client and a good three minutes to deactivate the autosave function in VS Code.
Agent S can be linked to various language models via API. Depending on the task area, either Claude 3.5 or GPT-4o in combination with the framework came out ahead, but overall the difference remains marginal at 0.1 percentage points. A language model optimized for such use cases could promise even better performance.
In a test environment under Windows, the system also achieved good results without special adaptation. The basic principles thus seem to be viable across operating system boundaries.
The researchers see room for improvement. A detailed error analysis attributed about 40 percent of the observed problems to weaknesses in task planning or the assignment of control commands to screen elements. The developers also want to further optimize processing speed.
Generally, there are various approaches by scientists to operate user interfaces via natural language input - this promise was also made by the AI startup Rabbit, for example, but has not yet been fulfilled despite the release of the Large Action Model Playground. It will likely take some time before such systems are mature enough to save more time than they cost in everyday use.
The Python code for Agent S is freely available on GitHub.