Microsoft finds API agents are faster but GUI agents more flexible

May 11, 2025

GPT-Image-1 prompted by THE DECODER

Key Points

Microsoft Research has examined the differences between API-based and GUI-based AI agents, outlining their respective strengths and use cases.
API agents operate through programmed interfaces, offering greater efficiency and reliability, while GUI agents interact with graphical interfaces and offer more flexibility.
The researchers conclude that API agents are best for performance-critical and security-sensitive scenarios, whereas GUI agents are preferable for working with legacy systems lacking APIs and for handling visual tasks.

Microsoft researchers have compared API-based and GUI-based AI agents, finding that each approach has distinct strengths and that the two can work well together.

API agents interact with software through programmable interfaces. GUI agents, by contrast, mimic how humans use software, navigating menus and clicking buttons on a screen. For example, to schedule an event, an API agent might trigger a single function call, while a GUI agent would open the calendar app, find the right screen, and fill out the form manually.

Infographic illustrating API vs. GUI agents using an example scenario for scheduling an appointment in Google Calendar. — While API agents complete tasks via direct function calls, GUI agents simulate human-like interactions with software interfaces. | Image: Microsoft Research

How the two agent types compare

The researchers evaluated both agent types across nine categories. One key difference is how they interact with software: API agents use function calls, while GUI agents rely on visual screen content. As a result, API agents are generally more stable and less error-prone.

They’re also more efficient: complex tasks can often be completed in a single step. GUI agents, on the other hand, must take multiple actions to accomplish the same goal. But that extra effort comes with greater versatility: GUI agents can control almost any software that has a visible interface, whether or not it offers an API.

Comparison table of the characteristics of API agents and GUI agents with regard to various dimensions such as modality, reliability, efficiency, availability, flexibility, security, maintainability, transparency, and user-friendliness — API agents are more reliable but less adaptable. GUI agents work through screen content and can handle unfamiliar or changing interfaces. | Image: Microsoft Research

This flexibility becomes especially useful when working with new or frequently updated features. GUI agents can adjust to interface changes more easily, while API agents rely on stable specifications. Security, however, favors API agents. Access can be restricted at the function level. In contrast, GUI agents often interact with the entire interface all at once.

Maintenance is also easier with APIs, which benefit from version control. GUI agents are more fragile. Small visual changes can break them. But transparency is higher with GUI agents, since users can see every action played out visually, making them easier to audit.

According to the researchers, GUI agents are particularly useful for tasks that require visual confirmation. In one example, a GUI agent generates a financial report by manually navigating menus and setting parameters, just like a human user would.

Three ways to combine GUI and API agents

Microsoft outlines three strategies for combining both types of agents into hybrid systems. The first approach uses API wrappers to hide GUI actions behind a programmable interface. For instance, a multi-step process like generating a financial report can be turned into a single GenerateReport() function. Behind the scenes, the wrapper still performs all the GUI actions, but developers only see the clean API.

The second strategy uses orchestration tools to coordinate both API and GUI steps in a workflow. In a credit application scenario, APIs are used for database queries and credit checks, while GUI actions handle tasks like sending emails. Microsoft’s experimental tool UFO follows this model. It prefers APIs but falls back to GUI interactions when needed.

Flowchart with examples of API and GUI actions managed by a central orchestrator. — A unified orchestrator can combine API calls and GUI actions into a single automated workflow. | Image: Microsoft Research

The third approach involves low-code and no-code platforms. These tools allow non-technical users to build automations using drag-and-drop interfaces. Under the hood, the system decides whether to use APIs or GUI actions, depending on what's available.

Microsoft sees recent advances in multimodal AI as a key enabler for these hybrid systems. Improvements in visual AI and transformer models could make GUI agents more robust. At the same time, new tools are simplifying API development. Together, these trends could lead to more flexible forms of automation that blur the line between front-end and back-end integration.

Choosing the right agent for the job

The study outlines clear guidelines for when to use which type of agent. API agents are best for performance-critical tasks where speed and reliability matter, especially when working with well-documented interfaces.

They are also ideal for security-sensitive environments, where access needs to be tightly controlled. Microsoft recommends using API agents for backend operations and database access, where direct and efficient communication is essential.

GUI agents are better suited for legacy systems that lack APIs. Microsoft also highlights mobile apps as a strong use case, since these often restrict external API access. GUI agents are especially useful for tasks that require visual inspection, such as UI testing.

Table: Recommended approaches (API, GUI, Hybrid agents) for automation scenarios with their rationales. — Hybrid systems offer flexibility and can evolve as APIs become available. | Image: Microsoft Research

When available APIs only cover part of a system, a hybrid approach makes the most sense. Organizations can start with GUI agents, then gradually switch to APIs as they become available. According to Microsoft, choosing the right architecture from the outset is crucial for long-term automation success.

Growing momentum for GUI agents

Other companies are also working on ways to streamline how AI interacts with software. Anthropic recently introduced an open-source framework called the Model Context Protocol (MCP), which acts as a universal translator between AI systems and data sources. It’s already being used to control applications like Blender, which previously required custom integration for each task.

At the same time, GUI agents are gaining ground on the consumer side. That shift makes sense. These agents can, in theory, handle a wide range of tasks just by operating software the same way a person would. New agents like ChatGPT Operator and Chinese AI assistant Manus already use visual interfaces to complete workflows that once required manual input.

AI News Without the Hype – Curated by Humans

As a THE DECODER subscriber, you get ad-free reading, our weekly AI newsletter, the exclusive "AI Radar" Frontier Report 6× per year, access to comments, and our complete archive.

Source: Arxiv