Microsoft researchers have compared API-based and GUI-based AI agents, finding that each approach has distinct strengths and that the two can work well together.
API agents interact with software through programmable interfaces. GUI agents, by contrast, mimic how humans use software, navigating menus and clicking buttons on a screen. For example, to schedule an event, an API agent might trigger a single function call, while a GUI agent would open the calendar app, find the right screen, and fill out the form manually.

How the two agent types compare
The researchers evaluated both agent types across nine categories. One key difference is how they interact with software: API agents use function calls, while GUI agents rely on visual screen content. As a result, API agents are generally more stable and less error-prone.
They’re also more efficient: complex tasks can often be completed in a single step. GUI agents, on the other hand, must take multiple actions to accomplish the same goal. But that extra effort comes with greater versatility: GUI agents can control almost any software that has a visible interface, whether or not it offers an API.

This flexibility becomes especially useful when working with new or frequently updated features. GUI agents can adjust to interface changes more easily, while API agents rely on stable specifications. Security, however, favors API agents. Access can be restricted at the function level. In contrast, GUI agents often interact with the entire interface all at once.
Maintenance is also easier with APIs, which benefit from version control. GUI agents are more fragile. Small visual changes can break them. But transparency is higher with GUI agents, since users can see every action played out visually, making them easier to audit.
According to the researchers, GUI agents are particularly useful for tasks that require visual confirmation. In one example, a GUI agent generates a financial report by manually navigating menus and setting parameters, just like a human user would.
Three ways to combine GUI and API agents
Microsoft outlines three strategies for combining both types of agents into hybrid systems. The first approach uses API wrappers to hide GUI actions behind a programmable interface. For instance, a multi-step process like generating a financial report can be turned into a single GenerateReport() function. Behind the scenes, the wrapper still performs all the GUI actions, but developers only see the clean API.
The second strategy uses orchestration tools to coordinate both API and GUI steps in a workflow. In a credit application scenario, APIs are used for database queries and credit checks, while GUI actions handle tasks like sending emails. Microsoft’s experimental tool UFO follows this model. It prefers APIs but falls back to GUI interactions when needed.

The third approach involves low-code and no-code platforms. These tools allow non-technical users to build automations using drag-and-drop interfaces. Under the hood, the system decides whether to use APIs or GUI actions, depending on what's available.
Microsoft sees recent advances in multimodal AI as a key enabler for these hybrid systems. Improvements in visual AI and transformer models could make GUI agents more robust. At the same time, new tools are simplifying API development. Together, these trends could lead to more flexible forms of automation that blur the line between front-end and back-end integration.
Choosing the right agent for the job
The study outlines clear guidelines for when to use which type of agent. API agents are best for performance-critical tasks where speed and reliability matter, especially when working with well-documented interfaces.
They are also ideal for security-sensitive environments, where access needs to be tightly controlled. Microsoft recommends using API agents for backend operations and database access, where direct and efficient communication is essential.
GUI agents are better suited for legacy systems that lack APIs. Microsoft also highlights mobile apps as a strong use case, since these often restrict external API access. GUI agents are especially useful for tasks that require visual inspection, such as UI testing.

When available APIs only cover part of a system, a hybrid approach makes the most sense. Organizations can start with GUI agents, then gradually switch to APIs as they become available. According to Microsoft, choosing the right architecture from the outset is crucial for long-term automation success.
Growing momentum for GUI agents
Other companies are also working on ways to streamline how AI interacts with software. Anthropic recently introduced an open-source framework called the Model Context Protocol (MCP), which acts as a universal translator between AI systems and data sources. It’s already being used to control applications like Blender, which previously required custom integration for each task.
At the same time, GUI agents are gaining ground on the consumer side. That shift makes sense. These agents can, in theory, handle a wide range of tasks just by operating software the same way a person would. New agents like ChatGPT Operator and Chinese AI assistant Manus already use visual interfaces to complete workflows that once required manual input.