Content
summary Summary

Traditional user interfaces may fade into the background as AI technologies advance. With UFO, Microsoft is demonstrating how easy it could be to interact with Windows in the future.

Microsoft has developed an agent framework called UFO that can autonomously answer user queries within Windows.

UFO stands for "UI-Focused Agent" and is based on the GPT-4V image recognition model from OpenAI. It analyzes the graphical user interface and controls of Windows apps to seamlessly navigate within and between them.

Demonstration of a multi-step task that UFO can handle. | Image: Zhang et al.

UFO combines two agents that make decisions about which apps and controls to select to handle user requests. The AppAgent selects the right app, while the ActAgent performs specific actions in the selected app. A control interaction module translates the selected actions into executable operations.

Ad
Ad
Image: Zhang et al.

Microsoft evaluated UFO's performance using WindowsBench, a benchmark consisting of 50 user requests in nine common Windows applications such as Outlook, PowerPoint, File Explorer, and Adobe's Acrobat Reader.

UFO completed 86 percent of the tasks, significantly higher than other models such as GPT-3.5 and GPT-4, whose commands were executed by humans instead of UFO via GPT-4V.

UFO also required fewer steps on average and took more security precautions, such as avoiding irreversible file deletion.

Image: Zhang et al.

However, the researchers acknowledge significant limitations of their system. UFO could only perform controls and actions supported by the Python package pywinauto and Windows UI automation. They also noted difficulties when UFO was confronted with unusual application GUIs.

Microsoft plans to improve UFO by supporting alternative backends and by integrating dedicated GUI models for visual recognition. In addition, connecting to online search engines as an external knowledge base could improve the agent's ability to adapt to unknown GUIs.

Recommendation

Until now, acting AI agents have not been very flexible or reliable. They also come at an additional cost. Although you can install UFO on your computer for free via GitHub, it requires an API key from OpenAI for inferencing with GPT-4V, which incurs costs for each request. It is also only accessible from the command line.

For concepts like UFO to be useful, they would need to be more tightly integrated into the operating system. Ideally, they would run locally to provide fast performance at lower cost. This would also potentially eliminate privacy concerns.

Nonetheless, UFO is a promising step toward fundamentally changing the way we operate computers. Combined with a powerful speech recognition model such as Whisper, it could eliminate the need for traditional interfaces, although that seems a long way off.

Microsoft notes that no sensitive or confidential information should be visible while UFO is running. Screenshots are sent to GPT-4V via the programming interface.

Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

LAMs as the next AI evolution

Although UFO is the first AI navigator tailored to the Windows interface, it is not the first attempt to replace user interfaces with natural language input.

Earlier examples include Agent-1 and ACT-1, which aim for similar functionality. A while back, Google explored the extent to which a smartphone could be controlled by voice input alone.

Rabbit's recently unveiled AI hardware is also based on "LAMs" (Large Action Models), which are large language models capable of controlling human interfaces based on natural language prompts. OpenAI is also rumored to be working on two LAM agents that could automate entire work processes.

Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • Microsoft has developed a UI-Focused Agent (UFO) that independently processes user requests in Windows. UFO uses OpenAI's GPT-4V to analyze the graphical user interface and application controls.
  • UFO uses a combination of two agents, AppAgent and ActAgent, to select and perform actions in relevant applications. The system achieves an 86 percent success rate for tasks in the WindowsBench test.
  • UFO still has limitations. It can only perform controls and actions supported by the Python package pywinauto and Windows UI Automation. Future improvements might include alternative backends, dedicated GUI models, and integration with online search engines for better customization.
Sources
Jonathan works as a technology journalist who focuses primarily on how easily AI can already be used today and how it can support daily life.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.