Traditional user interfaces may fade into the background as AI technologies advance. With UFO, Microsoft is demonstrating how easy it could be to interact with Windows in the future.
Microsoft has developed an agent framework called UFO that can autonomously answer user queries within Windows.
UFO stands for "UI-Focused Agent" and is based on the GPT-4V image recognition model from OpenAI. It analyzes the graphical user interface and controls of Windows apps to seamlessly navigate within and between them.
UFO combines two agents that make decisions about which apps and controls to select to handle user requests. The AppAgent selects the right app, while the ActAgent performs specific actions in the selected app. A control interaction module translates the selected actions into executable operations.
Microsoft evaluated UFO's performance using WindowsBench, a benchmark consisting of 50 user requests in nine common Windows applications such as Outlook, PowerPoint, File Explorer, and Adobe's Acrobat Reader.
UFO completed 86 percent of the tasks, significantly higher than other models such as GPT-3.5 and GPT-4, whose commands were executed by humans instead of UFO via GPT-4V.
UFO also required fewer steps on average and took more security precautions, such as avoiding irreversible file deletion.
However, the researchers acknowledge significant limitations of their system. UFO could only perform controls and actions supported by the Python package pywinauto and Windows UI automation. They also noted difficulties when UFO was confronted with unusual application GUIs.
Microsoft plans to improve UFO by supporting alternative backends and by integrating dedicated GUI models for visual recognition. In addition, connecting to online search engines as an external knowledge base could improve the agent's ability to adapt to unknown GUIs.
Until now, acting AI agents have not been very flexible or reliable. They also come at an additional cost. Although you can install UFO on your computer for free via GitHub, it requires an API key from OpenAI for inferencing with GPT-4V, which incurs costs for each request. It is also only accessible from the command line.
For concepts like UFO to be useful, they would need to be more tightly integrated into the operating system. Ideally, they would run locally to provide fast performance at lower cost. This would also potentially eliminate privacy concerns.
Nonetheless, UFO is a promising step toward fundamentally changing the way we operate computers. Combined with a powerful speech recognition model such as Whisper, it could eliminate the need for traditional interfaces, although that seems a long way off.
Microsoft notes that no sensitive or confidential information should be visible while UFO is running. Screenshots are sent to GPT-4V via the programming interface.
LAMs as the next AI evolution
Although UFO is the first AI navigator tailored to the Windows interface, it is not the first attempt to replace user interfaces with natural language input.
Earlier examples include Agent-1 and ACT-1, which aim for similar functionality. A while back, Google explored the extent to which a smartphone could be controlled by voice input alone.
Rabbit's recently unveiled AI hardware is also based on "LAMs" (Large Action Models), which are large language models capable of controlling human interfaces based on natural language prompts. OpenAI is also rumored to be working on two LAM agents that could automate entire work processes.