Content
summary Summary

Microsoft's new Fara-7B model is a compact AI system built to operate user interfaces purely through visual input. Despite its small size, it aims to keep pace with far more complex systems while running locally on consumer devices.

Ad

Fara-7B is based on Alibaba's Qwen2.5-VL-7B and, according to Microsoft, relies solely on visual information. Instead of tapping into accessibility trees or parsing HTML, it works directly off screenshots of the interface. The model runs in a loop of observing, thinking, and acting, predicting click coordinates or generating keystrokes as needed. It uses the last three screenshots, previous actions, and user input to decide what to do next.

With seven billion parameters, Fara-7B is lightweight enough to run directly on hardware. Microsoft says this setup reduces latency and improves privacy because all data stays on the device.

The model works in a continuous loop, using screenshots of the interface to select its next action. | Image: Microsoft

A major challenge for these kinds of computer-use agents is the lack of usable training data. Recording click paths manually is extremely time-consuming. Microsoft addressed this with a synthetic data pipeline.

Ad
Ad
Microsoft generates synthetic training data by letting a multi-agent system solve tasks and then verifying the results. | Image: Microsoft

The team used its in-house multi-agent framework Magentic-One to create task solutions automatically. An Orchestrator agent produces step-by-step plans, while a WebSurfer agent carries them out. Microsoft then collected the successful task runs - roughly 145,000 trajectories with one million total steps - and distilled the knowledge into the smaller Fara-7B model.

The company also introduced a new benchmark, WebTailBench, to cover task types that were underrepresented in older test suites, including price comparisons and job searches.

 

Efficiency that challenges larger models

In Microsoft's benchmarks, the model performs strongly for its size. On the WebVoyager test, Fara-7B reaches a success rate of 73.5 percent. The team says this puts it ahead of the UI-TARS-1.5-7B model and even above OpenAI's commercial GPT-4o in this specific benchmark. An independent evaluation by Browserbase using human reviewers resulted in a 62 percent success rate.

Fara-7B performs well across several benchmarks and surpasses OpenAI's Computer-Use-Preview in WebVoyager. | Image: Microsoft

Microsoft also highlights the model's efficiency. On average, Fara-7B completes tasks in about 16 steps, while competing models like UI-TARS average around 41 steps. This has a direct impact on cost during use.

Recommendation
Fara-7B (purple) reaches high accuracy in WebVoyager at a fraction of the cost of larger models like GPT-4o. | Image: Microsoft

Microsoft notes that the model still makes mistakes, can misunderstand instructions, and is vulnerable to hallucinations. To reduce risk, the system is trained to pause at certain critical points - for example, before sending an email or initiating a financial transaction - so the user can confirm the action.

The model is available as an experimental open-weight release under an MIT license on Hugging Face and Microsoft Foundry. Users can also test it locally on Copilot+ PCs running Windows 11.

Companies including OpenAI, Anthropic, Google, and Manus AI have been pursuing AI-driven interface agents for some time. So far, many of these agents handle tasks slowly or fail outright, often without delivering real efficiency gains. They also remain vulnerable to issues like prompt injection.

A possible path forward is to move beyond purely visual interfaces and instead provide agents with interaction surfaces designed specifically for them. Researchers are already exploring standardized agent interaction concepts, which could significantly boost both the efficiency and the safety of AI-driven computer-use systems.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • Microsoft has introduced Fara-7B, a lightweight AI model designed to control user interfaces using only screenshots.
  • The model operates directly on user devices, leading to better response times and enhanced privacy.
  • Fara-7B achieves strong benchmark results for its size, even surpassing larger models like GPT-4o in certain tests.
Sources
Jonathan writes for THE DECODER about how AI tools can improve both work and creative projects.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.