Content
summary Summary

Apple has developed a new AI system called Ferret-UI 2 that can read and control apps across iPhones, iPads, Android devices, web browsers, and Apple TV.

Ad

The system scored 89.73 in UI element recognition tests, significantly higher than GPT-4o's score of 77.73. It also shows significant improvements over its predecessor in basic tasks such as text and button recognition, as well as more complex operations.

Comparison table: benchmark results of various UI models with different backbones, showing performance values for elementary and advanced tasks.
Apple tested the system with several language models. While Llama-3 showed the best results, the smaller Gemma-2B also performed well. | Image: Apple

Understanding user intent

Instead of relying on specific click coordinates, Ferret-UI 2 aims to understand user intent. When given a command such as "Please confirm your input," the system can identify the appropriate button without requiring precise location data. Apple's research team used GPT-4o's visual capabilities to generate high-quality training data that helped the system better understand how UI elements relate to each other spatially.

Ferret-UI 2 uses an adaptive architecture that recognizes UI elements across platforms. It includes an algorithm that automatically balances image resolution and processing requirements for each platform. According to the researchers, this approach is "both information-preserving and efficient for local encoding."

Ad
Ad
Four UI screenshots with conversation examples: iPhone settings, iPad weather app, MacBook product page and Apple TV interface with model answers.
Ferret-UI 2 interaction examples. | Image: Apple

Testing showed strong cross-platform performance, with models trained on iPhone data achieving 68 percent accuracy on iPads and 71 percent accuracy on Android devices. However, the system had more difficulty transitioning between mobile devices and TV or Web interfaces, which the researchers attribute to differences in screen layouts.

Llama- and Gemma-based Ferret UI models are available from Hugging Face, along with a demo.

Microsoft releases UI understanding tool as open source

Apple's work comes as other companies push forward with their own UI understanding AI systems. Anthropic recently released an updated Claude 3.5 Sonnet with UI interaction, while Microsoft released OmniParser, an open-source tool that converts screen content into structured data, for the same purpose.

Apple also recently unveiled CAMPHOR, a framework that uses specialized AI agents coordinated by a master reasoning agent to handle complex tasks. Combined with Ferret-UI 2, this technology could enable voice assistants like Siri to analyze and perform complex tasks, such as finding and booking a specific restaurant, that involve navigating apps or the web using only voice commands.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • Apple has developed Ferret UI 2, a new AI system that can understand and interact with screen content across devices. The goal is to create AI systems that can interact naturally with UIs designed for humans.
  • In tests, Ferret-UI 2 showed strong performance in transferring skills between similar devices - models trained on iPhone data worked well with both iPads (68 percent accuracy) and Android devices (71 percent accuracy).
  • However, the system struggled more when moving between mobile devices and TV or web interfaces due to their different layouts and formats.
Online journalist Matthias is the co-founder and publisher of THE DECODER. He believes that artificial intelligence will fundamentally change the relationship between humans and computers.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.