summary Summary

Google shows Spotlight, a multimodal AI model that can understand mobile interfaces. The goal is to improve accessibility and automation.


Google has previously introduced specialized models that can summarize screen content, recognize actionable elements, or execute simple commands. According to Google, these models primarily used metadata from mobile websites in addition to visual data, which is not always available and often incomplete.

With Spotlight, the Google team is training a multimodal AI model that works exclusively with visual information.

Google's Spotlight uses Vision Transformer and T5 language model

Spotlight is based on a pre-trained Vision Transformer and a pre-trained T5 language model. It is trained by Google on two datasets totaling 2.5 million mobile UI screens and 80 million web pages. This allows the AI model to benefit from the general capabilities of large models.


In addition, Google uses an MLP mesh to represent individual regions on a screenshot and extract them for processing, for example, to detect clickable buttons.

Video: Google

The team trains two different model sizes (619 million and 843 million parameters) for each task tested individually and once for all tasks. Tasks include describing individual elements, the entire visible page, or detecting controls.

The specialized Spotlight models significantly outperform all of Google's older expert UI models. The model trained on all four tasks drops in performance, but is still competitive, the team said.

Google wants to scale Spotlight

In the visualizations, Google shows that Spotlight pays attention to both the buttons and the text, such as for the "Select the Chelsea team" command in the screenshot. The multimodal approach works.

Image: Google

Compared to other multimodal models such as Flamingo, Spotlight is relatively small. The larger of the two Spotlight models already performs better than the smaller one. The model could therefore be scaled further and become even better.

"Spotlight can be easily applied to more UI tasks and potentially advance the fronts of many interaction and user experience tasks," the team writes.

In the future, Google's UI model could form the basis for reliable voice control of apps and mobile websites on Android smartphones, or take on other automation tasks. The startup Adept showed what this could look like in the web browser last year with the Action Transformer. Read more in Google's Spotlight blog post.

Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
  • Google Spotlight recognizes UI elements and other page content in screenshots of mobile pages or apps without access to metadata.
  • The multimodal model is based on Vision Transformer for machine vision, a T5 language model, and additional training with millions of UI examples.
  • Spotlight is designed to support the accessibility and automation of mobile content.
Max is managing editor at THE DECODER. As a trained philosopher, he deals with consciousness, AI, and the question of whether machines can really think or just pretend to.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.