Google's Mirasol pushes the boundaries of AI video understanding

Nov 15, 2023

DALL-E 3 prompted by THE DECODER

Google and Google Deepmind unveil Mirasol, a small AI model that can answer questions about video and set new records.

To understand video, AI models need to integrate information from different modalities, such as video, audio, and text. However, today's AI systems struggle to process diverse data streams and large amounts of data. In a new study, researchers at Google and Google Deepmind present an approach that significantly improves multimodal understanding of long-form video.

Mirasol relies on new "Combiner" transformer module

With the Mirasol AI model, the team seeks to address two key challenges: First, modalities such as video and audio are synchronized in time and occur at high sampling rates, while modalities such as titles and descriptions are asynchronous with the content itself. Second, video and audio generate large amounts of data that strain the model's capacity.

For Mirasol, the team uses combiners and autoregressive transformer models. The time-synchronized video and audio signals are processed by a model component, which splits the video into individual segments. A transformer processes each segment and learns the relationships between the segments. A separate transformer then processes the contextual text. Both components exchange information about their respective inputs.

In the video-audio component, a novel transformation module called the Combiner extracts common representations from each segment and compresses the data through dimension reduction. Each segment contains between 4 and 64 frames; in total, the current version of the model, with 3 billion parameters, can process videos with 128 to 512 frames. Other much larger models, based primarily on text-based transformers with additional modalities, can often only process 32 to 64 frames for the entire video.

Google's Mirasol could be used for YouTube

In tests, Mirasol3B achieves new benchmarks in video question analysis, is significantly smaller, and can process longer videos. With a variant of the combiner that incorporates memory, the team can reduce the computing power required by a further 18 percent.

In the future, models like Mirasol could be used by chatbots, such as the recently launched AI assistant for YouTube, to answer questions about videos or improve functions such as automatic categorization and chapter marking of videos.

AI News Without the Hype – Curated by Humans

As a THE DECODER subscriber, you get ad-free reading, our weekly AI newsletter, the exclusive "AI Radar" Frontier Report 6× per year, access to comments, and our complete archive.

AI news without the hype
Curated by humans.

Over 20 percent launch discount.
Read without distractions – no Google ads.
Access to comments and community discussions.
Weekly AI newsletter.
6 times a year: “AI Radar” – deep dives on key AI topics.
Up to 25 % off on KI Pro online events.
Access to our full ten-year archive.
Get the latest AI news from The Decoder.

Subscribe to The Decoder

Google's Mirasol pushes the boundaries of AI video understanding

Mirasol relies on new "Combiner" transformer module

Google's Mirasol could be used for YouTube

AI News Without the Hype – Curated by Humans

AI news without the hypeCurated by humans.

AI news without the hype
Curated by humans.