Content
summary Summary

Google and Google Deepmind unveil Mirasol, a small AI model that can answer questions about video and set new records.

To understand video, AI models need to integrate information from different modalities, such as video, audio, and text. However, today's AI systems struggle to process diverse data streams and large amounts of data. In a new study, researchers at Google and Google Deepmind present an approach that significantly improves multimodal understanding of long-form video.

Mirasol relies on new "Combiner" transformer module

With the Mirasol AI model, the team seeks to address two key challenges: First, modalities such as video and audio are synchronized in time and occur at high sampling rates, while modalities such as titles and descriptions are asynchronous with the content itself. Second, video and audio generate large amounts of data that strain the model's capacity.

For Mirasol, the team uses combiners and autoregressive transformer models. The time-synchronized video and audio signals are processed by a model component, which splits the video into individual segments. A transformer processes each segment and learns the relationships between the segments. A separate transformer then processes the contextual text. Both components exchange information about their respective inputs.

Ad
Ad

In the video-audio component, a novel transformation module called the Combiner extracts common representations from each segment and compresses the data through dimension reduction. Each segment contains between 4 and 64 frames; in total, the current version of the model, with 3 billion parameters, can process videos with 128 to 512 frames. Other much larger models, based primarily on text-based transformers with additional modalities, can often only process 32 to 64 frames for the entire video.

Google's Mirasol could be used for YouTube

In tests, Mirasol3B achieves new benchmarks in video question analysis, is significantly smaller, and can process longer videos. With a variant of the combiner that incorporates memory, the team can reduce the computing power required by a further 18 percent.

In the future, models like Mirasol could be used by chatbots, such as the recently launched AI assistant for YouTube, to answer questions about videos or improve functions such as automatic categorization and chapter marking of videos.

Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • Google and Deepmind present Mirasol, an AI model that can answer questions about video and set new records in the process.
  • Mirasol processes time-synchronized video and audio as well as contextual text, breaking the video into segments and compressing them using a new type of transformation module called a combiner.
  • The Mirasol model could be used by YouTube in the future to enable chatbots to answer questions about videos, or to improve features such as automatic categorization and chapter labelling of videos.
Sources
Max is managing editor at THE DECODER. As a trained philosopher, he deals with consciousness, AI, and the question of whether machines can really think or just pretend to.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.