Google and Google Deepmind unveil Mirasol, a small AI model that can answer questions about video and set new records.
To understand video, AI models need to integrate information from different modalities, such as video, audio, and text. However, today's AI systems struggle to process diverse data streams and large amounts of data. In a new study, researchers at Google and Google Deepmind present an approach that significantly improves multimodal understanding of long-form video.
Mirasol relies on new "Combiner" transformer module
With the Mirasol AI model, the team seeks to address two key challenges: First, modalities such as video and audio are synchronized in time and occur at high sampling rates, while modalities such as titles and descriptions are asynchronous with the content itself. Second, video and audio generate large amounts of data that strain the model's capacity.
For Mirasol, the team uses combiners and autoregressive transformer models. The time-synchronized video and audio signals are processed by a model component, which splits the video into individual segments. A transformer processes each segment and learns the relationships between the segments. A separate transformer then processes the contextual text. Both components exchange information about their respective inputs.
In the video-audio component, a novel transformation module called the Combiner extracts common representations from each segment and compresses the data through dimension reduction. Each segment contains between 4 and 64 frames; in total, the current version of the model, with 3 billion parameters, can process videos with 128 to 512 frames. Other much larger models, based primarily on text-based transformers with additional modalities, can often only process 32 to 64 frames for the entire video.
Google's Mirasol could be used for YouTube
In tests, Mirasol3B achieves new benchmarks in video question analysis, is significantly smaller, and can process longer videos. With a variant of the combiner that incorporates memory, the team can reduce the computing power required by a further 18 percent.
In the future, models like Mirasol could be used by chatbots, such as the recently launched AI assistant for YouTube, to answer questions about videos or improve functions such as automatic categorization and chapter marking of videos.