Meta's new Apollo models aim to crack the video understanding problem

Meta and Stanford researchers have developed Apollo, a new family of AI models that tackles one of AI's persistent challenges: getting machines to truly understand videos.

While AI has made huge strides in processing images and text, getting machines to truly understand videos remains a major challenge. Videos contain complex, dynamic information that's harder for AI to process, requiring more computing power and raising questions about the best way to design these systems.

Schematic diagram: components of a large language model with video processing, showing sampling strategies, encoder options and training parameters. — This schematic illustrates key design decisions for video-based language models. Optimization includes video sampling, encoder architectures, training sequences, and data composition for the best possible model performance. | Image: Meta

A joint team from Meta GenAI and Stanford University conducted extensive research to answer these fundamental design questions. Their systematic approach revealed insights about how to build more effective video understanding systems.

Small-scale insights apply to larger models

The team discovered something that could transform how AI video models are developed: improvements that work in small models reliably scale up to larger ones. This means researchers can test new approaches quickly using smaller, less expensive models before implementing them in larger systems.

When it comes to processing videos, the researchers found that keeping a constant sampling rate of frames per second produces the best results. Their optimal architecture uses two distinct components working together: one processes individual video frames, while the other tracks how objects and scenes change over time.

Adding time stamps between processed video segments proved crucial for helping the model understand how visual information relates to text descriptions. This simple but effective approach helps the system maintain temporal awareness throughout the processing pipeline.

Smart training beats bigger models

The research reveals that how you train an AI model matters more than its size. The team found that a carefully staged training approach, where different parts of the model are activated in sequence, produces significantly better results than training everything at once.

When training the visual components, focusing exclusively on video data helped the model develop stronger specialized capabilities. This targeted approach proved especially effective for tasks requiring detailed video understanding.

Getting the right mix of training data turned out to be crucial. The optimal balance includes 10-14% text data, with the remaining portion weighted slightly toward video content. This careful data composition helps the model develop both strong language understanding and video processing abilities.

Recommendation

AI research

AI agents outperform human teams in hacking competitions

Line chart with pie charts: Performance score of various data mixtures of text, image, multi-image and video over time, showing optimal distribution. — The performance of AI models depends heavily on the composition of the training data. According to Meta, a mix of about 14% text data and a higher proportion of video data achieves the best results in video understanding. | Image: Meta

The resulting Apollo models show impressive performance across different sizes. The smaller Apollo-3B outperforms similar-sized models like Qwen2-VL, while Apollo-7B competes with much larger systems. Meta has released both the code and model weights as open source, with a public demo available on Hugging Face.

Better benchmarks for video AI

The research team also tackled another industry challenge: how to test video AI models properly. They discovered that many reported improvements actually came from better language processing rather than enhanced video understanding.

To address this, they created ApolloBench, a streamlined set of testing tasks that cuts down evaluation time while better assessing how well models understand temporal relationships in videos.

Meta's findings align with a growing industry trend: thoughtful design and training strategies often matter more than raw model size. This mirrors recent developments like Microsoft's Phi-4 language model and Google's Gemini 2.0 Flash, which achieve strong results despite their relatively compact size.

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

Meta's new Apollo models aim to crack the video understanding problem

Small-scale insights apply to larger models

Smart training beats bigger models

AI agents outperform human teams in hacking competitions

Better benchmarks for video AI

Meta acquires audio AI startup WaveForms as it ramps up efforts to build Llama 4.5

Meta sees early signs of self-improving AI, signals caution on open source plans

Some Meta employees fear being sidelined as Zuckerberg reshuffles teams for AI progress

OpenAI launches GPT-5 as a unified system with adaptive reasoning for complex tasks

Google Deepmind's Genie 3 creates interactive 3D worlds that stay consistent for "multiple minutes"

Google upgrades Gemini with Deep Think and flags early warning risks

Meta's new Apollo models aim to crack the video understanding problem

Small-scale insights apply to larger models

Smart training beats bigger models

Better benchmarks for video AI

Share

Bank details