Deepmind says video models for visual tasks could become what LLMs are for text tasks

Google Deepmind researchers believe video models like Veo 3 can become as flexible and general-purpose for visual tasks as large language models are for text.

Their vision is that, instead of needing a separate model for every visual task—whether it's segmentation, object recognition, or image processing—a single, unified video model could eventually handle all of them. Just as LLMs can translate, summarize, and analyze text without retraining for each task, the researchers hope video models will lay the foundation for broad visual intelligence.

In their paper, the DeepMind team shows that Veo 3 already demonstrates impressive zero-shot abilities in image processing, physical modeling, image manipulation, and visual reasoning without extra training. They describe this as an early sign of a "GPT-3 moment" for machine vision, when a single video model can handle a broad range of visual tasks without extra training, much like GPT-3 did for text.

"We believe that video models will become unifying, general-purpose foundation models for machine vision just like large language models (LLMs) have become foundation models for natural language processing (NLP)," the DeepMind researchers write.

Prompts, not fine-tuning

Veo 3 works by taking text prompts and an input image, then generating an eight-second video in 720p at 24 frames per second. Unlike traditional computer vision models, it doesn't need retraining or fine-tuning for each new task.

The researchers say Veo 3 can handle standard perception tasks like edge detection, segmentation, and super-resolution, as well as more complex visual challenges such as the Dalmatian illusion and Rorschach spots.

Veo 3 also demonstrates basic physical understanding. It can simulate buoyancy, air resistance, reflections, and color mixing, and it removes objects in Jenga-like scenes in a physically plausible order.

The model supports image manipulation tasks, too. It can remove backgrounds, recolor objects, change viewpoints, or generate new perspectives. In many cases, the researchers found that Veo 3 preserves details and textures better than specialized editors, though it still sometimes struggles with unwanted animations.

According to Deepmind, Veo 3 stands out for its visual reasoning: it can solve mazes, identify symmetries, sort numbers, extrapolate rules, and even complete simple Sudoku puzzles, using only images and text prompts. The team calls this process "chain-of-frames," a visual version of the "chain-of-thought" reasoning seen in language models.

Recommendation

AI in practice

Google leans on token metrics, not revenue, adding to bubble talk about AI growth

The researchers note that prompt design and visual presentation have a big impact on results. For example, using a green background can improve segmentation, while rephrasing a prompt can lead to very different outcomes.

To help with some tasks, the API uses a prompt rewriter powered by an LLM. In cases like Sudoku, the researchers suspect the LLM is actually solving the puzzle, not the video model. However, their tests with Gemini 2.5 Pro show that for core visual challenges—such as robot navigation, maze solving, and symmetry detection—the LLM alone can't solve them from the image. The team calls Veo 3 a black box, but they think what it's doing points to some new, emerging reasoning inside the video model itself.

Generalist vs. specialist

The Deepmind team acknowledges that Veo 3 hasn't caught up to leading specialist models like Meta's segmentation model SAMv2, but it's advancing quickly. In just six months, Veo 3 has significantly outperformed its predecessor, Veo 2. On some tasks, it matches specialist models like Nano Banana, and on irregular maze problems, it even takes the lead. The researchers expect that instruction tuning and reinforcement learning with human feedback (RLHF) could further improve Veo 3, just as these methods helped early LLMs evolve.

These results might support DeepMind CEO Demis Hassabis's prediction that video models could become "world models," systems that generate physically consistent simulations for AI to learn from. Genie 3 from DeepMind already points in this direction and creates interactive environments for training AI agents in real time.

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

Not everyone in the field agrees. Meta's head of AI, Yann LeCun, argues that the generative world model approach used in tools like Sora is a dead end. He favors predictive architectures with physical understanding and robot control, such as V-JEPA 2, as a better path forward.

Deepmind says video models for visual tasks could become what LLMs are for text tasks

Prompts, not fine-tuning

Google leans on token metrics, not revenue, adding to bubble talk about AI growth

Generalist vs. specialist

Google Gemini now lets users guide AI video with multiple reference images per input

Google's Veo-3 can fake surgical videos but misses every hint of medical sense

Google Labs' creative director says AI hype turns even small updates into big events

The ARC benchmark's fall marks another casualty of relentless AI optimization

DeepseekMath-V2 is Deepseek's latest attempt to pop the US AI bubble

Frustrated authors withdraw papers after realizing their reviewers are just lazy LLMs

Deepmind says video models for visual tasks could become what LLMs are for text tasks

Prompts, not fine-tuning

Generalist vs. specialist

Share

Bank details