Content
summary Summary

Google Deepmind researchers believe video models like Veo 3 can become as flexible and general-purpose for visual tasks as large language models are for text.

Ad

Their vision is that, instead of needing a separate model for every visual task—whether it's segmentation, object recognition, or image processing—a single, unified video model could eventually handle all of them. Just as LLMs can translate, summarize, and analyze text without retraining for each task, the researchers hope video models will lay the foundation for broad visual intelligence.

In their paper, the DeepMind team shows that Veo 3 already demonstrates impressive zero-shot abilities in image processing, physical modeling, image manipulation, and visual reasoning without extra training. They describe this as an early sign of a "GPT-3 moment" for machine vision, when a single video model can handle a broad range of visual tasks without extra training, much like GPT-3 did for text.

"We believe that video models will become unifying, general-purpose foundation models for machine vision just like large language models (LLMs) have become foundation models for natural language processing (NLP)," the DeepMind researchers write.

Ad
Ad

Prompts, not fine-tuning

Veo 3 works by taking text prompts and an input image, then generating an eight-second video in 720p at 24 frames per second. Unlike traditional computer vision models, it doesn't need retraining or fine-tuning for each new task.

The researchers say Veo 3 can handle standard perception tasks like edge detection, segmentation, and super-resolution, as well as more complex visual challenges such as the Dalmatian illusion and Rorschach spots.

Veo 3 also demonstrates basic physical understanding. It can simulate buoyancy, air resistance, reflections, and color mixing, and it removes objects in Jenga-like scenes in a physically plausible order.

The model supports image manipulation tasks, too. It can remove backgrounds, recolor objects, change viewpoints, or generate new perspectives. In many cases, the researchers found that Veo 3 preserves details and textures better than specialized editors, though it still sometimes struggles with unwanted animations.

According to Deepmind, Veo 3 stands out for its visual reasoning: it can solve mazes, identify symmetries, sort numbers, extrapolate rules, and even complete simple Sudoku puzzles, using only images and text prompts. The team calls this process "chain-of-frames," a visual version of the "chain-of-thought" reasoning seen in language models.

Recommendation

The researchers note that prompt design and visual presentation have a big impact on results. For example, using a green background can improve segmentation, while rephrasing a prompt can lead to very different outcomes.

To help with some tasks, the API uses a prompt rewriter powered by an LLM. In cases like Sudoku, the researchers suspect the LLM is actually solving the puzzle, not the video model. However, their tests with Gemini 2.5 Pro show that for core visual challenges—such as robot navigation, maze solving, and symmetry detection—the LLM alone can't solve them from the image. The team calls Veo 3 a black box, but they think what it's doing points to some new, emerging reasoning inside the video model itself.

Generalist vs. specialist

The Deepmind team acknowledges that Veo 3 hasn't caught up to leading specialist models like Meta's segmentation model SAMv2, but it's advancing quickly. In just six months, Veo 3 has significantly outperformed its predecessor, Veo 2. On some tasks, it matches specialist models like Nano Banana, and on irregular maze problems, it even takes the lead. The researchers expect that instruction tuning and reinforcement learning with human feedback (RLHF) could further improve Veo 3, just as these methods helped early LLMs evolve.

These results might support DeepMind CEO Demis Hassabis's prediction that video models could become "world models," systems that generate physically consistent simulations for AI to learn from. Genie 3 from DeepMind already points in this direction and creates interactive environments for training AI agents in real time.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

Not everyone in the field agrees. Meta's head of AI, Yann LeCun, argues that the generative world model approach used in tools like Sora is a dead end. He favors predictive architectures with physical understanding and robot control, such as V-JEPA 2, as a better path forward.

Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • Google's new video model Veo 3 shows strong zero-shot abilities in image processing, physical understanding, image editing, and visual reasoning, all without special training for each task.
  • Veo 3 solves classic perception tasks, predicts physical properties, manipulates scenes, and performs visual reasoning, in some cases matching or even outperforming highly specialized models.
  • Some researchers view video models like Veo 3 as a promising approach to creating "world models" that could generate simulations for training other AI systems.
Sources
Matthias is the co-founder and publisher of THE DECODER, exploring how AI is fundamentally changing the relationship between humans and computers.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.