Meta has introduced V-JEPA 2, a 1.2-billion-parameter video model designed to connect intuitive physical understanding with robot control. The system achieves state-of-the-art results on motion recognition and action prediction benchmarks.
Humans develop an understanding of the physical world at a young age—long before they learn to talk. When we throw a tennis ball in the air, we know instinctively it will fall back down. Meta’s V-JEPA 2 aims to build a similar kind of intuitive physics into AI.
The new model, with 1.2 billion parameters, is based on the Joint Embedding Predictive Architecture (JEPA), which Meta’s chief scientist Yann LeCun sees as a crucial step toward more advanced machine intelligence. Unlike other “world models” such as OpenAI’s video generator Sora or large language models, V-JEPA 2—like its predecessor V-JEPA—takes a fundamentally different approach.
“World models are meant to let AI agents plan and reason in the physical world,” Meta explains in its technical report. LeCun draws a sharp distinction between his JEPA approach and generative models: while Sora and language models try to predict every detail down to the pixel or word, JEPA focuses on the essential—just the predictable parts of a scene. LeCun has even called generative models like Sora a dead end on the path to machine intelligence.
Learning by observation, not pixel generation
The architecture is what sets V-JEPA 2 apart. Instead of working at the pixel level, the model operates in a learned representation space. It doesn’t try to predict the position of every leaf on a tree or the precise shape of every shadow. Instead, it learns abstract concepts like “the ball will fall” or “the object moves to the left.”
This abstraction makes the system both more efficient and robust. Generative models like Sora waste computing power generating irrelevant visual details, but V-JEPA 2 focuses only on the information needed for planning and control. That efficiency shows up in practice: V-JEPA 2 needs just 16 seconds to plan a robot action, while Nvidia’s generative Cosmos model takes four minutes.
Two-stage training with minimal robot data
Training V-JEPA 2 takes place in two distinct phases. In the first phase, the model learns from more than a million hours of video and a million images—without human supervision. The dataset is carefully curated, featuring multiple viewpoints: first-person videos, third-person action shots, tutorial recordings, and filtered YouTube content.
Technically, the system uses a powerful encoder with a billion parameters to translate video into abstract representations. A unique aspect of the training: parts of the video are masked, and a “predictor” must infer what happens in these gaps—not in terms of pixels, but as abstract concepts. This teaches the system to focus on the most important, predictable elements of a scene.
The second phase introduces robot control. Remarkably, this requires just 62 hours of robot data from a public dataset. A dedicated predictor learns how robot actions change the world, based on the representations already acquired. By comparison, other robotics AI systems often need thousands of hours of specific training data and must be retrained for each new environment.
Video: Meta
Strong results across multiple tasks
V-JEPA 2 delivers strong results on several standard benchmarks. On the Something-Something v2 dataset—which tests recognition of complex movements and interactions, like “moving something from left to right” or “flipping a container and emptying it”—the model achieves 77.3 percent accuracy, outpacing other leading video models.
Its action prediction is especially impressive. On the Epic-Kitchens-100 test, which tracks everyday kitchen activities, V-JEPA 2 can predict the next action (such as “cutting an onion” or “placing a pot on the stove”) one second in advance with 39.7 percent accuracy—a 44 percent improvement over previous systems. When paired with a language model, it can also answer complex questions about video content, achieving top scores in multiple comparison tests.
From video understanding to robot control
Meta tested V-JEPA 2 on actual robots, using only the public DROID dataset—a collection of videos showing various robot movements. Without any additional training, the model was able to control two different Franka robot arms in new lab environments. For tasks like grasping a cup or picking up and placing objects, it achieved success rates between 65 and 80 percent.
Video: Meta
Here’s how it works: the robot is shown a photo of the goal state—say, a cup placed at a specific spot. V-JEPA 2 then plans a step-by-step path to reach that goal, simulating various possible movements in its learned model of physics and picking the most promising. After each move, it checks its current position and replans the next steps.
New benchmarks reveal gap to human intuitive physics
Alongside V-JEPA 2, Meta is releasing three new benchmarks to systematically test how well AI systems really understand physical reality. The first, IntPhys 2, is inspired by developmental psychology experiments: it shows pairs of videos, one of which violates physical laws—like a ball falling upward instead of down. While humans spot these impossibilities instantly, even the most advanced AI models, including V-JEPA 2, perform barely above chance.
The second benchmark, MVPBench (Minimal Video Pairs), goes even further. It uses cleverly designed video pairs that look almost identical but require opposite answers to the same question. This blocks models from relying on superficial visual or linguistic cues. Here, V-JEPA 2 scores 44.5 percent “paired accuracy”—the best of any tested system, well ahead of the previous leader InternVL-2.5 at 39.9 percent—but still far from human-level performance.
The third, CausalVQA, tests causal reasoning in physical scenarios. Models must not only describe what’s happening in a video, but also answer counterfactual questions (“What would have happened if...”), predict future events, and suggest actions. The pattern is clear: today’s AI systems are good at describing what they see, but struggle to imagine alternative outcomes or predict what’s next.
Toward hierarchical models
Despite its strengths, V-JEPA 2 still faces challenges. It struggles with long-term planning—it can predict the next few seconds, but not carry out complex, multi-step tasks. The system is also sensitive to camera position, which can cause problems in real-world use.
Meta’s vision for the future involves hierarchical models that can plan across multiple timescales—from fractions of a second to minutes or hours. Integrating additional senses like sound or touch is also on the roadmap.
LeCun’s team is taking a different path from many other tech giants with the JEPA approach. At the same time, Meta hasn’t given up on generative AI as a path to superintelligence: Mark Zuckerberg is currently assembling a team focused on advancing this line of research.