World Action Models give robots the ability to simulate consequences before they move
Today's robotics AI has a basic weakness: models learn to map camera images directly to movements. But they don't understand how the world actually changes as a result of their actions.
A new survey paper from Fudan University, the Shanghai Innovation Institute, and the National University of Singapore is the first to systematically catalog a class of models designed to close that gap: World Action Models.

Robots that simulate their own near future
Existing vision-language-action models mostly learn direct mappings from observations to matching actions. World Action Models go further. They also model how the environment will likely change, then couple that prediction to action generation.
The payoff is practical, the authors say. A model that simulates the consequences of a movement before executing it generalizes better to unfamiliar objects and settings. More importantly, it can learn from video footage where no robot actions are labeled at all—everyday first-person videos, for example. That kind of data was nearly useless for traditional robotics AI.
Pure video generators can produce plausible future frames, but they aren't tied to control signals. A research team at Peking University recently drew exactly that distinction in its unified definition of world models. World Action Models meet both conditions at once.

Two core architectures
The researchers sort about a hundred papers into two architectural lines. The first, Cascaded WAMs, works in two steps. A world model first generates an image or video of what the scene should look like next. Then a second module pulls the right control commands from that output. Early work like UniPi generates complete videos and derives motion through a learned inverse model.

Other approaches like AVDC or 3DFlowAction use motion fields from which the robot's trajectory can be computed geometrically. Still others - VPP or LAPA, for instance - skip visible images entirely and predict the future in compressed, abstract representations. That saves the compute otherwise needed to render every single pixel.
The second line, Joint WAMs, combines both tasks in a single model. Work like GR-1, GR-2, or WorldVLA treats images and actions as a unified token sequence. Diffusion-based variants such as PAD, UWM, or DreamZero generate the future frame and the movement in parallel. Nvidia's Cosmos Policy can use the same architecture as a controller, a simulator, or an evaluation model.

Nvidia pursues a similar dual role with DreamDojo, a world model that takes control commands and generates a simulated visual future from them. The survey also discusses π0.7, which uses the world model not as a replacement but as a supplier. It feeds imagined future frames into the context of a pretrained robotics AI, which then generates the movement.
The real bottleneck is data
A whole chapter digs into where training data comes from. Four sources shape the field. Teleoperation data from remotely controlled robots is precise but expensive and limited to a handful of environments. Datasets like Open X-Embodiment or DROID try to fix that by pooling data from many labs. Portable demo tools like the Universal Manipulation Interface sidestep hardware dependency: people perform tasks with handheld grippers in everyday settings.

The RDT2 dataset collects about 10,000 hours of material this way. Simulations like RoboCasa or RoboTwin 2.0 deliver unlimited trajectories with perfect depth data but suffer from the well-known sim-to-real gap. Nvidia leans hard into this approach with GR00T N1, training humanoid robots mostly in synthetic environments.
Egocentric everyday videos from Ego4D offer unlimited variety but contain no action labels. This is where World Action Models show their edge. They could use those videos to predict future frames even when no motion data is available.
Evaluation can't keep up with development
The authors are especially critical about how well these models are actually tested. Visual quality gets measured with standard metrics like PSNR or FVD, but those say little about whether a video is physically plausible.
Specialized benchmarks test different slices of physical plausibility. VideoPhy evaluates physical interaction scenarios. Physics-IQ tests predictions of real physical events from video frames. WorldModelBench checks explicit rules like gravity, conservation of mass, rigid body mechanics, and impenetrability.
One especially sharp finding comes from the "Wow, Where, Val!" benchmark. It checks whether a generated video can actually yield an executable movement. Many visually convincing models drop to near-zero success rates on this test, the survey reports.
So a video can look realistic and still contain nothing useful for control. The authors call this the core problem: there's no metric for whether the imagined future and the executed movement are causally consistent.
Validation for Yann LeCun's JEPA approach
So far, the authors say, no controlled study compares the different architectures under identical conditions. Nearly all models work only with camera images, even though tasks with fine contact need tactile and force data. Compute is still a bottleneck, too. DreamZero manages about seven predictions per second; traditional robot controllers run at around fifty.
The authors also raise a safety question. A model that confidently predicts a wrong future can kick off long action chains that are hard to stop. But that same predictive ability could also check planned movements against physical rules before they're executed.
Meta's V-JEPA 2 showed a few months ago that self-supervised video world models can skip generating visible pixels entirely, predicting only abstract representations of the future instead. The survey authors see this as one of the most promising ways to cut the heavy compute cost of explicit video generation without losing the physical grounding that makes predictions useful. A full list of all discussed papers is available on GitHub.
AI News Without the Hype – Curated by Humans
Subscribe to THE DECODER for ad-free reading, a weekly AI newsletter, our exclusive "AI Radar" frontier report six times a year, full archive access, and access to our comment section.
Subscribe nowRead on for the full picture.
Subscribe for hype-free coverage.
- Access to all THE DECODER articles.
- Read without distractions – no Google ads.
- Access to comments and community discussions.
- Weekly AI newsletter.
- 6 times a year: “AI Radar” – deep dives on key AI topics.
- Up to 25 % off on KI Pro online events.
- Access to our full ten-year archive.
- Get the latest AI news from The Decoder.