Ad
Skip to content

World Action Models give robots the ability to simulate consequences before they move

Image description
Nano Banana Pro prompted by THE DECODER

Today's robotics AI has a basic weakness: models learn to map camera images directly to movements. But they don't understand how the world actually changes as a result of their actions.

A new survey paper from Fudan University, the Shanghai Innovation Institute, and the National University of Singapore is the first to systematically catalog a class of models designed to close that gap: World Action Models.

Tree diagram of the WAM research landscape with two main branches for Joint WAM and Cascaded WAM, further split into diffusion-based, autoregressive, explicit, and implicit approaches, with models like Cosmos Policy, DreamZero, GR-1, GR-2, UniPi, and Vidar plotted on a timeline from 2024 to 2026.
The authors map all current World Action Models along two main branches, showing how joint and cascaded architectures have branched out since 2024. | Image: Wang et al.

Robots that simulate their own near future

Existing vision-language-action models mostly learn direct mappings from observations to matching actions. World Action Models go further. They also model how the environment will likely change, then couple that prediction to action generation.

The payoff is practical, the authors say. A model that simulates the consequences of a movement before executing it generalizes better to unfamiliar objects and settings. More importantly, it can learn from video footage where no robot actions are labeled at all—everyday first-person videos, for example. That kind of data was nearly useless for traditional robotics AI.

Pure video generators can produce plausible future frames, but they aren't tied to control signals. A research team at Peking University recently drew exactly that distinction in its unified definition of world models. World Action Models meet both conditions at once.

Schematic comparison of VLA, WAM, and WM with inputs Current Obs and Language and outputs Action and Next Obs, alongside a set diagram showing WAM as a superset of VAM and Video Policy.
Unlike standard VLA models, WAMs learn not just actions but also the predicted next frame of the environment. | Image: Wang et al.

Two core architectures

The researchers sort about a hundred papers into two architectural lines. The first, Cascaded WAMs, works in two steps. A world model first generates an image or video of what the scene should look like next. Then a second module pulls the right control commands from that output. Early work like UniPi generates complete videos and derives motion through a learned inverse model.

Three architecture diagrams for Cascaded WAMs: left shows a video generator with a downstream inverse dynamics module, center shows a video generator with geometric action extraction, right shows a video generator with latent representation and IDM.
In cascaded architectures, a video generator first plans the next state before a second module derives concrete actions from it. | Image: Wang et al.

Other approaches like AVDC or 3DFlowAction use motion fields from which the robot's trajectory can be computed geometrically. Still others - VPP or LAPA, for instance - skip visible images entirely and predict the future in compressed, abstract representations. That saves the compute otherwise needed to render every single pixel.

The second line, Joint WAMs, combines both tasks in a single model. Work like GR-1, GR-2, or WorldVLA treats images and actions as a unified token sequence. Diffusion-based variants such as PAD, UWM, or DreamZero generate the future frame and the movement in parallel. Nvidia's Cosmos Policy can use the same architecture as a controller, a simulator, or an evaluation model.

Four diagrams for Joint WAM architectures: a single DiT backbone on the left and three multi-stream variants on the right using cross-attention, hidden-state coupling, and a shared encoder.
Joint WAMs produce the future frame and the movement within the same model. Sometimes both share a single network; sometimes they run in separate streams that exchange information. | Image: Wang et al.

Nvidia pursues a similar dual role with DreamDojo, a world model that takes control commands and generates a simulated visual future from them. The survey also discusses π0.7, which uses the world model not as a replacement but as a supplier. It feeds imagined future frames into the context of a pretrained robotics AI, which then generates the movement.

The real bottleneck is data

A whole chapter digs into where training data comes from. Four sources shape the field. Teleoperation data from remotely controlled robots is precise but expensive and limited to a handful of environments. Datasets like Open X-Embodiment or DROID try to fix that by pooling data from many labs. Portable demo tools like the Universal Manipulation Interface sidestep hardware dependency: people perform tasks with handheld grippers in everyday settings.

Scatter plot rating four data sources for WAM training - robot teleoperation, portable human demonstrations, simulation, and egocentric human videos - along axes for transfer difficulty and scaling difficulty.
The easier a data source is to scale, the harder it typically is to transfer to a robot. Egocentric human videos are the extreme case. | Image: Wang et al.

The RDT2 dataset collects about 10,000 hours of material this way. Simulations like RoboCasa or RoboTwin 2.0 deliver unlimited trajectories with perfect depth data but suffer from the well-known sim-to-real gap. Nvidia leans hard into this approach with GR00T N1, training humanoid robots mostly in synthetic environments.

Egocentric everyday videos from Ego4D offer unlimited variety but contain no action labels. This is where World Action Models show their edge. They could use those videos to predict future frames even when no motion data is available.

Evaluation can't keep up with development

The authors are especially critical about how well these models are actually tested. Visual quality gets measured with standard metrics like PSNR or FVD, but those say little about whether a video is physically plausible.

Specialized benchmarks test different slices of physical plausibility. VideoPhy evaluates physical interaction scenarios. Physics-IQ tests predictions of real physical events from video frames. WorldModelBench checks explicit rules like gravity, conservation of mass, rigid body mechanics, and impenetrability.

One especially sharp finding comes from the "Wow, Where, Val!" benchmark. It checks whether a generated video can actually yield an executable movement. Many visually convincing models drop to near-zero success rates on this test, the survey reports.

So a video can look realistic and still contain nothing useful for control. The authors call this the core problem: there's no metric for whether the imagined future and the executed movement are causally consistent.

Validation for Yann LeCun's JEPA approach

So far, the authors say, no controlled study compares the different architectures under identical conditions. Nearly all models work only with camera images, even though tasks with fine contact need tactile and force data. Compute is still a bottleneck, too. DreamZero manages about seven predictions per second; traditional robot controllers run at around fifty.

The authors also raise a safety question. A model that confidently predicts a wrong future can kick off long action chains that are hard to stop. But that same predictive ability could also check planned movements against physical rules before they're executed.

Meta's V-JEPA 2 showed a few months ago that self-supervised video world models can skip generating visible pixels entirely, predicting only abstract representations of the future instead. The survey authors see this as one of the most promising ways to cut the heavy compute cost of explicit video generation without losing the physical grounding that makes predictions useful. A full list of all discussed papers is available on GitHub.

AI News Without the Hype – Curated by Humans

Subscribe to THE DECODER for ad-free reading, a weekly AI newsletter, our exclusive "AI Radar" frontier report six times a year, full archive access, and access to our comment section.

Read on for the full picture.
Subscribe for hype-free coverage.

  • Access to all THE DECODER articles.
  • Read without distractions – no Google ads.
  • Access to comments and community discussions.
  • Weekly AI newsletter.
  • 6 times a year: “AI Radar” – deep dives on key AI topics.
  • Up to 25 % off on KI Pro online events.
  • Access to our full ten-year archive.
  • Get the latest AI news from The Decoder.
Subscribe to The Decoder