Ad
Skip to content

New benchmark confirms AI video generators look stunning but still can't reason about the world

Image description
Nano Banana Pro prompted by THE DECODER

Modern video generators like Sora 2, Seedance 2.0, and Veo 3.1 produce increasingly impressive clips. But a new benchmark from Tsinghua University confirms what keeps coming up: visual quality and actual world understanding are two different things.

Instead of focusing on image quality, WorldReasonBench tests whether a model can take a starting scene and continue it in a way that makes sense: physically, socially, logically, and informationally.

Consider a basic test case: give a generator an image of an apple on a branch and tell it to drop the apple. The result might look great—smooth motion, realistic textures, nice lighting—and still get the physics fundamentally wrong. The apple might fly upward, pop like a balloon, or fall in a straight line instead of curving. Standard quality metrics would still reward that video for its realism. That's the gap WorldReasonBench is designed to catch.

Four color-coded quadrants display WorldReasonBench’s 22 task categories, complete with sample images and prompts; the dimensions World Knowledge, Human Centric, Logic Reasoning, and Information-Based group tasks such as falling dominoes, car washing, logic puzzles, and diagram interpretation.
WorldReasonBench breaks video generator evaluation into four reasoning dimensions with 22 subcategories, from physical mechanics to diagram logic. | Image: Wu et al.

WorldReasonBench includes about 400 test cases across four areas: world knowledge (physics, weather, cultural norms), human-centered scenes (object handling, social interaction), logical reasoning (math, geometry, science experiments), and information-based reasoning (reading data and diagrams).

Two-part flowchart; at the top, the WorldReasonBench pipeline consisting of taxonomy, data collection via Qwen Image Edit, and prompt design with human oversight; at the bottom, the WorldRewardBench pipeline featuring 13 video models, eight generated videos each, 15 annotators, and re-annotation in cases of high disagreement.
The setup splits into the WorldReasonBench task catalog and WorldRewardBench, a preference benchmark where 13 video models go head-to-head. | Image: Wu et al.

Scoring works in two stages. First, a process-aware method uses structured questions to check whether the video reaches the right end state in a plausible way. Then a second pass rates reasoning quality, temporal consistency, and visual aesthetics. Alongside the benchmark, the team also released WorldRewardBench, a dataset of about 6,000 video comparisons ranked by trained annotators.

Commercial models lead by a wide margin, but logic trips up everyone

The researchers tested five commercial systems (Sora 2, Kling, Wan 2.6, Seedance 2.0, Veo 3.1-Fast) and six open-source models (LTX 2.3, Wan 2.2-14B, UniVideo, HunyuanVideo 1.5, Cosmos-Predict 2.5, LongCat-Video). Commercial generators scored roughly double what open-source models managed on the core reasoning metric, with no statistical overlap between the two groups.

Three case studies listed below; Veo-3.1 renders a double row of dominoes in a physically implausible manner, Seedance 2.0 animates the wrong mechanism for a gripper robot, and fails to reproduce the expected rotational movement of the cable in a circuit diagram; red markings highlight the respective errors.
Even videos that look convincing fall apart under closer inspection - falling dominoes, a claw machine, and a simple circuit all trip up the tested models. | Image: Wu et al.

ByteDance's Seedance 2.0 came out on top, finishing first in nearly nine out of ten statistical re-runs. Veo 3.1-Fast did best on world knowledge, Sora 2 led on human-centered scenes. Seedance 2.0 also beat Veo 3.1-Fast, Kling, and Wan 2.6 in human ratings.

More important than the rankings is a shared weakness: logical reasoning is the hardest category for every model tested. Even the best commercial systems drop well below their overall averages here, and most open-source models fail it almost entirely. Information-based reasoning is the second-toughest area, particularly when tasks require physically grounded transitions or exact preservation of text and numbers.

Table showing the main results for five closed-source and six open-source video models across four reasoning dimensions and an overall score; Seedance2.0 leads with an overall Score_PR of 39.8, Veo3.1-Fast achieves the best individual score of 55.0 in World Knowledge, while no open-source model exceeds an overall score of 17.9.
Closed-source models like Seedance 2.0 and Veo 3.1-Fast outperform open-weight rivals on every reasoning dimension by roughly 2x. | Image: Wu et al.

The study also introduces a metric that tracks how many correct answers come from dynamic, process-based phases rather than static snapshots. Commercial models score much higher here, which points to where open-source models really fall short: not in how things look, but in understanding cause and effect.

When models get more detailed prompts that spell out what should happen step by step, open-source generators improve the most. They're simply more dependent on prompt quality than their commercial rivals, which may itself be a side effect of the commercial models' stronger reasoning ability.

Automated scoring lines up with human judgment

To validate their approach, the team compared their metrics against rankings from human video comparisons. The core metric tracks closely with human judgment and clearly outperforms traditional AI judges that compare videos in pairs.

Web interface for annotating a logic puzzle; at the top, the input screen and prompt; below that, a grid of eight generated videos, each of which is rated on a scale of 1 to 5 for reasoning accuracy, temporal consistency, and visual quality.
Fifteen trained annotators score eight anonymized model videos per case across three axes. They don't know which model made which video. | Image: Wu et al.

The conclusion fits a growing body of evidence: despite real progress in resolution, length, and controllability, the jump from pixel generator to reliable world model hasn't happened. Getting there will likely depend less on visual polish and more on a better grasp of causal mechanisms and the ability to keep information consistent over time. The benchmark, data, and code are available on GitHub.

An international team of researchers recently reached a similar conclusion: Sora 2 and Veo 3.1 fall well short of human performance on reasoning tasks. Whether video generators even qualify as "world models" remains a contested question in AI research. Meta's Yann LeCun considers systems like Sora a dead end, while DeepMind CEO Demis Hassabis sees Google's Veo as a step toward a world model. OpenAI shut down Sora as a commercial video generator but kept the team intact to focus on world model research. A proposed definition called OpenWorldLib explicitly rules out pure text-to-video models from the category.

AI News Without the Hype – Curated by Humans

Subscribe to THE DECODER for ad-free reading, a weekly AI newsletter, our exclusive "AI Radar" frontier report six times a year, full archive access, and access to our comment section.

Read on for the full picture.
Subscribe for hype-free coverage.

  • Access to all THE DECODER articles.
  • Read without distractions – no Google ads.
  • Access to comments and community discussions.
  • Weekly AI newsletter.
  • 6 times a year: “AI Radar” – deep dives on key AI topics.
  • Up to 25 % off on KI Pro online events.
  • Access to our full ten-year archive.
  • Get the latest AI news from The Decoder.
Subscribe to The Decoder