Content
summary Summary

A new study by researchers at Bytedance Research and Tsinghua University shows that current AI video models like OpenAI's Sora can create impressive visuals but fail to understand the physical laws that govern them.

Ad

While companies like OpenAI want their video AI models to simulate reality accurately, the research reveals significant limitations in how these systems process basic physics.

The scientists tested video generators' capabilities across three scenarios: predictions within known patterns, outside known patterns, and new combinations of familiar elements. Their goal was to determine whether these models truly learn physical laws or simply copy patterns from training data.

Testing the limits of training data

The researchers found that these AI models don't actually learn universal rules. Instead, they rely on surface-level features from their training data, following a strict hierarchy: color takes top priority, followed by size, speed, and shape.

Ad
Ad

Testing revealed a consistent pattern: the models perform nearly perfectly in familiar scenarios but fail when faced with unknown situations—even with basic physics like straight-line motion or collisions.

Co-author Bingyi Kang demonstrated this limitation on X, explaining that when they trained the model with fast-moving balls traveling left to right and back, then tested it with slow-moving balls, the model showed the balls suddenly changing direction after just a few frames (you can see it in the video at 1:55).

Video: Kang et al.

Scaling isn't the solution

The study shows that simply scaling up models and expanding training data produces only modest gains. While larger models handle familiar patterns and combinations better, they still fail to understand basic physics or work with scenarios beyond their training.

Kang suggests that these systems might work in narrow, specific cases where the training data thoroughly covers the intended use case.

Recommendation

"Personally, I think, if there is a specific scenario and the data coverage is good enough, an overfitted world model is possible," he noted.

However, such limited systems wouldn't qualify as true world models, since the core purpose of a world model is to generalize beyond its training data. Given that it's practically impossible to capture every detail of the world or universe in training data, true world models would need to understand and apply fundamental principles rather than merely memorize patterns.

Reality check for OpenAI

These findings challenge OpenAI's vision for Sora, which the company calls "GPT-1 for video" and plans to develop into a true world model through scaling. OpenAI claims Sora already shows basic understanding of physical interactions and 3D geometry. Other companies, including RunwayML and Google DeepMind, are pursuing similar world model concepts.

But the study renders those ambitions premature. "Our study suggests that naively scaling is insufficient for video generation models to discover fundamental physical laws," the researchers concluded.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

Meta's head of AI, Yann LeCun, shared that skepticism when OpenAI published its Sora paper, calling the approach of predicting the world by generating pixels "wasteful and doomed to failure."

That said, many would be delighted to see OpenAI finally release Sora as the video generator it was unveiled in mid-February 2024.

Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • Researchers from Bytedance and Tsinghua University have found that current video AI models, such as OpenAI's Sora, can generate impressive images but lack understanding of the underlying physical laws.
  • The models were tested in three scenarios, revealing that they do not learn universal rules but instead rely on superficial features from the training data, leading to failure in unfamiliar situations, even when simple physical processes are involved.
  • The researchers stress that simply upscaling the models is insufficient for discovering fundamental physical laws, tempering expectations for video models like Sora, which some AI labs are trying to develop into real-world models.
Sources
Online journalist Matthias is the co-founder and publisher of THE DECODER. He believes that artificial intelligence will fundamentally change the relationship between humans and computers.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.