A new study by researchers at Bytedance Research and Tsinghua University shows that current AI video models like OpenAI's Sora can create impressive visuals but fail to understand the physical laws that govern them.
While companies like OpenAI want their video AI models to simulate reality accurately, the research reveals significant limitations in how these systems process basic physics.
The scientists tested video generators' capabilities across three scenarios: predictions within known patterns, outside known patterns, and new combinations of familiar elements. Their goal was to determine whether these models truly learn physical laws or simply copy patterns from training data.
Testing the limits of training data
The researchers found that these AI models don't actually learn universal rules. Instead, they rely on surface-level features from their training data, following a strict hierarchy: color takes top priority, followed by size, speed, and shape.
Testing revealed a consistent pattern: the models perform nearly perfectly in familiar scenarios but fail when faced with unknown situations—even with basic physics like straight-line motion or collisions.
Co-author Bingyi Kang demonstrated this limitation on X, explaining that when they trained the model with fast-moving balls traveling left to right and back, then tested it with slow-moving balls, the model showed the balls suddenly changing direction after just a few frames (you can see it in the video at 1:55).
Scaling isn't the solution
The study shows that simply scaling up models and expanding training data produces only modest gains. While larger models handle familiar patterns and combinations better, they still fail to understand basic physics or work with scenarios beyond their training.
Kang suggests that these systems might work in narrow, specific cases where the training data thoroughly covers the intended use case.
"Personally, I think, if there is a specific scenario and the data coverage is good enough, an overfitted world model is possible," he noted.
However, such limited systems wouldn't qualify as true world models, since the core purpose of a world model is to generalize beyond its training data. Given that it's practically impossible to capture every detail of the world or universe in training data, true world models would need to understand and apply fundamental principles rather than merely memorize patterns.
Reality check for OpenAI
These findings challenge OpenAI's vision for Sora, which the company calls "GPT-1 for video" and plans to develop into a true world model through scaling. OpenAI claims Sora already shows basic understanding of physical interactions and 3D geometry. Other companies, including RunwayML and Google DeepMind, are pursuing similar world model concepts.
But the study renders those ambitions premature. "Our study suggests that naively scaling is insufficient for video generation models to discover fundamental physical laws," the researchers concluded.
Meta's head of AI, Yann LeCun, shared that skepticism when OpenAI published its Sora paper, calling the approach of predicting the world by generating pixels "wasteful and doomed to failure."
That said, many would be delighted to see OpenAI finally release Sora as the video generator it was unveiled in mid-February 2024.