Sora is widely perceived primarily as a text and video-to-video model. However, the real research goal of OpenAI is a world simulator.
But according to Yann LeCun, head of Meta's AI department, Sora is not suited for that. The renowned AI researcher has harsh words for OpenAI's simulator theory: "Modeling the world for action by generating pixels is as wasteful and doomed to failure."
There has been a historical debate about the merits of generative versus discriminative classification methods, with generative methods considered more difficult and less effective, LeCun said.
LeCun believes that generative models for sensory inputs will fail because it is too difficult to deal with the prediction uncertainty of high-dimensional continuous sensory inputs.
For text, generative AI works well because text is discrete and has a finite number of symbols. Dealing with uncertainty is easy here. However, sensory inputs generate a higher level of complexity.
"If your goal is to train a world model for recognition or planning, using pixel-level prediction is a terrible idea," writes LeCun.
LeCun's JEPA is supposed to do what Sora can't
Almost at the same time as Sora, LeCun presented a new model with his architecture, the Video Joint Embedding Predictive Architecture (V-JEPA), as a step towards a world model that does not rely on generative methods.
The model predicts complex interactions and interprets them by adding hidden parts of videos to convey the dynamics of objects and interactions to the AI.
V-JEPA focuses on predictions in a broader conceptual space, similar to human cognitive image processing.
This architecture allows V-JEPA to adapt to different tasks by adding a small, task-specific layer rather than retraining the entire model - a major advance over traditional AI models.
Meta's AI team plans to extend V-JEPA's capabilities and improve long-term predictions, ultimately developing comprehensive world models for autonomous AI systems.