Content
summary Summary

Sora is widely perceived primarily as a text and video-to-video model. However, the real research goal of OpenAI is a world simulator.

Ad

But according to Yann LeCun, head of Meta's AI department, Sora is not suited for that. The renowned AI researcher has harsh words for OpenAI's simulator theory: "Modeling the world for action by generating pixels is as wasteful and doomed to failure."

There has been a historical debate about the merits of generative versus discriminative classification methods, with generative methods considered more difficult and less effective, LeCun said.

LeCun believes that generative models for sensory inputs will fail because it is too difficult to deal with the prediction uncertainty of high-dimensional continuous sensory inputs.

Ad
Ad

For text, generative AI works well because text is discrete and has a finite number of symbols. Dealing with uncertainty is easy here. However, sensory inputs generate a higher level of complexity.

"If your goal is to train a world model for recognition or planning, using pixel-level prediction is a terrible idea," writes LeCun.

LeCun's JEPA is supposed to do what Sora can't

Almost at the same time as Sora, LeCun presented a new model with his architecture, the Video Joint Embedding Predictive Architecture (V-JEPA), as a step towards a world model that does not rely on generative methods.

The model predicts complex interactions and interprets them by adding hidden parts of videos to convey the dynamics of objects and interactions to the AI.

V-JEPA focuses on predictions in a broader conceptual space, similar to human cognitive image processing.

Recommendation

This architecture allows V-JEPA to adapt to different tasks by adding a small, task-specific layer rather than retraining the entire model - a major advance over traditional AI models.

Meta's AI team plans to extend V-JEPA's capabilities and improve long-term predictions, ultimately developing comprehensive world models for autonomous AI systems.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • OpenAI's Sora is known as a text and video-to-video model, but the real goal is a world simulator. But Meta's head of AI, Yann LeCun, believes this approach is inefficient and doomed to fail.
  • LeCun argues that generative models will fail with sensory inputs because the prediction uncertainty is too difficult with high-dimensional continuous sensory inputs.
  • LeCun has developed his own AI model, V-JEPA, which is based on a non-generative method and predicts and interprets complex interactions to convey the dynamics of objects and interactions to the AI.
Sources
Online journalist Matthias is the co-founder and publisher of THE DECODER. He believes that artificial intelligence will fundamentally change the relationship between humans and computers.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.