Content
summary Summary

OpenAI's Sora AI model is capable of generating minute-long videos of impressive quality. In a presentation, the developers compare it to GPT-1, the precursor to modern language models.

Ad

OpenAI sees Sora as the foundation for better understanding and simulating the real world - a crucial step on the path to AGI. In a presentation at the AGI House, Sora developers Tim Brooks and Bill Peebles called the model "the GPT-1 of video" - a reference to the first modern Transformer language model GPT from 2018. The video was uploaded by YouTuber Wes Roth.

Like GPT-1, Sora is basic research, but with the potential to enable groundbreaking new applications. In the case of GPT, its successors have shown what's possible, from chatbots to code assistants to text summarization. OpenAI now expects something similar from Sora for video generation and analysis: "We think this technology will get a lot better very soon."

OpenAI expects to see emergent capabilities at scale

OpenAI sees Sora as a demonstration that generative AI models for video are scalable, and that emergent capabilities arise from further scaling. In the sample videos, Sora already demonstrates a basic understanding of physical interaction and the 3D geometry of real-world environments. People and animals move almost naturally through the generated worlds, objects are preserved despite camera pans, and surfaces cast realistic reflections.

Ad
Ad

The Sora team identifies simulation of complex physical processes, causality, and improved spatio-temporal logic as key areas for further progress. The developers believe that these capabilities can be achieved with larger models, much as generative language models have developed natural-looking coherence only through scaling.

In the long term, OpenAI hopes to better understand how people, animals, and objects interact in our world through multimodal modeling of all environments with Sora and similar models. This would be a critical step toward artificial general intelligence that can fully simulate and understand the real world. According to the team, there is enough data and methods to make better use of it to achieve AGI.

Meta's AI boss does not believe that Sora will succeed

Meta's chief of AI, Yann LeCun, on the other hand, does not see Sora as a suitable tool for predicting the world by generating pixels. He describes this approach as wasteful and doomed to failure. LeCun argues that generative models for sensory input will fail because it is too difficult to deal with the predictive uncertainty of high-dimensional continuous sensory input. He believes that generative AI works well for text because text is discrete and has a finite number of symbols, making it easier to deal with uncertainty.

At almost the same time as Sora, LeCun presented his own AI model called Video Joint Embedding Predictive Architecture (V-JEPA), which predicts and interprets complex interactions without relying on generative methods. V-JEPA focuses on prediction in a broader conceptual space and enables adaptation to different tasks by adding a small, task-specific layer rather than retraining the entire model.

Sora is currently available to a select group of Red Teamers for damage and risk assessment, as well as artists, designers, and filmmakers who want to provide feedback to improve its utility for creative professionals. Sora is scheduled for release later this year, but could be several months away as the timing may be affected by the US elections in November.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Recommendation
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • OpenAI's Sora can generate high-quality video for several minutes.
  • In a talk, the developers now compare it to GPT-1, the first modern language model that laid the foundation for applications such as chatbots and coding assistants.
  • OpenAI sees the potential for Sora to gain a better understanding of the real world by learning how people, animals, and objects interact as it continues to scale. This would be an important step toward artificial general intelligence.
Max is managing editor at THE DECODER. As a trained philosopher, he deals with consciousness, AI, and the question of whether machines can really think or just pretend to.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.