OpenAI is showing off its first generative AI model for video called Sora, and from the looks of it, it's like a GPT-4 moment for video generation.
OpenAI announced Sora, the company's first text-to-video model, in a blog post and on X, formerly Twitter. Sora shows off an impressive array of capabilities, with the ability to create videos up to a minute long that boast unprecedented levels of visual fidelity and, most importantly, temporal stability, while - according to OpenAI - also adhering to user instructions. Examples such as a dog climbing between window sills show the impressive video stability of the model.
The AI model is now available to a select group of red teamers for damage and risk assessment, as well as to visual artists, designers, and filmmakers who want to provide feedback to improve its utility for creative professionals.
OpenAI sees Sora as a foundation model on the path to AGI
Sora's current limitations are the challenge of accurately simulating complex physics or capturing specific cause-and-effect scenarios, according to OpenAI. For example, a character may bite into a cookie, but the visual aftermath - a bite mark - may be missing. Sora may also falter with spatial details, such as distinguishing left from right, and struggle with detailed descriptions of events over time, such as following a camera trajectory.
In terms of safety, OpenAI is implementing several strategies in advance of integrating Sora into its products. This includes working with red teamers and developing tools such as a detection classifier to identify when a video is Sora-generated. They aim to include C2PA metadata in the future, assuming the model is used in an OpenAI product. Building on the security methods established for DALL-E 3, OpenAI plans to use text classifiers to check for prompts that violate content policies and image classifiers to check video frames to ensure compliance with usage policies.
Sora is a diffusion model that works by progressively transforming static, noisy videos into clear images. By representing videos as collections of data patches, similar to GPT's tokens, the model can work with a wider range of visual data than previously possible, the company says. Leveraging recaption techniques from DALL-E 3, Sora can more faithfully execute text instructions within generated videos. Temporal stability for Sora's generation is made possible by "allowing the model to look ahead many frames at a time."
OpenAI sees Sora as a foundational model "that can understand and simulate the real world", a critical step toward achieving Artificial General Intelligence (AGI).
More examples are available on the Sora website.