Content
summary Summary

OpenAI is showing off its first generative AI model for video called Sora, and from the looks of it, it's like a GPT-4 moment for video generation.

OpenAI announced Sora, the company's first text-to-video model, in a blog post and on X, formerly Twitter. Sora shows off an impressive array of capabilities, with the ability to create videos up to a minute long that boast unprecedented levels of visual fidelity and, most importantly, temporal stability, while - according to OpenAI - also adhering to user instructions. Examples such as a dog climbing between window sills show the impressive video stability of the model.

Video: OpenAI

The AI model is now available to a select group of red teamers for damage and risk assessment, as well as to visual artists, designers, and filmmakers who want to provide feedback to improve its utility for creative professionals.

Ad
Ad

OpenAI sees Sora as a foundation model on the path to AGI

Sora's current limitations are the challenge of accurately simulating complex physics or capturing specific cause-and-effect scenarios, according to OpenAI. For example, a character may bite into a cookie, but the visual aftermath - a bite mark - may be missing. Sora may also falter with spatial details, such as distinguishing left from right, and struggle with detailed descriptions of events over time, such as following a camera trajectory.

In terms of safety, OpenAI is implementing several strategies in advance of integrating Sora into its products. This includes working with red teamers and developing tools such as a detection classifier to identify when a video is Sora-generated. They aim to include C2PA metadata in the future, assuming the model is used in an OpenAI product. Building on the security methods established for DALL-E 3, OpenAI plans to use text classifiers to check for prompts that violate content policies and image classifiers to check video frames to ensure compliance with usage policies.

Video: OpenAI

Sora is a diffusion model that works by progressively transforming static, noisy videos into clear images. By representing videos as collections of data patches, similar to GPT's tokens, the model can work with a wider range of visual data than previously possible, the company says. Leveraging recaption techniques from DALL-E 3, Sora can more faithfully execute text instructions within generated videos. Temporal stability for Sora's generation is made possible by "allowing the model to look ahead many frames at a time."

OpenAI sees Sora as a foundational model "that can understand and simulate the real world", a critical step toward achieving Artificial General Intelligence (AGI).

Recommendation

More examples are available on the Sora website.

Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • OpenAI has introduced Sora, its first text-to-video generative AI model, capable of creating videos up to a minute long with impressive visual fidelity and temporal stability.
  • The model is currently being tested by a select group of red teamers for risk assessment and by visual artists, designers, and filmmakers for creative feedback.
  • Sora's limitations include challenges in simulating complex physics and capturing specific cause-and-effect scenarios, and OpenAI is working on safety measures such as detection classifiers and metadata integration for future product implementation.
Sources
Max is managing editor at THE DECODER. As a trained philosopher, he deals with consciousness, AI, and the question of whether machines can really think or just pretend to.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.