Content
summary Summary

Mira Murati, CTO of OpenAI, says in an interview with the Wall Street Journal that she doesn't know exactly what data Sora's latest video model was trained on. This is a problem because it shows a lack of acknowledgement of the problem.

Ad

When asked what training data was used for Sora, Murati repeats the wording from OpenAI's announcement: The model is trained on public and licensed data. Asked by WSJ reporter Joanna Stern whether she was talking about YouTube or Facebook videos, for example, Murati said she wasn't sure.

Of course, as CTO, Murati is not necessarily involved in day-to-day development. But with OpenAI being sued left and right for alleged data theft, saying "I'm not sure" in a prepared interview doesn't seem very convincing.

To her credit, Sora is still in development and won't be released anytime soon. After the interview, Murati confirmed that some of the licensed data is training material from Shutterstock.

Ad
Ad

OpenAI is facing several lawsuits, including from authors and the New York Times, who claim that their copyrighted works have been used to train AI models without permission.

OpenAI argues that the use of copyrighted data for AI training is covered by fair use, and that it is impossible to train state-of-the-art AI models without copyrighted material.

Sora is "much, much more expensive" than current generative AI systems

Murati also commented on the cost of Sora, saying that video generation is currently still "much, much more expensive" than existing systems. Once Sora is released, Murati expects the cost to be similar to that of DALL-E 3. Sora's release is "definitely planned for this year," but could take a few more months, Murati said.

The US elections in November may affect the release date. Sora's safety guidelines are still under development, but Murati expects them to be similar to those of DALL-E 3, which prohibit the creation of images of publicly known people.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • OpenAI CTO Mira Murati told the Wall Street Journal that Sora's video AI is trained on public and licensed data. However, she was not sure if that included videos from YouTube or Facebook, and would not provide examples. She later confirmed that the Shutterstock data was licensed.
  • This is relevant because OpenAI is currently facing lawsuits from authors and the New York Times, among others, who claim that their copyrighted works were used to train AI models without permission. OpenAI argues that the use of copyrighted data falls under fair use rules and is essential for training state-of-the-art AI models.
  • Murati also mentions that Sora's video generation is currently "much, much more expensive" than existing systems, but could be on par with DALL-E when it is released. The release is planned for this year, but is still a few months away. The US elections in November may affect the schedule.
Online journalist Matthias is the co-founder and publisher of THE DECODER. He believes that artificial intelligence will fundamentally change the relationship between humans and computers.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.