Content
summary Summary

An investigation by Proof News shows that several leading technology and AI companies, including Apple, Nvidia, Anthropic, and Salesforce, apparently used thousands of YouTube videos to train their AI models without the creators' knowledge.

Ad

Proof News discovered that the companies used subtitles from 173,536 YouTube videos across more than 48,000 channels. The dataset, called "YouTube Subtitles," contains video transcripts from educational channels like Khan Academy, MIT, and Harvard, as well as media outlets like The Wall Street Journal, NPR, and BBC.

Material from late-night shows such as "The Late Show with Stephen Colbert" and "Jimmy Kimmel Live!" as well as from well-known YouTube personalities such as MrBeast and Marques Brownlee was also used for the AI training, according to the research. The latter describes the AI training debate as a "problem that will continue to evolve for a long time to come". You can use this tool to find out what data is included in the dataset.

The "YouTube Subtitles" dataset is part of "The Pile," a collection of internet data compiled by research organization Eleuther AI. For example, Apple used The Pile for its open-source models OpenELM, which may be used in its own Apple Intelligence. Anthropic and Salesforce have already confirmed that they used The Pile for their AI systems.

Ad
Ad

There may be a peculiarity with YouTube data: In April, YouTube CEO Neal Mohan emphasized that this type of data use is expressly prohibited by YouTube's terms of service. It remains to be seen whether this changes the principle of "fair use" that data-collecting AI companies—including Google in its own legal disputes—usually rely on.

The legal situation regarding data scraping for AI training is still unclear. A recent court ruling on the code-AI tool Github Copilot states that there is no copyright infringement, at least as long as the output of the systems is not identical to the original content.

The case is one of a growing number of legal disputes. Several class action lawsuits by publishers and authors against technology companies are already pending, partly over the use of books as training data. Similar cases are also pending in the image and music sectors, and more are emerging in the video sector.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • Proof News has revealed that tech and AI companies including Anthropic, Nvidia, Apple, and Salesforce have been using thousands of YouTube videos to train their AI models without the knowledge of the creators.
  • The YouTube Subtitles dataset, which is part of Eleuther AI's The Pile dataset, contains subtitles from 173,536 videos across more than 48,000 channels, including educational, media and creator content.
  • According to YouTube CEO Neal Mohan, this type of data use is prohibited by YouTube's terms of service. Whether the companies can claim 'fair use' regardless of YouTube's terms of service is still unclear and will likely have to be decided in court.
Sources
Online journalist Matthias is the co-founder and publisher of THE DECODER. He believes that artificial intelligence will fundamentally change the relationship between humans and computers.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.