An investigation by Proof News shows that several leading technology and AI companies, including Apple, Nvidia, Anthropic, and Salesforce, apparently used thousands of YouTube videos to train their AI models without the creators' knowledge.
Proof News discovered that the companies used subtitles from 173,536 YouTube videos across more than 48,000 channels. The dataset, called "YouTube Subtitles," contains video transcripts from educational channels like Khan Academy, MIT, and Harvard, as well as media outlets like The Wall Street Journal, NPR, and BBC.
Material from late-night shows such as "The Late Show with Stephen Colbert" and "Jimmy Kimmel Live!" as well as from well-known YouTube personalities such as MrBeast and Marques Brownlee was also used for the AI training, according to the research. The latter describes the AI training debate as a "problem that will continue to evolve for a long time to come". You can use this tool to find out what data is included in the dataset.
The "YouTube Subtitles" dataset is part of "The Pile," a collection of internet data compiled by research organization Eleuther AI. For example, Apple used The Pile for its open-source models OpenELM, which may be used in its own Apple Intelligence. Anthropic and Salesforce have already confirmed that they used The Pile for their AI systems.
There may be a peculiarity with YouTube data: In April, YouTube CEO Neal Mohan emphasized that this type of data use is expressly prohibited by YouTube's terms of service. It remains to be seen whether this changes the principle of "fair use" that data-collecting AI companies—including Google in its own legal disputes—usually rely on.
The legal situation regarding data scraping for AI training is still unclear. A recent court ruling on the code-AI tool Github Copilot states that there is no copyright infringement, at least as long as the output of the systems is not identical to the original content.
The case is one of a growing number of legal disputes. Several class action lawsuits by publishers and authors against technology companies are already pending, partly over the use of books as training data. Similar cases are also pending in the image and music sectors, and more are emerging in the video sector.