Tech giants allegedly used thousands of YouTube videos for AI training without creators' consent

Jul 16, 2024

Midjourney prompted by THE DECODER

An investigation by Proof News shows that several leading technology and AI companies, including Apple, Nvidia, Anthropic, and Salesforce, apparently used thousands of YouTube videos to train their AI models without the creators' knowledge.

Proof News discovered that the companies used subtitles from 173,536 YouTube videos across more than 48,000 channels. The dataset, called "YouTube Subtitles," contains video transcripts from educational channels like Khan Academy, MIT, and Harvard, as well as media outlets like The Wall Street Journal, NPR, and BBC.

Material from late-night shows such as "The Late Show with Stephen Colbert" and "Jimmy Kimmel Live!" as well as from well-known YouTube personalities such as MrBeast and Marques Brownlee was also used for the AI training, according to the research. The latter describes the AI training debate as a "problem that will continue to evolve for a long time to come". You can use this tool to find out what data is included in the dataset.

The "YouTube Subtitles" dataset is part of "The Pile," a collection of internet data compiled by research organization Eleuther AI. For example, Apple used The Pile for its open-source models OpenELM, which may be used in its own Apple Intelligence. Anthropic and Salesforce have already confirmed that they used The Pile for their AI systems.

There may be a peculiarity with YouTube data: In April, YouTube CEO Neal Mohan emphasized that this type of data use is expressly prohibited by YouTube's terms of service. It remains to be seen whether this changes the principle of "fair use" that data-collecting AI companies—including Google in its own legal disputes—usually rely on.

The legal situation regarding data scraping for AI training is still unclear. A recent court ruling on the code-AI tool Github Copilot states that there is no copyright infringement, at least as long as the output of the systems is not identical to the original content.

The case is one of a growing number of legal disputes. Several class action lawsuits by publishers and authors against technology companies are already pending, partly over the use of books as training data. Similar cases are also pending in the image and music sectors, and more are emerging in the video sector.

AI News Without the Hype – Curated by Humans

As a THE DECODER subscriber, you get ad-free reading, our weekly AI newsletter, the exclusive "AI Radar" Frontier Report 6× per year, access to comments, and our complete archive.

AI news without the hype
Curated by humans.

Over 20 percent launch discount.
Read without distractions – no Google ads.
Access to comments and community discussions.
Weekly AI newsletter.
6 times a year: “AI Radar” – deep dives on key AI topics.
Up to 25 % off on KI Pro online events.
Access to our full ten-year archive.
Get the latest AI news from The Decoder.

Subscribe to The Decoder

Tech giants allegedly used thousands of YouTube videos for AI training without creators' consent

AI News Without the Hype – Curated by Humans

AI news without the hypeCurated by humans.

AI news without the hype
Curated by humans.