A two-year investigation by the International Confederation of Music Publishers (ICMP) and a separate analysis by The Atlantic point to a systematic double standard: Tech giants train AI on copyrighted works scraped at scale, while their own terms of service prohibit the same practice for their platforms.
"All we hear from AI and tech companies is, 'We need exceptions to build an open internet and access data, wholescale, without licenses, for our training,'" ICMP director general John Phelan said. "What our work on AI shows is that at the very same time, they’re demanding everybody else get prior written permission before using their content."
ICMP calls AI training the "largest theft of intellectual property in human history"
According to a Billboard-exclusive report, ICMP alleges that Google, Microsoft, Meta, OpenAI, and X trained their systems at scale on copyrighted music. ICMP, which compiled evidence over two years, calls it "the largest IP theft in human history." Phelan told Billboard that "tens of millions of works" are being infringed every day.
ICMP says its dossier contains "comprehensive and clear" evidence. It includes private datasets that purportedly show U.S. music apps Udio and Suno scraping YouTube, analyses suggesting Meta’s Llama 3 was trained on lyrics by artists such as The Weeknd and Ed Sheeran and court filings in the publishers’ lawsuit against Anthropic alleging that its Claude model reproduced hundreds of song lyrics, including "American Pie" and "Halo." ICMP also points to evidence that Microsoft’s Copilot and Google’s Gemini replicated copyrighted lyrics.
Some items Billboard lists as "evidence" are weaker than others, including chatbot "admissions" about training data for corporate products. Given how language models work, such statements are of little probative value for the kind of case ICMP is trying to make. Even so, there are numerous indications that large and smaller tech firms alike helping power the AI boom have drawn heavily on copyrighted datasets. In text, suits are ongoing, including the New York Times’ case against OpenAI, and the recently stalled settlement talks between Anthropic and affected authors. In music, Suno faces litigation, in imagery, Midjourney and others are being sued.
Millions of YouTube videos fed into AI video generators
The Atlantic reports that at least 15.8 million YouTube videos from more than 2 million channels have been downloaded without permission and bundled into at least 13 datasets, nearly 1 million of them how-to clips. Titles and channel names are often stripped but can be recovered via unique IDs. While mass downloading violates YouTube’s terms of service, YouTube has done little to stop it and did not comment, The Atlantic writes. A dedicated tool lets users search whether specific videos appear in the sets.
According to The Atlantic, companies including Microsoft, Meta, Amazon, Nvidia, Runway, ByteDance, Snap, and Tencent have used these datasets for training. Meta, Amazon, and Nvidia responded, saying they respect creators and believe their use is lawful. Amazon said it is currently focused on producing "compelling, high-quality advertisements from simple prompts." News and educational channels are especially exposed: the BBC with at least 33,000 videos, TED with nearly 50,000, plus hundreds of thousands from individual creators. A leak from Runway, reported by 404 Media and cited by The Atlantic, shows what the company prioritized: "high camera movement," "beautiful cinematic landscapes," "super high quality sci-fi short films"—one channel labeled "THE HOLY GRAIL OF CAR CINEMATICS SO FAR."
Curators of collections such as HowTo100M and HD-VILA-100M leaned on high view counts, HD-VG-130M used AI to select clips of "aesthetic quality." Datasets often avoid videos with overlays (subtitles, logos), making watermarks a deterrent. Long videos are split into clip segments and captioned in English—by crowd workers or automatically by AI—to align text with moving images, The Atlantic explains.
The results are already in products: Meta is developing its Movie Gen text-to-video suite, Snap offers AI Video Lenses, and Google’s Gemini can animate photos into short clips or generate new videos with Veo 3. At the same time, platforms train on their own content: Google on at least 70 million YouTube clips and Meta on more than 65 million Instagram clips. Creators increasingly find themselves competing with synthetic content on the very platforms they helped build.
The industry’s double standard
The reports highlight a central contradiction. While pushing for broad copyright exceptions to train AI, the same companies bar scraping of their own platforms in their terms of service. ICMP points to clauses at Facebook, YouTube, X, Google, OpenAI, Microsoft, and Adobe that require prior written consent for data use.
The reporting also undercuts another common industry argument—that disclosing training data is too complex. Data reviewed by ICMP and leaks from companies like Runway show the opposite: scraped content is meticulously labeled with metadata such as artist, genre, and tempo, suggesting that detailed traceability—of the kind the EU’s AI Act envisions—would be feasible.