Tech giants bend rules and discuss copyright violations to keep pace in AI arms race

A New York Times report reveals that leading AI companies are disregarding copyright and licensing rights while gathering data to train their AI models, even getting in each other's way in the process.

According to the report, OpenAI, Google, and Meta partly ignored their guidelines and discussed intentionally violating copyrights, assuming their competitors would do the same.

For instance, OpenAI developed Whisper, a speech recognition tool that transcribed over a million hours of YouTube videos, despite knowing it could be legally questionable since YouTube forbids using its content for unrelated applications.

The obtained texts were still used to develop GPT-4, OpenAI's most advanced language model. The Information already reported in summer 2023 that OpenAI uses YouTube transcripts.

Everything hinges on "fair use"

Google has also used YouTube video transcripts to train its AI models - potentially violating the copyrights of video creators. Google spokesman Matt Bryant said that the company had agreements with YouTube creators allowing such use, but admitted that Google was aware of "unconfirmed reports" about OpenAI's practice.

Google also changed its privacy policy to extract more user data from its services for AI training, including YouTube, Google Docs, Sheets, and similar products to improve systems like Google Translate and Bard.

Google has retroactively changed its privacy policy to allow it to use more user data for AI training. | Image: New York Times

The U.S. Federal Trade Commission (FTC) is critical of such retroactive adjustments to privacy rules to extract more data for AI training, and warns companies against this approach.

Records show Meta managers and lawyers also discussed obtaining additional data despite copyright restrictions, such as possibly buying publisher Simon & Schuster, which publishes J.K. Rowling and Stephen King.

Pressure came from Mark Zuckerberg to quickly catch up with ChatGPT. Meta argued license talks with publishers, artists, musicians, and news outlets would take too long and using the data would likely be "fair use."

Recommendation

AI in practice

Google leans on token metrics, not revenue, adding to bubble talk about AI growth

Meta also said that licensing deals on the scale required for AI training were unaffordable. OpenAI said current AI models would be impossible without training on protected data.

Will Google now sue OpenAI?

The NYT report shows big tech firms disregard third-party rights when collecting data, with few ethical concerns, bending rules to fit their business model.

It would be highly inconsistent if Google sued OpenAI over YouTube training, as YouTube CEO Neal Mohan recently threatened, while Google itself faces trials for numerous potential data rights violations.

Only now that the first large-scale AI models are out, proving themselves in the market, and facing critical public questions are AI companies seeking to license training data from publishers, communities like Reddit, or online archives such as Photobucket.

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

Companies are also testing synthetic AI-generated data for training, but this risks exacerbating existing errors and biases, potentially degrading performance over time. It also raises questions about legitimate data origins if models that generate artificial training data have been trained on copyrighted data.

Tech giants bend rules and discuss copyright violations to keep pace in AI arms race

Everything hinges on "fair use"

Google leans on token metrics, not revenue, adding to bubble talk about AI growth

Will Google now sue OpenAI?

OpenAI and Google are buying YouTubers' unpublished videos for up to $4 per minute

ChatGPT hits 3.7 billion visits in October as growth accelerates to 115% year-over-year

It's perplexing how Perplexity's CEO feels about journalism and his own product

The ARC benchmark's fall marks another casualty of relentless AI optimization

DeepseekMath-V2 is Deepseek's latest attempt to pop the US AI bubble

Frustrated authors withdraw papers after realizing their reviewers are just lazy LLMs

Tech giants bend rules and discuss copyright violations to keep pace in AI arms race

Everything hinges on "fair use"

Will Google now sue OpenAI?

Share

Bank details