OpenAI says it's "impossible" to train state-of-the-art models without copyrighted data

DALL-E 3 prompted by THE DECODER

The New York Times lawsuit against OpenAI has escalated the debate about AI training with copyrighted material. Even before the NYT lawsuit, OpenAI had publicly stated that it could not train leading AI models without this data.

In early December 2023, OpenAI made a statement about the need for copyrighted material for AI training to the British Parliament, which launched an inquiry into large language models last July.

According to the testimony, it is "impossible" to train today's leading AI models without copyrighted material, as today's copyright law largely covers any form of human expression.

Because copyright today covers virtually every sort of human expression– including blog posts, photographs, forum posts, scraps of software code, and government documents–it would be impossible to train today’s leading AI models without using copyrighted materials. Limiting training data to public domain books and drawings created more than a century ago might yield an interesting experiment, but would not provide AI systems that meet the needs of today’s citizens.

OpenAI

Limiting training data to decades-old books and images may be an "interesting experiment," but it does not lead to AI models that meet today's needs. In particular, OpenAI points out that parts of the creative industries work with the models and would benefit from them.

By democratizing the capacity to create, AI tools will expand the quantity, diversity, and quality of creative works, in both the commercial and noncommercial spheres. This will invigorate all creators, including those employed by the existing copyright industries, as these tools increase worker productivity, lower the costs of production, and stimulate creativity by making it easier to brainstorm, prototype, iterate, and share ideas.

OpenAI

OpenAI respects copyright and does not expect it to prevent AI training, it says. The company believes in "fair use," which it cited as a key argument in its legal battle with the New York Times.

But there is "still work to be done to support and empower creators," OpenAI says. It points to its support for individual publishers and the ability to block its training data-crawling bot. GPT-4 was trained before this feature was available, so this argument is only relevant for future AI models.

It's more about money than copyright

OpenAI's argument misses the point that content creators' criticism is often not directed at AI training on copyrighted material per se. It is primarily directed at unpaid training on copyrighted material.

OpenAI points out in its statement that it also uses licensed training material. But that is precisely the point of controversy: which data must be paid for, and how much?

In its letter to the UK Commission, OpenAI does not address the associated costs, which could bring the entire generative AI business to a halt.

Recommendation

AI in practice

OpenAI plans GPT-5 release in "a few months," shifts strategy on reasoning models

Meta was more explicit in a statement to the U.S. Copyright Office in the fall: licensing AI training data on the scale required would be unaffordable.

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

OpenAI says it's "impossible" to train state-of-the-art models without copyrighted data

It's more about money than copyright

OpenAI plans GPT-5 release in "a few months," shifts strategy on reasoning models

Amazon's new DeepFleet mode helps robots deliver your packages even faster

Black Forest Labs opens its AI image model FLUX.1 context [dev] for private use

Meta reportedly offers top OpenAI researchers up to $300 million over four years

Cloudflare CEO Matthew Prince sees trouble ahead for the open web

New Othello experiment supports the world model hypothesis for large language models

ChatGPT might be draining your brain, MIT warns - what ‘cognitive debt’ means for you

OpenAI says it's "impossible" to train state-of-the-art models without copyrighted data

It's more about money than copyright

Share

Bank details