OpenAI says it's "impossible" to train state-of-the-art models without copyrighted data

Jan 9, 2024

DALL-E 3 prompted by THE DECODER

Key Points

OpenAI submitted a statement to the UK Parliament claiming that it is "impossible" to train leading AI models without copyrighted material, as current copyright law covers almost every form of human expression.
The company argues for "fair use" and emphasizes that AI models support the creative economy by increasing productivity, reducing production costs, and stimulating creativity.
However, the main criticism from the content-creating industries is directed at the unpaid training of AI on copyrighted material, with the issue of licensing and the cost of training data at the heart of the debate.

The New York Times lawsuit against OpenAI has escalated the debate about AI training with copyrighted material. Even before the NYT lawsuit, OpenAI had publicly stated that it could not train leading AI models without this data.

In early December 2023, OpenAI made a statement about the need for copyrighted material for AI training to the British Parliament, which launched an inquiry into large language models last July.

According to the testimony, it is "impossible" to train today's leading AI models without copyrighted material, as today's copyright law largely covers any form of human expression.

Because copyright today covers virtually every sort of human expression– including blog posts, photographs, forum posts, scraps of software code, and government documents–it would be impossible to train today’s leading AI models without using copyrighted materials. Limiting training data to public domain books and drawings created more than a century ago might yield an interesting experiment, but would not provide AI systems that meet the needs of today’s citizens.
Ad
DEC_D_Incontent-1

OpenAI

Limiting training data to decades-old books and images may be an "interesting experiment," but it does not lead to AI models that meet today's needs. In particular, OpenAI points out that parts of the creative industries work with the models and would benefit from them.

By democratizing the capacity to create, AI tools will expand the quantity, diversity, and quality of creative works, in both the commercial and noncommercial spheres. This will invigorate all creators, including those employed by the existing copyright industries, as these tools increase worker productivity, lower the costs of production, and stimulate creativity by making it easier to brainstorm, prototype, iterate, and share ideas.

OpenAI
Ad
DEC_D_Incontent-2

OpenAI respects copyright and does not expect it to prevent AI training, it says. The company believes in "fair use," which it cited as a key argument in its legal battle with the New York Times.

But there is "still work to be done to support and empower creators," OpenAI says. It points to its support for individual publishers and the ability to block its training data-crawling bot. GPT-4 was trained before this feature was available, so this argument is only relevant for future AI models.

It's more about money than copyright

OpenAI's argument misses the point that content creators' criticism is often not directed at AI training on copyrighted material per se. It is primarily directed at unpaid training on copyrighted material.

OpenAI points out in its statement that it also uses licensed training material. But that is precisely the point of controversy: which data must be paid for, and how much?

In its letter to the UK Commission, OpenAI does not address the associated costs, which could bring the entire generative AI business to a halt.

Meta was more explicit in a statement to the U.S. Copyright Office in the fall: licensing AI training data on the scale required would be unaffordable.

AI News Without the Hype – Curated by Humans

As a THE DECODER subscriber, you get ad-free reading, our weekly AI newsletter, the exclusive "AI Radar" Frontier Report 6× per year, access to comments, and our complete archive.

Source: OpenAI—written evidence (LLM0113)