AI in practice

OpenAI says it's "impossible" to train state-of-the-art models without copyrighted data

Matthias Bastian
A hand-drawn editorial illustration in 16:9 aspect ratio, featuring a close-up view of a robot hand offering a bag of money to a human hand, enhanced with a glitch art style. The robot hand is sleek, metallic, and futuristic, with visible glitch effects like digital distortion and pixelation. The human hand appears realistic, reaching out to accept the old-fashioned money bag, which has a prominent dollar sign, symbolizing a secretive financial exchange. The background is subtly infused with digital glitches to complement the futuristic and mysterious vibe of the scene.

DALL-E 3 prompted by THE DECODER

The New York Times lawsuit against OpenAI has escalated the debate about AI training with copyrighted material. Even before the NYT lawsuit, OpenAI had publicly stated that it could not train leading AI models without this data.

In early December 2023, OpenAI made a statement about the need for copyrighted material for AI training to the British Parliament, which launched an inquiry into large language models last July.

According to the testimony, it is "impossible" to train today's leading AI models without copyrighted material, as today's copyright law largely covers any form of human expression.

Because copyright today covers virtually every sort of human expression– including blog posts, photographs, forum posts, scraps of software code, and government documents–it would be impossible to train today’s leading AI models without using copyrighted materials. Limiting training data to public domain books and drawings created more than a century ago might yield an interesting experiment, but would not provide AI systems that meet the needs of today’s citizens.

OpenAI

Limiting training data to decades-old books and images may be an "interesting experiment," but it does not lead to AI models that meet today's needs. In particular, OpenAI points out that parts of the creative industries work with the models and would benefit from them.

By democratizing the capacity to create, AI tools will expand the quantity, diversity, and quality of creative works, in both the commercial and noncommercial spheres. This will invigorate all creators, including those employed by the existing copyright industries, as these tools increase worker productivity, lower the costs of production, and stimulate creativity by making it easier to brainstorm, prototype, iterate, and share ideas.

OpenAI

OpenAI respects copyright and does not expect it to prevent AI training, it says. The company believes in "fair use," which it cited as a key argument in its legal battle with the New York Times.

But there is "still work to be done to support and empower creators," OpenAI says. It points to its support for individual publishers and the ability to block its training data-crawling bot. GPT-4 was trained before this feature was available, so this argument is only relevant for future AI models.

It's more about money than copyright

OpenAI's argument misses the point that content creators' criticism is often not directed at AI training on copyrighted material per se. It is primarily directed at unpaid training on copyrighted material.

OpenAI points out in its statement that it also uses licensed training material. But that is precisely the point of controversy: which data must be paid for, and how much?

In its letter to the UK Commission, OpenAI does not address the associated costs, which could bring the entire generative AI business to a halt.

Meta was more explicit in a statement to the U.S. Copyright Office in the fall: licensing AI training data on the scale required would be unaffordable.