Content
summary Summary

A new court ruling draws a sharp line between fair use and infringement for AI companies training on copyrighted books, allowing transformative use of legally obtained works but rejecting any defense for pirated material.

Ad

A recent court decision allows AI companies to use copyrighted books for training if the works are obtained legally, calling the practice "transformative - spectacularly so" because the aim is to learn from, not copy, the original texts.

"Like any reader aspiring to be a writer, Anthropic’s LLMs trained upon works not to race ahead and replicate or supplant them — but to turn a hard corner and create something different. If this training process reasonably required making copies within the LLM or otherwise, those copies were engaged in a transformative use."

Bartz v. Anthropic PBC, p. 13-14

This reasoning lines up with the broader transformative use argument put forward by many AI companies as they defend data scraping practices that happen without creator consent.

The finished models don't directly reproduce the books, either. The court noted that the plaintiffs - authors Andrea Bartz, Charles Graeber, and Kirk Wallace Johnson - didn't even try to show that Claude could generate outputs resembling or replacing the originals.

Ad
Ad

This part of the ruling covered books Anthropic legally bought in print, often secondhand. The company removed the bindings, scanned the books, and then destroyed the originals. The resulting PDFs were stored in a searchable internal library. Since Anthropic didn't make or distribute extra copies, the court said this also qualified as fair use.

No pass for piracy

The court took a much stricter line on books Anthropic obtained from pirate sources like Books3, LibGen, and PiLiMi. Between January 2021 and July 2022, Anthropic downloaded more than 7 million books from illegal sources, including works by the plaintiffs. These files were stored permanently, even if they weren't used for training. Meta and other AI companies are believed to have used similar data sources.

The court made clear that building a digital library of pirated books isn't transformative use and doesn't qualify as fair use. The idea that a company might develop a lawful use later doesn't excuse the initial infringement. "There is no carveout, however, from the Copyright Act for AI companies," the court wrote.

In short, using copyrighted works for AI training can be fair use if the data was obtained legally. But companies that knowingly use pirated copies can't rely on fair use as a defense.

One big question remains: Is mass scraping of online content - especially when technical barriers are bypassed - a lawful way to obtain data?

Recommendation

Many AI models are trained on data scraped from public websites without the creators' consent, and clear legal standards are still missing. If this ruling leads to a requirement for mass licensing of copyrighted data, it could pose major challenges for AI companies, even if the actual use of the data is considered transformative.

While the court sided with Anthropic on digitizing purchased books and using them for training, it refused to dismiss the case entirely. The claims related to pirated books and the permanent storage of unused works are still on the table. The proceedings will continue, focusing on whether Anthropic can be held liable for using pirated content. The court will also consider possible damages for "willful infringement."

The case is still in its early stages and will move forward in the federal court for the Northern District of California. The outcome could influence other lawsuits over AI training and copyrighted data.

In a separate case involving Meta, another US judge has already raised serious doubts about whether copyrighted data can be used for AI training at all. The US Copyright Office also stated that fair use does not extend to AI models trained on large amounts of copyrighted material, but its director was removed by the Trump administration soon after that report was published, so that position may no longer reflect current policy.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • A US court has decided that training LLMs using copyrighted books can qualify as fair use, as long as the books were acquired legally.
  • The court found no fair use protection for books obtained from pirate sources like Books3, LibGen, and PiLiMi, stating that storing and using pirated materials for an internal library is a clear violation of copyright law.
  • The lawsuit against Anthropic is ongoing, with the court yet to determine whether the company is liable for using pirated copies and if there will be damages for intentional infringement.
Sources
Matthias is the co-founder and publisher of THE DECODER, exploring how AI is fundamentally changing the relationship between humans and computers.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.