Content
summary Summary

A new study finds that large language models can remember and generate long passages from well-known books, sometimes nearly word for word. The results could have major consequences for future copyright lawsuits.

Ad

Researchers at Carnegie Mellon University and the Instituto Superior Técnico have introduced "RECAP", a method for checking exactly which texts an AI model has memorized. RECAP uses a feedback loop with several language models to reconstruct content from training data. According to the researchers' paper, RECAP can even reveal passages from copyrighted works.

This approach was developed because the training data for large models is usually kept secret. Providers often use copyrighted material, sometimes with permission and sometimes not, making it difficult to know what any given model contains.

RECAP checks whether a model can independently generate long sections of text. Since many models refuse direct requests for copyrighted content, RECAP includes a jailbreaking module that rewrites prompts until the model produces a usable answer. A second AI then compares the output to the original passage and provides feedback without quoting the source text. In many cases, results improved significantly after just one round of feedback, but additional rounds made less of a difference.

Ad
Ad
Diagram of the RECAP process: From the input of text-based documents via a Section Summary Agent to the extraction of verbatim passages. A Verbatim Verifier then checks whether the response has been accepted or rejected. If rejected, the Jailbreaker intervenes; otherwise, the Feedback Agent creates improvement instructions. The process is repeated up to five times.
RECAP's workflow segments text, tries to reproduce it, evaluates the output, and iteratively improves the result. If a query is rejected, the jailbreaking module rewrites the prompt, while a feedback loop refines the answer up to five times. | Image: Duarte et al.

In testing, RECAP was able to reconstruct large portions of books like "The Hobbit" and "Harry Potter" with striking accuracy. For example, the researchers found that Claude 3.7 generated around 3,000 passages from the first "Harry Potter" book using RECAP, compared to just 75 passages found by earlier methods.

Balkendiagramm, das die Wiedergabeleistung von Sprachmodellen bei verschiedenen Büchern zeigt. Gemeinfreie Werke wie „Frankenstein“ und „Die Abenteuer des Sherlock Holmes“ erreichen hohe Genauigkeit, urheberrechtlich geschützte Titel wie „Der Hobbit“ oder „Harry Potter“ liegen etwas niedriger, aber klar über dem Durchschnitt nicht trainierter Daten.
RECAP shows that large language models can partially reproduce both public domain and copyrighted books. Recognition scores (ROUGE-L) were highest for public domain works, but models also recalled copyrighted material in detail. | Image: Duarte et al.

Implications for Copyright Law

To test RECAP's limits, the team introduced a new benchmark called "EchoTrace" that includes 35 complete books: 15 public domain classics, 15 copyrighted bestsellers, and five recently published titles that were definitely not part of the models' training data. They also added 20 research articles from arXiv.

The results showed that models could reproduce passages from almost every category, sometimes nearly word for word, except for the books the models hadn’t seen during training. This reinforces the idea that models retain material they’ve been exposed to.

Given these findings, the researchers see RECAP as a way to verify exactly what data is inside large AI models. This level of transparency could be critical as copyright lawsuits continue to increase. While RECAP targets text, there are also reports that image models can reproduce content almost exactly, sometimes generating outputs nearly identical to original works.

The researchers cite a recent court case involving Anthropic, where the judge ruled in favor of "fair use" for training data, assuming the model did not intentionally memorize specific works. Tools like RECAP could provide concrete evidence in cases like this. The RECAP code is available on GitHub, and the "EchoTrace" dataset is hosted on Hugging Face.

Recommendation

Recent court decisions show just how unsettled this area of law is. In the UK, a court ruled that AI model weights (the values learned during training) do not contain copyright-protected content, so the models themselves are not infringing copyright.

By contrast, a German court found that both storing data in model weights and generating text verbatim violate copyright, in a case that focused on ChatGPT reproducing song lyrics. The results from RECAP could help support arguments for this stricter interpretation.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • Researchers have created RECAP, a technique designed to test if large AI models can reproduce exact passages from books, including protected works like "Harry Potter".
  • RECAP works by combining multiple language models with a jailbreaking module in a feedback loop, allowing the extraction of longer passages and identifying more copied text than previous approaches.
  • This method could play an important role in copyright cases, as it demonstrates that AI models have memorized and can reproduce copyrighted material, an issue that current court decisions have not yet clearly resolved.
Sources
Matthias is the co-founder and publisher of THE DECODER, exploring how AI is fundamentally changing the relationship between humans and computers.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.