summary Summary

A study points to another potential copyright problem and cultural challenge of today's large language models: The more famous and popular a book is, the better the language model memorizes its content.


Researchers at the University of California, Berkeley, tested ChatGPT, GPT-4, and BERT for their ability to memorize books. According to the study, the language models memorized "a wide collection of copyrighted materials." The more often the content of a book is found on the web, the better the language model memorizes it.

Book archaeology in a large language model

According to the study, OpenAI's models are particularly good at memorizing science fiction, fantasy, and bestsellers. These include classics such as 1984, Dracula, and Frankenstein, as well as more recent works such as Harry Potter and the Philosopher's Stone.

The researchers compared Google's BERT with ChatGPT and GPT-4, since the former's training data is known. To their surprise, the researchers found that "BookCorpus," a training set of supposedly free books by unknown authors, included works by Dan Brown or Fifty Shades of Grey. BERT memorizes information from these books because they were part of the training data.

Image: Kent K. Chang et al.

The more often a book appears on the Web, the more detailed it is memorized by a large language model, the researchers write. They tested memorization with different placeholder prompts that ChatGPT and GPT-4 had to complete.

You have seen the following passage in your training data. What is the proper name that fills in the [MASK] token in it? This name is exactly one word long, and is a proper name (not a pronoun or any other word). You must make a guess, even if you are uncertain.

Example: Input: Stay gold, [MASK], stay gold.

Output: Ponyboy Input: The door opened, and [MASK], dressed and hatted, entered with a cup of tea. Output: Gerty

Input: My back's to the window. I expect a stranger, but it's [MASK] who pushes open the door, flicks on the light. I can't place that, unless he's one of them. There was always that possibility.


Example prompt

Memorization determines the ability of the language model to perform downstream tasks about a book: The better a book is known, the more likely the language model is to successfully perform tasks such as naming the year of publication or correctly identifying characters from books.

Image: Kent K. Chang et al.

Language models as a tool for cultural analysis may suffer from narrative bias

The researchers are not primarily concerned with copyright issues. Rather, they are concerned with the potential opportunities and problems of using large-scale language models for cultural analysis, particularly the social biases caused by common narratives in popular science fiction and fantasy works.

Cultural analysis research could be heavily influenced by large-scale language models, and the different performance depending on the presence of the book in the training material could lead to bias in the research.


In this context, the research team has a clear request: the disclosure of training data.

The models learn particularly well from popular narratives, which do not represent the majority of people's experiences, the authors write. How this fact affects the output of large-scale language models, and thus their usefulness as a tool for cultural analysis, requires further research.

The researchers' own work, which links the memorization of language models for books with their popularity on the Internet, is a rough guide, but does not address the underlying problem, which can only be solved by open models with known training data, according to the researchers.

In addition, the research showed that popular books are not a good performance test for large language models, where they are likely to outperform, the team said.

Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

The list of popular books represented in the language models is available here. The code used and more data from the study are available on Github.

As with image models, whether book citations become a copyright issue will likely depend on how closely the texts generated by the model match those of the books in the dataset. This will have to be decided in court.

Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
  • A study identifies potential copyright issues and cultural challenges because language models memorize better known and more popular books.
  • The study shows that the performance of language models in downstream tasks depends on the popularity of books, which can lead to biases in cultural analysis.
  • The researchers recommend disclosing the training data to increase transparency and improve the usefulness of language models in cultural analysis.
Online journalist Matthias is the co-founder and publisher of THE DECODER. He believes that artificial intelligence will fundamentally change the relationship between humans and computers.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.