GPT-4 memorizes contents of copyrighted books, and it could be a cultural issue

May 4, 2023

Midjourney prompted by THE DECODER

A study points to another potential copyright problem and cultural challenge of today's large language models: The more famous and popular a book is, the better the language model memorizes its content.

Researchers at the University of California, Berkeley, tested ChatGPT, GPT-4, and BERT for their ability to memorize books. According to the study, the language models memorized "a wide collection of copyrighted materials." The more often the content of a book is found on the web, the better the language model memorizes it.

Book archaeology in a large language model

According to the study, OpenAI's models are particularly good at memorizing science fiction, fantasy, and bestsellers. These include classics such as 1984, Dracula, and Frankenstein, as well as more recent works such as Harry Potter and the Philosopher's Stone.

The researchers compared Google's BERT with ChatGPT and GPT-4, since the former's training data is known. To their surprise, the researchers found that "BookCorpus," a training set of supposedly free books by unknown authors, included works by Dan Brown or Fifty Shades of Grey. BERT memorizes information from these books because they were part of the training data.

The more often a book appears on the Web, the more detailed it is memorized by a large language model, the researchers write. They tested memorization with different placeholder prompts that ChatGPT and GPT-4 had to complete.

You have seen the following passage in your training data. What is the proper name that fills in the [MASK] token in it? This name is exactly one word long, and is a proper name (not a pronoun or any other word). You must make a guess, even if you are uncertain.
Ad

Example: Input: Stay gold, [MASK], stay gold.

Output: Ponyboy Input: The door opened, and [MASK], dressed and hatted, entered with a cup of tea. Output: Gerty
Ad
DEC_D_Incontent-2

Input: My back's to the window. I expect a stranger, but it's [MASK] who pushes open the door, flicks on the light. I can't place that, unless he's one of them. There was always that possibility.
Ad

Output:

Example prompt

Memorization determines the ability of the language model to perform downstream tasks about a book: The better a book is known, the more likely the language model is to successfully perform tasks such as naming the year of publication or correctly identifying characters from books.

Language models as a tool for cultural analysis may suffer from narrative bias

The researchers are not primarily concerned with copyright issues. Rather, they are concerned with the potential opportunities and problems of using large-scale language models for cultural analysis, particularly the social biases caused by common narratives in popular science fiction and fantasy works.

Cultural analysis research could be heavily influenced by large-scale language models, and the different performance depending on the presence of the book in the training material could lead to bias in the research.

Takeaways: open models are good; popular texts are probably not good barometers of model performance; with the bias toward sci-fi/fantasy, we should be thinking about whose narrative experiences are encoded in these models, and how that influences other behaviors. 5/6

— David Bamman (@dbamman) May 2, 2023

In this context, the research team has a clear request: the disclosure of training data.

The models learn particularly well from popular narratives, which do not represent the majority of people's experiences, the authors write. How this fact affects the output of large-scale language models, and thus their usefulness as a tool for cultural analysis, requires further research.

The researchers' own work, which links the memorization of language models for books with their popularity on the Internet, is a rough guide, but does not address the underlying problem, which can only be solved by open models with known training data, according to the researchers.

In addition, the research showed that popular books are not a good performance test for large language models, where they are likely to outperform, the team said.

The list of popular books represented in the language models is available here. The code used and more data from the study are available on Github.

As with image models, whether book citations become a copyright issue will likely depend on how closely the texts generated by the model match those of the books in the dataset. This will have to be decided in court.

AI News Without the Hype – Curated by Humans

As a THE DECODER subscriber, you get ad-free reading, our weekly AI newsletter, the exclusive "AI Radar" Frontier Report 6× per year, access to comments, and our complete archive.

AI news without the hype
Curated by humans.

Over 20 percent launch discount.
Read without distractions – no Google ads.
Access to comments and community discussions.
Weekly AI newsletter.
6 times a year: “AI Radar” – deep dives on key AI topics.
Up to 25 % off on KI Pro online events.
Access to our full ten-year archive.
Get the latest AI news from The Decoder.

Subscribe to The Decoder

GPT-4 memorizes contents of copyrighted books, and it could be a cultural issue

Book archaeology in a large language model

Language models as a tool for cultural analysis may suffer from narrative bias

AI News Without the Hype – Curated by Humans

AI news without the hypeCurated by humans.

AI news without the hype
Curated by humans.