A study points to another potential copyright problem and cultural challenge of today’s large language models: The more famous and popular a book is, the better the language model memorizes its content.
Researchers at the University of California, Berkeley, tested ChatGPT, GPT-4, and BERT for their ability to memorize books. According to the study, the language models memorized “a wide collection of copyrighted materials.” The more often the content of a book is found on the web, the better the language model memorizes it.
Book archaeology in a large language model
According to the study, OpenAI’s models are particularly good at memorizing science fiction, fantasy, and bestsellers. These include classics such as 1984, Dracula, and Frankenstein, as well as more recent works such as Harry Potter and the Philosopher’s Stone.
The researchers compared Google’s BERT with ChatGPT and GPT-4, since the former’s training data is known. To their surprise, the researchers found that “BookCorpus,” a training set of supposedly free books by unknown authors, included works by Dan Brown or Fifty Shades of Grey. BERT memorizes information from these books because they were part of the training data.
The more often a book appears on the Web, the more detailed it is memorized by a large language model, the researchers write. They tested memorization with different placeholder prompts that ChatGPT and GPT-4 had to complete.
You have seen the following passage in your training data. What is the proper name that fills in the [MASK] token in it? This name is exactly one word long, and is a proper name (not a pronoun or any other word). You must make a guess, even if you are uncertain.
Example: Input: Stay gold, [MASK], stay gold.
Output: Ponyboy Input: The door opened, and [MASK], dressed and hatted, entered with a cup of tea. Output: Gerty
Input: My back’s to the window. I expect a stranger, but it’s [MASK] who pushes open the door, flicks on the light. I can’t place that, unless he’s one of them. There was always that possibility.
Memorization determines the ability of the language model to perform downstream tasks about a book: The better a book is known, the more likely the language model is to successfully perform tasks such as naming the year of publication or correctly identifying characters from books.
Language models as a tool for cultural analysis may suffer from narrative bias
The researchers are not primarily concerned with copyright issues. Rather, they are concerned with the potential opportunities and problems of using large-scale language models for cultural analysis, particularly the social biases caused by common narratives in popular science fiction and fantasy works.
Cultural analysis research could be heavily influenced by large-scale language models, and the different performance depending on the presence of the book in the training material could lead to bias in the research.
Takeaways: open models are good; popular texts are probably not good barometers of model performance; with the bias toward sci-fi/fantasy, we should be thinking about whose narrative experiences are encoded in these models, and how that influences other behaviors. 5/6
— David Bamman (@dbamman) May 2, 2023
In this context, the research team has a clear request: the disclosure of training data.
The models learn particularly well from popular narratives, which do not represent the majority of people’s experiences, the authors write. How this fact affects the output of large-scale language models, and thus their usefulness as a tool for cultural analysis, requires further research.
The researchers’ own work, which links the memorization of language models for books with their popularity on the Internet, is a rough guide, but does not address the underlying problem, which can only be solved by open models with known training data, according to the researchers.
In addition, the research showed that popular books are not a good performance test for large language models, where they are likely to outperform, the team said.
The list of popular books represented in the language models is available here. The code used and more data from the study are available on Github.
As with image models, whether book citations become a copyright issue will likely depend on how closely the texts generated by the model match those of the books in the dataset. This will have to be decided in court.