summary Summary

The context window of large language models, represented in tokens, indicates how much information an AI model can process simultaneously. Today, this window is large enough for models to summarize entire books. A new study assesses the quality of these summaries across multiple dimensions.


The context windows of large language models have been growing steadily lately, with the largest currently being Claude 3 with 200,000 tokens and Google Gemini 1.5 Pro with one million tokens.

In theory, they should be able to summarize long documents like entire novels. However, the quality of these summaries can only be judged by people who are very familiar with the extensive source material, which requires a great deal of effort.

To create the FABLES dataset, the researchers had GPT-4 extract 3,158 statements from AI-generated summaries of 26 books, which were then reviewed by humans for accuracy. | Image: Kim et al.

Researchers from UMass Amherst, Adobe, the Allen Institute for AI, and Princeton University have published a new dataset called FABLES (Faithfulness Annotations for Book-Length Summarization) to advance research on evaluating the reliability and accuracy of AI-generated summaries for entire books.


The researchers found that Anthropic's latest model, Claude 3 Opus, significantly outperformed all of OpenAI's closed-source LLMs, with 90 percent of assertions rated as reliable, followed by GPT-4 and GPT-4 Turbo at 78 percent, GPT-3.5 Turbo at 72 percent, and Mixtral, the only open-source model tested, just behind at 70 percent.

Analysis of the reviewers' comments showed that most of the unreliable statements related to events, characters, and relationships. Verifying the statements usually required indirect, multi-level reasoning, which the researchers said made the task even more complex.

The researchers developed taxonomies for the type of assertion and the type of reasoning in the AI summaries. | Image: Kim et al.

Good but difficult to scale method

The study focused on books published in 2023 and 2024 to avoid them being included in the training material and potentially skewing the results. To keep costs and cognitive load to a minimum, the annotators were asked to read the books in advance on their own time.

The researchers note that their approach is not easily scalable to new books and datasets, as the 14 human helpers recruited through Upwork cost a total of $5,200. Expanding and constantly updating the training set would therefore be very time-consuming and costly.

The researchers also experimented with using LLMs to automatically verify claims, but even their best method struggled to detect false claims reliably.

Language models cannot replace the human labor of verifying the extracted claims. False claims are classified as true and true claims as false by both Claude 3 and GPT-4. | Image: Kim et al.

Beyond the correctness of the assertions, the researchers made other hypotheses based on the annotators' comments. In general, all language models made chronological errors, although the models with a larger context window were less affected.

All models were also criticized for omitting important information, with Claude 3 Opus performing best in this respect, while GPT-4 Turbo and Mixtral even omitted individual persons.

The researchers also confirmed the tendency previously observed in various models with very long context windows to systematically give more weight to content at the end of a book, a phenomenon known as "lost-in-the-middle".

Claude 3 Opus is not perfect at summarizing long texts, but it is significantly better than the competition. | Image: Kim et al.

The researchers are publishing the FABLES dataset on GitHub to encourage further research of this kind.

Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
  • Researchers from several U.S. universities and Adobe have released the Faithfulness Annotations for Book-Length Summarization (FABLES) dataset to evaluate the reliability, accuracy, and overall quality of AI-generated summaries of entire books.
  • The dataset contains human annotations on 3,158 statements from AI-generated summaries of 26 novels. Claude 3 Opus performed best in tests on this dataset with 90 percent reliable annotations, followed by GPT-4 and GPT-4 Turbo with 78 percent, GPT-3.5 Turbo with 72 percent, and Mixtral with 70 percent.
  • The researchers also experimented with using LLMs to automatically verify statements, but encountered difficulties. They also found that while models with a larger context window made fewer chronological errors, they gave more weight than average to content at the end of a book.
Jonathan works as a technology journalist who focuses primarily on how easily AI can already be used today and how it can support daily life.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.