Content
summary Summary

Researchers at the Hebrew University of Jerusalem have discovered that the number of documents processed in Retrieval Augmented Generation (RAG) affects language model performance, even when the total text length remains constant.

Ad

The research team used MuSiQue's validation dataset, which contains 2,417 answerable questions. Each question links to 20 Wikipedia paragraphs, with two to four paragraphs containing relevant answer information while the rest serve as realistic distractors.

Diagram with four groups of documents in pink and blue, number of pink documents constant, number of blue documents decreasing to the right.
Study design showing question collections with supporting documents (pink) and distracting documents (blue). Fewer documents are compensated by extending remaining ones to maintain consistent length. | Image: Levy et al.

To study how document quantity affects performance, the researchers created multiple data partitions. They gradually reduced the number of documents from 20 to 15, 10, eight, and finally down to just the two to four documents containing relevant information. To maintain consistent token counts and information positioning, they expanded the selected documents using text from the original Wikipedia articles.

Fewer documents lead to better results

Testing several open-source models including Llama-3.1, Qwen2, and Gemma 2 revealed that reducing document count improved performance by up to 10 percent in most cases. Qwen2 proved to be an exception, possibly handling multiple document collections more effectively. While these tested models are only a few months old, newer versions like Llama-3.3, Qwen2.5, and Gemma 3 have already superseded them.

Ad
Ad
Bar chart: F1 score of various large language models for retrieval tasks with different numbers of retrieved documents.
Performance comparison showing Qwen2 maintaining steady performance while Llama-3.1 and Gemma-2 decline up to 10 percent with increased document count. | Image: Levy et al.

The language models performed significantly better when given only supporting documents, which meant shorter context and eliminated distracting content. The results showed that similar but unrelated documents, often retrieved in RAG systems, can confuse the model and reduce performance.

Bar chart showing F1 scores for Qwen-2 72B, Qwen-2 7B, Llama-3.1 72B, Llama-3.1 8B, Gamma-2 27B and Gamma-2 9B on different data sets.
Research findings showing improved model performance with random irrelevant documents, suggesting models more easily identify and filter obviously irrelevant content. | Image: Levy et al.

The study demonstrates that processing multiple documents makes tasks more challenging in a retrieval environment. The researchers emphasize that retrieval systems need to balance relevance and diversity to minimize conflicts. Future models might benefit from mechanisms that can identify and discard contradictory information while still utilizing document diversity.

The researchers acknowledge certain study limitations, including lack of investigation into prompt variations and data order effects. They've made their datasets publicly available to facilitate further research into multiple document processing.

The RAG versus context window debate continues

As context windows continue to grow, there's ongoing discussion about whether RAG systems remain necessary. While language models are getting better at processing large amounts of text at once, RAG architectures show particular advantages when using smaller open-source models.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • A study by researchers at the Hebrew University of Jerusalem investigated the impact of the number of documents processed on the performance of AI language models in Retrieval Augmented Generation (RAG).
  • Using a dataset of questions and Wikipedia paragraphs, the researchers reduced the number of documents while maintaining the same overall text length. In most cases, this led to performance improvements of up to 10 percent for models like Llama-3.1 and Gemma 2.
  • The findings suggest that processing multiple documents in a RAG environment complicates the task for the models. The study highlights the need for retrieval systems to strike a balance between relevance and diversity to minimize conflicts in the input data.
Sources
Jonathan writes for THE DECODER about how AI tools can make our work and creative lives better.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.