Study finds that fewer documents can lead to better performance in RAG systems

GPT-4o prompted by THE DECODER

Researchers at the Hebrew University of Jerusalem have discovered that the number of documents processed in Retrieval Augmented Generation (RAG) affects language model performance, even when the total text length remains constant.

The research team used MuSiQue's validation dataset, which contains 2,417 answerable questions. Each question links to 20 Wikipedia paragraphs, with two to four paragraphs containing relevant answer information while the rest serve as realistic distractors.

Diagram with four groups of documents in pink and blue, number of pink documents constant, number of blue documents decreasing to the right. — Study design showing question collections with supporting documents (pink) and distracting documents (blue). Fewer documents are compensated by extending remaining ones to maintain consistent length. | Image: Levy et al.

To study how document quantity affects performance, the researchers created multiple data partitions. They gradually reduced the number of documents from 20 to 15, 10, eight, and finally down to just the two to four documents containing relevant information. To maintain consistent token counts and information positioning, they expanded the selected documents using text from the original Wikipedia articles.

Fewer documents lead to better results

Testing several open-source models including Llama-3.1, Qwen2, and Gemma 2 revealed that reducing document count improved performance by up to 10 percent in most cases. Qwen2 proved to be an exception, possibly handling multiple document collections more effectively. While these tested models are only a few months old, newer versions like Llama-3.3, Qwen2.5, and Gemma 3 have already superseded them.

Bar chart: F1 score of various large language models for retrieval tasks with different numbers of retrieved documents. — Performance comparison showing Qwen2 maintaining steady performance while Llama-3.1 and Gemma-2 decline up to 10 percent with increased document count. | Image: Levy et al.

The language models performed significantly better when given only supporting documents, which meant shorter context and eliminated distracting content. The results showed that similar but unrelated documents, often retrieved in RAG systems, can confuse the model and reduce performance.

Bar chart showing F1 scores for Qwen-2 72B, Qwen-2 7B, Llama-3.1 72B, Llama-3.1 8B, Gamma-2 27B and Gamma-2 9B on different data sets. — Research findings showing improved model performance with random irrelevant documents, suggesting models more easily identify and filter obviously irrelevant content. | Image: Levy et al.

The study demonstrates that processing multiple documents makes tasks more challenging in a retrieval environment. The researchers emphasize that retrieval systems need to balance relevance and diversity to minimize conflicts. Future models might benefit from mechanisms that can identify and discard contradictory information while still utilizing document diversity.

The researchers acknowledge certain study limitations, including lack of investigation into prompt variations and data order effects. They've made their datasets publicly available to facilitate further research into multiple document processing.

The RAG versus context window debate continues

As context windows continue to grow, there's ongoing discussion about whether RAG systems remain necessary. While language models are getting better at processing large amounts of text at once, RAG architectures show particular advantages when using smaller open-source models.

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

Study finds that fewer documents can lead to better performance in RAG systems

Fewer documents lead to better results

The RAG versus context window debate continues

Spiral-Bench shows which AI models most strongly reinforce users' delusional thinking

AI persuades best by overwhelming people with information instead of using psychological tricks

Richard Sutton says the AI industry has "lost its way" by ignoring core principles of intelligence

Google downplays AI's environmental impact in new study

Deepseek’s first hybrid model V3.1 surpasses its R1 reasoning model on benchmarks

Meta's human-like chatbot personas can mislead users and result in real-world harm

Study finds that fewer documents can lead to better performance in RAG systems

Fewer documents lead to better results

The RAG versus context window debate continues

Spiral-Bench shows which AI models most strongly reinforce users' delusional thinking

AI persuades best by overwhelming people with information instead of using psychological tricks

Richard Sutton says the AI industry has "lost its way" by ignoring core principles of intelligence