Study reveals tension between a LLM's prior knowledge and reference data

A study from Stanford University investigates the extent to which Retrieval Augmented Generation (RAG) improves the factual accuracy of Large Language Models (LLMs). The results show that the reliability of RAG systems depends critically on the quality of the data sources used, and that prior knowledge of the language model matters.

Researchers at Stanford University have studied the reliability of RAG systems in answering questions compared to RAG-less LLMs such as GPT-4. In RAG systems, the AI model is given a reference document or database of relevant information to improve the accuracy of the answers.

The study shows that the factual accuracy of RAG systems depends on both the strength of the AI model's pre-trained knowledge and the correctness of the reference information.

Tension between RAG and LLM knowledge

According to the research team, there is a tension between the internal knowledge of a language model and the information provided via RAG. This is especially the case when the retrieved information contradicts the model's pre-trained knowledge.

The researchers tested GPT-4 and other LLMs on six different question sets totaling more than 1,200 questions. When given the correct reference information, the models answered 94 percent of the questions correctly.

However, when the reference documents were increasingly modified with false values, the probability of the LLM repeating the false information was higher when its own pre-trained knowledge on the subject was weaker.

When the pre-trained knowledge was stronger, the model was better able to resist the false reference information.

Depending on how incorrect the information in the reference document is, the LLM outputs the incorrect answer by RAG reference or from its knowledge. | Image: Wu et al.

A similar pattern emerged when the altered information deviated more strongly from what the model considered plausible: the more unrealistic the deviation, the more the LLM relied on its own pre-trained knowledge.

The strength of the prompt to adhere to the reference information also had an influence: a stronger prompt led to a higher probability of the model adhering to the reference.

Recommendation

AI research

New Othello experiment supports the world model hypothesis for large language models

In contrast, the probability decreased when the prompt was less strict and the model had more leeway to weigh its prior knowledge against the reference information.

The way the LLM accesses the RAG data affects the accuracy of the information extracted from the reference. To achieve the highest possible accuracy, the LLM must be told very clearly that it should only take data from the reference.| Image: Wu et al.

RAG with high-quality reference data can significantly improve the accuracy of LLMs

The study results show that while RAG systems can significantly improve the factual accuracy of language models, they are not a panacea against misinformation.

Without context (i.e., without RAG), the tested language models answered on average only 34.7 percent of the questions correctly. With RAG, the accuracy rate increased to 94 percent.

However, the reliability of the reference information is crucial. In addition, a well-trained prior knowledge of the model is helpful in recognizing and ignoring unrealistic information.

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

For the commercial use of RAG systems in areas such as finance, medicine, and law, the researchers see a need for greater transparency. Users need to be made more aware of how the models deal with potentially conflicting or incorrect information, and that RAG systems, like LLMs, can be wrong.

For example, if RAG systems are used to extract nested financial data to be used in an algorithm, what will happen if there is a typo in the financial documents? Will the model notice the error and if so, what data will it provide in its place? Given that LLMs are soon to be widely deployed in many domains including medicine and law users and developers alike should be cognizant of their unintended effects, especially if users have preconceptions that RAG-enabled systems are, by nature, always truthful.

From the paper

Study reveals tension between a LLM's prior knowledge and reference data

Tension between RAG and LLM knowledge

New Othello experiment supports the world model hypothesis for large language models

RAG with high-quality reference data can significantly improve the accuracy of LLMs

Shopify CEO and ex-OpenAI researcher agree that context engineering beats prompt engineering

Apple's "Illusion of Thinking" paper shows experts deeply divided on AI reasoning

AI agents can be easily tricked into doing stupid things, study says

OpenAI launches GPT-5 as a unified system with adaptive reasoning for complex tasks

Google Deepmind's Genie 3 creates interactive 3D worlds that stay consistent for "multiple minutes"

Google upgrades Gemini with Deep Think and flags early warning risks

Study reveals tension between a LLM's prior knowledge and reference data

Tension between RAG and LLM knowledge

RAG with high-quality reference data can significantly improve the accuracy of LLMs

Share

Bank details