Researchers combine two language models and a database for more accurate LLMs

Midjourney prompted by THE DECODER

Researchers have developed a new approach called "Speculative RAG" that combines two language models to make Retrieval Augmented Generation (RAG) systems more efficient and accurate.

RAG systems augment Large Language Models (LLMs) with external knowledge bases to reduce factual errors and bullshit, sorry, "hallucinations". However, RAG can still be prone to errors, especially with large amounts of data and complex contexts.

So developers are investigating how to improve RAG. One such approach is Speculative RAG. It aims to improve on traditional RAG systems by combining a smaller, specialized language model with a larger, general-purpose model.

A smaller "RAG Drafter" model generates multiple answer suggestions in parallel, based on different subsets of retrieved documents. This model is specifically trained on question-answer-document relationships. A larger "RAG Verifier" model then reviews these suggestions and selects the best answer.

By generating answers from different document subsets in parallel, the specialized model produces high-quality options while processing fewer input tokens. The general model can then efficiently verify these suggestions without having to process lengthy contexts.

In tests on several benchmark datasets, the Speculative RAG framework achieved up to 12.97 percent higher accuracy with 51 percent lower latency compared to conventional RAG systems.

The University of California and Google researchers believe that splitting between specialized and general models is a promising approach to making RAG systems more efficient. "We demonstrate that a smaller, specialized RAG drafter can effectively augment a larger, general-purpose LM for knowledge-intensive tasks."

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

Researchers combine two language models and a database for more accurate LLMs

AI system StreamDiT generates livestream videos from text at 16 fps 512p

Researchers used 1,600 YouTube fail videos to show AI models struggle with surprises

AI coding can make developers slower even if they feel faster

OpenAI launches new ChatGPT agent that automates complex tasks for Pro, Plus, and Team

Kimi-K2 is the next open-weight AI milestone from China after Deepseek

New Energy-Based Transformer architecture aims to bring better "System 2 thinking" to AI models

Researchers combine two language models and a database for more accurate LLMs

AI system StreamDiT generates livestream videos from text at 16 fps 512p

Researchers used 1,600 YouTube fail videos to show AI models struggle with surprises

AI coding can make developers slower even if they feel faster