Anthropic has developed a method to improve the precision of knowledge database queries. The approach, called contextual retrieval, aims to provide more accurate answers by incorporating additional context.
Contextual retrieval addresses a key limitation of existing retrieval augmented generation (RAG) systems. When documents are split into smaller chunks for indexing, important contextual information is often lost.
Anthropic's solution adds a brief summary of the full document to each chunk before indexing. These context snippets are typically up to 100 words long.
Here's an example:
original_chunk = "The company's revenue grew by 3% over the previous quarter."
contextualized_chunk = "This chunk is from an SEC filing on ACME corp's performance in Q2 2023; the previous quarter's revenue was $314 million. The company's revenue grew by 3% over the previous quarter."
According to Anthropic, this new method can cut information retrieval error rates by up to 49 percent. When combined with additional result reranking, improvements of up to 67 percent are possible.
Anthropic notes that contextual retrieval can be integrated into existing RAG systems with minimal effort. The company has published a detailed implementation guide with code samples on GitHub.
Research backs context-aware approach
Recent work from Cornell University supports the effectiveness of context-aware retrieval. In a paper, researchers examined a similar technique called "Contextual Document Embeddings" (CDE). They developed two complementary methods for contextualized embeddings:
- Contextual training: This reorganizes training data so each batch contains similar but hard-to-distinguish documents, forcing the model to learn more nuanced differences.
- Contextual architecture: A two-stage encoder integrates information from neighboring documents directly into embeddings, allowing the model to account for relative term frequencies and other contextual cues.
The researchers found both methods yield improvements independently, but work best in combination. They've released their CDE model and a tutorial on Hugging Face.
In tests on the Massive Text Embedding Benchmark (MTEB), the CDE model achieved top scores for its size class. Experiments showed CDE offers particular advantages for smaller, domain-specific datasets in areas like finance or medicine. Improvements were also seen in tasks like classification, clustering and semantic similarity.
However, the researchers note it's unclear how CDE might impact massive knowledge bases with billions of documents. More investigation is also needed into optimal context size and selection.