Content
summary Summary

Microsoft Research has developed a more efficient way to incorporate external knowledge into language models. The new system, called Knowledge Base-Augmented Language Models (KBLaM), takes a plug-and-play approach that doesn't require modifying existing models.

Ad

Unlike current approaches such as RAG or In-Context Learning, KBLaM doesn't use separate retrieval systems. Instead, it turns knowledge into vector pairs and weaves them directly into the model's architecture using what Microsoft calls "rectangular attention."

Diagram of the KBLaM architecture: tokenization of question and knowledge base, rectangular attention, language model for generating answers.
KBLaM processes knowledge directly within the model instead of using external retrieval, leading to faster and more efficient responses compared to traditional systems. | Image: Microsoft Research

Current RAG systems face a quadratic scaling problem due to their self-attention mechanism - every token must interact with every other token. When 1,000 tokens from the knowledge base are inserted into the context, the model must process one million token pairs. With 10,000 tokens, that jumps to 100 million interactions.

Line chart for performance comparison: Time to first token and memory usage for KBLaM vs. RAG with increasing number of triples in the knowledge base
Microsoft's data shows KBLaM can process 4,096 knowledge triples faster than RAG can handle just 5 triples. | Image: Microsoft Research

KBLaM sidesteps this issue: while the user's input can access all knowledge tokens, those knowledge tokens don't interact with each other or the input. This means that as the knowledge base grows, the required computational power increases only linearly. According to the researchers, a single GPU can handle more than 10,000 knowledge triples (about 200,000 tokens).

Ad
Ad

Opening up to developers

Tests show some promising results. Working with about 200 knowledge items, KBLaM is better than traditional models at avoiding hallucinations and refusing to answer questions for which it doesn't have information. It's also more transparent than in-context learning because it can link knowledge to specific tokens.

The code and datasets for KBLaM are now available on GitHub. The system works with several popular models, including Meta's Llama 3 and Microsoft's Phi-3, with plans to add support for Hugging Face Transformers. The researchers emphasize that KBLaM isn't ready for widespread use yet. While it handles straightforward question-answer scenarios well, it still needs work on more complex reasoning tasks.

LLMs struggle with an interesting paradox: their context windows keep getting bigger, letting them handle more information at once, but processing all that data reliably remains a challenge. As a result, RAG has become the go-to solution for feeding specific information into models with relative reliability, but KBLaM suggests that there may be a more efficient way forward.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • Microsoft Research has developed KBLaM, a new method that directly integrates structured knowledge databases into language models without requiring separate retrieval modules or retraining the model.
  • KBLaM's computational effort grows linearly with the amount of data, in contrast to conventional methods like RAG, which scale quadratically. The system is particularly effective at avoiding hallucinations.
  • The code and data sets have been made open source and support various models such as Llama-3 and Phi-3. However, Microsoft states that further research is needed before the method can be used on a large scale.
Jonathan writes for THE DECODER about how AI tools can make our work and creative lives better.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.