Microsoft Research has developed a new way to feed knowledge into LLMs

Midjourney prompted by THE DECODER

Microsoft Research has developed a more efficient way to incorporate external knowledge into language models. The new system, called Knowledge Base-Augmented Language Models (KBLaM), takes a plug-and-play approach that doesn't require modifying existing models.

Unlike current approaches such as RAG or In-Context Learning, KBLaM doesn't use separate retrieval systems. Instead, it turns knowledge into vector pairs and weaves them directly into the model's architecture using what Microsoft calls "rectangular attention."

Diagram of the KBLaM architecture: tokenization of question and knowledge base, rectangular attention, language model for generating answers. — KBLaM processes knowledge directly within the model instead of using external retrieval, leading to faster and more efficient responses compared to traditional systems. | Image: Microsoft Research

Current RAG systems face a quadratic scaling problem due to their self-attention mechanism - every token must interact with every other token. When 1,000 tokens from the knowledge base are inserted into the context, the model must process one million token pairs. With 10,000 tokens, that jumps to 100 million interactions.

Line chart for performance comparison: Time to first token and memory usage for KBLaM vs. RAG with increasing number of triples in the knowledge base — Microsoft's data shows KBLaM can process 4,096 knowledge triples faster than RAG can handle just 5 triples. | Image: Microsoft Research

KBLaM sidesteps this issue: while the user's input can access all knowledge tokens, those knowledge tokens don't interact with each other or the input. This means that as the knowledge base grows, the required computational power increases only linearly. According to the researchers, a single GPU can handle more than 10,000 knowledge triples (about 200,000 tokens).

Opening up to developers

Tests show some promising results. Working with about 200 knowledge items, KBLaM is better than traditional models at avoiding hallucinations and refusing to answer questions for which it doesn't have information. It's also more transparent than in-context learning because it can link knowledge to specific tokens.

The code and datasets for KBLaM are now available on GitHub. The system works with several popular models, including Meta's Llama 3 and Microsoft's Phi-3, with plans to add support for Hugging Face Transformers. The researchers emphasize that KBLaM isn't ready for widespread use yet. While it handles straightforward question-answer scenarios well, it still needs work on more complex reasoning tasks.

LLMs struggle with an interesting paradox: their context windows keep getting bigger, letting them handle more information at once, but processing all that data reliably remains a challenge. As a result, RAG has become the go-to solution for feeding specific information into models with relative reliability, but KBLaM suggests that there may be a more efficient way forward.

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

Microsoft Research has developed a new way to feed knowledge into LLMs

Opening up to developers

Microsoft's new Large Action Model can perform some tasks in Word

New Microsoft tool can compress AI prompts by up to 80 percent, saving time and money

Microsoft's Orca 2 can beat LLMs 5-10 times its size thanks to a unique training method

Researchers push "Context Engineering 2.0" as the road to lifelong AI memory

German court deepens the split on AI and copyright with its latest ruling

OpenAI and Microsoft call AGI pointless, then make it the linchpin of billion-dollar deals

Microsoft Research has developed a new way to feed knowledge into LLMs

Opening up to developers

Microsoft's new Large Action Model can perform some tasks in Word

New Microsoft tool can compress AI prompts by up to 80 percent, saving time and money

Microsoft's Orca 2 can beat LLMs 5-10 times its size thanks to a unique training method