AI research
Maximilian Schreiner

Microsoft's "DIFF Transformer" promises more efficient LLMs with fewer hallucinations

Microsoft (Screenshot bei YouTube)
Microsoft's
Max is managing editor at THE DECODER. As a trained philosopher, he deals with consciousness, AI, and the question of whether machines can really think or just pretend to.
Profile
E-Mail
Content
summary Summary

Microsoft Research has created a new AI architecture called the "Differential Transformer" (DIFF Transformer) designed to enhance focus on relevant context while reducing interference. According to the researchers, this approach shows improvements in various areas of language processing.

Ad

The core of the DIFF Transformer is "differential attention." This involves calculating two separate softmax attention maps and then subtracting them from each other. The researchers explain that this subtraction eliminates common noise in both attention maps - similar to how noise-canceling headphones work.

"Transformer tends to overallocate attention to irrelevant context" says the research team. This leads to problems in accurately retrieving key information. The DIFF Transformer aims to solve this issue through its novel attention mechanism.

Diagramm: Vergleich Aufmerksamkeitsverteilung Transformer vs. Differential Transformer, zeigt verbesserte Signalverarbeitung und Leistung.
By amplifying relevant signals and suppressing noise, the Differential Transformer achieves more precise context modeling and higher accuracy in multi-needle retrieval tasks. Image: Microsoft

DIFF Transformer shows more performance with less data

In tests, the DIFF Transformer achieved comparable performance to conventional transformers using about 65 percent of the model size or training data. For a 3-billion-parameter model trained on one trillion tokens, the DIFF Transformer outperformed variants with established transformer architecture, according to the study.

Ad
Ad

Advantages were particularly evident in processing longer contexts of up to 64,000 tokens. In tests extracting key information from long texts ("needle in a haystack"), the DIFF Transformer performed significantly better than conventional models. When positioning important information in the first half of a 64,000-token context, the new model achieved up to 76 percent higher accuracy, according to the researchers.

Fewer hallucinations, more robust learning, better quantization

Another advantage of the DIFF Transformer is the reduction of hallucinations - a common problem with large language models. When summarizing texts from datasets like XSum, CNN/DM, and MultiNews, the DIFF Transformer showed 9 to 19 percentage points higher accuracy than a comparable standard transformer. Similar improvements were observed in question-answering tasks.

The new architecture also proved more robust to changes in the order of examples in contextual learning - a known issue with conventional models.

The researchers also report benefits in quantizing AI models. Quantization reduces the continuous values of model parameters to a limited number of discrete values to decrease model size and increase inference speed. The DIFF Transformer reduces outlier activations, which pose a challenge for efficient compression. At extreme quantization to 4 bits, the DIFF Transformer achieved about 25 percentage points higher accuracy than a standard transformer.

Despite these advantages, the throughput of the DIFF Transformer is only about 5 to 12 percent lower than that of a comparable conventional transformer, according to the study. The researchers therefore see the new architecture as a promising foundation for future large language models.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Recommendation
AI research

Researchers put OpenAI's o1 through its paces, exposing both breakthroughs and limitations

Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • Microsoft Research has developed a new AI architecture called "Differential Transformer" (DIFF Transformer). This is designed to increase attention to relevant contexts and reduce interference. According to the researchers, the approach shows improvements in various areas of language processing.
  • The core of the DIFF Transformer is "differential attention". This involves calculating two separate softmax attention maps and subtracting them from each other. This is intended to eliminate common noise, similar to noise-canceling headphones.
  • In tests, the DIFF Transformer achieved comparable performance to conventional transformers with around 65 percent of the model size or training data. It showed clear advantages in longer contexts, up to 64,000 tokens. It also reduced hallucinations and improved quantization, with only slightly lower throughput.
Sources
Arxiv
Max is managing editor at THE DECODER. As a trained philosopher, he deals with consciousness, AI, and the question of whether machines can really think or just pretend to.
Profile
E-Mail
AI research
Update

Microsoft releases framework for highly efficient 1-bit language models

News, tests and reports about VR, AR and MIXED Reality.
Apple may stop making Vision Pro by the end of the year Got two Meta Quests? Try this fun local multiplayer MR game Minecraft to drop support for PC VR headsets next year MIXED-NEWS.com
AI and society

Microsoft's AI ambitions fuel unlikely comeback for dormant Pennsylvania nuclear reactor

AI in practice

Excel users can now wield Python's power without coding, thanks to Copilot's latest update

Google News
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

Microsoft's "DIFF Transformer" promises more efficient LLMs with fewer hallucinations

Bank details

IBAN: DE87 1203 0000 1086 0070 75
Account holder: DEEP CONTENT GbR
Purpose: Support THE DECODER
AI in practice

Anthropic launches smarter Claude models with computer skills

AI research

Apple's local AI agent framework paves the way for more useful Apple Intelligence

AI research

Apple AI researchers question OpenAI's claims about o1's reasoning capabilities

Google News