Content
summary Summary

Microsoft Research has created a new AI architecture called the "Differential Transformer" (DIFF Transformer) designed to enhance focus on relevant context while reducing interference. According to the researchers, this approach shows improvements in various areas of language processing.

Ad

The core of the DIFF Transformer is "differential attention." This involves calculating two separate softmax attention maps and then subtracting them from each other. The researchers explain that this subtraction eliminates common noise in both attention maps - similar to how noise-canceling headphones work.

"Transformer tends to overallocate attention to irrelevant context" says the research team. This leads to problems in accurately retrieving key information. The DIFF Transformer aims to solve this issue through its novel attention mechanism.

Diagramm: Vergleich Aufmerksamkeitsverteilung Transformer vs. Differential Transformer, zeigt verbesserte Signalverarbeitung und Leistung.
By amplifying relevant signals and suppressing noise, the Differential Transformer achieves more precise context modeling and higher accuracy in multi-needle retrieval tasks. Image: Microsoft

DIFF Transformer shows more performance with less data

In tests, the DIFF Transformer achieved comparable performance to conventional transformers using about 65 percent of the model size or training data. For a 3-billion-parameter model trained on one trillion tokens, the DIFF Transformer outperformed variants with established transformer architecture, according to the study.

Ad
Ad

Advantages were particularly evident in processing longer contexts of up to 64,000 tokens. In tests extracting key information from long texts ("needle in a haystack"), the DIFF Transformer performed significantly better than conventional models. When positioning important information in the first half of a 64,000-token context, the new model achieved up to 76 percent higher accuracy, according to the researchers.

Fewer hallucinations, more robust learning, better quantization

Another advantage of the DIFF Transformer is the reduction of hallucinations - a common problem with large language models. When summarizing texts from datasets like XSum, CNN/DM, and MultiNews, the DIFF Transformer showed 9 to 19 percentage points higher accuracy than a comparable standard transformer. Similar improvements were observed in question-answering tasks.

The new architecture also proved more robust to changes in the order of examples in contextual learning - a known issue with conventional models.

The researchers also report benefits in quantizing AI models. Quantization reduces the continuous values of model parameters to a limited number of discrete values to decrease model size and increase inference speed. The DIFF Transformer reduces outlier activations, which pose a challenge for efficient compression. At extreme quantization to 4 bits, the DIFF Transformer achieved about 25 percentage points higher accuracy than a standard transformer.

Despite these advantages, the throughput of the DIFF Transformer is only about 5 to 12 percent lower than that of a comparable conventional transformer, according to the study. The researchers therefore see the new architecture as a promising foundation for future large language models.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Recommendation
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • Microsoft Research has developed a new AI architecture called "Differential Transformer" (DIFF Transformer). This is designed to increase attention to relevant contexts and reduce interference. According to the researchers, the approach shows improvements in various areas of language processing.
  • The core of the DIFF Transformer is "differential attention". This involves calculating two separate softmax attention maps and subtracting them from each other. This is intended to eliminate common noise, similar to noise-canceling headphones.
  • In tests, the DIFF Transformer achieved comparable performance to conventional transformers with around 65 percent of the model size or training data. It showed clear advantages in longer contexts, up to 64,000 tokens. It also reduced hallucinations and improved quantization, with only slightly lower throughput.
Sources
Max is managing editor at THE DECODER. As a trained philosopher, he deals with consciousness, AI, and the question of whether machines can really think or just pretend to.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.