ChatGPT can read 4,096 tokens, LongNet a billion. This could enable Transformer models that can process entire parts of the Web simultaneously.
The sequence length of transformer models plays an important role in training and especially in deployment: Larger sequence lengths allow a large context window in which, for example, a language model can process and generate more text, or a vision transformer can capture more information in an image.
A major problem with scaling sequence length is that the relationship between sequence length and required compute is quadratic in the standard Transformer architecture, so the required compute quickly explodes.
LongNet processes 250,000 times more tokens than ChatGPT
However, larger sequence lengths can be achieved by various optimizations: OpenAI's ChatGPT has a context window of 4,096 tokens, which is about 3,000 words, but there are variants of GPT-3.5-turbo with just about 8,000 tokens, and the largest GPT-4 model has about 32,000 tokens. With Claude, Anthropic offers a commercially available model with about 100,000 tokens.
With LongNet, Microsoft is now demonstrating a method that scales linearly and, according to the team, can scale to a billion tokens, which is 250,000 times longer than that of ChatGPT. That's about 750,000,000 words or 2,000,000 pages.
The team achieves this leap through an adapted attention mechanism they call "dilated attention". Here, the attention allocation decreases exponentially as the distance between tokens grows, so that the network looks at relationships between nearby tokens as closely as a standard attention mechanism, but applies coarser attention patterns to tokens that are farther apart.
LongNet to enable processing of web-sized datasets
In a test, the team uses LongNet to train a speech generation model with up to 32,000 tokens and compares it to classical transformer-based approaches. According to the team, LongNet demonstrates known scaling laws of classical transformer models; for example, the perplexity of the model decreases as it gets larger.
In the future, LongNet could enable the processing of web-sized datasets, the team said. The large context window also provides a large memory and receptive field for models, which is relevant to their interaction with people or the world. A larger context window also contains more complex causality and reasoning paths that models could exploit in training data, which could lead to better generalizing models. LongNet also makes it possible to explore the limits of in-context learning, the team said, "which has the potential to be a paradigm shift for many-shot learning, as an extremely long context may help the models alleviate catastrophic forgetting."
LongNet is just a proof of concept for now
Whether LongNet can actually deliver on these promises is unclear; the paper lacks comparisons with modern language models such as GPT-4 32k, and truly meaningful metrics such as accuracy or human evaluations. In this respect, LongNet is initially a feasibility study; whether such gigantic sequence lengths bring real advantages must now be shown in follow-up work.
In the future, the team plans to use LongNet for other applications, such as multimodal large language models or genomic data modeling.