Researchers have released a preview of LongLLaMA, a large language model capable of handling long contexts up to 256.000 tokens or more. Built on the open-source OpenLLaMA and fine-tuned using the Focused Transformer (FoT) method, it permits some attention layers to access a memory cache of key-value pairs to extend their context length.
According to the researchers, the model retains performance on tasks that don't require long contexts, and can be used as a drop-in replacement for shorter context LLaMA implementations. The team has released their smaller 3B variant under the Apache 2.0 license, with inference code supporting longer contexts on Hugging Face. More information and examples of LongLLaMA can be found on their GitHub repository.