Content
summary Summary

Researchers have developed a method to extend the output length of AI language models to over 10,000 words. Until now, a limit of 2,000 words was common.

Ad

Today's language models can process inputs of hundreds of thousands or even millions of tokens, but without external intervention, they do not generate outputs longer than a modest 2,000 words.

According to a new study, this is primarily due to the training data. Through controlled experiments, the researchers found that a model's effective output length is limited by the longest output it has seen during supervised fine-tuning (SFT).

In other words, the output limitation is due to the scarcity of examples with long outputs in existing SFT datasets. To solve this problem, the scientists introduce "AgentWrite" - an agent-based pipeline that breaks down long generation tasks into subtasks. This allows existing LLMs to generate coherent outputs of over 20,000 words.

Ad
Ad
Image: Bai, Zhang et al.

LongWriter routinely generates 40 pages of text

Using AgentWrite, the researchers created the "LongWriter-6k" dataset. It contains 6,000 SFT data with output lengths between 2,000 and 32,000 words. By training with this dataset, they were able to scale the output length of existing models to over 10,000 words without compromising output quality.

Video: Bai, Zhang et al.

To evaluate ultra-long generation capabilities, they also developed "LongBench-Write" - a comprehensive benchmark with various writing instructions and output lengths ranging from 0 to over 4,000 words.

The researchers' 9 billion parameter model, further enhanced by Direct Preference Optimization (DPO), achieved top performance in this benchmark. It even surpassed much larger proprietary models.

The code and model for LongWriter are available on GitHub.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Recommendation
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • Researchers have developed a method called "AgentWrite" that can extend the output length of AI language models from the usual 2,000 words to over 10,000 words.
  • According to a study, the limitation of the output length is due to the training data. The effective output length of a model is limited by the longest output it has seen during supervised fine-tuning.
  • Using AgentWrite, the researchers created the "LongWriter-6k" dataset with 6,000 training data and output lengths of up to 32,000 words. A 9-billion-parameter model trained with it achieved top performance on the newly developed LongBench-Write benchmark.
Sources
Max is managing editor at THE DECODER. As a trained philosopher, he deals with consciousness, AI, and the question of whether machines can really think or just pretend to.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.