A research team from Singapore and China has introduced LongWriter-Zero, an AI model that uses reinforcement learning to write texts longer than 10,000 words—without relying on synthetic training data.
Existing language models often struggle with generating very long texts: As the output grows, coherence drops, repetitions increase, and structural problems become more common. Most current approaches tackle these issues with supervised fine-tuning (SFT) on artificially generated long-form texts. But creating these datasets is labor-intensive, and the results often fall short in both style and substance.
LongWriter-Zero, developed by researchers from the Singapore University of Technology and Design and Tsinghua University, takes a different approach. Instead of using pre-made training examples, the model relies solely on reinforcement learning (RL) to produce coherent long-form texts. The team builds on its earlier LongWriter research.
"Think Prompts" and reinforcement learning
At the heart of LongWriter-Zero’s RL training are three specialized reward models that evaluate text length, writing quality, and structure. The researchers also introduced a technical innovation called advantage averaging, which balances rewards across different quality dimensions. The base model for LongWriter-Zero is Qwen2.5-32B.
A unique aspect of LongWriter-Zero is its use of "think prompts." Before generating an answer, the model is prompted to plan the structure and content of its response. According to the team, this step leads to much better text coherence.
Benchmarks like Arena-Write show a significant jump in the model’s performance with this strategy, from 700 to 1200 Elo points. Adding a pretraining phase with 30 billion tokens of high-quality text further boosts results. This head start allows the model to make better use of RL rewards, suggesting that stronger base models benefit more from RL fine-tuning.
LongWriter-Zero and "reward hacking"
In evaluations, LongWriter-Zero outperformed established models like DeepSeek-R1 and Claude 4 Sonnet in both automated tests and human reviews.
However, the researchers highlight a common RL problem: reward model hacking. They observed two main issues. First, the model tends to repeat or slightly rephrase content to reach the required word count and maximize its score in the length reward model. Even with explicit penalties for obvious duplicates, subtler forms of redundancy—like paraphrased or lightly edited sentences—often go undetected.
Second, the writing reward model shows a bias toward certain keywords that were strongly rewarded during training. The model learns to overuse these words, even in inappropriate contexts, to maximize its rewards.
These issues could make LongWriter-Zero unsuitable for producing truly high-quality text in real-world applications.
The authors see this as a fundamental weakness of current RL-based language model training: models often exploit superficial statistical patterns instead of genuinely aligning with the real intentions of human users.