Ad
Skip to content

ByteDance study finds that asking LMMs questions beats making it transcribe text for long document training

Image description
Nano Banana Pro prompted by THE DECODER

Multimodal AI models are supposed to handle ever-longer documents, but how they're trained to do so usually stays a trade secret. A new study shows that character recognition as a training task actually hurts performance and that question-answer pairs work far better.

Researchers from ByteDance Seed and the Hong Kong University of Science and Technology (HKUST) studied how image-language models can be trained efficiently on long documents. The result is a model called MMProLong, built on Alibaba's open Qwen2.5-VL, that beats much larger competitors.

Modern multimodal AI models need to handle increasingly long inputs: entire PDF collections of rendered pages, hours of video, or agents that remember their tasks across many steps. AI labs like OpenAI, Google, and Alibaba tout context windows of up to 1 million tokens, capable of holding not just text but thousands of page images or video frames. But according to the authors, technical reports barely reveal what data a model should see and in what mix.

Asking questions teaches more than transcribing text

At first glance, the study's central finding seems obvious. For a multimodal model to learn to find the right spot in a 100-page document, having it transcribe the text of every page barely helps. It's more effective to ask questions whose answers are buried somewhere in those pages.

Three-step diagram of the synthesis pipeline for long-document VQA: Step 1 samples a contiguous segment of 8 to 15 pages from an OCR-parsed document, Step 2 generates a question-answer pair from the segment using a QA generator, Step 3 embeds the pair back into the full document context.
The synthesis pipeline combines OCR parsing, automatic question generation, and re-embedding to extract long-context training examples from real documents. | Image: ByteDance

The researchers tested both approaches head-to-head. In one setup, the model had to perform text recognition either across all pages of a document or for a few selected pages, while the remaining pages stayed in context as distractions.

In the other setup, the researchers used a separate model (Seed 2.0 from ByteDance) to generate question-answer pairs for individual sections of a document. The question then went into training alongside the entire document, forcing the model to locate the relevant passage within a long context.

Table comparing different training data on MMLongBench benchmarks at 64K and 128K context lengths. Question-answer training data achieves average improvements of 5 to 6 points, while OCR training data shows losses of 6.8 to 17.4 points compared to the base model Qwen2.5-VL-7B.
Question-answer training (top rows) sharply improves the model's long-document performance, while pure character recognition training (bottom rows) actually makes it worse. Even with extra fine-tuning, the OCR variants don't catch up. | Image: ByteDance

Pure text recognition as a training task actually worsened performance compared to the starting point. Question-answer training, on the other hand, brought clear gains. The model only learns to navigate long texts when it has to filter out and categorize information with a specific goal.

Diversity beats specialization

Three more findings turned up in the experiments. Feeding the model mainly very long documents at the top end of the context window isn't worth it. A broader mix of shorter and longer examples works more reliably. Long-context ability isn't a skill tied to a specific length but requires flexible searching across different distances.

The real bottleneck also turns out to be finding the relevant passage, not reasoning about it. A mix weighted toward extraction tasks with a smaller share of calculation tasks delivered the best results.

The third finding is surprising because it contradicts common practice with text-only language models. Adding short training examples doesn't appear strictly necessary. The model largely kept its short-task abilities even when trained only on long question-answer data. The format of the data itself probably helps: even when the context is very long, the task is still framed as a question-answer interaction in the familiar instruction-following format.

Small but stable up to 512,000 tokens

With this recipe and a fairly modest training budget, MMProLong beats several much larger open models like InternVL3-38B and Gemma3-27B. The model was trained on only 128,000 tokens but stays stable at 256,000 and even 512,000 token input lengths, while the original model falls apart sharply at those ranges.

Bar chart comparing the base model Qwen2.5-VL-7B and MMProLong on the MM-NIAH benchmark across Retrieval, Counting, Reasoning, and Average categories. MMProLong wins in all four with gains between 7.0 and 45.7 points.
On the Needle-in-a-Haystack benchmark for long multimodal contexts, MMProLong gains an average of 29.4 points over the Qwen2.5-VL-7B base. | Image: ByteDance

This ability also transfers to tasks the model was never specifically trained on, like understanding long videos. In an extra transfer experiment, the recipe proved effective on the stronger Qwen3-VL-8B too, even though that model is already built for long contexts.

Bar chart comparing Qwen2.5-VL-7B and MMProLong on three long-video benchmarks: Video-MME, MLVU, and Long VideoBench. MMProLong wins all three with gains between 1.6 and 3.3 points.
Even though it was trained only on documents, the gains carry over to long-video benchmarks. | Image: ByteDance

The study is also interesting because it comes from an entirely different camp than Deepseek's widely discussed work on the same problem. Deepseek tries to extend the long memory of AI models by processing texts as images and compressing them heavily, most recently with an encoder that re-sorts visual information by content. ByteDance Seed takes the opposite approach: optimize the training data instead of the architecture.

AI News Without the Hype – Curated by Humans

Subscribe to THE DECODER for ad-free reading, a weekly AI newsletter, our exclusive "AI Radar" frontier report six times a year, full archive access, and access to our comment section.

Read on for the full picture.
Subscribe for hype-free coverage.

  • Access to all THE DECODER articles.
  • Read without distractions – no Google ads.
  • Access to comments and community discussions.
  • Weekly AI newsletter.
  • 6 times a year: “AI Radar” – deep dives on key AI topics.
  • Up to 25 % off on KI Pro online events.
  • Access to our full ten-year archive.
  • Get the latest AI news from The Decoder.
Subscribe to The Decoder