Ad
Skip to content

LLM text data is drying up, but Meta points to unlabeled video as the next massive training frontier

Image description
Nano Banana Pro prompted by THE DECODER

Key Points

  • A single AI model can learn text, images, and video simultaneously from scratch without the different modalities interfering with each other, according to a study by Meta FAIR and New York University.
  • The findings suggest that the conventional approach of using two separate visual encoders for image understanding and image generation is unnecessary, as one unified model can handle both tasks effectively.
  • However, the researchers found that vision and language scale in fundamentally different ways: language capabilities grow in a balanced relationship between model size and data volume, while visual capabilities demand a disproportionately large amount of training data.

A research team from Meta FAIR and New York University systematically investigated how multimodal AI models can be trained from scratch. Their findings challenge several widely held beliefs about how these models should be built.

Language models have defined the foundation model era. But text, the researchers argue in their paper "Beyond Language Modeling," is ultimately a lossy compression of reality. Drawing on Plato's allegory of the cave, they suggest that language models have learned to describe the shadows on the wall without ever seeing the objects casting them. There's also a practical problem: high-quality text data is finite and quickly running out.

Four rows of example training data. Top row shows a text passage, second row shows three image-text pairs with animal subjects, third row shows a video sequence with navigation actions and numerical values, bottom row shows a video sequence with multiple frames of a hand moving an object.
Examples of the four training data types: plain text, image-text pairs, action-based video sequences, and raw video. | Image: Tong et al.

The study, which involved Yann LeCun before he left the company, trains a single model entirely from scratch. It pairs standard word-by-word prediction for language with a diffusion method called flow matching for visual data, training on text, video, image-text pairs, and action-related videos. By not building on top of an existing language model, the researchers avoid contaminating their results with previously learned knowledge.

Schematic diagram of the model architecture with an autoregressive model combining Next Text Token Prediction and Next Visual State Prediction. Below are five colored blocks for the areas studied: Visual Representation, Data, World Model, Architecture, and Scaling Behavior.
The model architecture combines text and image prediction in a single model (top). The five research axes are shown below. | Image: Tong et al.

A single visual encoder can handle both understanding and generation

Previous approaches like Janus or BAGEL use separate visual encoders for image understanding and image generation. The Meta researchers found that this separation is unnecessary.

Ad
DEC_D_Incontent-1

A representation autoencoder (RAE) built on the SigLIP 2 image model outperforms conventional VAE encoders at both image generation and visual comprehension, according to the study. Language performance stays on par with a text-only model.

Five bar charts side by side showing different metrics: DCLM PPL, Notes PPL, DPGBench, GenEval, and Avg VQA. SigLIP 2 in blue achieves the best scores for generation and VQA. The dashed line marks the text-only baseline. VAE encoders like SD-VAE and FLUX.1 perform worse on both generation and comprehension.
RAE based on SigLIP 2 beats VAE-based encoders in both image generation and visual comprehension without hurting language performance. | Image: Tong et al.

Rather than maintaining two separate paths, one encoder handles both tasks, dramatically simplifying the architecture. This challenges the common assumption that vision and language inevitably compete inside a model. Raw video without text annotations doesn't hurt language capabilities at all, according to the study. On a validation dataset, the model trained on both text and video actually edges out the text-only baseline.

Two line charts side by side. Left shows Diffusion Loss, right shows GenEval Score, both plotted against text tokens in billions. Four colored lines represent different amounts of image tokens from 25 to 100 billion. Dashed lines show the respective unimodal baselines. All curves improve with increasing text volume.
More text improves image generation: for each visual token budget, adding text lowers diffusion loss and pushes the GenEval score above the visual-only baseline. | Image: Tong et al.

The researchers trace the slight degradation that shows up with image-text pairs to the distribution gap between normal training text and image captions, not to the visual modality itself.

The synergy is notable. Twenty billion VQA tokens (visual question answering data) supplemented by 80 billion tokens from video, image-text pairs (MetaCLIP), or plain text each outperform a model trained on 100 billion pure VQA tokens.

Ad
DEC_D_Incontent-2

World modeling shows up without explicit training

The researchers also tested whether their model could learn to predict visual states. Given a current image and a navigation instruction, the model has to predict the next visual state. Actions are encoded directly as text, so no architectural changes are needed.

Grid of twelve images of an outdoor scene with buildings, arranged in three rows. The top row shows four context images. The middle row shows predicted images with keyboard input W and the text instruction "get out of the shadow!" The bottom row shows more predicted images with keys A and D, with the perspective rotating accordingly.
The model generates image sequences from keyboard input (W, A, D) or natural language instructions like "get out of the shadow!" - without ever seeing such input during training. | Image: Tong et al.

According to the researchers, world modeling ability emerges mainly from general multimodal training, not from task-specific navigation data. The model hits competitive performance with just one percent of task-specific data. It can even follow natural language instructions like "Get out of the shadow!" and produce matching image sequences, despite never encountering that kind of input during training.

Mixture-of-Experts figures out capacity allocation on its own

For the architecture, the researchers looked at Mixture-of-Experts (MoE), an approach where each input token gets routed to just a subset of specialized network modules instead of activating the entire model. This saves compute while boosting overall capacity.

With a model totaling 13.5 billion parameters but only 1.5 billion active per token, MoE outperforms both dense models and manually designed separation strategies, according to the study. The model figures out specialization by itself, assigning far more experts to language than to vision. Early layers are dominated by text-specific experts, while deeper layers increasingly feature visual and multimodal ones.

Stacked bar chart with 16 bars for network layers 0 through 15. Each bar shows the distribution of 256 experts in three categories: text experts in blue dominate throughout, multimodal experts in orange and vision experts in red increase in deeper layers.
The model develops specialization on its own: early layers are dominated by text experts, while visual and multimodal experts become more common in deeper layers. | Image: Tong et al.

One standout finding is that image comprehension and image generation activate the same experts, with a correlation of at least 0.90 across all layers. The researchers see this as confirmation of Richard Sutton's "Bitter Lesson" that learning from data usually beats hand-designed solutions.

Vision needs far more data than language to scale well

Training an AI model always involves a fundamental tradeoff in how to split a fixed compute budget. You can build a bigger model with less data, or a smaller model with more data. The Chinchilla scaling laws showed that for pure language models, both should grow at roughly the same rate.

The Meta researchers calculated these scaling laws for a joint vision-language model for the first time and found a major asymmetry. For language, the familiar equilibrium holds. For vision, the optimum shifts heavily toward data. Visual capabilities benefit disproportionately from more training data, while making the model bigger brings relatively little improvement.

 Eight charts in two rows. Top row for language, bottom for vision. From left to right: IsoFLOP curves with colored point clouds, optimal parameter count as a function of compute, optimal token count as a function of compute, and a comparison chart. The exponents show that vision has a significantly higher data exponent of 0.63 compared to language at 0.53.
Scaling laws for vision and language are fundamentally different: language follows a roughly balanced Chinchilla pattern, while vision demands significantly more data. | Image: Tong et al.

The larger the model gets, the wider the gap in data requirements. Starting from a 1 billion parameter base, the relative need for vision data compared to language data grows 14-fold at 100 billion parameters and 51-fold at 1 trillion parameters, according to the study. Language scales much more modestly across this range. In conventional dense models, where every parameter is active at every step, this imbalance is nearly impossible to resolve.

The Mixture-of-Experts architecture helps close the gap. Since only a fraction of experts fire per token, the model can carry a massive total parameter count without compute costs scaling proportionally. Language gets the high parameter capacity it needs, while vision benefits from the large data volumes it requires. According to the study, MoE cuts the scaling asymmetry between the two modalities in half.

The researchers note that their work covers pre-training only and they didn't dig into fine-tuning or reinforcement learning. Still, they see their results as evidence that the boundary between multimodal models and world models is getting blurrier by the day. Huge volumes of unlabeled video remain largely untapped, and the study shows they can be folded in without hurting language performance.

AI News Without the Hype – Curated by Humans

As a THE DECODER subscriber, you get ad-free reading, our weekly AI newsletter, the exclusive "AI Radar" Frontier Report 6× per year, access to comments, and our complete archive.

Source: Arxiv