summary Summary

Researchers have developed a new method called "Diffusion Forcing" that combines the strengths of autoregressive models and diffusion models. The technique enables, among other things, more stable video generation and more flexible planning for robotics tasks.


Scientists from MIT CSAIL and the Technical University of Munich have introduced a new method they call "Diffusion Forcing." In this approach, the model learns to denoise a sequence of tokens or observations, with each token having its own independent noise level. In this way, the method combines the advantages of autoregressive models, which today power large language models like GPT-4, with those of diffusion models, which have proven successful in image generation, such as in Stable Diffusion.

In next-token prediction, each token is usually "masked" and predicted from the preceding tokens. In full sequence diffusion, the entire sequence is gradually noised, with all tokens having the same noise level.

Diffusion Forcing combines both approaches: Each token, such as each word of a text or each frame of a video, can have its own noise level between 0 (unchanged) and K (pure noise). This way, a sequence can be partially masked. The model thus learns to reconstruct arbitrary subsets of the observed sequences.


During sampling, token-by-token processing can be used as in autoregression, or entire sequences can be denoised at once, depending on the desired use case. By cleverly choosing the noise levels, uncertainty about the future can also be modeled - near tokens are less noisy than distant ones.

Diffusion Forcing generates temporally stable videos and controls robots

The researchers evaluated their method in various applications such as video generation, time series prediction, and robot control. It was shown that Diffusion Forcing delivers better results than previous methods in many cases.

In video generation, for example, conventional autoregressive models can often only provide plausible results for short periods of time. Diffusion Forcing remains stable even for longer sequences.

Video prediction using diffusion forcing and baselines on the Minecraft dataset (0.5x speed). Teacher forcing can easily fail, while diffusion models suffer from serious consistency issues. Stable and consistent video prediction can be achieved with Diffusion Forcing. | Video: Chen et al.

In reinforcement learning scenarios, the model can also plan action sequences of different lengths, depending on the requirements of the current situation. Similar to diffusion models for images, the method can also be used to guide the generation towards specific goals.


Visualization of the diffusion forcing planning process using a simple maze as an example. To model the causal uncertainty of the future, the diffusion plan can have a near future with a lower noise level and a distant future with a higher noise level - visualized here by the color. | Video: Chen et al.

The method can treat incoming observations as noisy to be robust against distractions. In the video above, the team shows how a robotic arm controlled by Diffusion Forcing continues its task despite the visual disturbance of a shopping bag randomly thrown into the workspace.| Video: Chen et al.

The researchers now want to further improve the method and apply it to larger datasets. The team conducted most of the experiments with a small RNN model; larger datasets or high-resolution videos require large transformer models. However, initial experiments with transformers are already underway. If the method scales well, Diffusion Forcing could soon take over many tasks and deliver more robust and better results.

Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
  • Researchers have developed a new method called “Diffusion Forcing” that combines autoregressive and diffusion models. The method allows each token in a sequence to be assigned its own noise level, and thus allows any subset of the sequence to be reconstructed.
  • In video generation, diffusion forcing provides stable and consistent results even for longer sequences, whereas conventional autoregressive models often can only generate plausible videos for short periods of time.
  • In robotics applications, diffusion forcing can plan action sequences of different lengths and is robust to visual disturbances. First experiments with larger Transformer models give hope that the method can be scaled well and will soon be able to take over many tasks.
Max is managing editor at THE DECODER. As a trained philosopher, he deals with consciousness, AI, and the question of whether machines can really think or just pretend to.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.