Researchers have developed a new method called "Diffusion Forcing" that combines the strengths of autoregressive models and diffusion models. The technique enables, among other things, more stable video generation and more flexible planning for robotics tasks.
Scientists from MIT CSAIL and the Technical University of Munich have introduced a new method they call "Diffusion Forcing." In this approach, the model learns to denoise a sequence of tokens or observations, with each token having its own independent noise level. In this way, the method combines the advantages of autoregressive models, which today power large language models like GPT-4, with those of diffusion models, which have proven successful in image generation, such as in Stable Diffusion.
In next-token prediction, each token is usually "masked" and predicted from the preceding tokens. In full sequence diffusion, the entire sequence is gradually noised, with all tokens having the same noise level.
Diffusion Forcing combines both approaches: Each token, such as each word of a text or each frame of a video, can have its own noise level between 0 (unchanged) and K (pure noise). This way, a sequence can be partially masked. The model thus learns to reconstruct arbitrary subsets of the observed sequences.
During sampling, token-by-token processing can be used as in autoregression, or entire sequences can be denoised at once, depending on the desired use case. By cleverly choosing the noise levels, uncertainty about the future can also be modeled - near tokens are less noisy than distant ones.
Diffusion Forcing generates temporally stable videos and controls robots
The researchers evaluated their method in various applications such as video generation, time series prediction, and robot control. It was shown that Diffusion Forcing delivers better results than previous methods in many cases.
In video generation, for example, conventional autoregressive models can often only provide plausible results for short periods of time. Diffusion Forcing remains stable even for longer sequences.
In reinforcement learning scenarios, the model can also plan action sequences of different lengths, depending on the requirements of the current situation. Similar to diffusion models for images, the method can also be used to guide the generation towards specific goals.
The researchers now want to further improve the method and apply it to larger datasets. The team conducted most of the experiments with a small RNN model; larger datasets or high-resolution videos require large transformer models. However, initial experiments with transformers are already underway. If the method scales well, Diffusion Forcing could soon take over many tasks and deliver more robust and better results.