Motion Diffusion can create natural-looking human animations from various inputs such as text, actions, or existing animations.
So far, 2022 is the year of generative AI systems that create new media from text: DALL-E 2, Midjourney, Imagen, or Stable Diffusion produce photorealistic or artistic images. Make-a-Video and Imagen Video produce short video clips, AudioGen and AudioLM Audio, and CLIP-Mesh and Dreamfusion create 3D models from text.
Now, in a new paper, Tel Aviv University researchers turn their attention to generating human motion. Their Motion Diffusion Model (MDM) can, among other things, generate matching animations based on text.
"The holy grail of computer animation"
Automated generation of natural and expressive motion is the holy grail of computer animation, according to the researchers. The wide variety of possible movements and the ability of humans to perceive even slight flaws as unnatural are the biggest challenges, the researchers say.
A person's gait from A to B does include some repetitive features. But there are countless variations in the exact implementation of movements.
In addition, movements are difficult to describe: A kick, for example, can be a soccer kick or a karate kick.
Diffusion models used in current imaging systems such as DALL-E 2 have demonstrated remarkable generative capabilities and variability, making them a good choice for human motion, the team writes. For MDM, the researchers accordingly relied on a diffusion model and a transformer architecture.
Motion diffusion model is versatile and beats specialized models
The researchers' model is a generic framework that is suitable for various forms of input. In their work, they show examples of text-to-motion, action-to-motion, and completion and manipulation of existing animations.
In a text-to-motion task, MDM generates an animation that corresponds to a text description. Thanks to the diffusion model, the same prompt generates different variants.
In the action-to-motion task, MDM generates animations that match a particular motion class, such as "sitting down" or "walking."
In addition, the model can complete or edit motions. The researchers compare their method with inpainting, which allows users to mark parts of an image in DALL-E 2 or Stable Diffusion and change them via text description.
During an edit, individual parts of the body can be selectively animated, while others do not move or retain their original animation.
In benchmarks, MDM is ahead of other generative models for motion, the researchers write. Currently, generating an animation takes about a minute on an Nvidia GeForce RTX 2080 Ti GPU. The training of the model took about three days.
In the future, the team wants to explore ways to control the animations even better and as a result expand the range of applications for the AI system. The code and model for MDM are available on GitHub.