Content
summary Summary

Motion Diffusion can create natural-looking human animations from various inputs such as text, actions, or existing animations.

Ad

So far, 2022 is the year of generative AI systems that create new media from text: DALL-E 2, Midjourney, Imagen, or Stable Diffusion produce photorealistic or artistic images. Make-a-Video and Imagen Video produce short video clips, AudioGen and AudioLM Audio, and CLIP-Mesh and Dreamfusion create 3D models from text.

Now, in a new paper, Tel Aviv University researchers turn their attention to generating human motion. Their Motion Diffusion Model (MDM) can, among other things, generate matching animations based on text.

"The holy grail of computer animation"

Automated generation of natural and expressive motion is the holy grail of computer animation, according to the researchers. The wide variety of possible movements and the ability of humans to perceive even slight flaws as unnatural are the biggest challenges, the researchers say.

Ad
Ad

A person's gait from A to B does include some repetitive features. But there are countless variations in the exact implementation of movements.

In addition, movements are difficult to describe: A kick, for example, can be a soccer kick or a karate kick.

Diffusion models used in current imaging systems such as DALL-E 2 have demonstrated remarkable generative capabilities and variability, making them a good choice for human motion, the team writes. For MDM, the researchers accordingly relied on a diffusion model and a transformer architecture.

Motion diffusion model is versatile and beats specialized models

The researchers' model is a generic framework that is suitable for various forms of input. In their work, they show examples of text-to-motion, action-to-motion, and completion and manipulation of existing animations.

In a text-to-motion task, MDM generates an animation that corresponds to a text description. Thanks to the diffusion model, the same prompt generates different variants.

Recommendation

"A person kicks." | Video: Tevet et al.

"A person kicks." | Video: Tevet et al.

"a person turns to his right and paces back and forth." | Video: Tevet et al.

In the action-to-motion task, MDM generates animations that match a particular motion class, such as "sitting down" or "walking."

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

(Class) Run | Video: Tevet et al.

In addition, the model can complete or edit motions. The researchers compare their method with inpainting, which allows users to mark parts of an image in DALL-E 2 or Stable Diffusion and change them via text description.

(Blue=Input, Gold=Synthesis) | Video: Tevet et al.

During an edit, individual parts of the body can be selectively animated, while others do not move or retain their original animation.

Upper body editing (lower body is fixed) (Blue=Input, Gold=Synthesis) | Video: Tevet et al.

In benchmarks, MDM is ahead of other generative models for motion, the researchers write. Currently, generating an animation takes about a minute on an Nvidia GeForce RTX 2080 Ti GPU. The training of the model took about three days.

Ad
Ad

In the future, the team wants to explore ways to control the animations even better and as a result expand the range of applications for the AI system. The code and model for MDM are available on GitHub.

Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • It is difficult to automatically animate convincing human movements because they are very complex. Moreover, humans perceive even small flaws in such animated movements.
  • The generative AI system "Motion Diffusion Model" creates believable, lifelike human animations based on text input. It uses the same technology as DALL-E 2 or Stable Diffusion.
  • The researchers make their model available for free on Github.
Max is managing editor at THE DECODER. As a trained philosopher, he deals with consciousness, AI, and the question of whether machines can really think or just pretend to.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.