Ad
Skip to content

Motion Diffusion turns text into lifelike human animations

Image description
Tevet et al.

Motion Diffusion can create natural-looking human animations from various inputs such as text, actions, or existing animations.

So far, 2022 is the year of generative AI systems that create new media from text: DALL-E 2, Midjourney, Imagen, or Stable Diffusion produce photorealistic or artistic images. Make-a-Video and Imagen Video produce short video clips, AudioGen and AudioLM Audio, and CLIP-Mesh and Dreamfusion create 3D models from text.

Now, in a new paper, Tel Aviv University researchers turn their attention to generating human motion. Their Motion Diffusion Model (MDM) can, among other things, generate matching animations based on text.

"The holy grail of computer animation"

Automated generation of natural and expressive motion is the holy grail of computer animation, according to the researchers. The wide variety of possible movements and the ability of humans to perceive even slight flaws as unnatural are the biggest challenges, the researchers say.

Ad
DEC_D_Incontent-1

A person's gait from A to B does include some repetitive features. But there are countless variations in the exact implementation of movements.

In addition, movements are difficult to describe: A kick, for example, can be a soccer kick or a karate kick.

Diffusion models used in current imaging systems such as DALL-E 2 have demonstrated remarkable generative capabilities and variability, making them a good choice for human motion, the team writes. For MDM, the researchers accordingly relied on a diffusion model and a transformer architecture.

Motion diffusion model is versatile and beats specialized models

The researchers' model is a generic framework that is suitable for various forms of input. In their work, they show examples of text-to-motion, action-to-motion, and completion and manipulation of existing animations.

Ad
DEC_D_Incontent-2

In a text-to-motion task, MDM generates an animation that corresponds to a text description. Thanks to the diffusion model, the same prompt generates different variants.

"A person kicks." | Video: Tevet et al.

"A person kicks." | Video: Tevet et al.

"a person turns to his right and paces back and forth." | Video: Tevet et al.

In the action-to-motion task, MDM generates animations that match a particular motion class, such as "sitting down" or "walking."

(Class) Run | Video: Tevet et al.

In addition, the model can complete or edit motions. The researchers compare their method with inpainting, which allows users to mark parts of an image in DALL-E 2 or Stable Diffusion and change them via text description.

(Blue=Input, Gold=Synthesis) | Video: Tevet et al.

During an edit, individual parts of the body can be selectively animated, while others do not move or retain their original animation.

Upper body editing (lower body is fixed) (Blue=Input, Gold=Synthesis) | Video: Tevet et al.

In benchmarks, MDM is ahead of other generative models for motion, the researchers write. Currently, generating an animation takes about a minute on an Nvidia GeForce RTX 2080 Ti GPU. The training of the model took about three days.

In the future, the team wants to explore ways to control the animations even better and as a result expand the range of applications for the AI system. The code and model for MDM are available on GitHub.

AI News Without the Hype – Curated by Humans

As a THE DECODER subscriber, you get ad-free reading, our weekly AI newsletter, the exclusive "AI Radar" Frontier Report 6× per year, access to comments, and our complete archive.

AI news without the hype
Curated by humans.

  • Over 20 percent launch discount.
  • Read without distractions – no Google ads.
  • Access to comments and community discussions.
  • Weekly AI newsletter.
  • 6 times a year: “AI Radar” – deep dives on key AI topics.
  • Up to 25 % off on KI Pro online events.
  • Access to our full ten-year archive.
  • Get the latest AI news from The Decoder.
Subscribe to The Decoder