Content
summary Summary

MVDream uses Stable Diffusion and NeRFs to generate some of the best 3D renderings yet from text prompts.

Researchers at ByteDance present MVDream (Multi-view Diffusion for 3D Generation), a diffusion model capable of generating high-quality 3D renderings from text prompts. Similar models already exist, but MVDream achieves comparatively high quality and avoids two core problems of alternative approaches.

These often struggle with the Janus problem and content drift. For example, a generated baby Yoda has multiple faces, or a generated plate of waffles changes the number and arrangement of the waffles depending on the viewing angle.

To solve this problem, ByteDance trains a diffusion model such as Stable Diffusion not only with the usual prompt-image pairs but also with multiple views of 3D objects. To do this, the researchers render a large dataset of 3D models from different perspectives and camera angles.

Ad
Ad

By seeing coherent views from different angles, the model learns to produce coherent 3D shapes instead of disjointed 2D images, the team says.

Video: ByteDance

MVDream to get even better with SDXL

Specifically, the model generates images of an object from different perspectives from a text prompt, which the team then uses to train a NeRF as a 3D representation of the object.

In direct comparison to alternative approaches, MVDream shows a significant jump in quality and avoids common artifacts such as the Janus problem or content drift.

Video: ByteDance

Recommendation

In an experiment, the team also shows that MVDream can learn new concepts via Dreambooth and then generate 3D views of a specific dog, for example.

Video: ByteDance

The team cites the still low resolution of 256 x 256 pixels and limited generalizability as limitations. However, ByteDance expects that both problems can be reduced or solved in the future by using larger diffusion models such as SDXL. To significantly improve the quality and style of 3D renderings, however, the team says that extensive training with a new dataset will likely be required.

More information and examples are available on the MVDreams GitHub.

Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • ByteDance researchers are developing MVDream, a diffusion model that creates high-quality 3D renderings from text prompts while avoiding some of the major problems of the past.
  • To produce coherent 3D shapes instead of disjointed 2D images, the model trains on multiple views of 3D objects from different perspectives.
  • Limitations include the low resolution of 256 x 256 pixels and generalizability, but ByteDance expects that future larger diffusion models such as SDXL could solve these problems.
Sources
Max is managing editor at THE DECODER. As a trained philosopher, he deals with consciousness, AI, and the question of whether machines can really think or just pretend to.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.