Content
summary Summary

Researchers at Stanford University and MIT have developed an AI system that can interactively generate 3D scenes from a single image.

Ad

Researchers at Stanford University and MIT have developed an AI system that can interactively generate 3D scenes from a single image in real-time. This new technology, called WonderWorld, enables users to build and explore virtual environments step-by-step by controlling the content and layout of generated scenes.

The biggest challenge in developing WonderWorld was achieving rapid 3D scene generation. While previous approaches often required dozens of minutes to hours to generate a single scene, WonderWorld can produce a new 3D environment within 10 seconds on an Nvidia A6000 GPU. This speed allows for real-time interaction, a significant advancement in the field.

Video: Yu, Duan et al.

Ad
Ad

WonderWorld works by starting with an input image and generating an initial 3D scene. It then enters a loop, alternating between creating scene images and corresponding FLAGS representations. Users can control where new scenes are generated by moving the camera and use text input to specify the type of scene they want.

Illustration der WonderWorld-Funktionsweise: Aus einem Eingabebild werden schrittweise mehrere zusammenhängende 3D-Szenen generiert, gesteuert durch Nutzereingaben zu Inhalt und Platzierung neuer Szenen.
WonderWorld generates interactively linked 3D scenes from a single input image. Users can control the content and layout of the generated environments. | Image: Yu, Duan et al.

The FLAGS representation consists of three layers: foreground, background, and sky. Each layer contains a set of "surfels" - elements defined by their 3D position, orientation, scale, opacity, and color. These surfels are initialized using estimated depth and normal maps, then optimized to create the final scene.

To reduce geometric distortions at scene transitions, WonderWorld employs a guided depth diffusion process. This uses a pre-trained diffusion model for depth maps, adjusting the depth estimate to match the geometry of existing parts of the scene.

Team sees potential in game development

Experiments have shown that WonderWorld significantly outperforms previous methods for 3D scene generation in terms of speed and visual quality. In user studies, the generated scenes were rated as more visually convincing than those produced by other approaches.

Video: Yu, Duan et al.

Recommendation

Video: Yu, Duan et al.

However, the system does have some limitations. It can only create forward-facing surfaces, restricting user movement to about 45 degrees in the virtual world. The generated worlds currently look like paper cut-outs. The system also struggles with detailed objects like trees, which can lead to "holes" or "floating" elements when the viewing angle changes.

Despite these limitations, the researchers see significant potential for WonderWorld in various applications. Game developers could use it to build 3D worlds iteratively. It could generate larger and more diverse content for virtual reality experiences. In the long term, it could enable users to create freely explorable, dynamically evolving virtual worlds.

You can find more examples to try out for yourself on the WonderWorld project page.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • Researchers at Stanford University and MIT have developed WonderWorld, an AI system that can interactively generate 3D scenes from a single image. Users can control the content and layout of the generated environments.
  • The system generates a new scene within 10 seconds on an Nvidia A6000 GPU, which is significantly faster than previous methods. It uses a FLAGS display with three levels and so-called surfels as well as guided depth diffusion to optimize the geometry.
  • Despite limitations such as only displaying forward-facing surfaces, the researchers see potential in game development, virtual reality and the creation of dynamic virtual worlds. In user studies, the generated scenes were rated as visually convincing.
Sources
Max is managing editor at THE DECODER. As a trained philosopher, he deals with consciousness, AI, and the question of whether machines can really think or just pretend to.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.