Content
summary Summary

Scientists at Shanghai Tech University have developed an AI model called CLAY that can generate detailed 3D objects from text and images. The model surpasses previous approaches in quality and versatility.

Ad

A research team from Shanghai Tech University has unveiled a new AI system for generating 3D content. The model, named CLAY (Controllable Large-scale generative model for creating high-quality 3D Assets with high-qualitY geometry and appearance), can create complex three-dimensional objects from simple text descriptions or 2D images.

At the core of CLAY are a multi-resolution Variational Autoencoder (VAE) and a Diffusion Transformer (DiT). The VAE encodes 3D geometries at various detail levels into a latent space, while the DiT is responsible for generating the geometries. Unlike many other methods, CLAY processes 3D content natively without converting to 2D first.

According to the researchers, CLAY can generate a wide range of objects - from simple everyday items to complex fantasy creatures. The system was trained on more than 500,000 3D models. The researchers developed a special pipeline to unify different 3D datasets, including a remeshing process to standardize geometries and the use of GPT-4V for precise automatic annotation.

Ad
Ad

A unique feature of CLAY is the ability to control generation through additional inputs. Besides texts and images, rough shapes (voxel structures, point clouds) or bounding boxes can also be specified. This allows for more precise control of the end result.

These conditions can be used individually or in combination. For example, entire city scenes can be generated from scattered bounding boxes, or detailed 3D models can be reconstructed from hand sketches.

Comparison with previous approaches

In direct comparisons, CLAY outperformed existing text-to-3D and image-to-3D systems like Shap-E, DreamFusion, or Wonder3D in both qualitative and quantitative evaluations.

For text-to-3D generation, CLAY produced more consistent geometries with smoother surfaces and finer details. In image-to-3D conversion, the system could more accurately reconstruct inputs and better preserve complex structures.

Another advantage of CLAY is its speed: While some comparison systems require several hours for optimization, CLAY generates high-quality 3D assets in about 45 seconds.

Recommendation

In addition to geometry generation, CLAY also masters the synthesis of realistic materials. The system can generate Physical Based Rendering Materials (PBR) with diffuse, rough, and metallic textures. CLAY uses a special Multi-View Material Diffusion approach trained on over 40,000 high-quality PBR materials.

Possible applications and outlook

The scientists see diverse applications for CLAY, such as in game development, film production, or 3D printing. The system could significantly simplify the time-consuming manual creation of 3D models.

However, the researchers also point out potential risks. Like other AI systems, CLAY could be misused to create deceptively real virtual content. The developers are therefore planning further safety measures to ensure responsible use.

Despite the impressive results, the researchers still see room for improvement. They plan to further expand the training data and improve its quality. They are also working on integrating geometry and material generation into a single model.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

A version of CLAY can be accessed through the 3D-Gen service Rodin.

Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • Scientists at Shanghai University of Technology have developed CLAY, an AI model for generating detailed 3D objects from text and images.
  • CLAY has been trained on over 500,000 processed 3D models and can be driven by additional inputs such as rough shapes or bounding boxes. It generates more consistent geometry and finer detail than previous systems, and takes only about 45 seconds to do so.
  • The researchers see potential applications in game development, film production and 3D printing. The model is available through an online service.
Max is managing editor at THE DECODER. As a trained philosopher, he deals with consciousness, AI, and the question of whether machines can really think or just pretend to.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.