Scientists at Shanghai Tech University have developed an AI model called CLAY that can generate detailed 3D objects from text and images. The model surpasses previous approaches in quality and versatility.
A research team from Shanghai Tech University has unveiled a new AI system for generating 3D content. The model, named CLAY (Controllable Large-scale generative model for creating high-quality 3D Assets with high-qualitY geometry and appearance), can create complex three-dimensional objects from simple text descriptions or 2D images.
At the core of CLAY are a multi-resolution Variational Autoencoder (VAE) and a Diffusion Transformer (DiT). The VAE encodes 3D geometries at various detail levels into a latent space, while the DiT is responsible for generating the geometries. Unlike many other methods, CLAY processes 3D content natively without converting to 2D first.
According to the researchers, CLAY can generate a wide range of objects - from simple everyday items to complex fantasy creatures. The system was trained on more than 500,000 3D models. The researchers developed a special pipeline to unify different 3D datasets, including a remeshing process to standardize geometries and the use of GPT-4V for precise automatic annotation.
A unique feature of CLAY is the ability to control generation through additional inputs. Besides texts and images, rough shapes (voxel structures, point clouds) or bounding boxes can also be specified. This allows for more precise control of the end result.
These conditions can be used individually or in combination. For example, entire city scenes can be generated from scattered bounding boxes, or detailed 3D models can be reconstructed from hand sketches.
Comparison with previous approaches
In direct comparisons, CLAY outperformed existing text-to-3D and image-to-3D systems like Shap-E, DreamFusion, or Wonder3D in both qualitative and quantitative evaluations.
For text-to-3D generation, CLAY produced more consistent geometries with smoother surfaces and finer details. In image-to-3D conversion, the system could more accurately reconstruct inputs and better preserve complex structures.
Another advantage of CLAY is its speed: While some comparison systems require several hours for optimization, CLAY generates high-quality 3D assets in about 45 seconds.
In addition to geometry generation, CLAY also masters the synthesis of realistic materials. The system can generate Physical Based Rendering Materials (PBR) with diffuse, rough, and metallic textures. CLAY uses a special Multi-View Material Diffusion approach trained on over 40,000 high-quality PBR materials.
Possible applications and outlook
The scientists see diverse applications for CLAY, such as in game development, film production, or 3D printing. The system could significantly simplify the time-consuming manual creation of 3D models.
However, the researchers also point out potential risks. Like other AI systems, CLAY could be misused to create deceptively real virtual content. The developers are therefore planning further safety measures to ensure responsible use.
Despite the impressive results, the researchers still see room for improvement. They plan to further expand the training data and improve its quality. They are also working on integrating geometry and material generation into a single model.
A version of CLAY can be accessed through the 3D-Gen service Rodin.