Content
summary Summary

A new study examines whether and how well multimodal AI models understand the 3D structure of scenes and objects.

Ad

Researchers from the University of Michigan and Google Research investigated the 3D awareness of multimodal models. The goal was to understand how well the representations learned by these models capture the 3D structure of our world.

According to the team, 3D awareness can be measured by two key capabilities: Can the models reconstruct the visible 3D surface from a single image, i.e., infer depth and surface information? Are the representations consistent across multiple views of the same object or scene?

The results show that some models can encode depth and surface information without explicit training. In particular, the self-monitoring models DINO and DINOv2 and the text-driven diffusion model StableDiffusion performed well. Models trained with vision-language pre-training, such as CLIP, were an exception. They captured very little 3D information.

Ad
Ad
Image: El Banani et al.

Multimodal models likely not learning 3D consistent representations

In terms of consistency across multiple viewpoints, all the models tested showed weaknesses. While they were still able to produce accurate mappings between image regions with small changes in viewing angle, their performance dropped sharply with larger changes in viewing angle.

The researchers conclude that the models learned representations that are view-consistent, not 3D consistent. The researchers hope their findings will spark more interest in this topic, especially given the impressive progress in photorealistic image and video synthesis by generative AI models such as OpenAI's Sora. 3D understanding also includes the ability to infer spatial relationships and make predictions about deformation and dynamics - whether Sora has these capabilities will be the subject of future work.

The code and results are available on GitHub.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • Researchers at the University of Michigan and Google Research have investigated how well multimodal AI models understand the 3D structure of scenes and objects. They evaluated the models' ability to infer depth and surface information from an image and to generate consistent representations across multiple viewpoints.
  • The results show that some models, such as DINO, DINOv2, and StableDiffusion, are partially able to encode 3D information without being explicitly trained to do so. In contrast, models trained with vision-language pre-training, such as CLIP, hardly captured any 3D information.
  • All of the models tested showed weaknesses when it came to consistency across multiple viewing angles. The team therefore suggests that the models learn viewpoint-dependent (2.5D) representations rather than true 3D-consistent representations.
Sources
Max is managing editor at THE DECODER. As a trained philosopher, he deals with consciousness, AI, and the question of whether machines can really think or just pretend to.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.