Ad
Skip to content

Know3D lets users control the hidden back side of 3D objects with text prompts

Image description
Chen et al.

A research team taps into the knowledge of large language models to control the back side of objects via text during 3D generation from single images, addressing a fundamental problem in 3D generation.

When an AI model has to build a complete 3D object from a single photo, it's working with a major blind spot: the image only shows one side, so the model basically has to guess everything behind it. According to a new paper from a team at several Chinese universities, this regularly leads to physically implausible shapes or results that miss what the user was going for.

The issue comes down to data. Compared to the enormous image and text datasets floating around the internet, 3D training data is still hard to come by. The world knowledge 3D models absorb during training just isn't enough to reliably fill in what's hidden.

Know3D tackles this by pulling in the broad world knowledge of multimodal language models. Users can type out a text description of what should appear on the side of an object they can't see.

An image generator bridges the gap between language and 3D

The obvious move—feeding a language model's output straight into a 3D network—doesn't actually work, the researchers say. The representations are too abstract and don't carry enough spatial information to generate usable geometry.

So Know3D takes a detour, slotting an image generation model between the language model and the 3D generator to act as a translator. The setup uses Qwen2.5-VL as the language model, Qwen-Image-Edit for image generation, and Microsoft's Trellis.2 as the 3D generator.

The language model reads the text instruction and analyzes the input image. The image generator then turns that understanding into spatial-structural information that steers the 3D generator.

Diagram of the Know3D pipeline: An input image of a rabbit on a snowboard and a text prompt feed into a language model and an image generation model. The model's internal intermediate states are then passed to a 3D generator, which produces a complete 3D object.
The language model analyzes the image and text instruction, the image generator translates that into intermediate spatial states, and those states guide the 3D generator as it builds the object. | Image: Chen et al.

The trick is figuring out what information to pull from the image generator. The team tested three options: an internal image representation grabbed right before the final output, image features extracted from it via Meta's DINOv3, and the model's internal intermediate states during generation. The last option won by a clear margin—these intermediate states carry both semantic and spatial information without relying on pixel-level accuracy or mistakes in the final image.

Internal model states keep errors from spreading

This matters in practice. If the image generator produces a bad rear view, like rendering a one-shoulder bag with two straps, image-based methods pass that mistake straight through to the 3D output.

The model's internal intermediate states are more forgiving: they seem to hold enough spatial and semantic information to still produce a solid 3D object. Even partway through processing, the model apparently builds a pretty reliable sense of structure and form.

The timing of when the team taps into those states matters, too. Go too early and the information is too focused on pixel details. Go too late and noise takes over. Their ablation studies showed that grabbing states at roughly the quarter mark of the process hit the sweet spot.

The main advantage of Know3D over existing methods is the degree of control it offers. The researchers demonstrate this with a coffee cup: the same input photo yields different but geometrically consistent back sides depending on the text instruction. The same principle applies to chairs, robots, and houses: the back adapts to match the description while the visible front remains unchanged.

Strong benchmark scores, but results hinge on the base model

Know3D posts the best scores for semantic match between input image and generated 3D object on HY3D-Bench, a benchmark built by the Hunyuan3D team. That holds up against both current single-image methods and an approach that feeds the generated rear view back in as a second input image. Know3D also beats the competition on geometric quality for the back sides, the researchers say.

Three examples of Know3D's semantic controllability: A miniature house, a witch figure, and a phone booth are each shown as input images. Next to each are four back-view normal map renderings varied by different text instructions, such as adding a swing on the wall, potted plants, a magic potion bottle, or a hot air balloon.
Same input image, different back sides: Depending on the text instruction, Know3D generates different but geometrically consistent structures on the hidden side of the object. | Image: Chen et al.

How good the results are ultimately comes down to whether the underlying language model reads the text instructions correctly, the researchers say. If it misinterprets a prompt, the 3D output goes sideways too. So stronger multimodal models could cut down on this problem going forward.

AI News Without the Hype – Curated by Humans

As a THE DECODER subscriber, you get ad-free reading, our weekly AI newsletter, the exclusive "AI Radar" Frontier Report 6× per year, access to comments, and our complete archive.

AI news without the hype
Curated by humans.

  • More than 16% discount.
  • Read without distractions – no Google ads.
  • Access to comments and community discussions.
  • Weekly AI newsletter.
  • 6 times a year: “AI Radar” – deep dives on key AI topics.
  • Up to 25 % off on KI Pro online events.
  • Access to our full ten-year archive.
  • Get the latest AI news from The Decoder.
Subscribe to The Decoder