Skip to content

Zhipu AI's GLM-Image uses "semantic tokens" to teach AI the difference between a face and a font

Image description
Zhipu AI

Chinese AI company Zhipu AI combines an autoregressive language model with a diffusion decoder. The 16-billion-parameter model excels at rendering text in images and knowledge-heavy content.

The architecture splits image generation into two specialized modules. A 9-billion-parameter autoregressive module based on the GLM-4 language model first creates a semantic representation of the image. It works like word prediction in language models: token by token, it builds a structure that determines the image's content and layout. A 7-billion-parameter diffusion decoder then refines this into a high-resolution image between 1024 and 2048 pixels.

Raster mit mehreren von GLM-Image generierten Beispielbildern, darunter Porträts, Landschaften und stilisierte Illustrationen.
Zhipu AI highlights the model's ability to generate photorealistic images with embedded text. | Image: Zhipu AI

Semantic tokens help the model learn faster

What sets GLM-Image apart from its competitors is how it breaks down images. Rather than relying on classic VQVAE tokens (Vector Quantized Variational Autoencoder), which prioritize visual reconstruction, the model uses what Zhipu AI calls semantic tokens. These carry both color information and meaning, identifying whether an area represents text, a face, or background. Zhipu AI says this approach speeds up training and produces more reliable outputs.

Schematische Darstellung der GLM-Image-Pipeline mit zwei Modulen: links das autoregressive Modul, das semantische Tokens erzeugt, rechts der Diffusions-Decoder, der das finale Bild generiert.
An autoregressive module plans the image while a diffusion decoder handles the details. | Image: Zhipu AI

For higher resolutions, GLM-Image first generates a compact preview with around 256 tokens before creating the full 1024 to 4096 tokens for the final image. This preview locks in the basic layout. Zhipu AI says generating high-resolution images directly causes the model to lose controllability.

For text rendering, GLM-Image integrates a Glyph-byT5 module that processes text areas character by character. Zhipu AI says this significantly improves how text looks, especially for Chinese characters. Since the semantic tokens already carry enough meaning, the decoder doesn't need a separate large text encoder - cutting memory requirements and compute.

Technisches Diagramm des GLM-Image-Decoders, das zeigt, wie semantische Tokens und VAE-Latents kombiniert und durch den Diffusionsprozess zu einem hochauflösenden Bild verarbeitet werden.
The diffusion decoder takes semantic tokens and optional reference images as input. | Image: Zhipu AI

Split training improves content and visuals independently

After pre-training, Zhipu AI fine-tunes both modules separately using reinforcement learning. The autoregressive module gets feedback on aesthetics and content accuracy, whether the image matches the prompt, and whether text is readable.

The diffusion decoder trains for visual quality: Are textures right? Are hands rendered correctly? Zhipu AI built a specialized evaluation model for the latter. Splitting these two aspects lets the team tweak each one without messing up the other.

Beispielbilder von GLM-Image mit komplexen Textinhalten, darunter Infografiken, Diagramme und Bilder mit eingebetteter Schrift in verschiedenen Sprachen.
GLM-Image shines on knowledge-heavy images with lots of text. | Image: Zhipu AI

For image editing and style transfer, GLM-Image processes both the semantic tokens and the image data from the original. The model takes a more efficient approach than Qwen-Image-Edit: instead of fully cross-referencing the reference and target images, GLM-Image caches intermediate results from reference images and reuses them. Zhipu AI says this saves compute without noticeably hurting detail preservation.

Running the model isn't easy: Zhipu AI says it requires either a GPU with more than 80 gigabytes of memory or a multi-GPU setup. Consumer hardware won't cut it. GLM-Image is available under an MIT license on Hugging Face and GitHub.

Chinese hardware-only training signals growing chip independence

Zhipu AI claims GLM-Image is the first open image model trained entirely on Chinese hardware. The company likely used Huawei's Ascend Atlas 800T A2 chips instead of Nvidia's H100 accelerators.

The timing matters. US export restrictions on AI chips have pushed China to limit purchases of even permitted Nvidia hardware to boost domestic chip production. Zhipu AI sees GLM-Image as proof that powerful multimodal models can be built without Western silicon - a political statement as much as a technical one.

The company, which openly names OpenAI as a competitor, recently became one of the first Chinese AI companies to go public. Its stock has since jumped more than 80 percent.

The release adds to Zhipu AI's growing lineup of open-weight models. In December, the company shipped GLM-4.7, a model built for autonomous programming that hit 73.8 percent on the SWE-bench Verified benchmark, matching commercial offerings from OpenAI and Anthropic.

GLM-Image's focus on text rendering reflects a broader push among Chinese image models. In August, Alibaba released Qwen-Image, a 20-billion-parameter model built for more natural-looking results and text precise enough for PowerPoint slides. Meituan followed in December with LongCat-Image, a 6-billion-parameter model that aims to beat larger competitors on text rendering.

AI News Without the Hype – Curated by Humans

As a THE DECODER subscriber, you get ad-free reading, our weekly AI newsletter, the exclusive "AI Radar" Frontier Report 6× per year, access to comments, and our complete archive.

AI news without the hype
Curated by humans.

  • Over 20 percent launch discount.
  • Read without distractions – no Google ads.
  • Access to comments and community discussions.
  • Weekly AI newsletter.
  • 6 times a year: “AI Radar” – deep dives on key AI topics.
  • Up to 25 % off on KI Pro online events.
  • Access to our full ten-year archive.
  • Get the latest AI news from The Decoder.
Subscribe to The Decoder