summary Summary

PixArt-Σ outperforms SDXL with significantly fewer parameters, and even outperforms commercial models.


Researchers from the Huawei Noah's Ark Lab and several Chinese universities recently introduced PixArt-Σ (Sigma), a text-to-image model based on the earlier results of PixArt-α (Alpha) and PixArt-δ (Delta), which offers improved image quality, prompt accuracy and efficiency in handling training data. Its unique feature is the superior resolution of the images generated by the model.

Images with higher resolution and closer to the prompt

PixArt-Σ can directly generate images up to 3,840 x 2,560 pixels without an intermediate upscaler, even in unusual aspect ratios. Previous PixArt models were limited to 1,024 x 1,024 pixels.

Image: Chen et al.

Higher resolution also leads to higher computational requirements, which the researchers try to compensate for with a "weak-to-strong" training strategy. This strategy involves specific fine-tuning techniques that enable a fast and efficient transition from weaker to stronger models, the researchers write.


The techniques they used include using a more powerful variable autoencoder (VAE) that "understands" images better, scaling from low to high resolution, and evolving from a model without key-value compression (KV) to a model with KV compression that focuses on the most important aspects of an image. Overall, efficient token compression reduced training and inference time by 34 percent.

According to the paper, the training material collected from the Internet consists of 33 million images with a resolution of at least 1K and 2.3 million images with a resolution of 4K. This is more than double the 14 million images of PixArt-α training material. However, it is still a far cry from the 100 million images processed by SDXL 1.0.

Prompt: "Da Vinci's Last Supper oil painting in the style of Van Gogh" | Image: Chen et al.

In addition to the resolution of the images in the training material, the accuracy of the descriptions also plays an important role. While the researchers observed hallucinations when using LLaVA in PixArt-α, this problem is largely eliminated by the GPT-4V-based share-captioner. The open-source tool writes detailed and accurate captions for the images collected to train the PixArt-Σ model.

In addition, the token length has been increased to approximately 300 words, which also results in a better content match between the text prompt and image generation.

Prompt: "Game-Art - An island with different geographical properties and multiple small cities floating in space" |Image: Chen et al.

Compared to other models, PixArt-Σ showed better performance in terms of image quality and prompt matching than existing open-source text-image diffusion models such as SDXL (2.6 billion) and SD Cascade (5.1 billion), despite its relatively low parameter count of 600 million. In addition, a 1K model comparable to PixArt-α required only 9 percent of the GPU training time required for the original PixArt-α.


PixArt-Σ can keep up with commercial alternatives such as Adobe Firefly 2, Google Imagen 2, OpenAI DALL-E 3 and Midjourney v6, the researchers claim.

Prompt: Pixel art style of a snowboarder in mid-air performs a trick on a black rail, wearing a blue sweatshirt and black pants, with arms outstretched. The serene snowy landscape background, dotted with trees, complements the scene. The low-angle perspective emphasizes the trick's height and skill. | Image: Chen et al.
Prompt: A cherry pick scientist.| Image: Chen et al.

The researchers do not show any textual content in their example images. While Stable Diffusion 3, Midjourney, and Ideogram in particular have recently made great strides in this area, PixArt is likely to perform less well due to its training focus on high-resolution photographs.

"We believe that the innovations presented in PixArt-Σ will not only contribute to advancements in the AIGC community but also pave the way for entities to access more efficient, and high-quality generative models," the scientists conclude in their paper.

Other research could benefit from their insights on how to handle training data more efficiently. PixArt-α was eventually released as open source, but we don't know yet if this will be the case for PixArt-Σ.

Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
  • After PixArt-α, researchers from Huawei's Noah's Ark Lab and others introduced PixArt-Σ, which can generate images in 4K resolution and follow prompts very precisely.
  • PixArt-Σ uses a "weak-to-strong" strategy to reduce the need for computing power and delivered better results than open-source models, despite having 600 million fewer parameters.
  • The researchers attribute the high image quality to careful selection of training material and more precise image descriptions.
Jonathan works as a technology journalist who focuses primarily on how easily AI can already be used today and how it can support daily life.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.