Content
summary Summary

Stable Diffusion may soon have some competition when it comes to open-source image generators. In its latest iteration, PixArt becomes faster and more accurate while maintaining a relatively high resolution.

In a paper, researchers from Huawei Noah's Ark Lab, Dalian University of Technology, Tsinghua University, and Hugging Face presented PixArt-δ (Delta), an advanced text-to-image synthesis framework designed to compete with the Stable Diffusion family.

This model is a significant improvement over the previous PixArt-α (Alpha) model, which was already able to quickly generate images with a resolution of 1024 x 1024 pixels.

High-resolution image generation in half a second

PixArt-δ integrates the Latent Consistency Model (LCM) and ControlNet into the PixArt-α model, significantly accelerating inference speed. The model can generate high-quality images with a resolution of 1,024 x 1,024 pixels in just two to four steps in as little as 0.5 seconds, seven times faster than PixArt-α.

Ad
Ad

SDXL Turbo, introduced by Stability AI in November 2023, can generate images of 512 x 512 pixels in just one step, or about 0.2 seconds.

However, PixArt-δ's results are higher resolution and seem more consistent compared to SDXL Turbo and a four-step variant of SDXL with LCM. The images appear to have fewer errors and the model follows the instructions more accurately.

Image: Chen et al.

The new PixArt model is designed to train efficiently on V100 GPUs with 32 GB of VRAM in less than a day. In addition, its 8-bit inference capability allows it to synthesize 1024-pixel images even on 8-GB GPUs, greatly improving its usability and accessibility.

More control over image generation

The integration of a ControlNet module into PixArt-δ allows finer control of text-to-image diffusion models using reference images. The researchers have introduced a novel ControlNet architecture specifically designed for transformer-based models that provide explicit controllability while maintaining high-quality image generation.

Image: Chen et al.

The researchers have published the weights for the ControlNet variant of PixArt-δ on Hugging Face. However, an online demo seems to be available only for PixArt-α with and without LCM.

Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Recommendation
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • Researchers from Huawei Noah's Ark Lab, Dalian University of Technology, Tsinghua University and Hugging Face present PixArt-δ, an improved text-to-image synthesis framework that generates high-resolution images in only two to four steps, making it extremely fast.
  • The new model integrates the Latent Consistency Model (LCM) and ControlNet to increase inference speed and generate 1,024 x 1,024 pixel images in 0.5 seconds, which is seven times faster than the previous PixArt-α model.
  • The ControlNet module in PixArt-δ, designed specifically for Transformer, enables more precise control of text-to-image diffusion models.
Sources
Jonathan works as a technology journalist who focuses primarily on how easily AI can already be used today and how it can support daily life.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.