summary Summary

X-Decoder can perform a variety of vision and vision-language tasks. The technology could replace specialized AI models.


When it comes to visual understanding, there are numerous specialized AI models. They analyze images at different levels of granularity and perform very different tasks such as image classification, image labeling, or object recognition.

Lately, driven by the success of the Transformer architecture unveiled in 2017, the trend has been to develop general-purpose models that can be applied to numerous vision and vision-language tasks.

According to researchers at the University of Wisconsin-Madison, UCLA, and Microsoft, however, many of these models focus on image-level and region-level processing rather than pixel-level understanding. Approaches that focus on the pixel level on the other hand are tailored to specific tasks and have not yet been generalized.


X-Decoder can predict pixel-level segmentation and language tokens seamlessly

The team is introducing X-Decoder, a generalized decoding model that can process images at all levels of granularity and perform numerous vision and vision-language tasks, including pixel-level image segmentation and image captioning, for example. X-Decoder can also be combined with generative AI models such as Stable Diffusion. This approach allows for pixel-perfect image processing.

X-Decoder can perform numerous image segmentation tasks after training. | Image: Zou, Dou, Yang et al.

X-Decoder is built on top of a vision backbone and a transformer encoder. The model is trained with a dataset of several million image-text pairs and a limited amount of segmentation data. It processes inputs from an image encoder or a text encoder. There are two types of output: pixel-level masks and token-level semantics, such as text descriptions.

Depending on the task, only certain parts of the system are used. | Image: Zou, Dou, Yang et al.

Different input and output capabilities enable X-Decoder to perform different tasks. The team noted that the use of a single text encoder also facilitates synergy between tasks.

X-Decoder shows impressive performance

According to the researchers, X-Decoder shows strong transferability to a range of downstream tasks in both zero-shot and finetuning settings. The system achieves state-of-the-art results in difficult segmentation tasks, like referencing segmentation.

Furthermore, the model offered better or competitive finetuned performance to other generalist and specialist models on segmentation and vision-language tasks. X-Decoder is flexible enough to perform novel tasks such as referring captioning, combining referring segmentation and captioning.


We present X-Decoder, a model that seamlessly supports pixel-level and image-level vision-language understanding. With a simple and generalized design, X-Decoder can unite and support generic segmentation, referring segmentation and VL tasks effortlessly, achieving strong generalizability and competitive or even SoTA performance. We hope this work can shed a light on the design of the next-generation general-purpose vision system.

From the paper

You can find more information about X-Decoder on its GitHub page. On GitHub, you can also find the code.

Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
  • X-Decoder can be used for various vision and vision-language tasks, such as pixel-level image segmentation or image captioning.
  • X-Decoder builds on a vision backbone and a transformer encoder, and uses image and text encoders as input. Text descriptions, as well as masks at the pixel level, serve as output.
  • X-Decoder achieves top performance on several benchmarks and has strong zero-shot and finetuned performance on a good deal of segmentation and vision-language tasks.
Max is managing editor at THE DECODER. As a trained philosopher, he deals with consciousness, AI, and the question of whether machines can really think or just pretend to.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.