Content
summary Summary

Researchers have shown that transformer models do not need to divide images into blocks but can be trained directly on individual pixels - this challenges common methods in computer vision.

Ad

A team of researchers from the University of Amsterdam and Meta AI has introduced a new approach that questions the necessity of local relationships in AI models for computer vision. The researchers trained transformer models directly on individual image pixels instead of pixel blocks and achieved surprisingly good results.

Typically, transformer models like the Vision Transformer (ViT) are trained on blocks of, for example, 16 by 16 pixels. This teaches the model that neighboring pixels are more strongly related than distant pixels. This principle of locality is considered a fundamental prerequisite for image processing by AI and is also used in ConvNets.

The researchers now wanted to find out if this relationship could be completely dissolved. To do this, they developed the "Pixel Transformer" (PiT), which treats each pixel as an individual token and makes no assumptions about spatial relationships. Instead, the model is supposed to learn these relationships independently from the data.

Ad
Ad
Image: D-K. Nguyen et al.

Pixel Transformer outperforms classic Vision Transformer

To test the performance of PiT, the scientists conducted experiments in three application areas:

1. Supervised learning for object classification on the CIFAR-100 and ImageNet datasets. Here, PiT was shown to outperform the conventional ViT.

2. Self-supervised learning using Masked Autoencoding (MAE). Here too, PiT performed better than ViT and scaled better with increasing model size.

3. Image generation using diffusion models. PiT generated detailed images with a quality comparable to locality-based models.

Image: D-K. Nguyen et al.

Pixel Transformer is currently not practical - but could be the future

Overall, the results suggest that transformers can capture more information when they consider images as a set of individual pixels than when they divide them into blocks, as is the case with ViT, according to the team.

Recommendation

"We believe this work has sent out a clear, unfiltered message that locality is not fundamental, and patchification is simply a useful heuristic that trades-off efficiency vs. accuracy."

From the paper

The researchers emphasize that PiT is currently not practical for real applications due to higher computational intensity - but it should support the development of future AI architectures for computer vision.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • Researchers from the University of Amsterdam and Meta AI have presented a new approach in which transformer models are trained directly on individual image pixels instead of on blocks of pixels, as was previously the case. In doing so, they are challenging conventional methods in computer vision.
  • The team developed the "Pixel Transformer" (PiT), which considers each pixel as an individual token and makes no assumptions about spatial relationships. In experiments on object classification, self-supervised learning, and image generation, PiT outperformed conventional approaches such as the Vision Transformer (ViT), which learns from blocks of pixels.
  • According to the researchers, the results suggest that transformers can capture more information when viewing images as a set of individual pixels. Due to the higher computational intensity, PiT is currently not practical, but could support the development of future AI architectures for computer vision.
Sources
Max is managing editor at THE DECODER. As a trained philosopher, he deals with consciousness, AI, and the question of whether machines can really think or just pretend to.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.