Pixel Transformers: Researchers show that AI models learn more from raw pixels

Midjourney prompted by THE DECODER

Researchers have shown that transformer models do not need to divide images into blocks but can be trained directly on individual pixels - this challenges common methods in computer vision.

A team of researchers from the University of Amsterdam and Meta AI has introduced a new approach that questions the necessity of local relationships in AI models for computer vision. The researchers trained transformer models directly on individual image pixels instead of pixel blocks and achieved surprisingly good results.

Typically, transformer models like the Vision Transformer (ViT) are trained on blocks of, for example, 16 by 16 pixels. This teaches the model that neighboring pixels are more strongly related than distant pixels. This principle of locality is considered a fundamental prerequisite for image processing by AI and is also used in ConvNets.

The researchers now wanted to find out if this relationship could be completely dissolved. To do this, they developed the "Pixel Transformer" (PiT), which treats each pixel as an individual token and makes no assumptions about spatial relationships. Instead, the model is supposed to learn these relationships independently from the data.

Pixel Transformer outperforms classic Vision Transformer

To test the performance of PiT, the scientists conducted experiments in three application areas:

1. Supervised learning for object classification on the CIFAR-100 and ImageNet datasets. Here, PiT was shown to outperform the conventional ViT.

2. Self-supervised learning using Masked Autoencoding (MAE). Here too, PiT performed better than ViT and scaled better with increasing model size.

3. Image generation using diffusion models. PiT generated detailed images with a quality comparable to locality-based models.

Pixel Transformer is currently not practical - but could be the future

Overall, the results suggest that transformers can capture more information when they consider images as a set of individual pixels than when they divide them into blocks, as is the case with ViT, according to the team.

Recommendation

AI research

"Cat attack" on reasoning model shows how important context engineering is

"We believe this work has sent out a clear, unfiltered message that locality is not fundamental, and patchification is simply a useful heuristic that trades-off efficiency vs. accuracy."

From the paper

The researchers emphasize that PiT is currently not practical for real applications due to higher computational intensity - but it should support the development of future AI architectures for computer vision.

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

Pixel Transformers: Researchers show that AI models learn more from raw pixels

Pixel Transformer outperforms classic Vision Transformer

Pixel Transformer is currently not practical - but could be the future

"Cat attack" on reasoning model shows how important context engineering is

Meta's new 'Sapiens' AI models can analyze human images with unprecedented accuracy

Meta's new HOT3D dataset could enable robots to learn manual skills from human experts

Meta's AI lab turns 10 with three new AI projects and an impressive demo

Google DeepMind's Gemini wins Mathematical Olympiad gold using only natural language

OpenAI launches new ChatGPT agent that automates complex tasks for Pro, Plus, and Team

Kimi-K2 is the next open-weight AI milestone from China after Deepseek

Pixel Transformers: Researchers show that AI models learn more from raw pixels

Pixel Transformer outperforms classic Vision Transformer

Pixel Transformer is currently not practical - but could be the future

Share

Bank details