newsletter Newsletter

A new research paper shows that Transformers known from computational linguistics are also the best image analysis AI available today. Previously used architectures may be obsolete.

Google's transformers have revolutionized the field of computational linguistics: All major language AIs of the last two years are based on the AI architecture, including Google's BERT, OpenAI's GPT-3, and Facebook's RoBERTa. They are enabling major advances in translation, speech recognition, text generation, and other areas of natural language processing.

This is made possible by a mechanism that developers simply call "attention". Transformers process their input data in the context of all the surrounding data, such as words in the context of a sentence.

Previously used architectures such as LSTM (long short-term memory) networks could capture up to a hundred words of context, but the capacity of current transformers is thousands of words and can be scaled further with additional computing resources.


Language technology for image analysis

The Transformer architecture can also be used outside of language processing, because for neural networks, all data is ultimately just vectors, whether words or images.

Facebook made a first attempt to use Transformers for image recognition in May 2020. The image analysis AI DETR (Detection Transformers) essentially consists of two building blocks: a convolutional neural network that extracts information from the image, and a Transformer block that uses that information to recognize objects.

In image analysis benchmarks, DETR achieves the recognition performance of the widely used Faster R-CNNs. However, the AI is simpler and more flexible, according to Facebook researchers.

Google AI undercover

Does artificial intelligence image analysis only work with transformers? A new research paper shows that CNNs may have had their day: Officially, this research is still in an anonymous evaluation phase, so the results could change.

The authors are unknown due to the anonymous evaluation phase, but the anonymous researchers used the Google-exclusive JFT-300M image dataset and Google's TPUv3 cloud computer. If you can put one and one together, you can quickly see that this is Google's latest research.


Counting image sections instead of pixels

Using a pure transformer architecture for image analysis is problematic because, during training, the attention mechanism weighs each token in the data (such as a word in a paragraph) against every other token. This results in a quadratic process where the transformer performs 250,000 operations on a 500-word paragraph.

Correspondingly, the number of operations skyrockets if the Transformer is to process pixels instead of words. For a low-resolution image of 250 x 250 pixels, it would have to perform 62,500² (3,906,250,000) arithmetic operations. In practice, this means that the hardware requirements for memory and processing power for pixel-precise attention are uneconomical, even for Google.

Instead, the researchers trained their Vision Transformer (ViT) using image slices. They divided each image from the 300 million-image JFT-300M dataset into sections of 14 x 14 pixels for the largest model and 16 x 16 pixels for the smaller model. As a result, the Transformer only has to process nine or a few image slices instead of thousands of pixels.

Transformer: A lot helps a lot

The researchers trained three ViT models: ViT-Base with 12 layers and 86 million parameters, ViT-Large with 24 layers and 307 million parameters, and ViT-Huge with 32 layers and 632 million parameters. After training, ViT achieves the performance of today's best image recognition systems.

Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

The largest model (ViT-Huge) performed best in almost all benchmarks such as ImageNet ReaL, CIFAR-10 and CIFAR-100 or VTAB, followed by ViT-Large. A giant ResNet (BiT-Large) trained for comparison was less accurate and took almost four times longer to train. Google reports training time in TPUv3 days: ViT-Heavy required 2,500 TPUv3 days, ViT-Large 680 TPUv3 days, and BiT-Large a full 9,900 TPUv3 days.

Google's ViT shows that Transformers have what it takes to become the new standard for all kinds of deep learning tasks. They are already the standard for computational linguistics, and they are likely to become the standard for image analysis next year, because Google's ViT is more accurate and requires less training than previous top models.

There was no need to specialize the architecture: Google used the same building block for image analysis AI as BERT or GPT-3.

Google's next step is to apply the Transformer architecture to other image tasks and train an even larger version of ViT, "since the largest model doesn't seem to have reached its performance limit yet.

In the long run, Transformers could potentially unite image and text analysis, making giant AIs like OpenAI's GPT-3 even more powerful.


Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Max is managing editor at THE DECODER. As a trained philosopher, he deals with consciousness, AI, and the question of whether machines can really think or just pretend to.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.