Content
summary Summary

EfficientViT brings efficient computer vision to high-resolution cameras and could lead to more accurate results in autonomous driving or medicine.

MIT researchers have unveiled a new computer vision model that can perform semantic segmentation of high-resolution images much more efficiently than previous models. Semantic segmentation involves tagging and categorizing all pixels in an image. With this, autonomous vehicles can accurately recognize objects such as other cars, pedestrians, and their surroundings, or medical AI systems can recognize tumor tissue.

EfficientViT reduces computational complexity of vision transformers

The new model, called EfficientViT, is based on the Vision Transformer but changes the attention mechanism of the model. This is because conventional vision transformers suffer from the same drawback as classical transformers for language processing: the computational complexity of the attention mechanism increases quadratically, making the processing of high-resolution images extremely computationally intensive and thus making real-time processing impossible on current edge devices.

For this, the researchers' model, EfficientViT, simplifies the process of creating "attention maps," representations that capture the relationships between different parts of the image. These help the model form a "global receptive field," meaning it can access all relevant parts of the image.

Ad
Ad

EfficientViT uses a linear function instead of the usual non-linear similarity function - increasing computational complexity linearly rather than quadratically.

Team compensates for the loss of accuracy with additional components

However, the linear function comes with a loss of accuracy because it only captures the global context of the image, not the local information. The team compensates for this loss of accuracy with two components that can capture local features as well as large and small objects.

In tests, EfficientViT performed as well as or better than less efficient models, processing high-resolution images up to nine times faster.

This more efficient processing could enable real-time segmentation for autonomous vehicles, medical imaging, VR headsets, and other edge applications.

The code and model are available on GitHub.

Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Recommendation
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • MIT researchers have developed a new computer vision model called EfficientViT that can more efficiently perform semantic segmentation of high-resolution images, which could be useful for autonomous vehicles and medical AI systems.
  • EfficientViT is based on the Vision Transformer, but modifies the attention mechanism to reduce computational complexity and enable processing of high-resolution images on current edge devices.
  • In tests, EfficientViT achieved similar or better performance than less efficient models and processed high-resolution images up to nine times faster, which could enable real-time segmentation in applications such as autonomous vehicles, medical imaging, and VR headsets.
Sources
Max is managing editor at THE DECODER. As a trained philosopher, he deals with consciousness, AI, and the question of whether machines can really think or just pretend to.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.