Content
summary Summary

New research from UC Berkeley shows that current AI systems have trouble extracting relevant information from large collections of images. The study highlights weaknesses in existing large multimodal models (LMMs) when dealing with visual data.

Ad

A team at Berkeley Artificial Intelligence Research (BAIR) has created "Visual Haystacks" (VHS), a benchmark to test AI models' ability to process high volumes of images. The test includes about 1,000 binary question-answer pairs, with each set containing between 1 and 10,000 images.

The benchmark consists of two tasks: In the "single-needle" task, only one relevant "needle" image is hidden in the "haystack" of images. In the "multi-needle" task, there are two to five relevant images. Questions ask if a specific object appears in one, all, or any of the relevant images.

The researchers tested various models, including open-source and proprietary ones like LLaVA-v1.5, GPT-4o, Claude 3 Opus, and Gemini-v1.5-pro. They also used a baseline model that generates captions with LLaVA and then answers questions based on text using Llama 3.

Ad
Ad

Results show that models struggle to filter out irrelevant visual information. Their performance on the single-needle task drops significantly as the number of images increases.

Single-needle precision. | Image: Tsung-Han et al.

Interestingly, simple two-stage approaches (generating captions first, then evaluating with a language model) outperform all tested LMMs on the multi-needle task. This suggests that LMMs have difficulty processing information from multiple images, which again calls into question the current advantages of large context windows.

Multi-needle precision. | Bild: Tsung-Han et al.

The models are also very sensitive to image position in the sequence. If the relevant image is in the middle, performance is much worse than if it's at the beginning or end.

Existing LMMs show a drop in performance of up to 41% if the image to be found is not ideally positioned. Gray boxes: Exceeding the context length. | Image: Tsung-Han et al.

This mirrors the "lost in the middle" phenomenon seen in language processing, where models focus on the start and end of a document while more or less ignoring the middle. LLMs also have a problem drawing meaningful conclusions from large amounts of text, according to a recent study.

To address these issues, the BAIR team developed MIRAGE (Multi-Image Retrieval Augmented Generation), an image processing-optimized RAG system. MIRAGE compresses visual tokens, allowing for more images in the same context lengths, a retriever trained in-line with the LLM fine-tuning to filter irrelevant images, and is trained on multi-image reasoning data. This approach achieves better results on both VHs and more complex visual question-answering tasks.

Recommendation
The Mirage image RAG system developed by the researchers can outperform LMMs without image RAG on image tasks. | Image: Tsung-Han et al.

The researchers recommend that future LMM projects use the Visual Haystacks framework to identify and fix potential weaknesses before deployment, adding that multi-image question answering is an important step toward artificial general intelligence (AGI). The benchmark is available on GitHub.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • Researchers at UC Berkeley have developed a new benchmark, Visual Haystacks" (VHs), to test the ability of AI models to extract relevant information from a large set of images.
  • The evaluation of different models showed that they have difficulty filtering out irrelevant visual information. Their performance in finding a relevant image decreased significantly as the number of images in the dataset increased.
  • The position of the image in the dataset also had an influence - images in the middle tended to be ignored. This is a phenomenon already known from text processing with LLMs. The research team developed the RAG system MIRAGE, which is optimized for image processing and can increase performance.
Online journalist Matthias is the co-founder and publisher of THE DECODER. He believes that artificial intelligence will fundamentally change the relationship between humans and computers.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.