AI models struggle with "lost in the middle" issue when processing large image sets

Midjourney prompted by THE DECODER

New research from UC Berkeley shows that current AI systems have trouble extracting relevant information from large collections of images. The study highlights weaknesses in existing large multimodal models (LMMs) when dealing with visual data.

A team at Berkeley Artificial Intelligence Research (BAIR) has created "Visual Haystacks" (VHS), a benchmark to test AI models' ability to process high volumes of images. The test includes about 1,000 binary question-answer pairs, with each set containing between 1 and 10,000 images.

The benchmark consists of two tasks: In the "single-needle" task, only one relevant "needle" image is hidden in the "haystack" of images. In the "multi-needle" task, there are two to five relevant images. Questions ask if a specific object appears in one, all, or any of the relevant images.

The researchers tested various models, including open-source and proprietary ones like LLaVA-v1.5, GPT-4o, Claude 3 Opus, and Gemini-v1.5-pro. They also used a baseline model that generates captions with LLaVA and then answers questions based on text using Llama 3.

Results show that models struggle to filter out irrelevant visual information. Their performance on the single-needle task drops significantly as the number of images increases.

Single-needle precision. | Image: Tsung-Han et al.

Interestingly, simple two-stage approaches (generating captions first, then evaluating with a language model) outperform all tested LMMs on the multi-needle task. This suggests that LMMs have difficulty processing information from multiple images, which again calls into question the current advantages of large context windows.

Multi-needle precision. | Bild: Tsung-Han et al.

The models are also very sensitive to image position in the sequence. If the relevant image is in the middle, performance is much worse than if it's at the beginning or end.

Existing LMMs show a drop in performance of up to 41% if the image to be found is not ideally positioned. Gray boxes: Exceeding the context length. | Image: Tsung-Han et al.

This mirrors the "lost in the middle" phenomenon seen in language processing, where models focus on the start and end of a document while more or less ignoring the middle. LLMs also have a problem drawing meaningful conclusions from large amounts of text, according to a recent study.

To address these issues, the BAIR team developed MIRAGE (Multi-Image Retrieval Augmented Generation), an image processing-optimized RAG system. MIRAGE compresses visual tokens, allowing for more images in the same context lengths, a retriever trained in-line with the LLM fine-tuning to filter irrelevant images, and is trained on multi-image reasoning data. This approach achieves better results on both VHs and more complex visual question-answering tasks.

Recommendation

AI in practice

OpenAI's new "Orion" model reportedly shows small gains over GPT-4

The Mirage image RAG system developed by the researchers can outperform LMMs without image RAG on image tasks. | Image: Tsung-Han et al.

The researchers recommend that future LMM projects use the Visual Haystacks framework to identify and fix potential weaknesses before deployment, adding that multi-image question answering is an important step toward artificial general intelligence (AGI). The benchmark is available on GitHub.

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

AI models struggle with "lost in the middle" issue when processing large image sets

OpenAI's new "Orion" model reportedly shows small gains over GPT-4

AI Math Olympiad wins revive the debate over symbols, reasoning, and the nature of intelligence

AI training shifts from clickworkers to experts in physics, biology and engineering

Alibaba's Qwen2.5 only excels at math thanks to memorized training data

OpenAI launches GPT-5 as a unified system with adaptive reasoning for complex tasks

Google Deepmind's Genie 3 creates interactive 3D worlds that stay consistent for "multiple minutes"

Google upgrades Gemini with Deep Think and flags early warning risks

AI models struggle with "lost in the middle" issue when processing large image sets

Share

Bank details