Content
summary Summary

BLIVA is a vision language model that excels at reading text in images, making it useful in real-world scenarios and applications in many industries.

Researchers at UC San Diego have developed BLIVA, a vision language model designed to better handle images that contain text. Vision language models (VLMs) extend large language models (LLMs) by incorporating visual understanding capabilities to answer questions about images.

Such multimodal models have made impressive progress in open-ended visual question-answering benchmarks. One example is OpenAI's GPT-4, which in its multimodal form can discuss image content when prompted by a user, although this capability is currently only available in the "Be my Eyes" app.

However, a major limitation of current systems is the ability to handle images with text, which are common in real-world scenarios.

Ad
Ad

BLIVA combines InstructBLIP and LLaVA

To address this problem, the team developed BLIVA, which stands for "BLIP with Visual Assistant". BLIVA incorporates two complementary types of visual embeddings, namely learned query embeddings extracted by a Q-former module to focus on image regions relevant to the textual input, similar to Salesforce InstructBLIP, and encoded patch embeddings extracted directly from the raw pixel patches of the full image, inspired by Microsofts LLaVA (Large Language and Vision Assistant).

Bild: Hu, Xu et al.

According to the researchers, this dual approach allows BLIVA to use both refined query-based embeddings tailored to the text and richer encoded patches capturing more visual detail.

BLIVA is pre-trained with approximately 550,000 image-caption pairs, and instruction tuned with 150,000 visual question-answer examples while keeping the visual encoder and language model frozen.

The team shows that BLIVA significantly improves the handling of text-rich images on datasets such as OCR-VQA, TextVQA, and ST-VQA. For example, it achieved 65.38% accuracy on OCR-VQA, compared to 47.62% for InstructBLIP. The new system also outperformed InstructBLIP on seven out of eight general, non-text VQA benchmarks. The team believes this demonstrates the benefits of multi-embedding approaches to visual comprehension in general.

The team tested BLIVA with YouTube thumbnails. | Bild: Hu, Xu et al.

The researchers also evaluated BLIVA on a new dataset of YouTube video thumbnails with associated questions, available on Hugging Face. BLIVA achieved 92% accuracy, significantly higher than previous methods. BLIVA's ability to read text on images such as road signs or food packaging, could enable practical applications in many industries, the team said. Recently, Microsoft researchers demonstrated a multimodal AI assistant for biomedicine based on LLaVA, called LLaVA-med.

Recommendation

More information and the code is available on the BLIVA Github, a demo is available on Hugging Face.

Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • Researchers at UC San Diego have created BLIVA, a vision language model that excels at reading text in images, improving real-world applications across various industries.
  •  BLIVA combines learned query embeddings from Salesforce InstructBLIP and encoded patch embeddings from Microsoft's LLaVA to provide better visual comprehension.
  • The new model outperformed other systems on several datasets and has potential applications in areas such as reading road signs and food packaging.
Sources
Max is managing editor at THE DECODER. As a trained philosopher, he deals with consciousness, AI, and the question of whether machines can really think or just pretend to.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.