BLIVA: AI gets much better at reading text in images

Midjourney prompted by THE DECODER

BLIVA is a vision language model that excels at reading text in images, making it useful in real-world scenarios and applications in many industries.

Researchers at UC San Diego have developed BLIVA, a vision language model designed to better handle images that contain text. Vision language models (VLMs) extend large language models (LLMs) by incorporating visual understanding capabilities to answer questions about images.

Such multimodal models have made impressive progress in open-ended visual question-answering benchmarks. One example is OpenAI's GPT-4, which in its multimodal form can discuss image content when prompted by a user, although this capability is currently only available in the "Be my Eyes" app.

However, a major limitation of current systems is the ability to handle images with text, which are common in real-world scenarios.

BLIVA combines InstructBLIP and LLaVA

To address this problem, the team developed BLIVA, which stands for "BLIP with Visual Assistant". BLIVA incorporates two complementary types of visual embeddings, namely learned query embeddings extracted by a Q-former module to focus on image regions relevant to the textual input, similar to Salesforce InstructBLIP, and encoded patch embeddings extracted directly from the raw pixel patches of the full image, inspired by Microsofts LLaVA (Large Language and Vision Assistant).

According to the researchers, this dual approach allows BLIVA to use both refined query-based embeddings tailored to the text and richer encoded patches capturing more visual detail.

BLIVA is pre-trained with approximately 550,000 image-caption pairs, and instruction tuned with 150,000 visual question-answer examples while keeping the visual encoder and language model frozen.

The team shows that BLIVA significantly improves the handling of text-rich images on datasets such as OCR-VQA, TextVQA, and ST-VQA. For example, it achieved 65.38% accuracy on OCR-VQA, compared to 47.62% for InstructBLIP. The new system also outperformed InstructBLIP on seven out of eight general, non-text VQA benchmarks. The team believes this demonstrates the benefits of multi-embedding approaches to visual comprehension in general.

The team tested BLIVA with YouTube thumbnails. | Bild: Hu, Xu et al.

The researchers also evaluated BLIVA on a new dataset of YouTube video thumbnails with associated questions, available on Hugging Face. BLIVA achieved 92% accuracy, significantly higher than previous methods. BLIVA's ability to read text on images such as road signs or food packaging, could enable practical applications in many industries, the team said. Recently, Microsoft researchers demonstrated a multimodal AI assistant for biomedicine based on LLaVA, called LLaVA-med.

Recommendation

AI research

Meta's latest model highlights the challenge AI faces in long-term planning and causal reasoning

More information and the code is available on the BLIVA Github, a demo is available on Hugging Face.

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

BLIVA: AI gets much better at reading text in images

BLIVA combines InstructBLIP and LLaVA

Meta's latest model highlights the challenge AI faces in long-term planning and causal reasoning

Deepseek's Janus Pro is a good upgrade, but it won't fuel a US AI 'Sputnik crisis'

Seattle startup shrinks computer vision AI to fit in your pocket

'OCR 2.0' model converts images of text, formulas, notes, and shapes into editable text

Cloudflare CEO Matthew Prince sees trouble ahead for the open web

New Othello experiment supports the world model hypothesis for large language models

ChatGPT might be draining your brain, MIT warns - what ‘cognitive debt’ means for you

BLIVA: AI gets much better at reading text in images

BLIVA combines InstructBLIP and LLaVA

Share

Bank details