Content
summary Summary

Google's AVIS tries to answer questions about images where the information is not readily available, such as the date an airline was founded or the year a car was built.

Recent advances in large language models (LLMs) have enabled multimodal capabilities for tasks such as image captioning and visual question answering. However, these visual language models (VLMs) still struggle with complex real-world visual reasoning that requires external knowledge - called "visual information seeking".

Some examples of the type of questions AVIS tries to answer. | Bild: Google

To address this limitation, researchers at Google have introduced a new method called AVIS, which integrates Googles PALM with computer vision, web search, and image search tools. With these tools, AVIS uses a language model in a dynamic framework to autonomously search for visual information.

Googles AVIS learns from humans

Unlike previous systems that combine large language models with tools in a rigid two-step process, AVIS uses them more flexibly for planning and reasoning. This allows actions to be adapted based on real-time feedback.

Ad
Ad

AVIS has three main components:

  • planner that determines the next action (API call and query) using the LLM
  • working memory to retain information from past API executions
  • reasoner that processes API outputs using the LLM to extract useful information

The planner and reasoner are used iteratively, with the planner deciding on the next tool and query based on the updated state from the reasoner. This continues until the reasoner determines that there is sufficient information to provide the final answer.

Bild: Google

Three types of tools are also integrated:

  • Computer vision tools for extracting visual information from images
  • A web search tool for retrieving open-world knowledge and facts
  • An image search tool to read relevant information from metadata associated with visually similar images

To find out how best to exploit these capabilities, the researchers conducted a user study that captured human decision-making using visual reasoning tools. The study revealed common sequences of actions that were used to construct a transition graph that guides AVIS in its behavior.

Bild: Google

AVIS reaches state-of-the-art without fine-tuning

On the Infoseek dataset, AVIS achieved 50.7% accuracy, significantly outperforming fine-tuned visual language models such as OFA and PaLI. On the OK-VQA dataset, AVIS achieved 60.2% accuracy with few examples, outperforming most previous work and approaching fine-tuned models, Google said.

Recommendation

In the future, the team wants to explore their framework on other reasoning tasks and see if these capabilities can be performed by lighter language models, as the PALM model used is computationally intensive with 540 billion parameters.

Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • Google's Autonomous Visual Information Seeking (AVIS) is a new method that integrates large language models with computer vision, web search, and image search tools to autonomously search for visual information and answer complex questions about images.
  • The AVIS system has three main components: a planner that uses the language model to decide the next action, a working memory to store information, and a reasoner that processes output and extracts useful information, working dynamically and flexibly for planning and reasoning. It also uses a variety of tools.
  • On the Infoseek and OK-VQA datasets, AVIS achieved high accuracy, outperforming fine-tuned visual language models and approaching state-of-the-art results without fine-tuning. The team plans to explore this framework for other reasoning tasks and to experiment with lighter language models, as the current model is computationally intensive.
Sources
Max is managing editor at THE DECODER. As a trained philosopher, he deals with consciousness, AI, and the question of whether machines can really think or just pretend to.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.