ILLUME: Can AI models learn why giraffes are giraffes?

Multimodal AI models can caption images and answer questions about them - but their answers don't always make sense. Can they learn from humans?

Vision-Language Models (VLMs) combine transformer-based language models with computer vision for image captioning, answering questions about them, or, conversely, judging how well an image description matches an image. There are different architectures with different capabilities, e.g. OpenAI's CLIP, Deepmind's Flamingo, the recently released MiniGPT-4, or Aleph Alpha's MAGMA.

Most VLMs today are based on a large language model that has not yet been aligned with human intent on a given task through methods such as instruction tuning and reinforcement learning with human feedback. As a result, the output of VLMs often does not match human rationales for specific responses. Now researchers from TU Darmstadt, Hessian.AI, the Center for Cognitive Science Darmstadt, Aleph Alpha, LAION, and the German Research Center for Artificial Intelligence demonstrate the alignment of VLMs with human feedback.

ILLUME aims to "rationalize" VLMs

The team calls the method ILLUME (InteractiveLy RationaLizing Vision-LangUage ModEls), a fine-tuning scheme "to transfer reasoning capabilities from language models to vision-language models." The method is based on three steps: (1) The VLM generates several rationales for an answer to a question about an image, e.g., "Q: What type of animal is in the picture? - A: giraffe, seeing that...".

(2) Human annotators select the appropriate reasons from the given options, e.g. "...it has a long neck". (3) The VLM is fine-tuned for all selected rationales that have at least one matching explanation.

The process is repeated until there is an appropriate rationale for all cases, or no further progress is made.

According to the team, human feedback could theoretically be replaced by a reward model, as in the case of ChatGPT, but "this could require prior expensive human labor and is inherently limited."

ILLUME significantly reduces required training data

The process improves the model's performance based solely on examples generated by the model and selected through human feedback. It interactively aligns the model to human preferences while "gradually carving out rationalization capabilities." Empirical evaluation by the team shows that ILLUME uncovers and reinforces latent capabilities of the language model, resulting in better overall reasoning.

A question about an image, a human rationale, and various AI model-generated explanations. ILLUME provides an answer that is closest to ground truth. | Image: Brack, Schramowski et al.

A major advantage of the method is that the team was able to show that a MAGMA-VLM trained with ILLUME can approach the performance of models trained with up to five times more ground truth fine-tuning data.

Recommendation

AI research

Automated research: The AI Scientist generates papers for 15 dollars each

For all its success, however, the method is unable to extract from the underlying language models capabilities that they did not already possess. "Current LMs appear incapable of inferring logical reasoning from a few training examples. Therefore, VLMs bootstrapped from LMs struggle to transfer logical reasoning capabilities without major extensions. Instead, we argue that the approach of training and evaluating logical reasoning as a pure text generation task may be inherently flawed. ".

All information and code for ILLUME is available on GitHub

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

ILLUME: Can AI models learn why giraffes are giraffes?

ILLUME aims to "rationalize" VLMs

ILLUME significantly reduces required training data

Automated research: The AI Scientist generates papers for 15 dollars each

AI system StreamDiT generates livestream videos from text at 16 fps 512p

Researchers used 1,600 YouTube fail videos to show AI models struggle with surprises

AI coding can make developers slower even if they feel faster

OpenAI launches new ChatGPT agent that automates complex tasks for Pro, Plus, and Team

Kimi-K2 is the next open-weight AI milestone from China after Deepseek

New Energy-Based Transformer architecture aims to bring better "System 2 thinking" to AI models

ILLUME: Can AI models learn why giraffes are giraffes?

ILLUME aims to "rationalize" VLMs

ILLUME significantly reduces required training data

Share

Bank details