Meta's new MILS system teaches LLMs to handle multimedia data without specialized training
Key Points
- Researchers from Meta AI, the University of Texas at Austin, and UC Berkeley have developed MILS, a system that allows large language models to gain skills in understanding and generating images, videos, and audio without requiring specialized training.
- MILS employs a language model as a "generator" to propose solutions and a pre-trained multimodal model as a "scorer" to assess the quality of the suggestions. Through iterative feedback between the generator and scorer, the system progressively refines the solutions.
- The researchers showcase MILS' adaptability across various tasks and data types, including image description, image generation, style transfer, and processing of video and audio content.
Meta AI researchers and their academic partners have developed a system that teaches large language models to handle images, video, and audio without specialized training.
Called MILS (Multimodal Iterative LLM Solver), the system relies on the models' natural problem-solving abilities instead of extensive data training.
MILS works by pairing two AI models: a "generator" that proposes solutions to tasks, and a "scorer" that evaluates how well those solutions work. The scorer's feedback helps the generator improve its answers step by step until it reaches a satisfactory result.

The system shows particular strength in describing images. Using Llama-3.1-8B as its generator and CLIP as its scorer, MILS creates detailed image descriptions that match or exceed current leading methods - even though CLIP wasn't specifically trained for this task.
MILS also enhances text-to-image generation by fine-tuning text prompts, and can handle image editing tasks like style transfer by combining AI-generated prompts with image processing tools.

The system's capabilities extend to video and audio as well. In tests using the MSR-VTT video dataset, MILS performed better than existing models at describing video content.
Since MILS doesn't modify model parameters during operation, it can convert different types of data into readable text. This allows for new applications, like combining information from multiple sources - such as images and audio - by converting everything to text, merging the information, and then converting it back into the desired format.
Tests indicate that larger generator and scorer models produce more accurate results, and that having more potential solutions to work with improves performance. The researchers found that scaling up to larger language models also led to noticeable quality improvements.
AI assistants learn to see and hear
The AI landscape is quickly shifting toward models that can handle multiple types of input, so-called multimodality. While OpenAI's GPT-4o led the way, open-source alternatives are catching up: Meta's Llama 3.2, Mistral's Pixtral, and DeepSeek's Janus Pro can all process images alongside text - a key feature for AI systems that aim to be truly helpful in everyday situations.
MILS takes a different approach to multimodality by moving the training requirements to a pre-trained scorer model. This strategy fits with the field's current direction of enhancing language models through smarter inference methods instead of just adding more training data. Looking ahead, the research team sees potential for MILS to tackle 3D data processing as well.
AI News Without the Hype – Curated by Humans
As a THE DECODER subscriber, you get ad-free reading, our weekly AI newsletter, the exclusive "AI Radar" Frontier Report 6× per year, access to comments, and our complete archive.
Subscribe now