Content
summary Summary

A team of researchers from various Chinese institutions has created Infinity-MM, one of the largest publicly available datasets for multimodal AI models, and trained a new top-performing model on it.

Ad

The Infinity-MM dataset consists of four main categories: 10 million image descriptions, 24.4 million general visual instruction data, 6 million selected high-quality instruction data, and 3 million data generated by GPT-4 and other AI models.

The team used existing open source AI models to create the data: The RAM++ model first analyses the images and extracts important information. This is then used to generate questions and answers. A special classification system with six main categories is used to ensure the quality and diversity of the data generated.

Flussdiagramm: Prozess der synthetischen Datengenerierung mit Seed-Daten, Bildmarkierung, Instruktionsklassifikation und Antwortgenerierung.
The synthetic data generation method uses a multi-layered process with RAM++ and MiniCPM-V models. The combination of image recognition, instruction classification and response generation generates precise training data for AI systems. | Image: Gu et al.

Four-stage training for better performance

The trained model, Aquila-VL-2B, is based on the LLaVA-OneVision architecture and uses Qwen-2.5 as a language model and SigLIP for image processing. Training was performed in four successive phases of increasing complexity.

Ad
Ad

In the first phase, the model learned basic image-text associations. Subsequent phases included general visual tasks, specific instructions and finally the integration of synthetically generated data. The maximum image resolution was also gradually increased.

New standards in benchmark tests

In extensive testing, Aquila-VL-2B achieved top scores despite its comparatively small size of only two billion parameters. In the MMStar benchmark for multimodal understanding, it scored 54.9% - the best for a model of this size.

According to the researchers, performance on mathematical tasks is particularly impressive, scoring 59% on the MathVista test, significantly outperforming comparable systems. The model also performed very well in general image understanding tests such as HallusionBench (43%) and MMBench (75.2%).

The researchers were also able to demonstrate that the integration of synthetically generated data significantly improved performance. Tests without this additional data resulted in an average performance drop of 2.4%.

Liniendiagramm: Performance-Vergleich dreier KI-Modelle über Datengröße, Aquila-VL-2B zeigt steigenden Trend über drei Entwicklungsstufen.
The performance development of the Aquila-VL-2B model exceeds the constant reference values of the InternVL2-2B and Qwen2VL-2B reference models from stage 3 onwards. With increasing data volume, the performance increases significantly, especially in stage 4.

The team is making both the dataset and the model available to the research community. The model was trained on Nvidia A100 GPUs as well as Chinese chips.

Recommendation

Vision Language Models on the rise

The development of Aquila-VL-2B fits into a broader trend in AI research. While closed commercial systems such as GPT-4o have often shown better performance, open-source models are catching up. The use of synthetic training data is proving particularly promising.

For example, the open-source model Llava-1.5-7B was able to outperform even GPT-4V on certain tasks by training on over 62,000 synthetically generated examples. Meta also relies heavily on synthetic data with Llama models.

However, current tests also reveal the limitations of today's VLMs. Image processing is still inadequate in many areas, especially when it comes to extracting specific visual information from large amounts of data. The limited resolution of visual encoders is also a technical limitation.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • A Chinese research team has compiled the multimodal data set Infinity-MM with 40 million image-text pairs. This includes image descriptions, visual instruction data and synthetically generated data from AI models such as GPT-4o.
  • The researchers used this data set to train the Aquila-VL-2B model, which, despite its comparatively small size of 2 billion parameters, achieved top performance in various benchmarks, for example in mathematical tasks or general image comprehension.
  • The four-stage training with increasing complexity and the integration of synthetically generated data proved to be decisive for performance. Open source models thus also excel in vision tasks.
Sources
Jonathan works as a freelance tech journalist for THE DECODER, focusing on AI tools and how GenAI can be used in everyday work.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.