Researchers create massive multimodal dataset to train a tiny AI model that beats the big ones

A team of researchers from various Chinese institutions has created Infinity-MM, one of the largest publicly available datasets for multimodal AI models, and trained a new top-performing model on it.

The Infinity-MM dataset consists of four main categories: 10 million image descriptions, 24.4 million general visual instruction data, 6 million selected high-quality instruction data, and 3 million data generated by GPT-4 and other AI models.

The team used existing open source AI models to create the data: The RAM++ model first analyses the images and extracts important information. This is then used to generate questions and answers. A special classification system with six main categories is used to ensure the quality and diversity of the data generated.

Flussdiagramm: Prozess der synthetischen Datengenerierung mit Seed-Daten, Bildmarkierung, Instruktionsklassifikation und Antwortgenerierung. — The synthetic data generation method uses a multi-layered process with RAM++ and MiniCPM-V models. The combination of image recognition, instruction classification and response generation generates precise training data for AI systems. | Image: Gu et al.

Four-stage training for better performance

The trained model, Aquila-VL-2B, is based on the LLaVA-OneVision architecture and uses Qwen-2.5 as a language model and SigLIP for image processing. Training was performed in four successive phases of increasing complexity.

In the first phase, the model learned basic image-text associations. Subsequent phases included general visual tasks, specific instructions and finally the integration of synthetically generated data. The maximum image resolution was also gradually increased.

New standards in benchmark tests

In extensive testing, Aquila-VL-2B achieved top scores despite its comparatively small size of only two billion parameters. In the MMStar benchmark for multimodal understanding, it scored 54.9% - the best for a model of this size.

According to the researchers, performance on mathematical tasks is particularly impressive, scoring 59% on the MathVista test, significantly outperforming comparable systems. The model also performed very well in general image understanding tests such as HallusionBench (43%) and MMBench (75.2%).

The researchers were also able to demonstrate that the integration of synthetically generated data significantly improved performance. Tests without this additional data resulted in an average performance drop of 2.4%.

Liniendiagramm: Performance-Vergleich dreier KI-Modelle über Datengröße, Aquila-VL-2B zeigt steigenden Trend über drei Entwicklungsstufen. — The performance development of the Aquila-VL-2B model exceeds the constant reference values of the InternVL2-2B and Qwen2VL-2B reference models from stage 3 onwards. With increasing data volume, the performance increases significantly, especially in stage 4.

The team is making both the dataset and the model available to the research community. The model was trained on Nvidia A100 GPUs as well as Chinese chips.

Recommendation

AI research

Meta's latest model highlights the challenge AI faces in long-term planning and causal reasoning

Vision Language Models on the rise

The development of Aquila-VL-2B fits into a broader trend in AI research. While closed commercial systems such as GPT-4o have often shown better performance, open-source models are catching up. The use of synthetic training data is proving particularly promising.

For example, the open-source model Llava-1.5-7B was able to outperform even GPT-4V on certain tasks by training on over 62,000 synthetically generated examples. Meta also relies heavily on synthetic data with Llama models.

However, current tests also reveal the limitations of today's VLMs. Image processing is still inadequate in many areas, especially when it comes to extracting specific visual information from large amounts of data. The limited resolution of visual encoders is also a technical limitation.

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

Researchers create massive multimodal dataset to train a tiny AI model that beats the big ones

Four-stage training for better performance

New standards in benchmark tests

Meta's latest model highlights the challenge AI faces in long-term planning and causal reasoning

Vision Language Models on the rise

AWS shuts its Shanghai AI lab as McKinsey bans generative AI projects for clients in China

Trump administration could kill Nvidia's China business for good

Chinese OpenAI o1 challenger Kimi k1.5 now available as free web version

Meta's human-like chatbot personas can mislead users and result in real-world harm

OpenAI launches GPT-5 as a unified system with adaptive reasoning for complex tasks

Google Deepmind's Genie 3 creates interactive 3D worlds that stay consistent for "multiple minutes"

Researchers create massive multimodal dataset to train a tiny AI model that beats the big ones

Four-stage training for better performance

New standards in benchmark tests

Vision Language Models on the rise

Share

Bank details