Content
summary Summary

French AI company Mistral introduced Pixtral-12B, its first multimodal model capable of processing both images and text, at a conference in San Francisco. This open-source model builds on Mistral's text model NeMo-12B, launched earlier this year.

With 12 billion parameters and a size of about 24 GB, Pixtral-12B can answer questions about multiple images provided as URLs or Base64-encoded. Its vision encoder supports image resolutions up to 1,024 x 1,024 pixels.

Benchmark performance

Mistral claims Pixtral-12B outperforms other open-source vision models like Phi 3, Qwen2 VL, and LLaVA in some multimodal capabilities. However, larger closed multimodal models such as Claude 3.5 Sonnet or GPT-4 still show superior image comprehension.

Image: swyx/X

In certain tests, Pixtral-12B shows significant improvements over similar models. For important text comprehension benchmarks like MMLU and HumanEval, it trails slightly behind Anthropic's smallest model, Claude 3 Haiku. Smaller but vision-only models like Qwen2-VL 7B and Phi 3.5 Vision also surpass Pixtral-12B in some vision benchmarks.

Image: swyx/X

Mistral reports that Pixtral-12B can perform optical character recognition (OCR), analyze diagrams and schemas, and process screenshots. The model also shows promise with satellite imagery. While not yet tested on video frames, its 128,000-token context window suggests potential in this area.

Image: swyx/X

Availability and licensing

Mistral has released Pixtral-12B under an Apache 2.0 license for free use. The model is currently available on GitHub and Hugging Face. Sophia Yang, Head of Developer Relations at Mistral, announced that Pixtral-12B will soon be available for testing on Mistral's platforms Le Chat and La Plateforme.

The release follows Mistral's recent $600 million funding round, valuing the company at $6 billion. Microsoft holds a minority stake in Mistral, while Amazon and AWS have also formed partnerships with the company.

As with previous releases, Mistral quickly made Pixtral-12B publicly available. However, the company has not provided information about the training data or independent performance evaluations beyond their own benchmarks.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • French AI startup Mistral has unveiled its first multimodal model, Pixtral-12B, which can process both images and text. With 12 billion parameters, it is based on Mistral's NeMo-12B text model.
  • In benchmarks, Pixtral-12B partially outperforms other open-source vision models such as Phi 3, Qwen2 VL, and LLaVA, but lags behind closed, larger models such as Claude 3.5 Sonnet or GPT-4o. Among other things, it is capable of OCR, diagram analysis and screenshot processing.
  • Mistral has released Pixtral-12B under an Apache 2.0 license and plans to test it soon on its own platforms Le Chat and La Plateforme. Details on the training data are not known, and the real performance will have to be proven on real tasks outside of benchmarks.
Sources
Jonathan works as a freelance tech journalist for THE DECODER, focusing on AI tools and how GenAI can be used in everyday work.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.