Mistral unveils Voxtral, an open-source speech model with lower costs than proprietary rivals

GPT-4o prompted by THE DECODER
Mistral unveils Voxtral, an open-source speech model with lower costs than proprietary rivals
Jonathan writes for THE DECODER about how AI tools can make our work and creative lives better.
French AI company Mistral unveils Voxtral, an open-source speech understanding model that aims to replace proprietary solutions at less than half the cost.

The Voxtral models come in two versions: a 24B variant for production applications and a compact 3B model for local and edge deployments. Both support a 32,000-token context window, which Mistral says can handle audio files up to 30 minutes for transcription or 40 minutes for comprehension tasks.

Unlike basic transcription tools, Voxtral builds in Q&A and summarization features without requiring separate speech recognition and language models. It also lets users trigger backend functions directly through voice commands by automatically translating spoken requests into API calls.

Scatterplot: Preis (USD/min) vs. Wortfehlerrate im FLEURS-Datensatz, zeigt Voxtral Small als beste Kosten-Fehler-Balance.
Voxtral Small has a significantly lower error rate, but undercuts Whisper large in terms of price. | Image: Mistral

The models support automatic speech recognition in English, Spanish, French, Portuguese, Hindi, German, Dutch and Italian while retaining the text comprehension capabilities of Mistral Small 3.1's language model backbone.

Benchmark performance exceeds competition

Mistral's tests show Voxtral Small outperforming leading open-source model Whisper large-v3, along with GPT-4o mini Transcribe and Gemini 2.5 Flash across all tested tasks. For English short-form tasks and Mozilla's Common Voice benchmark, it reportedly beats ElevenLabs Scribe - currently one of the strongest performers.

Balkendiagramm: Voxtral Mini/Small, GPT-4o mini Audio und Gemini 2.5 Flash bei Speech-Benchmarks und FLEURS BLEU
According to Mistral's benchmarks, Voxtral can keep up with much larger models such as GPT-4o mini and Gemini 2.5 Flash. | Picture: Mistral

In the FLEURS multilingual speech recognition benchmark, Voxtral Small allegedly surpasses Whisper in all nine tested languages. For audio comprehension tasks, it performs comparably to GPT-4o-mini and Gemini 2.5 Flash while delivering state-of-the-art results in speech translation.

Pricing undercuts proprietary alternatives

Mistral positions Voxtral as a budget-friendly option, with API pricing starting at $0.001 per minute. The company claims Voxtral Mini Transcribe outperforms OpenAI's Whisper at less than half the cost for price-sensitive applications, while Voxtral Small matches ElevenLabs Scribe's performance at similar savings.

Enterprise features include private deployment options for regulated industries and domain-specific fine-tuning. Coming updates will add speaker segmentation, audio markups for age/emotion detection, and word-level timestamps.

Coming to Le Chat's Voice Mode

Both Voxtral versions are available under Apache-2.0 license for download on Hugging Face, with Mistral also offering API access. The models will power the Voice Mode in Le Chat, which rolls out to all users in coming weeks.

  • French AI company Mistral has released Voxtral, an open-source speech understanding model that supports both transcription and audio comprehension, featuring built-in Q&A and summarization without needing separate language models.
  • Available in two sizes—a 24B production model and a compact 3B version—Voxtral reportedly outperforms leading competitors like Whisper large-v3 and ElevenLabs Scribe in multiple benchmarks, while supporting eight languages and delivering state-of-the-art results in speech translation and comprehension.
  • Voxtral is priced at less than half the cost of proprietary alternatives, offers private deployment options, and will soon power the Voice Mode in Le Chat, with both versions downloadable under an Apache-2.0 license and accessible via API.
Mistral
