Mistral's first open-weight TTS model Voxtral clones voices from three seconds of audio across nine languages
French AI startup Mistral has released Voxtral TTS, its first text-to-speech model. The model supports nine languages—including German, English, French, and Spanish—and is relatively compact at four billion parameters. Mistral says it produces realistic, emotionally expressive speech and can adapt to new voices from as little as three seconds of reference audio. Latency sits at 70 milliseconds for a typical setup with a 10-second speech sample and 500 characters.
In human comparison tests, Voxtral TTS scored higher on naturalness than ElevenLabs Flash v2.5 at a similar response time. That said, ElevenLabs has since shipped a newer model with v3. Voxtral TTS is available through an API at $0.016 per 1,000 characters, can be tested in Mistral Studio, and is also available as an open-weights version on Hugging Face.
AI News Without the Hype – Curated by Humans
As a THE DECODER subscriber, you get ad-free reading, our weekly AI newsletter, the exclusive "AI Radar" Frontier Report 6× per year, access to comments, and our complete archive.
Subscribe now