French AI startup Mistral has released Voxtral TTS, its first text-to-speech model. The model supports nine languages—including German, English, French, and Spanish—and is relatively compact at four billion parameters. Mistral says it produces realistic, emotionally expressive speech and can adapt to new voices from as little as three seconds of reference audio. Latency sits at 70 milliseconds for a typical setup with a 10-second speech sample and 500 characters.
In human comparison tests, Voxtral TTS scored higher on naturalness than ElevenLabs Flash v2.5 at a similar response time. That said, ElevenLabs has since shipped a newer model with v3. Voxtral TTS is available through an API at $0.016 per 1,000 characters, can be tested in Mistral Studio, and is also available as an open-weights version on Hugging Face.
