Ad
Skip to content

Google ships its most expressive Gemini 3.1 text-to-speech model yet with 70+ language support

Google is rolling out its new text-to-speech model based on Gemini 3.1 Flash. The company says it's the most natural and expressive voice output it has shipped to date. The big new feature is audio tags—simple text commands that let developers control the style, tempo, tone, and accent of the generated speech. The model supports over 70 languages and can handle multi-speaker dialogs.

On the Artificial Analysis ranking list, the model hits an Elo rating of 1,211 and stands out for its quality-to-price ratio. It beats Elevenlabs v3 in overall quality and sits just behind Inworld 1.5 Max.

Gemini 3.1 Flash TTS ranks among the top text-to-speech models for both quality and value. | Image: Google

Gemini 3.1 Flash TTS has a free tier, but Google uses the data to improve its products. The paid tier runs $1.00 per million tokens for text input and $20.00 per million tokens for audio output. Batch mode cuts those prices in half to $0.50 and $10.00, respectively. On the paid tier, Google doesn't use the data for product improvement.

Gemini 3.1 Flash TTS is available as a preview through the Gemini API, Vertex AI for enterprise users, and Google Vids for Workspace users. Anyone can try it for free in Google's AI Studio. All generated audio is tagged with Google's SynthID watermark to flag AI-generated content.

Ad
DEC_D_Incontent-1

AI News Without the Hype – Curated by Humans

Subscribe to THE DECODER for ad-free reading, a weekly AI newsletter, our exclusive "AI Radar" frontier report six times a year, full archive access, and access to our comment section.

Source: Google Blog