Google is rolling out its new text-to-speech model based on Gemini 3.1 Flash. The company says it's the most natural and expressive voice output it has shipped to date. The big new feature is audio tags—simple text commands that let developers control the style, tempo, tone, and accent of the generated speech. The model supports over 70 languages and can handle multi-speaker dialogs.
On the Artificial Analysis ranking list, the model hits an Elo rating of 1,211 and stands out for its quality-to-price ratio. It beats Elevenlabs v3 in overall quality and sits just behind Inworld 1.5 Max.

Gemini 3.1 Flash TTS has a free tier, but Google uses the data to improve its products. The paid tier runs $1.00 per million tokens for text input and $20.00 per million tokens for audio output. Batch mode cuts those prices in half to $0.50 and $10.00, respectively. On the paid tier, Google doesn't use the data for product improvement.
Gemini 3.1 Flash TTS is available as a preview through the Gemini API, Vertex AI for enterprise users, and Google Vids for Workspace users. Anyone can try it for free in Google's AI Studio. All generated audio is tagged with Google's SynthID watermark to flag AI-generated content.