Content
summary Summary

With audio tags, support for over 70 languages, and a dialogue function, Elevenlabs aims to position its new text-to-speech model, Eleven v3, as the most expressive AI voice on the market.

Ad

AI startup Elevenlabs has introduced Eleven v3 (Alpha), a text-to-speech model built to deliver more expressive synthetic voices. The new system is designed to handle reactions, interruptions, and a range of emotions more naturally than earlier versions.

Elevenlabs says the real obstacle for synthetic voices hasn't been sound quality, but a lack of expressiveness. To address this, they rebuilt the model from scratch to enable voices that can whisper, laugh, sigh, or react with surprise.

Users can control expressive cues using audio tags embedded directly in the text, such as "[sighs]" or "[excited]." Users can also combine multiple tags for a more nuanced delivery, such as "We did it! [happily][shouts] [laughs].

Ad
Ad

Elevenlabs sees v3 as an experimental tool for developers and media creators looking to push the limits of AI-generated speech. With support for over 70 languages, Eleven v3 is intended for professional applications such as film, audiobook production, and digital media.

Natural-sounding multispeaker dialogs via API

One of the standout features is support for multispeaker dialogues with realistic conversational flow. A new text-to-dialogue API lets users send structured JSON objects that define each speaker's turn. The model automatically manages speaker changes, emotional shifts, and even interruptions.

Text processing has also been improved, according to Elevenlabs. The new model aims for better alignment of emphasis, cadence, and speech melody with the meaning of the text. Early demos show Eleven v3 handling a range of emotional situations—from enthusiastic sports commentary to quiet, somber monologues. Accent changes are supported and can be triggered by prompt.

A test with German shows some remaining weaknesses, though the expressiveness is noticeable in places.

The English voices are much more convincing.

Recommendation

Eleven v3 is now available on their website, with a public API on the way. They're offering an 80 percent discount on app usage through June. For real-time applications, the company still recommends their v2.5 Turbo or Flash models, as v3 isn't optimized for those scenarios yet. They claim a real-time version is in development.

The company says that their "Professional" Voice Clones aren't performing optimally with v3 at the moment. For projects aiming to use the new expressive features, Elevenlabs suggests using their Instant Voice Clones or prebuilt voices instead.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • Elevenlabs has launched Eleven v3 (Alpha), a new text-to-speech model that can convincingly express emotion, handle interruptions, and respond to different conversational contexts.
  • Users can control voice expression using audio tags directly in the text, such as for whispering, laughing, or sounding surprised. The new API also enables multi-speaker dialogues with dynamic interactions and automatic speaker changes.
  • The model supports more than 70 languages and is designed mainly for professional use in film, audiobooks, and media production.
Sources
Matthias is the co-founder and publisher of THE DECODER, exploring how AI is fundamentally changing the relationship between humans and computers.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.