Content
summary Summary

Sesame AI, a California-based startup, is taking an unconventional approach to voice AI by deliberately incorporating imperfections into its speech output. Their new model represents an early step toward more authentic dialogues and what they call "voice presence" in AI systems.

Ad

According to early testing, Sesame's most impressive features are subtle elements like micro-pauses, emphasis variations, and laughter during conversations. In one interaction, Sesame's avatar Maya responded in real-time to a user's sudden giggle demonstrating emotional awareness.

The system intentionally incorporates human-like behaviors such as mid-sentence self-corrections, apologies for interruptions, and filler words. Techradar specifically praised these deliberate imperfections, noting how they differ from the polished corporate tone of ChatGPT or Gemini.

In simulated scenarios, like discussions about work stress or party planning, the system provided contextually appropriate responses and questions rather than falling back on generic phrases.

Ad
Ad

Sesame AI uses semantic and acoustic tokens

While a formal paper hasn't been released yet, Sesame's blog post offers insights into their architecture. Their CSM uses a two-part transformer structure, combining a backbone transformer (1-8 billion parameters) for basic processing with a smaller decoder (100-300 million parameters) for audio generation.

The system processes speech using semantic tokens for linguistic properties and phonetics, alongside acoustic tokens for sound characteristics like pitch and emphasis. To optimize training, the audio decoder trains on just one-sixteenth of the audio frames, while semantic processing uses the complete dataset.

The model trained on one million hours of English audio data across five epochs. It can process sequences of up to 2,048 tokens (about two minutes of audio) in an end-to-end architecture. This approach differs from traditional text-to-speech systems through its integrated processing of text and audio.

While not directly stated in the blog post, the demo voice reveals that it uses a 27-billion parameter version of Google's open-source LLM Gemma.

Testing reveals near-human performance

In blind tests with Sesame, participants couldn't distinguish between CSM and real humans during short conversation snippets. However, longer dialogues still revealed limitations like occasional unnatural pauses and audio artifacts.

Recommendation

Sesame developed custom phonetic benchmarks to measure model performance. In listening tests, participants rated the generated speech as equivalent to real recordings when heard without context, though they still preferred the original when context was provided.

Horizontales Balkendiagramm: Vergleich der Erfolgsraten zwischen KI-Modell und menschlicher Referenz mit/ohne Kontext, farblich codiert in Türkis/Grün.
The subjects' preference for AI-generated speech reaches almost human levels. | Image: Sesame AI

Future developments and open source plans

Sesame plans to release key components of their research as open source under the Apache 2.0 license. In coming months, they aim to scale up both model size and training scope, with plans to expand to over 20 languages.

The company is focusing particularly on integrating pre-trained language models and developing fully duplex-capable systems that can learn conversation dynamics like speaker transitions, pauses, and pacing directly from data. This development would require fundamental changes throughout the processing pipeline, from data curation to post-training methods.

"Building a digital companion with voice presence is not easy, but we are making steady progress on multiple fronts, including personality, memory, expressivity and appropriateness," the developers note.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

Founded by former Oculus CTO Brendan Iribe and his team, Sesame AI secured significant Series A funding led by Andreessen Horowitz. A demo is available.

The impact of natural AI voices on assistant adoption was demonstrated by the excitement around ChatGPT's Advanced Voice Mode. Voice assistants powered by LLMs are likely to become increasingly prevalent, as suggested by Amazon's release of Alexa+.

Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • Californian start-up Sesame AI is developing a speech model that uses intentional imperfections such as micro-pauses, intonation and laughter to create more authentic dialogue.
  • The system is based on a two-part transformer architecture with semantic and acoustic tokens. In blind tests, test subjects were unable to tell the difference between the AI model and real people in short conversations.
  • Sesame plans to make key components of its research available as open source and to extend the model in terms of size, training scope and language support.
Jonathan writes for THE DECODER about how AI tools can make our work and creative lives better.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.