Content
summary Summary
Update
  • Release information added

French AI lab Kyutai releases two versions of its AI assistant Moshi, capable of natural conversations with users.

Ad

After unveiling its AI assistant Moshi in July, Kyutai has now released the open-source models as promised. The release includes several components: a technical report, weights for Moshi and its Mimi codec, and streaming inference code in PyTorch, Rust, and MLX.

Video: Kyutai Labs

According to the report, Moshi consists of three main components: Helium, a 7B language model; Mimi, a neural audio codec; and a new multi-stream architecture. The system can model real-time conversations with overlaps and interruptions. Kyutai Labs provides two Moshi models with artificially generated voices. More details can be found in the published paper and GitHub repository.

Ad
Ad

Original article from July 4th, 2024

French AI lab Kyutai unveiled its Moshi AI assistant in Paris, which can have natural conversations with users. The technology will be released as open source.

According to Kyutai, Moshi is the first publicly accessible AI assistant with natural conversational abilities. OpenAI had previously showcased this feature for GPT-4o but has not yet released it.

During the presentation, Kyutai CEO Patrick Perez explained that Moshi was developed by a team of eight in just six months. What sets Moshi apart is its ability to speak and listen in real time. Kyutai claims that Moshi has a theoretical latency of only 160 milliseconds, while in practice, it ranges between 200 and 240 milliseconds.

Moshi's architecture is based on a new approach that Kyutai calls an "Audio Language Model." Instead of converting speech into text as usual, the model heavily compresses audio data and treats it like pseudo-words. This allows it to work directly with audio data and predict speech, making it a natively multimodal model, similar to GPT-4o.

Recommendation

Video: Kyutai

For training, Kyutai used various data sources, including human motion data and YouTube videos. First, a pure text model called Helium was trained. Then, combined training with text and audio data was conducted. Synthetic dialogues were used for fine-tuning the conversation.

Since the underlying language model has only 7 billion parameters, it exhibits the usual limitations of small models in dialogue. Nevertheless, the language capabilities and speed are impressive and hint at the potential when more powerful and larger modules are employed with this technology.

To give Moshi a consistent voice, Kyutai collaborated with a voice actress named Alice. She recorded monologues and dialogues in various styles, which were then used to train a speech synthesis system.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

Moshi: Demo available, open source to follow

Kyutai sees great potential in Moshi to change the way we communicate with machines. The company sees promising applications, particularly in the area of accessibility for people with disabilities.

The Moshi demo is now available online. If you are in the US, you should use this link for better latency. In the coming months, Kyutai plans to release the technology as open source, allowing developers and researchers to examine, adapt, and extend it. A research paper is also set to follow.

Kyutai was founded in 2023 and received 300 million euros from French billionaires like Xavier Niel and Rodolphe Saadé last November. Kyutai has attracted renowned AI researchers like Yann LeCun and Bernhard Schölkopf as scientific advisors. One of the main arguments that attracts researchers to Kyutai is its commitment to open science and the ability to publish their work: All of Kyutai's models are to be open source, and the researchers plan to publish not only the models but also the training source code and documentation of the training process.

 

Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • French AI startup Kyutai has released its Moshi AI assistant, which can have natural conversations with users in real time. Moshi was developed in just six months by a team of eight and has a latency of 200-240 milliseconds.
  • Moshi's architecture is based on an "audio language model" that compresses audio data and treats it like pseudowords. Various data sources such as human motion data, YouTube videos, and synthetic dialog have been used for training.
  • Kyutai sees great potential in Moshi, especially for accessibility for people with disabilities.
Max is managing editor at THE DECODER. As a trained philosopher, he deals with consciousness, AI, and the question of whether machines can really think or just pretend to.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.