Zyphra has released Zonos-v0.1, an open source model that turns text into natural-sounding speech and can clone voices using just seconds of audio data. The new model supports five languages - English, Japanese, Chinese, French, and German - and gives users control over speaking speed, pitch, audio quality, and emotional tone. According to Zyphra, the model processes audio faster than real-time when running on an RTX 4090 GPU. Zyphra has made Zonos available in two versions: a pure transformer model and a hybrid model that combines state-space models with transformers. Both versions were trained on approximately 200,000 hours of audio data, primarily in English. Users can try out Zonos through a user-friendly Gradio interface, with easy Docker installation for local use. The model is also accessible through the Zyphra Playground or via API for those who prefer cloud-based solutions.

Ad
Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Sources
Max is managing editor at THE DECODER. As a trained philosopher, he deals with consciousness, AI, and the question of whether machines can really think or just pretend to.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.