The Qwen team at Alibaba Cloud has released two new AI models that create or clone voices using text commands. The Qwen3-TTS-VD-Flash model lets users generate voices based on detailed descriptions, allowing them to precisely define characteristics like emotion and speaking tempo. For example, a user could request a "Male, middle-aged, booming baritone - hyper-energetic infomercial voice with rapid-fire delivery and exaggerated pitch rises, dripping with salesmanship." According to the manufacturer, the model outperforms the API for OpenAI's GPT-4o mini-tts, which launched earlier this spring.
The second release, Qwen3-TTS-VC-Flash, can copy voices from just three seconds of audio and reproduce them in ten languages. Qwen claims the model achieves a lower error rate than competitors like Elevenlabs or MiniMax. The AI is also capable of processing complex texts, imitating animal sounds, and extracting voices from recordings. Both models are accessible via the Alibaba Cloud API. You can try demos for the design model and the clone model on Hugging Face.