The open-source voice cloning model "Voice Craft" makes OpenAI's ethical restrictions on its "Voice Engine" seem irrelevant.
A team of researchers from the University of Texas at Austin and the company Rembrand has developed "Voice Craft", a voice cloning tool that can edit natural speech by inserting or removing words from spoken sentences and converting text to speech, according to the developers.
Voice Craft is suitable for both editing speech and speech synthesis without a pre-trained target voice model ("zero-shot TTS"). Just three seconds of source material are sufficient for a voice clone.
Source material, the first three seconds prompt the clone:
voice clone
In tests, the model significantly outperforms comparable systems like FluentSpeech or VALL-E. Even humans often struggle to distinguish the generated voices from real recordings.
Original statement
Edited statement, "that had expensive gold watches" was inserted
Voice Craft is available on Github under the Apache 2.0 license in a small (330M) and large (830M) version, including weights. The small version is said to be on par with the large one, but has much better performance. You can listen to numerous demos on the project's website.
The Voice Craft team emphasizes that they are working on methods for watermarking and recognizing synthetic speech to advance research in both speech synthesis and AI safety.
OpenAI's ethical concerns outpaced by open-source movement
Speaking of AI safety, the researchers are aware of the ethical concerns that could come with releasing their open-source project. In particular, they acknowledge in their paper that the ability to clone human voices could be misused for criminal activities.
But the team argues that releasing the model gives the research community a chance to study and understand it to prevent potential misuse.
"By making these methods open source, we can catalyze the development of more robust countermeasures against the misuse of voice cloning technologies. This collaborative approach allows for the rapid identification of vulnerabilities and the exploration of innovative strategies to address them," the researchers state.
This is in stark contrast to the stance taken by OpenAI, which recently unveiled its own voice model, Voice Engine, which can clone human voices with as little as 15 seconds of source material. Because of the risks involved, however, OpenAI is only making it available to selected groups.
But the very existence of Voice Craft undermines OpenAI's argument. The situation is somewhat reminiscent of the fuss surrounding GPT-2 in 2019, when OpenAI made headlines by claiming the model was too dangerous to publish.
Today, there are many open-source models on the market that are many times more powerful than GPT-2. While OpenAI's decision may have been the right one at the time, generative text AI is now widely available and easily abused.
The same has happened with image models, and OpenAI should know this by now. It clearly can't, and shouldn't, fulfill the role of gatekeeper for the ethically correct use of generative AI that it seems to ascribe to itself. Such groundbreaking technologies need to be developed by many, and their use has to be decided by many as well.
In the case of Voice Engine specifically, OpenAI claims to have completed the model in 2022 but only unveiled it on March 29, 2024. Voice Craft was unveiled eight days earlier, on March 21 - and it's not even the first model of its kind.
Statements from OpenAI like: "It's important that people around the world understand where this technology is headed, whether we ultimately deploy it widely ourselves or not," seem out of touch when a comparable model is already freely available at this time.