Multilingual content could be a key driver for content creators, with increased views from non-primary language audiences. Alexander Konovalov, CEO of vidby, explains how AI tools can help.
Multilingual videos have become a holy grail for content creators aiming to reach a global audience by engaging viewers from diverse linguistic backgrounds. YouTube videos dubbed into multiple languages see a 15% increase in views from non-primary language audiences, the company says.
Traditionally, multilingual content was achieved through human dubbing, where voice actors would replicate the original speech in the target language. However, this process is pretty costly and time-consuming, which appears as a towering roadblock in the ear of the creator economy.
Today, AI-powered video dubbing focuses on the high accuracy of speech translation and also on-screen lip movements. While viewers may not always hear the original language, they can deduce it from the speaker's facial expressions and lip movements. Discrepancy, such as asynchronous lip movements, can break the viewer's immersion and dilute the creator's intended message.
AI masters semantics and contexts
Language brims with intricate details, and the context can significantly change the meaning of a word or phrase. While AI-based solutions often grapple with comprehending tricky contexts, vidby uses deep learning models specifically trained to capture context and deliver more accurate translations. Through comprehensive user feedback, the team continually fine-tunes these models, aiming for translations that are both precise and natural-sounding.
The transfer learning approach, which uses pre-trained models enriched with data prepared by the vidby team, speeds up the learning process and improves the output. Models like GPT and BERT, known for their prowess in understanding context, enable the team to develop robust translation tools.
Tackling the nuances of speech and voice generation
The natural intonation, accents, and rhythm of speech can be challenging for AI to replicate. To tackle this, vidby has introduced features that allow for the customized pronunciation of specific words in different languages.
When it comes to voice generation, the deep learning model WaveNet is a standout. It's adept at creating voice waves that sound natural, ensuring that the AI-generated voice retains the intricacies and nuances of human speech.
More nuances with dubbing brand names
Have you ever wondered how brand names sound different in various languages? For example, if your brand's name is AI Studio, it is [ˌeɪˈaɪ] Studio in English but [aːˈiː ˈʃtuːdio] in German.
No matter the rules or local norms, brand owners and their audiences consumers want their brand name to sound just right in every language. If their brand name doesn't hit the right notes, customers will care little about the rest.
That said, vidby has integrated its proprietary model BrandNameTTS to voice names according to English language rules in any language. With this, our AI-powered video dubbing solution ensures that the brand name will resonate.
Synchronization: AI learns natural lip-sync and takes over manual work
Dubbing videos requires aligning the voice with the speaker's lip movements, a challenge that AI often grapples with. Vidby leverages algorithms that scrutinize lip movements, adjusting the audio track to achieve a natural sync.
For this synchronization, the team deploys GAN (Generative Adversarial Network). This network consists of a generator, which crafts audio, and a discriminator that assesses its quality. Their combined efforts result in an accurate synchronization.
Like other advancements in automation, lip-syncing has significantly reduced manual work for creators. Videos dubbed with lip-sync in mind look natural and offer a competitive solution that could fully automate manual expensive processes and could work for most specific custom cases.
The speed and length of phrases matters
Synchronization is also essential for phrase length, as the length varies across languages. In video dubbing, an essential aspect is controlling the speed of pronunciation while determining the length of original phrases and their predicted duration in the target language. While recognizing that, vidby has incorporated its own SpeechTempo models to handle this aspect.
For example, the German language is typically 20-30% longer in pronunciation. If an original phrase in German lasts for 20 seconds, a correctly translated English version would last around 12-14 seconds.
During the next 6-8 seconds after the synthesis, a complete lack of synchronization may occur. However, AI's deep learning and neural networks come to the rescue, generating synthetic speech considering pacing and intonation. Time stretching can help adjust speech duration, ensuring synchronization with lip movements.
In the race to enhance the quality of voicing, competing providers develop and train proprietary neural networks to empower their video dubbing solutions. AI plays a crucial role in determining the length of specific sentences during synthesized speech and establishing normative values for various speech aspects like pitch and volume.
The future frontiers in video dubbing
AI-based advancements in video dubbing, such as speech synthesis for different ages and pronunciation control, are just the tip of the iceberg. Next-gen speech cloning, emotion-infused speech synthesis, and real-time simultaneous translation are already on the horizon. They are poised to revolutionize the way creators develop video content. With the help of AI-based video dubbing solutions, creators will produce expressive videos accessible to a diverse global audience.