Meta presents new research on artificial intelligence for realistic metaverse audio. Multimodal trained audio systems evaluate visual information and automatically adjust sound.
Meta's Reality Labs, together with the University of Texas, unveils new AI models designed to optimize sound in VR and AR based on visual data. AI is critical for realistic sound quality in the Metaverse, the company writes.
Multimodal AI for matching sound and image
Meta is releasing three new AI models as open source: visual-acoustic matching, visually-informed dereverberation and visual voice. All three models ultimately involve an AI automatically shaping the sound to match visual information. This multimodal interaction of audio, video, and text is the focus of the newly presented research.
"Existing AI models do a good job understanding images, and are getting better at video understanding. However, if we want to build new, immersive experiences for AR and VR, we need AI models that are multimodal — models that can take audio, video, and text signals all at once and create a much richer understanding of the environment," Meta's research team writes.
For example, if an AI detects that a sound is coming from a cave, it can automatically add appropriate reverberation (visual-acoustic matching). An example of visual-acoustic dereverberation is matching the sound of existing content to the current space, rather than the sound of the space where the content was originally recorded.
For example, the soundscape of a recorded theater performance could be processed as if it were being performed live in the current space during an AR projection. The AI should also be able to automatically remove unwanted background noise from the original soundtrack, according to the researchers.
Better concert experiences in the Metaverse
Another application example, according to Meta, is a virtual concert visit. In the metaverse, avatars could initially hear muffled sounds outside the concert hall, which become increasingly clear the closer they get to the stage.
The metaverse trick: Dialogue could remain clearly audible despite the increasing ambient volume, as if people were standing next to each other without loud background music. AI audio could also focus the audio around small groups, for example, so that voices do not overlap each other (visual voice).
Working together, these audio systems could one day additionally enable "intelligent assistants" to better understand what we are saying to them - even at a loud concert or a wild party.
Meta is releasing the three AI models as open source. Paper, models, and more information are available on Meta's AI Blog.