Meta has developed three new AI models – Visual-Acoustic Matching, Visually-Informed Dereverberation, and VisualVoice – to make the sound more realistic in mixed and virtual reality (VR) experiences.
The three AI models focus on human speech and sound in the video. They are designed to push the industry faster toward the immersive reality, the company said in a statement.
The AI models were built in collaboration with the University of Texas at Austin. The company is also making the audio-visual understanding models open to developers.
Read More: Microsoft Uses AI To Improve Audio And Video Quality On Microsoft Teams
Acoustics play a role in how sound will be experienced in the metaverse. According to Meta’s AI researchers and audio specialists, AI will be core to delivering realistic sound quality.
AViTAR, the self-supervised Visual-Acoustic Matching model, adjusts audio to match the space of a target image.
Despite the lack of acoustically mismatched audio and unlabelled data, the self-supervised training objective learns acoustic matching from in-the-wild web videos, said Meta. VisualVoice learns by learning visual and auditory cues from unlabelled videos to achieve audio-visual speech separation.
VisualVoice generalizes well to the challenging real-world videos of diverse scenarios, said Meta AI researchers.
For instance, consider attending a group meeting in the metaverse with colleagues worldwide. However, instead of people having fewer conversations and talking over one another, the acoustics and reverberation would adjust accordingly as they join smaller groups and moved around the virtual spaces.