NVIDIA recently unveiled its research on speech synthesis that would make voice assistants like Google Assistant and Siri sound way more human-like. The current technology used for generating speech has improved by many folds over the past years but still lacks critical human speech elements like rhythm and intonation.
NVIDIA researchers are developing a new speech synthesis technology that would make voice assistants sound richer and will be able to produce voice modulations and dynamics closer to those made by humans.
The complete research will be released in session at the Interspeech 2021 conference that will commence on 3rd September. The researchers from NVIDIA’s text-to-speech department have developed a unique model named RAD-TTS that can achieve the aforementioned qualities in a voice bot.
The developers conducted extensive research on conversational artificial intelligence, natural language processing, audio enhancement, automated speech recognition, and various other subjects to build the RAD-TTS model. The model can efficiently be run on NVIDIA GPUs, and the company has open-sourced some of the research works through their NVIDIA NeMo toolkit.
NVIDIA mentioned in a blog, “With this interface, our video producer could record himself reading the video script, and then use the AI model to convert his speech into the female narrator’s voice.”
The blog also mentioned that users could use the technology to tweak the generated speech to improve the narration and flow of videos. Earlier speech synthesis models were not able to produce accurate voice modulations, so they could not add the emotional aspect of speech to narrations.
NVIDIA’s RAD-TTS comes with a unique feature that can change a speaker’s voice to sound like someone else. “The AI model’s capabilities go beyond voiceover work: text-to-speech can be used in gaming, to aid individuals with vocal disabilities or to help users translate between languages in their own voice,” said NVIDIA.