Facebook researchers open-sourced code of their work, “Voice Separation with an Unknown Number of Multiple Speakers.” Suppose, there is only one mic and there are multiple speakers, talking simultaneously. Can you separate the voices? For a human, it is easy. But, in the case of a machine, how do you do that?
The single-source multiple-speaker voice-separation paper answers the question. It extends the state-of-the-art voice separation task to five persons, which were previously limited to two persons. Independent Component Analysis mostly addressed this task in the past. However, with the recent advent of deep learning, it is now possible to separate mixed audio containing multiple unseen speakers.
The main contributions as listed by authors are:
- A novel audio separation model that employs a specific RNN architecture,
- a set of losses for effective training of voice separation networks,
- performing effective model selection in the context of voice separation with an unknown number of speakers,
- state of the art results that show a sizable improvement over the current state-of-the-art in an active and competitive domain.
Also Read: Computer Vision Has A New DeIT By Facebook
Previous methods were actually trained using masks for each voice. But this paper introduced a novel mask-free approach. In voice separation, two subtasks exist innately. First, improve the signal quality while screening out the noises and second, identify the speaker to maintain the continuity in the voice sequence.
The author used utterance level permutation invariant training (uPIT) loss for the first subtask and mean squared error between the L2 distance between the network embeddings of the predicted audio channel and the corresponding source.
To avoid biases arising from the distribution of data and to promote solutions in which the separation models are not detached from the selection process, model selection was based on an activity detection algorithm.
Starting from the model that was trained on the dataset with the largest number of speakers — C, speech detectors are applied to each output channel. If it detects silence (no-activity) in any of the channels, it moves to the model with C − 1 output channels and repeats the process until all output channels contain speech.