Facebook AI researchers recently open-sourced their unsupervised cross-lingual speech recognition model, XLSR, that can handle 53 languages at a time. The model provides 72% phoneme error rate reduction and 16% word error rate reduction on the CommonVoice and the Babel benchmarks, respectively, compared to the best results.
Why Unsupervised Training?
Multilingual models leverage data from other languages to transcribe in multiple languages. These models mostly use supervised cross-lingual training, which requires labeled data in various languages. But the transcribed speech acts as the label is often scarcer than unlabeled speech and requires non-trivial human annotation. Hence, unsupervised representation learning or pre-training is used as it does not require labeled data. The choice is supported by previous experiments that have shown cross-lingual pre-training is very useful for low resource languages mainly.
What Is New This Time?
Unsupervised representation learning has been followed mostly in the monolingual setting. The Facebook researchers extend to cross-lingual settings by learning representations on unlabeled data that generalize across languages. Unlike previous works, the researchers fine-tuned the transformer part of the model instead of freezing all pre-trained representations or feeding them to a separate downstream model, managing to pack all functions in a single model.
How Does The XLSR Architecture Look?
The researchers build on the wav2vec ver2.0 model’s pretraining approach. The model contains a convolutional feature encoder to map raw audio to latent speech representations fed to a transformer network to output context representations. The transformer architecture follows BERT except for relative positional embeddings. Feature encoder representations are discretized with a quantization module to represent the self-supervised learning objective targets for training. A shared quantization module ensures multilingual quantized speech units whose embeddings are then used as targets for a transformer trained by contrastive learning. Therefore, XLSR jointly learns contextualized speech representations as well as a discrete vocabulary of latent speech representations. The latter is used to train the model with a contrastive loss, and the discrete speech representations are shared across languages creating bridges among them.
How Is XLSR Pre-trained?
It is pre-trained on 56k hours of speech in 53 languages on datasets like BABEL – conversational telephone data, CommonVoice – a corpus of read speech, and Multilingual LibriSpeech (MLS). When pretraining on L languages, multilingual batches are formed by sampling speech samples from a multinomial distribution. The model is trained by solving a contrastive task over masked latent speech representations and learning a quantization of the latent features shared across languages. The researchers randomly masked all time steps with probability 0.065 to be initial indices and the next ten timesteps. The objective requires identifying the real quantized latent for a masked time-step within a set of K = 100 other latent features sampled from other masked time steps. A codebook diversity penalty augments this to encourage the model to use all codebook entries.
How Is XLSR-53 Fine-Tuned?
To fine-tune the model, a classifier representing the respective downstream task’s output vocabulary on top of the model is added and trained on the labeled data with a Connectionist Temporal Classification (CTC) loss. Weights of the feature encoder are not updated at the fine-tuning time. For CommonVoice, they fine-tuned for 20k updates and on BABEL for 50k updates on 2 GPUs for the Base model (XLSR-10) and 4 GPUs for the Large model (XLSR-53).
What Were The Results?
Cross-lingual pre-training significantly outperforms monolingual pretraining thanks to the latent discrete speech representations shared across languages with increased sharing for related languages. As a result, the multilingual pre-trained model implicitly learns to cluster related languages.
The researchers also demonstrated that XLSR representations could be fine-tuned simultaneously on multiple languages to obtain a multilingual speech recognition system whose performance is competitive and fine-tuning a separate model. The model’s learned representations transfer well to unseen languages. Cross-lingual transfer learning improves low-resource language understanding, but the transfer-interference trade-off between high-resource versus low-resource languages benefits low resource languages and hurts high resource languages.
To tinker with the code and read more, visit here.