Google loves to scale up things. And no wonder, the trillion parameter threshold for pre-trained language models has been breached by Google AI research. The researchers have recently introduced Switch-Transformers that have 1.6-trillion-parameters. It is the largest sparse-model ever trained for the first time with lower precision (bfloat16) formats.
The basic intuition at work here is to use simple architectures that surpass far more complicated algorithms backed by large datasets and parameter counts. The researchers built the Switch-Transformers’ base architecture on Google’s T5 architectures. They reported a four-fold speed over the T5-XXL and seven times over T5-Base and T5-Large in pre-training speed with the same computational resources.
However, the Switch Transformer fundamentally differs from the currently famous Pre-trained Language Models (PLMs) that use densely activated transformer architectures like GPT-2 and GPT-3. The transformer does not re-use the same weights for all input; instead, it contains a mixture of experts, small models that select different parameters for each input, specialized in various tasks. A gating network looks upon this mixture and draws an inference from the most relevant expert for the task at hand.
This arrangement results in a sparsely-activated expert model with an outrageous number of parameters but provides greater computational efficiency. The sparsity comes from activating a subset of the neural network weights — only the expert model’s weight for each input. The reported computational efficiency was observed from the fact that the 1.6-trillion-parameter model with 2,048 experts (Switch-C) exhibited “no training instability at all,” in contrast to a smaller model (Switch-XXL) containing 395 billion parameters and 64 experts. The researchers credit the efficient combination of data, model, and expert-parallelism to create models with up to a trillion parameters.
This sparse network is distilled to a specialized fine-tuned dense version for a particular downstream task. The researchers were able to reduce the model size by up to 99% while preserving 30% of the sizable sparse teacher’s quality gains. But because of the vast size of the model, novel pre-training and fine-tuning techniques were employed. They came up with a selective precision training that enables training with lower bfloat16 precision. They used a new initialization scheme that enables scaling to many experts and, lastly, increased expert regularization that improves sparse model fine-tuning and multi-task training.
The team trained the huge model by spreading the weights over specialized hardware, TPUs. This training scheme provided a manageable memory and computational footprint on each device. They used the Colossal Clean Crawled Corpus, a 750GB-sized web-crawled multilingual data dataset of text. Masked-training was leveraged where the model had to predict for masked words.
Astonishingly, the model showed a universal improvement across 101 languages and with 91% of languages benefiting from 4x+ speedups over the mT5 baseline. The researchers have a plan to apply the Switch Transformer to multimodal models.