Microsoft and NVIDIA recently announced the successful training of the world’s largest and most powerful monolithic transformer language model: Megatron-Turing Natural Language Generation (MT-NLG). The Megatron-Turing Natural Language Generation is deemed as the successor to the Turing NLG 17B and Megatron-LM models.
Microsoft launched the project Turing in 2019 with the goal of allowing AI-powered enterprise search.
The MT-NLG has 530 billion parameters and can perform a wide range of natural language tasks, including completion prediction, reading comprehension, commonsense reasoning, natural language inferences, and word sense disambiguation.
In zero-, one-, and few-shot settings, the 105-layer, transformer-based MT-NLG outperformed previous state-of-the-art models, setting a new benchmark for large-scale language models in both model scale and quality. MLT-NLG was trained on NVIDIA’s Selene machine learning supercomputer, which is the sixth fastest supercomputer in the world. Selene consists of 560 DGX A100 servers with eight A100 80GB GPUs on each server and advanced network solutions like Mellanox HDDR networking. It is likewise powered by AMD’s EPYC 7v742 CPUs and is expected to cost more than US$85 million.
NLP models have been a point of rivalry among the major internet companies in recent years, especially when it comes to surpassing GPT-3. Aside from Microsoft-Nvidia and OpenAI, Google unveiled LaMBDA, or Language Model for Dialogue Applications, a language model that the company claims can converse freely about an apparently infinite number of topics. This allows LaMBDA to unlock more natural ways of interacting with technology and entirely new categories of practical applications. As these language models grow in size, AI researchers and engineers must devise new approaches to train them. This necessitates meticulous planning since the model and its training data must be stored and analyzed across several processors at the same time.
First mentioned in the paper, ‘Attention Is All You Need,’ transformers have an encoder-decoder architecture based on attention layers. This is similar to a sequence-to-sequence architecture. Sequence-to-Sequence (or Seq2Seq) is a neural network that takes a sequence as input and produces another sequence with different sizes as output. These models are typically excellent at translation, which involves transforming a series of words from one language into a sequence of distinct words in another.
Read More: NVIDIA unveils Artificial Intelligence Technology for Speech Synthesis
A critical distinction in the transformer model is that the input sequence may be transmitted in parallel, allowing GPU to be used more efficiently and training speed to be enhanced. The attention mechanism in the transformer examines an input sequence and determines which portions of the sequence are significant at each stage. Having a multi-headed attention layer solves the vanishing gradient problem that other seq2seq models commonly face.
While transformer-based generative models are state-of-the-art innovations, there are major challenges faced by developers when it comes to designing large language models:
- Even the most powerful GPU can no longer accommodate the parameters of these models in its memory. As a result, data parallelism does not aid in the reduction of memory footprint per device.
- Due to the high cost of transmission, model parallelism does not scale well.
- Suppose special attention is not devoted to optimizing the algorithms, software, and hardware stack as a whole. In that case, the massive number of computing operations necessary might result in unreasonably long training durations.
To communicate with one another, all 4,480 GPUs use NvLink and NVSwitch. Each one had a processing speed of 113 teraFLOPS per second.
As training these models is an expensive affair, and even if they’re operating on top-of-the-line hardware, software hacks are required to minimize training durations. Therefore, leveraging DeepSpeed, a deep learning toolkit including PyTorch code developed by Nvidia and Microsoft, allowed developers to pack more data into several pipelines simultaneously. DeepSpeed was introduced by Microsoft as an open-source framework for extensive model training with increased scalability, speed, cost, and usability, allowing users to train models with 100 billion parameters.
Meanwhile, the team noted that existing parallelism approaches, such as data, pipeline, or tensor-slicing, have memory and compute efficiency trade-offs and can’t be utilized to train models at this scale on their own.
- Data parallelism is efficient in terms of computation, but it replicates model states and does not use aggregate distributed memory.
- Pipelines parallelism scales well across nodes. However, it requires vast batch sizes, coarse grain parallelism, and perfect load balancing to be compute-efficient, which is not achievable at scale.
- Tensor-slicing necessitates a lot of communication across GPUs, limiting computation efficiency beyond a single node when NVLink isn’t accessible.
To address the parallelism challenges, the team created an efficient and scalable 3D parallel system capable of combining all three parallelism methods, with the help of NVIDIA Megatron-LM and Microsoft DeepSpeed. More precisely, the system scales the model within a node using tensor-slicing from Megatron-LM and across nodes using pipeline parallelism from DeepSpeed.
Microsoft and Nvidia also claim to have developed a training dataset containing 270 billion tokens (words, characters, or portions of words) taken from English-language websites in order to train MT-NLG.
MT-NLG, like any other AI model, has to learn patterns among data points, such as grammatical and syntactical rules, by ingesting a collection of samples. The majority of the data came from EleutherAI’s The Pile, an 835GB collection of 22 smaller datasets. The Pile includes academic sources (e.g., Arxiv, PubMed), communities (e.g., StackExchange, Wikipedia), code repositories (e.g., Github), and more, which Microsoft and Nvidia assert they curated and integrated with filtered Common Crawl snapshots.
Because of the enormous volume of content, derogatory and offensive material cannot be removed from the dataset. Unfortunately, this means that MT-NLG can produce inappropriate, racist, or sexist outputs.
According to the blog post, NVIDIA mentions: “Our observations with MT-NLG are that the model picks up stereotypes and biases from the data on which it is trained.” As a result, both Microsoft and NVIDIA are dedicated to resolving the issue.