Language learning models are the latest fad in artificial intelligence technologies. In the realm of AI, we’ve seen some pretty remarkable advancements in the last few months. Last year, Google unveiled Pathways, a new AI architecture that works similarly to the human brain and learns faster than previous models. Before, AI models were only trained for one sense, such as sight or hearing, and not both. On the other hand, Pathways allows Google to interpret text, pictures, and audio in a single AI model. Google Pathways was recently put to the test by the Google Research team, who used it to train the Pathways Language Model (PaLM), a 540-billion-parameter dense decoder-only autoregressive transformer, using 780 billion tokens of high-quality text. In “PaLM: Scaling Language Modeling with Pathways”, the team explains PaLM outperforms state-of-the-art few-shot performance on language interpretation and creation tasks in many instances.
Language models anticipate the next item or token in a text sequence based on the preceding tokens. When such a model is applied iteratively, with the projected output fed back as input, the model is called autoregressive. Many researchers have built large language models based on autoregressive concept models that are founded on the Transformer deep-learning architecture. The transformer architecture made it easier for models to capture context when parsing text. This was a game-changer because since the previous language models like Recurrent Neural Networks (RNNs) sequentially analyzed text, training on a vast text corpus had to be done word by word, sentence by phrase, which took a long time. Moreover, it meant that any kind of long-term context was computationally too costly to maintain. Transformer architecture makes use of key, query, and value parameters to determine which portion of the text is most relevant in a given context. Transformer-based models, such as BERT, also use a process known as Attention, which allows the model to learn which inputs require more Attention than others in specific instances.
PaLM is based on a conventional transformer model architecture, although it only employs a decoder and adds the modifications like SwiGLU Activation, Parallel Layers, Multi-Query Attention, RoPE Embeddings, Shared Input-Output Embeddings, and No Biases and Vocabulary.
SwiGLU activations are used for the multilayer perceptron (MLP) intermediate activations, resulting in considerable quality improvements over typical ReLU, GeLU, or Swish activations; and a “parallel” formulation in each transformer block, rather than the standard serialized formulation, results in around 15 percent quicker large-scale training speeds. At autoregressive decoding time, multi-query Attention keeps costs down, and the use of RoPE embeddings instead of absolute or relative position embeddings allows for superior performance on larger sequence lengths. To boost training stability for big models, the system additionally shares the input and output embedding matrices and employs no biases in the dense kernels or layer norms. Moreover, to accommodate the vast number of languages in the training corpus without over-tokenization, the team adopts a SentencePiece vocabulary with 256k tokens.
Any language learning model is based on the idea of using a massive amount of human-created data to train machine learning algorithms that can help build models that replicate how people communicate. OpenAI’s GPT-3, for example, contains 175 billion parameters and was trained on 570 Gigabytes of text. DeepMind’s Gopher, a 280-billion-parameter autoregressive transformer-based dense language learning model was trained on 10.5 Terabytes of MassiveText. This includes various sources like MassiveWeb (a compilation of web pages) C4 (Common Crawl text), Wikipedia, GitHub, books, and news articles. PaLM was trained on a range of English and multilingual datasets, including high-quality online publications, books, Wikipedia articles, interactions, and GitHub code. The researchers also developed a “lossless” vocabulary that retains all whitespace (which is critical for coding), separates out-of-vocabulary Unicode characters into bytes, and divides numbers into distinct tokens, one for each digit.
Regardless of having only 5 percent code in the pre-training dataset, PaLM performs well on coding and natural language tasks in a single model. Its few-shot learning performance is incredible since it is on par with the fine-tuned Codex 12B despite using 50 times less Python code in training. This observation backs up prior findings that larger models can be more sample efficient than smaller models because they can more effectively transfer learning from multiple programming languages and plain language data.
PaLM’s performance may be enhanced even further by fine-tuning it on a Python-only code dataset called PaLM-Coder. With a compile rate of 82.1 percent, PaLM-Coder 540B beats the previous state-of-the-art record of 71.7 percent on a code repair assignment called DeepFix, where the objective is to fix originally erroneous C programs until they compile successfully. It could also decompose multi-step issues into many sections and answer various elementary school-level arithmetic problems. Aside from the astounding feat, PaLM was designed in part to demonstrate Google’s capacity to harness thousands of AI processors for a single model.
PaLM beat other language models on 28 out of 29 English benchmarks, including TriviaQA, LAMBADA, RACE SuperGLUE, etc., improving few-shot performance on language understanding and generation. These tasks include question-answering tasks (open-domain closed-book variant), cloze and sentence-completion tasks, Winograd-style tasks, in-context reading comprehension tasks, and common-sense reasoning SuperGLUE tasks, and natural inference tasks. Furthermore, PaLM displayed remarkable natural language understanding and generating capabilities on several BIG-bench tasks. For example, the model can distinguish between cause and effect, understand conceptual combinations in certain situations, and even guess the movie from an emoji. Even though just 22 percent of the training corpus is non-English, PaLM performs well on multilingual NLP benchmarks, including translation and English NLP tasks.
Also, PaLM demonstrates breakthrough skills in reasoning problems that need multi-step arithmetic or common-sense reasoning by integrating model size with chain-of-thought prompting. PaLM outperforms the previous top score of 55 percent achieved by fine-tuning the GPT-3 175B model with a training set of 7500 problems and combining it with an external calculator and verifier by solving 58 percent of the problems in GSM8K, a benchmark of thousands of challenging grade school level math questions, using 8-shot prompting. PaLM can even provide clear explanations for instances requiring a complicated combination of multi-step logical reasoning, world knowledge, and deep language comprehension. It can, for example, give high-quality explanations for original jokes that aren’t present on the internet.
PaLM is the first large-scale use of the Pathways system, scaling training to 6144 chips, the largest TPU-based system configuration utilized for training to date. Data parallelism is used at the Pod level to scale the training over two Cloud TPU v4 Pods, while conventional data and model parallelism is used inside each Pod. Most earlier language learning models were either trained on a single TPU v3 Pod (e.g., GLaM, LaMDA), employed pipeline parallelism to scale to 2240 A100 GPUs across GPU clusters (Megatron-Turing NLG), or used multiple TPU v3 Pods (Gopher) with a maximum scale of 4096 TPU v3 chips (Megatron-Turing NLG).
PaLM achieves the maximum training efficiency for the language learning model at this scale, with 57.8 percent hardware FLOPs usage. This is due to a combination of the parallelism technique and a Transformer block reformulation that allows for simultaneous computation of the attention and feedforward layers, resulting in speedups from TPU compiler optimizations.