Over the past few years, many language models, like GPT-3, LaMDA, etc., have made it into the news because of their capabilities across multiple tasks, like text generation, problem-solving, sentiment analysis, and much more. Many companies are actively exploring the domain to harness this potential. Meta is one of the most successful companies working on natural language processing (NLP) technology, specifically for systems with over 100 parameters. These systems, called large language models, can potentially transform artificial intelligence research in the language and conversation domain. In the most recent development, Meta has introduced a new large language model (LLM) called OPT-IML.
As per Meta’s latest research, fine-tuning language models via pre-determined instructions can potentially enhance their functionality for newer tasks. However, there is a lack of understanding among researchers about instruction tuning. Following the same line of research, Meta created OPT-IML as a benchmark for fine-tuning large language models while scaling both the model and benchmark sizes. OPT-IML, or OPT for Instruction Meta-Learning, is a model of 2000 NLP tasks as a benchmark framework for evaluating all model generalizations. Using this benchmark, Meta presents how instruction-tuning works for OPT-30B and then for OPT-175B.
Core Technology: Instruction Fine-Tuning
Instruction fine-tuning entails optimizing LLMs using input formats designed for instruction on various NLP tasks. Researchers consolidated eight meta-datasets in a collection of 1,991 NLP tasks to put the technology into use. These tasks and instructions were stored in more than 100 collections and later transformed into the evaluating framework, as seen in the image below:
OPT-IML was instruction-tuned across three levels of generalizations:
- Model performance on fully-held-out task categories not used for tuning.
- Model performance on unseen tasks seen during instruction tuning (partially supervised).
- Model performance on held-out instances of tasks seen during tuning (fully supervised).
The last condition assesses the generalization of supervised multi-task learning, while the first two assess cross-task generalization of instruction tuning. The diversity and distribution of tuning tasks, the manner of their prompts, and the targets used for fine-tuning all impact how effective instruction-tuning is for LLMs. The researchers found that instruction tuning is preferred only when it improves the performance of fully-held out and partially supervised tasks without compromising on fully-supervised task performance.
From thereon, researchers finetuned OPT similarly to a next-word prediction objective that is pre-trained based on previously inputted tokens. The training sequence was bifurcated into a source and a target sequence, out of which the researchers minimized the loss terms in the target sequence. Formally, fine-tuning a dataset D, containing instances sᵢ and their corresponding target tokens tᵢ = {tᵢⱼ}. Say that the model is pre-trained with parameters 𝛩, the researchers minimize the following loss function:
In the following steps, researchers fine-tuned the hyperparameters for all 30B models on 64 A100 Tensor Core GPUs and 175B models on 128 A100 Tensor Core GPUs. Ultimately, each 30B model is tuned for 4000 steps, and 175B models for double the number of steps with half the batch sizes.
The evaluation datasets comprised tasks with answer options and those without options. For the former category, the researchers used score-based classification of tasks based on the likelihood of an output. For the latter category, researchers decode a token until a maximum of 256 tokens are predicted. Lastly, researchers examined the results and concluded that OPT-IML outperforms standard OPT models (not fine-tuned) on all benchmarks. It is a specific outcome as standard OPT prompts take on from GPT-3 undergoing prompt engineering. At the same time, OPT-IML (with fine-tuning) enhances robustness and reduces the need for prompt engineering.
While the model is a significant improvement for instruction-tuning technology, it is yet subjected to some limitations. The evaluation framework used to tradeoff instruction-tuning variables may interact, resulting in different tuning settings. Additionally, all variable tradeoffs studied on 30B models may waver off from the usual trends at larger scales.
However, the future scope of instruction-tuning can benefit many areas, including multi-task learning, meta-training, and prompting. In multi-task learning (MTL), instruction tuning enhances generalization performance for fully-held-out tasks. Instruction tuning also betters prompting techniques that have become a dominant paradigm in recent years. From hereon, the researchers will continue to work on instruction tuning LLMs to improve their generalization abilities and find more applications for the technology.