Facebook researchers have scaled up a relevantly new technique, Pre-finetuning (PFT), in their paper MUPPET to multi-task learning of over 50 tasks on a vast scale, i.e., 4.8 million instances. They showed that PFT increases both the performance and sample efficiency of fine-tuned models like BERT, RoBERTa, and more. They even set new records in RTE and HellaSWAG benchmarks.
The usual workflow in large scale language modeling is pre-training via self-supervision over massive unlabeled datasets and then fine-tuning to suit the tasks at hand with relatively few labeled data. This arrangement works fine till the datasets and tasks are relevant. But in low-resource languages or individual tasks with very little labeled data, this training scheme bleeds the language models out.
In 2019, a group of researchers had introduced a Pre-finetuning (PFT) stage in a paper named ‘Tri-Train,’ that lies in between pre-training and fine-tuning to overcome the above problem. They constructed another small-sized corpus by selecting a set of sentences from unlabeled pre-training data relevant to the labeled training data. Then they fine-tune the pre-trained model on merely two tasks – predict the next word on sentences from the small corpus and predict the start/end words of those sentences.
Facebook’s MUPPET — Massive Multi-task Representations with Pre-Finetuning — extends the above work to new levels. The researchers used 50 diverse tasks that include classification, summarization, question answering, and common sense reasoning. Their investigation showed that general multi-task learning schemes fail to learn useful representations and are unstable. However, their experiments also showed that scale plays a significant role in multitask learning.
Fewer tasks degrade representation quality than the pre-trained model, but more tasks than this point improve representations. Pre-finetuning hurts performance when few tasks are used until a critical point, usually above 15, after which performance improves linearly in the number of tasks.
The researchers used loss scaling and task-heterogeneous batches so that learning remains balanced across different competing tasks, significantly improving training stability and overall performance. For training on several tasks, their model contains task-specific heads, each optimizing for a task-specific loss. They scaled each data-point loss so that, if the class distribution were uniformly distributed along with the model’s predictions, all of the task-specific losses would have equivalent values.
Similarly, the researchers proposed task-heterogeneous batches to optimize several potentially competing objectives to create a global representation across several model training tasks. During gradient descent, moving along the gradient of a single task may not be the optimal direction for the model to move to learn a single unified representation across tasks. To overcome this, the model optimizes over batches that consist of several tasks. Each worker samples a random batch from the set of tasks and computes a gradient, accumulated for the final update.
The model also learns better representation than the standard RoBERTa, leveraging representations from the pre-fine tuned models with 34-40 tasks. The scale factor becomes evident as the more the tasks are, the more the data-efficiency is.