The Facebook MUPPET Show

February 3, 2021

Facebook researchers have scaled up a relevantly new technique, Pre-finetuning (PFT), in their paper MUPPET to multi-task learning of over 50 tasks on a vast scale, i.e., 4.8 million instances. They showed that PFT increases both the performance and sample efficiency of fine-tuned models like BERT, RoBERTa, and more. They even set new records in RTE and HellaSWAG benchmarks.

The usual workflow in large scale language modeling is pre-training via self-supervision over massive unlabeled datasets and then fine-tuning to suit the tasks at hand with relatively few labeled data. This arrangement works fine till the datasets and tasks are relevant. But in low-resource languages or individual tasks with very little labeled data, this training scheme bleeds the language models out.

Also Read: Data Labeling And The Hidden Costs In Machine Learning

In 2019, a group of researchers had introduced a Pre-finetuning (PFT) stage in a paper named ‘Tri-Train,’ that lies in between pre-training and fine-tuning to overcome the above problem. They constructed another small-sized corpus by selecting a set of sentences from unlabeled pre-training data relevant to the labeled training data. Then they fine-tune the pre-trained model on merely two tasks – predict the next word on sentences from the small corpus and predict the start/end words of those sentences.

Facebook’s MUPPET — Massive Multi-task Representations with Pre-Finetuning — extends the above work to new levels. The researchers used 50 diverse tasks that include classification, summarization, question answering, and common sense reasoning. Their investigation showed that general multi-task learning schemes fail to learn useful representations and are unstable. However, their experiments also showed that scale plays a significant role in multitask learning.

Fewer tasks degrade representation quality than the pre-trained model, but more tasks than this point improve representations. Pre-finetuning hurts performance when few tasks are used until a critical point, usually above 15, after which performance improves linearly in the number of tasks.

The researchers used loss scaling and task-heterogeneous batches so that learning remains balanced across different competing tasks, significantly improving training stability and overall performance. For training on several tasks, their model contains task-specific heads, each optimizing for a task-specific loss. They scaled each data-point loss so that, if the class distribution were uniformly distributed along with the model’s predictions, all of the task-specific losses would have equivalent values.

Similarly, the researchers proposed task-heterogeneous batches to optimize several potentially competing objectives to create a global representation across several model training tasks. During gradient descent, moving along the gradient of a single task may not be the optimal direction for the model to move to learn a single unified representation across tasks. To overcome this, the model optimizes over batches that consist of several tasks. Each worker samples a random batch from the set of tasks and computes a gradient, accumulated for the final update.

The model also learns better representation than the standard RoBERTa, leveraging representations from the pre-fine tuned models with 34-40 tasks. The scale factor becomes evident as the more the tasks are, the more the data-efficiency is.

The Facebook MUPPET Show

LEAVE A REPLY Cancel reply

Most Popular

Google NotebookLM Video Overviews Launch Turns Research into AI‑Powered Explainer Videos

Samsung-Tesla AI6 Chip Pact Sparks Machine Learning Revolution

Meta Unveils Vision for Personal Superintelligence

TCS to Lay Off Over 12,000 Staff Amid Strategic Realignment—not an AI Cut

The Facebook MUPPET Show

Subscribe to our newsletter

RELATED ARTICLES

Data Structures: A Beginner’s Guide to Organizing Information Efficiently

Unlocking the Power of Amazon Cloud Services: A Comprehensive Guide to Boost Your Business

The Future of Deep Learning: Trends to Watch in 2025 and Beyond

LEAVE A REPLY Cancel reply

Most Popular

Google NotebookLM Video Overviews Launch Turns Research into AI‑Powered Explainer Videos

Samsung-Tesla AI6 Chip Pact Sparks Machine Learning Revolution

Meta Unveils Vision for Personal Superintelligence

TCS to Lay Off Over 12,000 Staff Amid Strategic Realignment—not an AI Cut