With DeepSpeed, You Can Now Build 1-Trillion-Parameters NLP Models

September 14, 2020

Microsoft enhanced its open-source deep learning optimization library DeepSpeed for empowering developers or researchers to make 1-trillion-parameters models. Initially, when the library was released on 13 February 2020, it enabled uses to build 100-billion-parameters models. With the library, practitioners of natural language processing (NLP) can train large models with reduced cost and compute while scaling to unlock new intelligence possibilities.

Today, many developers have embraced large NLP models as a go-to approach for developing superior NLP products with higher accuracy and precision. However, training large models is not straightforward. It requires computational resources for parallel processing, thereby increasing the cost. To mitigate such challenges, Microsoft, in February, also released Zero Redundancy Optimizer (ZeRO), a parallelized optimizer to reduce the need for intensive resources while scaling the models with more parameters.

ZeRO allowed users to train models with 100 billion parameters on the existing GPU clusters by 3x-5x times. In May, Microsoft released ZeRO-2 to further enhance the workflow by allowing 200 billion parameters for model training with 10x the speed of the then state-of-the-art approach.

Now, with the recent release of DeepSpeed, one can even use a single GPU for developing large models. The library includes four new system technologies that support long input sequences, high-end clusters, and low-end clusters.

With DeepSpeed, Microsoft offers a combination of three parallelism approaches—ZeRO powered data parallelism, pipeline parallelism, and tensor-slicing model parallelism.

In addition, with ZeRO-Offload, NLP practitioners can train 10x bigger models using CPU and GPU memory. For instance, with NVIDIA V100 GPU, you can build a model up to 13 billion parameters without running out of memory.

For long sequences with text, image, and audio, DeepSpeed provides sparse attention kernels, which powers 10x longer sequences and 6x faster execution. Besides, its 1-bit Adam–a new algorithm–reduces communication volume by up to 5x. This makes the distributed training in communication-constrained scenarios 3.5x faster.

With DeepSpeed, You Can Now Build 1-Trillion-Parameters NLP Models

LEAVE A REPLY Cancel reply

Most Popular

With DeepSpeed, You Can Now Build 1-Trillion-Parameters NLP Models

Subscribe to our newsletter

RELATED ARTICLES

Concepts and Workflow of MLOps

Machine learning projects with source code

List of Bug Bounty Platforms for Cyber Security

LEAVE A REPLY Cancel reply

Most Popular