Tuesday, June 18, 2024
HomeDeveloperWith DeepSpeed, You Can Now Build 1-Trillion-Parameters NLP Models

With DeepSpeed, You Can Now Build 1-Trillion-Parameters NLP Models

Microsoft enhanced its open-source deep learning optimization library DeepSpeed for empowering developers or researchers to make 1-trillion-parameters models. Initially, when the library was released on 13 February 2020, it enabled uses to build 100-billion-parameters models. With the library, practitioners of natural language processing (NLP) can train large models with reduced cost and compute while scaling to unlock new intelligence possibilities.

Today, many developers have embraced large NLP models as a go-to approach for developing superior NLP products with higher accuracy and precision. However, training large models is not straightforward. It requires computational resources for parallel processing, thereby increasing the cost. To mitigate such challenges, Microsoft, in February,  also released Zero Redundancy Optimizer (ZeRO), a parallelized optimizer to reduce the need for intensive resources while scaling the models with more parameters.

ZeRO allowed users to train models with 100 billion parameters on the existing GPU clusters by 3x-5x times. In May, Microsoft released ZeRO-2 to further enhance the workflow by allowing 200 billion parameters for model training with 10x the speed of the then state-of-the-art approach.

Now, with the recent release of DeepSpeed, one can even use a single GPU for developing large models. The library includes four new system technologies that support long input sequences, high-end clusters, and low-end clusters.

Source: Microsoft Blog

With DeepSpeed, Microsoft offers a combination of three parallelism approaches—ZeRO powered data parallelism, pipeline parallelism, and tensor-slicing model parallelism. 

In addition, with ZeRO-Offload, NLP practitioners can train 10x bigger models using CPU and GPU memory. For instance, with NVIDIA V100 GPU, you can build a model up to 13 billion parameters without running out of memory.

For long sequences with text, image, and audio, DeepSpeed provides sparse attention kernels, which powers 10x longer sequences and 6x faster execution. Besides, its 1-bit Adam–a new algorithm–reduces communication volume by up to 5x. This makes the distributed training in communication-constrained scenarios 3.5x faster.

Subscribe to our newsletter

Subscribe and never miss out on such trending AI-related articles.

We will never sell your data

Join our WhatsApp Channel and Discord Server to be a part of an engaging community.

Analytics Drift
Analytics Drift
Editorial team of Analytics Drift


Please enter your comment!
Please enter your name here

Most Popular