Recently Hugging Face launched a new open-source library called Optimum, which aims to democratize the production performance of Machine Learning models. This tool kit also maximizes efficiency to train and run models on specific hardware.
Many data-driven companies like Tesla, Google, Facebook run millions of Transformer model predictions every day to — drive in AutoPilot mode, complete sentences in Gmail, or translate your posts, respectively. As Transformer has brought a capricious improvement in the accuracy of machine learning algorithms, it has vanquished many challenges in NLP and gradually bolstered attempts to expand modalities in areas of Speech and Vision.
Despite such advancements, many machine learning engineers strive to obtain faster running scalable models when brought into production. With Optimum, Hugging Face not only improves Transformer-based models’ performance but also facilitates targeting efficient AI hardware. In addition, the Optimum library helps engineers to leverage the pinnacle of available hardware features with its state-of-the-art AI hardware accelerators.
Transformer-based models can be tricky and expensive as they require a lot of computational power. In order to get optimized performance while training and deploying, the model acceleration methods need to be compatible with targeted hardware. Since each hardware platform offers specific software tooling, it is inevitable to take advantage of advanced model acceleration methods such as sparsity and quantization. But, quantization of a model requires a lot of work as shown below:
- The model needs editing: Some operations need to be replaced by their quantized counterparts, new ops required to be inserted, and others need to adapt the weights and activations.
- Optimizing quantization: Post-editing, a model involves many parameters to find the best quantization that includes the following questions:
- Which kind of observers should be used to calibrate range?
- Which quantization scheme could be used?
- Which quantization data types (int8, uint8, int16) are supported by your target device?
- Tradeoff: When tuning a model, it must have a balance quantization and an acceptable accuracy loss.
- Exporting: the quantized model for a target device.
Citing Intel, Hugging Face referred to Low Precision Optimization Tool (LPOT), displaying the approach to solve quantization. LPOT is an open-source python library to help users deploy low-precision inference solutions that support post-training, quantization-aware training, and dynamic quantization. When specifying the quantization approach, objective, and performance criteria, the user needs to provide a configuration YAML file and the tuning parameters. Below is the code that displays how you can quantize, prune, and train transformers for Intel Xeon CPUs with Optimum:
- Quantize:
- Prune:
- Train:
The Hugging Face team mentioned that Optimum would focus on leveraging optimal production performance on dedicated hardware, where software and hardware acceleration methods give maximum efficiency. Further, Hugging Face said, they would work with their hardware partners — Intel, Qualcomm, and Graphcore to scale, train, and maintain acceleration.