Static Quantization with Hugging Face `optimum` for ~3x latency improvements
notebook: optimum-static-quantization
In this session, you will learn how to do post-training static quantization on Hugging Face Transformers model. The session will show you how to quantize a DistilBERT model using Hugging Face Optimum and ONNX Runtime. Hugging Face Optimum is an extension of 🤗 Transformers, providing a set of performance optimization tools enabling maximum efficiency to train and run models on targeted hardware.
Note: Static quantization is currently only supported for CPUs, so we will not be utilizing GPUs / CUDA in this session. By the end of this session, you see how quantization with Hugging Face Optimum can result in significant increase in model latency while keeping almost 100% of the full-precision model. Furthermore, you’ll see how to easily apply some advanced quantization and optimization techniques shown here so that your models take much less of an accuracy hit than they would otherwise.
You will learn how to:
- 1. Setup Development Environment
- 2. Convert a Hugging Face
Transformers
model to ONNX for inference - 3. Configure static quantization & run Calibration of quantization ranges
- 4. Use the ORTQuantizer to apply static quantization
- 5. Test inference with the quantized model
- 6. Evaluate the performance and speed
- 7. Push the quantized model to the Hub
- 8. Load and run inference with a quantized model from the hub
Or you can immediately jump to the Conclusion.
Let's get started! 🚀
This tutorial was created and run on a c6i.xlarge AWS EC2 Instance.
1. Setup Development Environment
Our first step is to install Optimum with the onnxruntime utilities and evaluate.
This will install all required packages including transformers, torch, and onnxruntime. If you are going to use a GPU you can install optimum with pip install optimum[onnxruntime-gpu].
Transformers
model to ONNX for inference
2. Convert a Hugging Face Before we can start qunatizing, we need to convert our vanilla transformers
model to the onnx
format. To do this we will use the new ORTModelForSequenceClassification class calling the from_pretrained()
method with the from_transformers
attribute. The model we are using is the optimum/distilbert-base-uncased-finetuned-banking77 a fine-tuned DistilBERT model on the Banking77 dataset achieving an Accuracy score of 92.5
and as the feature (task) text-classification
.
3. Configure static quantization & run Calibration of quantization ranges
Post-training static quantization, compared to dynamic quantization not only involves converting the weights from float to int, but also performing an first additional step of feeding the data through the model to compute the distributions of the different activations (calibration ranges). These distributions are then used to determine how the different activations should be quantized at inference time. Importantly, this additional step allows us to pass quantized values between operations instead of converting these values to floats - and then back to ints - between every operation, resulting in a significant speed-up.
First step is to create our Quantization configuration using optimum
.
After we have configured our configuration we are going to use the fine-tuning dataset as calibration data to calculate the quantization parameters of activations. The ORTQuantizer
supports three calibration methods: MinMax, Entropy and Percentile.
We are going to use Percentile as a calibration method. For the session we have already run hyperparameter optimization in advance to find the right percentiles
to achieve the highest accuracy. Therefore we used the scripts/run_static_quantizatio_hpo.py
together with optuna
.
Finding the right calibration method and percentiles is what makes static quantization cost-intensive. Since it can take up to multiple hours to find the right values and there is sadly no rule of thumb. If you want to learn more about it you should check out the "INTEGER QUANTIZATION FOR DEEP LEARNING INFERENCE: PRINCIPLES AND EMPIRICAL EVALUATION" paper
4. Use the ORTQuantizer to apply static quantization
After we have calculated our calibration tensor ranges we can quantize our model using the ORTQuantizer
.
Lets quickly check the new model size.
5. Test inference with the quantized model
Optimum has built-in support for transformers pipelines. This allows us to leverage the same API that we know from using PyTorch and TensorFlow models.
Therefore we can load our quantized model with ORTModelForSequenceClassification
class and transformers pipeline
.
6. Evaluate the performance and speed
We can now leverage the map function of datasets to iterate over the validation set of squad 2 and run prediction for each data point. Therefore we write a evaluate helper method which uses our pipelines and applies some transformation to work with the squad v2 metric.
Okay, now let's test the performance (latency) of our quantized model. We are going to use a payload with a sequence length of 128 for the benchmark. To keep it simple, we are going to use a python loop and calculate the avg,mean & p95 latency for our vanilla model and for the quantized model.
We managed to accelerate our model latency from 75.69ms to 26.75ms or 2.83x while keeping 99.72% of the accuracy.
7. Push the quantized model to the Hub
The Optimum model classes like ORTModelForSequenceClassification
are integrated with the Hugging Face Model Hub, which means you can not only load model from the Hub, but also push your models to the Hub with push_to_hub()
method. That way we can now save our qunatized model on the Hub to be for example used inside our inference API.
We have to make sure that we are also saving the tokenizer
as well as the config.json
to have a good inference experience.
If you haven't logged into the huggingface hub
yet you can use the notebook_login
to do so.
After we have configured our hugging face hub credentials we can push the model.
8. Load and run inference with a quantized model from the hub
This step serves as a demonstration of how you could use optimum in your api to load and use our qunatized model.
Conclusion
We successfully quantized our vanilla Transformers model with Hugging Face and managed to accelerate our model latency from 75.69ms to 26.75ms or 2.83x while keeping 99.72% of the accuracy.
But I have to say that this isn't a plug and play process you can transfer to any Transformers model, task and dataset. The challenge with static quantization is the calibration of the dataset to find the right ranges which you can use to quantize the model achieve good performance. I ran a hyperparameter search to find the best ranges for our dataset and quantized the model using the run_static_quantizatio_hpo.py.
Also, notably to say it that static quantization can only achieve as good as results as dynamic quantization, but will be faster than dynamic quantization. Means that it might always be a good start to first dynamically quantize your model using Optimum and then move to static quantization for further latency and throughput gains. The attached repository also includes an example on how dynamically quantize the model dynamic_quantization.py
The code can be found in this repository philschmid/optimum-static-quantization
Thanks for reading! If you have any questions, feel free to contact me, through Github, or on the forum. You can also connect with me on Twitter or LinkedIn.