Optimizing Transformers with Hugging Face Optimum
last update: 2022-11-18
In this session, you will learn how to optimize Hugging Face Transformers models using Optimum. The session will show you how to dynamically quantize and optimize a DistilBERT model using Hugging Face Optimum and ONNX Runtime. Hugging Face Optimum is an extension of 🤗 Transformers, providing a set of performance optimization tools enabling maximum efficiency to train and run models on targeted hardware.
Note: dynamic quantization is currently only supported for CPUs, so we will not be utilizing GPUs / CUDA in this session.
By the end of this session, you see how quantization and optimization with Hugging Face Optimum can result in significant increase in model latency while keeping almost 100% of the full-precision model. Furthermore, you’ll see how to easily apply some advanced quantization and optimization techniques shown here so that your models take much less of an accuracy hit than they would otherwise.
You will learn how to:
- Setup Development Environment
- Convert a Hugging Face
Transformers
model to ONNX for inference - Apply graph optimization techniques to the ONNX model
- Apply dynamic quantization using
ORTQuantizer
from Optimum - Test inference with the quantized model
- Evaluate the performance and speed
- Push the quantized model to the Hub
- Load and run inference with a quantized model from the hub
Let's get started! 🚀
This tutorial was created and run on an c6i.xlarge AWS EC2 Instance.
1. Setup Development Environment
Our first step is to install Optimum, along with Evaluate and some other libraries. Running the following cell will install all the required packages for us including Transformers, PyTorch, and ONNX Runtime utilities:
If you want to run inference on a GPU, you can install 🤗 Optimum with
pip install optimum[onnxruntime-gpu]
.
Transformers
model to ONNX for inference
2. Convert a Hugging Face Before we can start qunatizing we need to convert our vanilla transformers
model to the onnx
format. To do this we will use the new ORTModelForSequenceClassification class calling the from_pretrained()
method with the from_transformers
attribute. The model we are using is the optimum/distilbert-base-uncased-finetuned-banking77 a fine-tuned DistilBERT model on the Banking77 dataset achieving an Accuracy score of 92.5
and as the feature (task) text-classification
.
One neat thing about 🤗 Optimum, is that allows you to run ONNX models with the pipeline()
function from 🤗 Transformers. This means that you get all the pre- and post-processing features for free, without needing to re-implement them for each model! Here's how you can run inference with our vanilla ONNX model:
If you want to learn more about exporting transformers model check-out Convert Transformers to ONNX with Hugging Face Optimum blog post
3. Apply graph optimization techniques to the ONNX model
Graph optimizations are essentially graph-level transformations, ranging from small graph simplifications and node eliminations to more complex node fusions and layout optimizations. Examples of graph optimizations include:
- Constant folding: evaluate constant expressions at compile time instead of runtime
- Redundant node elimination: remove redundant nodes without changing graph structure
- Operator fusion: merge one node (i.e. operator) into another so they can be executed together
![operator fusion](
If you want to learn more about graph optimization you take a look at the ONNX Runtime documentation. We are going to first optimize the model and then dynamically quantize to be able to use transformers specific operators such as QAttention for quantization of attention layers.
To apply graph optimizations to our ONNX model, we will use the ORTOptimizer()
. The ORTOptimizer
makes it with the help of a OptimizationConfig
easy to optimize. The OptimizationConfig
is the configuration class handling all the ONNX Runtime optimization parameters.
To test performance we can use the ORTModelForSequenceClassification class again and provide an additional file_name
parameter to load our optimized model. (This also works for models available on the hub).
ORTQuantizer
from Optimum
4. Apply dynamic quantization using After we have optimized our model we can accelerate it even more by quantizing it using the ORTQuantizer
. The ORTQuantizer
can be used to apply dynamic quantization to decrease the size of the model size and accelerate latency and inference.
We use the avx512_vnni
config since the instance is powered by an intel ice-lake CPU supporting avx512.
Lets quickly check the new model size.
5. Test inference with the quantized model
Optimum has built-in support for transformers pipelines. This allows us to leverage the same API that we know from using PyTorch and TensorFlow models.
Therefore we can load our quantized model with ORTModelForSequenceClassification
class and transformers pipeline
.
6. Evaluate the performance and speed
We can now leverage the map function of datasets to iterate over the validation set of squad 2 and run prediction for each data point. Therefore we write a evaluate helper method which uses our pipelines and applies some transformation to work with the squad v2 metric.
Okay, now let's test the performance (latency) of our quantized model. We are going to use a payload with a sequence length of 128 for the benchmark. To keep it simple, we are going to use a python loop and calculate the avg,mean & p95 latency for our vanilla model and for the quantized model.
We managed to accelerate our model latency from 68.4ms to 27.55ms or 2.48x while keeping 99.72% of the accuracy.
7. Push the quantized model to the Hub
The Optimum model classes like ORTModelForSequenceClassification
are integrated with the Hugging Face Model Hub, which means you can not only load model from the Hub, but also push your models to the Hub with push_to_hub()
method. That way we can now save our qunatized model on the Hub to be for example used inside our inference API.
We have to make sure that we are also saving the tokenizer
as well as the config.json
to have a good inference experience.
If you haven't logged into the huggingface hub
yet you can use the notebook_login
to do so.
After we have configured our hugging face hub credentials we can push the model.
8. Load and run inference with a quantized model from the hub
This step serves as a demonstration of how you could use optimum in your api to load and use our qunatized model.
Conclusion
We successfully quantized our vanilla Transformers model with Hugging Face Optimum and managed to decrease our model latency from 68.4ms to 27.55ms or 2.48x while keeping 99.72% of the accuracy.
Thanks for reading. If you have any questions, feel free to contact me, through Github, or on the forum. You can also connect with me on Twitter or LinkedIn.