Optimizing Transformers for GPUs with Optimum
In this session, you will learn how to optimize Hugging Face Transformers models for GPUs using Optimum. The session will show you how to convert you weights to fp16 weights and optimize a DistilBERT model using Hugging Face Optimum and ONNX Runtime. Hugging Face Optimum is an extension of 🤗 Transformers, providing a set of performance optimization tools enabling maximum efficiency to train and run models on targeted hardware. We are going to optimize a DistilBERT model for Question Answering, which was fine-tuned on the SQuAD dataset to decrease the latency from 7ms to 3ms for a sequence lenght of 128.
Note: int8 quantization is currently only supported for CPUs. We plan to add support for in the near future using TensorRT.
By the end of this session, you will know how GPU optimization with Hugging Face Optimum can result in significant increase in model latency and througput while keeping 100% of the full-precision model.
You will learn how to:
- Setup Development Environment
- Convert a Hugging Face
Transformers
model to ONNX for inference - Optimize model for GPU using
ORTOptimizer
- Evaluate the performance and speed
Let's get started! 🚀
This tutorial was created and run on an g4dn.xlarge AWS EC2 Instance including a NVIDIA T4.
1. Setup Development Environment
Our first step is to install Optimum, along with Evaluate and some other libraries. Running the following cell will install all the required packages for us including Transformers, PyTorch, and ONNX Runtime utilities:
Note: You need a machine with a GPU and CUDA installed. You can check this by running nvidia-smi
in your terminal. If you have a correct environment you should statistics abour your GPU.
Before we start. Lets make sure we have the CUDAExecutionProvider
for ONNX Runtime available.
If you want to run inference on a CPU, you can install 🤗 Optimum with
pip install optimum[onnxruntime]
.
Transformers
model to ONNX for inference
2. Convert a Hugging Face Before we can start optimizing our model we need to convert our vanilla transformers
model to the onnx
format. To do this we will use the new ORTModelForQuestionAnswering class calling the from_pretrained()
method with the from_transformers
attribute. The model we are using is the distilbert-base-cased-distilled-squad a fine-tuned DistilBERT-based model on the SQuAD dataset achieving an F1 score of 87.1
and as the feature (task) question-answering
.
Before we jump into the optimization of the model lets first evaluate the current performance of the model. Therefore we can use pipeline()
function from 🤗 Transformers. Meaning we will measure the end-to-end latency including the pre- and post-processing features.
After we prepared our payload we can create the inference pipeline
.
If you are seeing a CreateExecutionProviderInstance
error you are not having a compatible cuda
version installed. Check the documentation, which cuda version you need.
If you want to learn more about exporting transformers model check-out Convert Transformers to ONNX with Hugging Face Optimum blog post
ORTOptimizer
3. Optimize model for GPU using The ORTOptimizer allows you to apply ONNX Runtime optimization on our Transformers models. In addition to the ORTOptimizer
Optimum offers a OptimizationConfig a configuration class handling all the ONNX Runtime optimization parameters.
There are several technique to optimize our model for GPUs including graph optimizations and converting our model weights from fp32
to fp16
.
Graph optimizations are essentially graph-level transformations, ranging from small graph simplifications and node eliminations to more complex node fusions and layout optimizations. Examples of graph optimizations include:
- Constant folding: evaluate constant expressions at compile time instead of runtime
- Redundant node elimination: remove redundant nodes without changing graph structure
- Operator fusion: merge one node (i.e. operator) into another so they can be executed together
If you want to learn more about graph optimization you take a look at the ONNX Runtime documentation.
To achieve best performance we will apply the following optimizations parameter in our OptimizationConfig
:
optimization_level=99
: to enable all the optimizations. Note: Switching Hardware after optimization can lead to issues.optimize_for_gpu=True
: to enable GPU optimizations.fp16=True
: to convert model computation fromfp32
tofp16
. Note: Only for V100 and T4 or newer.
To test performance we can use the ORTModelForSequenceClassification class again and provide an additional file_name
parameter to load our optimized model. (This also works for models available on the hub).
4. Evaluate the performance and speed
As the last step, we want to take a detailed look at the performance and accuracy of our model. Applying optimization techniques, like graph optimizations or mixed-precision not only impact performance (latency) those also might have an impact on the accuracy of the model. So accelerating your model comes with a trade-off.
Let's evaluate our models. Our transformers model distilbert-base-cased-distilled-squad was fine-tuned on the SQuAD dataset.
We can now leverage the map function of datasets to iterate over the validation set of squad_v2
and run prediction for each data point. Therefore we write a evaluate
helper method which uses our pipelines and applies some transformation to work with the squad v2 metric.
Okay, now let's test the performance (latency) of our optimized model. We are going to use a payload with a sequence length of 128 for the benchmark. To keep it simple, we are going to use a python loop and calculate the avg,mean & p95 latency for our vanilla model and for the optimized model.
We managed to accelerate our model latency from 7.8ms to 3.4ms or 2.3x while keeping 100.00% of the accuracy.
Conclusion
We successfully optimized our vanilla Transformers model with Hugging Face Optimum and managed to accelerate our model latency from 7.8ms to 3.4ms or 2.3x while keeping 100.00% of the accuracy.
But I have to say that this isn't a plug and play process you can transfer to any Transformers model, task or dataset.
Thanks for reading. If you have any questions, feel free to contact me, through Github, or on the forum. You can also connect with me on Twitter or LinkedIn.