# Optimizing Transformers with Hugging Face Optimum

*last update: 2022-11-18*

In this session, you will learn how to optimize Hugging Face Transformers models using Optimum. The session will show you how to dynamically quantize and optimize a DistilBERT model using Hugging Face Optimum and ONNX Runtime. Hugging Face Optimum is an extension of ðŸ¤— Transformers, providing a set of performance optimization tools enabling maximum efficiency to train and run models on targeted hardware.

Note: dynamic quantization is currently only supported for CPUs, so we will not be utilizing GPUs / CUDA in this session.

By the end of this session, you see how quantization and optimization with Hugging Face Optimum can result in significant increase in model latency while keeping almost 100% of the full-precision model. Furthermore, youâ€™ll see how to easily apply some advanced quantization and optimization techniques shown here so that your models take much less of an accuracy hit than they would otherwise.

You will learn how to:

- Setup Development Environment
- Convert a Hugging Face
`Transformers`

model to ONNX for inference - Apply graph optimization techniques to the ONNX model
- Apply dynamic quantization using
`ORTQuantizer`

from Optimum - Test inference with the quantized model
- Evaluate the performance and speed
- Push the quantized model to the Hub
- Load and run inference with a quantized model from the hub

Let's get started! ðŸš€

*This tutorial was created and run on an c6i.xlarge AWS EC2 Instance.*

## 1. Setup Development Environment

Our first step is to install Optimum, along with Evaluate and some other libraries. Running the following cell will install all the required packages for us including Transformers, PyTorch, and ONNX Runtime utilities:

If you want to run inference on a GPU, you can install ðŸ¤— Optimum with

`pip install optimum[onnxruntime-gpu]`

.

`Transformers`

model to ONNX for inference

2. Convert a Hugging Face Before we can start qunatizing we need to convert our vanilla `transformers`

model to the `onnx`

format. To do this we will use the new ORTModelForSequenceClassification class calling the `from_pretrained()`

method with the `from_transformers`

attribute. The model we are using is the optimum/distilbert-base-uncased-finetuned-banking77 a fine-tuned DistilBERT model on the Banking77 dataset achieving an Accuracy score of `92.5`

and as the feature (task) `text-classification`

.

One neat thing about ðŸ¤— Optimum, is that allows you to run ONNX models with the `pipeline()`

function from ðŸ¤— Transformers. This means that you get all the pre- and post-processing features for free, without needing to re-implement them for each model! Here's how you can run inference with our vanilla ONNX model:

If you want to learn more about exporting transformers model check-out Convert Transformers to ONNX with Hugging Face Optimum blog post

## 3. Apply graph optimization techniques to the ONNX model

Graph optimizations are essentially graph-level transformations, ranging from small graph simplifications and node eliminations to more complex node fusions and layout optimizations. Examples of graph optimizations include:

**Constant folding**: evaluate constant expressions at compile time instead of runtime**Redundant node elimination**: remove redundant nodes without changing graph structure**Operator fusion**: merge one node (i.e. operator) into another so they can be executed together

![operator fusion](

If you want to learn more about graph optimization you take a look at the ONNX Runtime documentation. We are going to first optimize the model and then dynamically quantize to be able to use transformers specific operators such as QAttention for quantization of attention layers.
To apply graph optimizations to our ONNX model, we will use the `ORTOptimizer()`

. The `ORTOptimizer`

makes it with the help of a `OptimizationConfig`

easy to optimize. The `OptimizationConfig`

is the configuration class handling all the ONNX Runtime optimization parameters.

To test performance we can use the ORTModelForSequenceClassification class again and provide an additional `file_name`

parameter to load our optimized model. *(This also works for models available on the hub).*

`ORTQuantizer`

from Optimum

4. Apply dynamic quantization using After we have optimized our model we can accelerate it even more by quantizing it using the `ORTQuantizer`

. The `ORTQuantizer`

can be used to apply dynamic quantization to decrease the size of the model size and accelerate latency and inference.

*We use the avx512_vnni config since the instance is powered by an intel ice-lake CPU supporting avx512.*

Lets quickly check the new model size.

## 5. Test inference with the quantized model

Optimum has built-in support for transformers pipelines. This allows us to leverage the same API that we know from using PyTorch and TensorFlow models.
Therefore we can load our quantized model with `ORTModelForSequenceClassification`

class and transformers `pipeline`

.

## 6. Evaluate the performance and speed

We can now leverage the map function of datasets to iterate over the validation set of squad 2 and run prediction for each data point. Therefore we write a evaluate helper method which uses our pipelines and applies some transformation to work with the squad v2 metric.

Okay, now let's test the performance (latency) of our quantized model. We are going to use a payload with a sequence length of 128 for the benchmark. To keep it simple, we are going to use a python loop and calculate the avg,mean & p95 latency for our vanilla model and for the quantized model.

We managed to accelerate our model latency from 68.4ms to 27.55ms or 2.48x while keeping 99.72% of the accuracy.

## 7. Push the quantized model to the Hub

The Optimum model classes like `ORTModelForSequenceClassification`

are integrated with the Hugging Face Model Hub, which means you can not only load model from the Hub, but also push your models to the Hub with `push_to_hub()`

method. That way we can now save our qunatized model on the Hub to be for example used inside our inference API.

*We have to make sure that we are also saving the tokenizer as well as the config.json to have a good inference experience.*

If you haven't logged into the `huggingface hub`

yet you can use the `notebook_login`

to do so.

After we have configured our hugging face hub credentials we can push the model.

## 8. Load and run inference with a quantized model from the hub

This step serves as a demonstration of how you could use optimum in your api to load and use our qunatized model.

## Conclusion

We successfully quantized our vanilla Transformers model with Hugging Face Optimum and managed to decrease our model latency from 68.4ms to 27.55ms or 2.48x while keeping 99.72% of the accuracy.

Thanks for reading. If you have any questions, feel free to contact me, throughÂ Github, or on theÂ forum. You can also connect with me onÂ TwitterÂ orÂ LinkedIn.