Accelerate Sentence Transformers with Hugging Face Optimum
last update: 2022-11-18
In this session, you will learn how to optimize Sentence Transformers using Optimum. The session will show you how to dynamically quantize and optimize a MiniLM Sentence Transformers model using Hugging Face Optimum and ONNX Runtime. Hugging Face Optimum is an extension of 🤗 Transformers, providing a set of performance optimization tools enabling maximum efficiency to train and run models on targeted hardware.
Note: dynamic quantization is currently only supported for CPUs, so we will not be utilizing GPUs / CUDA in this session.
By the end of this session, you see how quantization and optimization with Hugging Face Optimum can result in significant decrease in model latency.
You will learn how to:
- Setup Development Environment
- Convert a Sentence Transformers model to ONNX and create custom Inference Pipeline
- Apply graph optimization techniques to the ONNX model
- Apply dynamic quantization using
ORTQuantizer
from Optimum - Test inference with the quantized model
- Evaluate the performance and speed
Let's get started! 🚀
This tutorial was created and run on an c6i.xlarge AWS EC2 Instance.
Quick intro: What are Sentence Transformers
Sentence Transformers is a Python library for state-of-the-art sentence, text and image embeddings. The initial work is described in our paper Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks.
Sentence Transformers can be used to compute embeddings for more than 100 languages and to build solutions for semantic textual similar, semantic search, or paraphrase mining.
1. Setup Development Environment
Our first step is to install Optimum, along with Evaluate and some other libraries. Running the following cell will install all the required packages for us including Transformers, PyTorch, and ONNX Runtime utilities:
If you want to run inference on a GPU, you can install 🤗 Optimum with
pip install optimum[onnxruntime-gpu]
.
2. Convert a Sentence Transformers model to ONNX and create custom Inference Pipeline
Before we can start qunatizing we need to convert our vanilla sentence-transformers
model to the onnx
format. To do this we will use the new ORTModelForFeatureExtraction class calling the from_pretrained()
method with the from_transformers
attribute. The model we are using is the sentence-transformers/all-MiniLM-L6-v2 which maps sentences & paragraphs to a 384 dimensional dense vector space and can be used for tasks like clustering or semantic search and was trained on the 1-billion sentence dataset.
When using sentence-transformers
natively you can run inference by loading your model in the SentenceTransformer
class and then calling the .encode()
method. However this only works with the PyTorch based checkpoints, which we no longer have. To run inference using the Optimum ORTModelForFeatureExtraction
class, we need to write some methods ourselves. Below we create a SentenceEmbeddingPipeline
based on "How to create a custom pipeline?" from the Transformers documentation.
We can now initialize our SentenceEmbeddingPipeline
using our ORTModelForFeatureExtraction
model and perform inference.
If you want to learn more about exporting transformers model check-out Convert Transformers to ONNX with Hugging Face Optimum blog post
3. Apply graph optimization techniques to the ONNX model
Graph optimizations are essentially graph-level transformations, ranging from small graph simplifications and node eliminations to more complex node fusions and layout optimizations. Examples of graph optimizations include:
- Constant folding: evaluate constant expressions at compile time instead of runtime
- Redundant node elimination: remove redundant nodes without changing graph structure
- Operator fusion: merge one node (i.e. operator) into another so they can be executed together
If you want to learn more about graph optimization you take a look at the ONNX Runtime documentation. We are going to first optimize the model and then dynamically quantize to be able to use transformers specific operators such as QAttention for quantization of attention layers.
To apply graph optimizations to our ONNX model, we will use the ORTOptimizer()
. The ORTOptimizer
makes it with the help of a OptimizationConfig
easy to optimize. The OptimizationConfig
is the configuration class handling all the ONNX Runtime optimization parameters.
To test performance we can use the ORTModelForSequenceClassification class again and provide an additional file_name
parameter to load our optimized model. (This also works for models available on the hub).
ORTQuantizer
from Optimum
4. Apply dynamic quantization using After we have optimized our model we can accelerate it even more by quantizing it using the ORTQuantizer
. The ORTQuantizer
can be used to apply dynamic quantization to decrease the size of the model size and accelerate latency and inference.
We use the avx512_vnni
config since the instance is powered by an intel ice-lake CPU supporting avx512.
Lets quickly check the new model size.
5. Test inference with the quantized model
Optimum has built-in support for transformers pipelines. This allows us to leverage the same API that we know from using PyTorch and TensorFlow models.
Therefore we can load our quantized model with ORTModelForSequenceClassification
class and transformers pipeline
.
6. Evaluate the performance and speed
As the last step, we want to take a detailed look at the performance and accuracy of our model. Applying optimization techniques, like graph optimizations or mixed-precision not only impact performance (latency) those also might have an impact on the accuracy of the model. So accelerating your model comes with a trade-off.
We are going to evaluate our Sentence Transformers model / Sentence Embeddings on the Semantic Textual Similarity Benchmark from the GLUE dataset.
The Semantic Textual Similarity Benchmark (Cer et al., 2017) is a collection of sentence pairs drawn from news headlines, video and image captions, and natural language inference data. Each pair is human-annotated with a similarity score from 1 to 5.
We can now leverage the map function of datasets to iterate over the validation set of stsb
and run prediction for each data point. Therefore we write a evaluate
helper method which uses our SentenceEmbeddingsPipeline
and sentence-transformers
helper methods.
the results are
Okay, now let's test the performance (latency) of our quantized model. We are going to use a payload with a sequence length of 128 for the benchmark. To keep it simple, we are going to use a python loop and calculate the avg,mean & p95 latency for our vanilla model and for the quantized model.
the results are
We managed to accelerate our model latency from 25.6ms to 12.3ms or 2.09x while keeping 100% of the accuracy on the stsb
dataset.
Conclusion
We successfully quantized our vanilla Transformers model with Hugging Face and managed to accelerate our model latency from 25.6ms to 12.3ms or 2.09x while keeping 100% of the accuracy on the stsb
dataset.
But I have to say that this isn't a plug and play process you can transfer to any Transformers model, task or dataset.