Accelerate BERT inference with DeepSpeed-Inference on GPUs
In this session, you will learn how to optimize Hugging Face Transformers models for GPU inference using DeepSpeed-Inference. The session will show you how to apply state-of-the-art optimization techniques using DeepSpeed-Inference. This session will focus on single GPU inference on BERT and RoBERTa models. By the end of this session, you will know how to optimize your Hugging Face Transformers models (BERT, RoBERTa) using DeepSpeed-Inference. We are going to optimize a BERT large model for token classification, which was fine-tuned on the conll2003 dataset to decrease the latency from 30ms to 10ms for a sequence length of 128.
You will learn how to:
- Quick Intro: What is DeepSpeed-Inference
- 1. Setup Development Environment
- 2. Load vanilla BERT model and set baseline
- 3. Optimize BERT for GPU using DeepSpeed
InferenceEngine
- 4. Evaluate the performance and speed
- Conclusion
Let's get started! 🚀
This tutorial was created and run on a g4dn.xlarge AWS EC2 Instance including an NVIDIA T4.
Quick Intro: What is DeepSpeed-Inference
DeepSpeed-Inference is an extension of the DeepSpeed framework focused on inference workloads. DeepSpeed Inference combines model parallelism technology such as tensor, pipeline-parallelism, with custom optimized cuda kernels. DeepSpeed provides a seamless inference mode for compatible transformer based models trained using DeepSpeed, Megatron, and HuggingFace. For a list of compatible models please see here. As mentioned DeepSpeed-Inference integrates model-parallelism techniques allowing you to run multi-GPU inference for LLM, like BLOOM with 176 billion parameters. If you want to learn more about DeepSpeed inference:
- Paper: DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale
- Blog: Accelerating large-scale model inference and training via system optimizations and compression
1. Setup Development Environment
Our first step is to install Deepspeed, along with PyTorch, Transfromers and some other libraries. Running the following cell will install all the required packages.
Note: You need a machine with a GPU and a compatible CUDA installed. You can check this by running nvidia-smi
in your terminal. If your setup is correct, you should get statistics about your GPU.
Before we start. Let's make sure all packages are installed correctly.
2. Load vanilla BERT model and set baseline
After we set up our environment, we create a baseline for our model. We use the dslim/bert-large-NER, a fine-tuned BERT-large model on the English version of the standard CoNLL-2003 Named Entity Recognition dataset achieving an f1 score 95.7%
.
To create our baseline, we load the model with transformers
and create a token-classification
pipeline.
Create a Baseline with evaluate
using the evaluator
and the conll2003
dataset. The Evaluator class allows us to evaluate a model/pipeline on a dataset using a defined metric.
Our model achieves an f1 score of 95.8%
on the CoNLL-2003 dataset with an average latency across the dataset of 18.9ms
.
InferenceEngine
3. Optimize BERT for GPU using DeepSpeed The next and most important step is to optimize our model for GPU inference. This will be done using the DeepSpeed InferenceEngine
. The InferenceEngine
is initialized using the init_inference
method. The init_inference
method expects as parameters atleast:
model
: The model to optimize.mp_size
: The number of GPUs to use.dtype
: The data type to use.replace_with_kernel_inject
: Whether inject custom kernels.
You can find more information about the init_inference
method in the DeepSpeed documentation or thier inference blog.
We can now inspect our model graph to see that the vanilla BertLayer
has been replaced with an HFBertLayer
, which includes the DeepSpeedTransformerInference
module, a custom nn.Module
that is optimized for inference by the DeepSpeed Team.
we can validate this with a simple assert
.
Now, lets run the same evaluation as for our baseline transformers model.
Our DeepSpeed model achieves an f1 score of 95.6%
on the CoNLL-2003 dataset with an average latency across the dataset of 9.33ms
.
4. Evaluate the performance and speed
As the last step, we want to take a detailed look at the performance and accuracy of our optimized model. Applying optimization techniques, like graph optimizations or mixed-precision, not only impact performance (latency) those also might have an impact on the accuracy of the model. So accelerating your model comes with a trade-off.
In our example, did we achieve on the conll2003
evaluation dataset an f1 score of 95.8%
with an average latency of 18.9ms
for the vanilla model and for our optimized model an f1 score of 95.6%
with an average latency of 9.33ms
.
The optimized ds-model achieves 99.88%
accuracy of the vanilla transformers model.
Now let's test the performance (latency) of our optimized model. We will use a payload with a sequence length of 128 for the benchmark. To keep it simple, we will use a python loop and calculate the avg, mean & p95 latency for our vanilla model and the optimized model.
We managed to accelerate the BERT-Large
model latency from 30.4ms
to 10.40ms
or 2.92x for sequence length of 128.
Conclusion
We successfully optimized our BERT-large Transformers with DeepSpeed-inference and managed to decrease our model latency from 30.4ms to 10.4ms or 2.92x while keeping 99.88% of the model accuracy.
The results are impressive, but applying the optimization was as easy as adding one additional call to deepspeed.init_inference
.
But I have to say that this isn't a plug-and-play process you can transfer to any Transformers model, task, or dataset. Also, make sure to check if your model is compatible with DeepSpeed-Inference.
Thanks for reading! If you have any questions, feel free to contact me, through Github, or on the forum. You can also connect with me on Twitter or LinkedIn.