In this session, you will learn how to optimize GPT-2/GPT-J for Inerence using Hugging Face Transformers and DeepSpeed-Inference. The session will show you how to apply state-of-the-art optimization techniques using DeepSpeed-Inference.
This session will focus on single GPU inference for GPT-2, GPT-NEO and GPT-J like models
By the end of this session, you will know how to optimize your Hugging Face Transformers models (GPT-2, GPT-J) using DeepSpeed-Inference. We are going to optimize GPT-j 6B for text-generation.
This tutorial was created and run on a g4dn.2xlarge AWS EC2 Instance including an NVIDIA T4.
Quick Intro: What is DeepSpeed-Inference
DeepSpeed-Inference is an extension of the DeepSpeed framework focused on inference workloads. DeepSpeed Inference combines model parallelism technology such as tensor, pipeline-parallelism, with custom optimized cuda kernels.
DeepSpeed provides a seamless inference mode for compatible transformer based models trained using DeepSpeed, Megatron, and HuggingFace. For a list of compatible models please see here.
As mentioned DeepSpeed-Inference integrates model-parallelism techniques allowing you to run multi-GPU inference for LLM, like BLOOM with 176 billion parameters.
If you want to learn more about DeepSpeed inference:
Our first step is to install Deepspeed, along with PyTorch, Transfromers and some other libraries. Running the following cell will install all the required packages.
Note: You need a machine with a GPU and a compatible CUDA installed. You can check this by running nvidia-smi in your terminal. If your setup is correct, you should get statistics about your GPU.
Before we start. Let's make sure all packages are installed correctly.
2. Load vanilla GPT-J model and set baseline
After we set up our environment, we create a baseline for our model. We use the EleutherAI/gpt-j-6B, a GPT-J 6B was trained on the Pile, a large-scale curated dataset created by EleutherAI. This model was trained for 402 billion tokens over 383,500 steps on TPU v3-256 pod. It was trained as an autoregressive language model, using cross-entropy loss to maximize the likelihood of predicting the next token correctly.
To create our baseline, we load the model with transformers and run inference.
Note: We created a separate repository containing sharded fp16 weights to make it easier to load the models on smaller CPUs by using the device_map feature to automatically place sharded checkpoints on GPU. Learn more here
Lets run some inference.
Create a latency baseline we use the measure_latency function, which implements a simple python loop to run inference and calculate the avg, mean & p95 latency for our model.
We are going to use greedy search as decoding strategy and will generate 128 new tokens with 128 tokens as input.
Our model achieves latency of 8.9s for 128 tokens or 69ms/token.
3. Optimize GPT-J for GPU using DeepSpeeds InferenceEngine
The next and most important step is to optimize our model for GPU inference. This will be done using the DeepSpeed InferenceEngine. The InferenceEngine is initialized using the init_inference method. The init_inference method expects as parameters atleast:
Note: You might need to restart your kernel if you are running into a CUDA OOM error.
We can now inspect our model graph to see that the vanilla GPTJLayer has been replaced with an HFGPTJLayer, which includes the DeepSpeedTransformerInference module, a custom nn.Module that is optimized for inference by the DeepSpeed Team.
4. Evaluate the performance and speed
As the last step, we want to take a detailed look at the performance of our optimized model. Applying optimization techniques, like graph optimizations or mixed-precision, not only impact performance (latency) those also might have an impact on the accuracy of the model. So accelerating your model comes with a trade-off.
Let's test the performance (latency) of our optimized model. We will use the same generation args as for our vanilla model.
Our Optimized DeepsPeed model achieves latency of 6.5s for 128 tokens or 50ms/token.
We managed to accelerate the GPT-J-6B model latency from 8.9s to 6.5 for generating 128 tokens. This results into an improvement from 69ms/token to 50ms/token or 1.38x.
Conclusion
We successfully optimized our GPT-J Transformers with DeepSpeed-inference and managed to decrease our model latency from 69ms/token to 50ms/token or 1.3x.
Those are good results results thinking of that we only needed to add 1 additional line of code, but applying the optimization was as easy as adding one additional call to deepspeed.init_inference.
But I have to say that this isn't a plug-and-play process you can transfer to any Transformers model, task, or dataset. Also, make sure to check if your model is compatible with DeepSpeed-Inference.
Thanks for reading! If you have any questions, feel free to contact me, through Github, or on the forum. You can also connect with me on Twitter or LinkedIn.