Accelerate Stable Diffusion inference with DeepSpeed-Inference on GPUs
In this session, you will learn how to optimize Stable Diffusion for Inerence using Hugging Face ๐งจ Diffusers library. and DeepSpeed-Inference. The session will show you how to apply state-of-the-art optimization techniques using DeepSpeed-Inference. This session will focus on single GPU (Ampere Generation) inference for Stable-Diffusion models. By the end of this session, you will know how to optimize your Hugging Face Stable-Diffusion models using DeepSpeed-Inference. We are going to optimize CompVis/stable-diffusion-v1-4 for text-to-image generation.
You will learn how to:
- Setup Development Environment
- Load vanilla Stable Diffusion model and set baseline
- Optimize Stable Diffusion for GPU using DeepSpeeds
InferenceEngine
- Evaluate the performance and speed
Let's get started! ๐
This tutorial was created and run on a g5.xlarge AWS EC2 Instance including an NVIDIA A10G. The tutorial doesn't work on older GPUs, e.g. due to incompatibility of triton
kernels.
Quick Intro: What is DeepSpeed-Inference
DeepSpeed-Inference is an extension of the DeepSpeed framework focused on inference workloads. DeepSpeed Inference combines model parallelism technology such as tensor, pipeline-parallelism, with custom optimized cuda kernels. DeepSpeed provides a seamless inference mode for compatible transformer based models trained using DeepSpeed, Megatron, and HuggingFace. For a list of compatible models please see here. As mentioned DeepSpeed-Inference integrates model-parallelism techniques allowing you to run multi-GPU inference for LLM, like BLOOM with 176 billion parameters. If you want to learn more about DeepSpeed inference:
- Paper: DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale
- Blog: Accelerating large-scale model inference and training via system optimizations and compression
1. Setup Development Environment
Our first step is to install Deepspeed, along with PyTorch, Transfromers, Diffusers and some other libraries. Running the following cell will install all the required packages.
Note: You need a machine with a GPU and a compatible CUDA installed. You can check this by running nvidia-smi
in your terminal. If your setup is correct, you should get statistics about your GPU.
Before we start. Let's make sure all packages are installed correctly.
2. Load vanilla Stable Diffusion model and set baseline
After we set up our environment, we create a baseline for our model. We use the CompVis/stable-diffusion-v1-4 checkpoint.
Before we can load our model from the Hugging Face Hub we have to make sure that we accepted the license of CompVis/stable-diffusion-v1-4 to be able to use it. CompVis/stable-diffusion-v1-4 is published under the CreativeML OpenRAIL-M license. You can accept the license by clicking on the Agree and access repository
button on the model page at: https://huggingface.co/CompVis/stable-diffusion-v1-4.
Note: This will give access to the repository for the logged in user. This user can then be used to generate HF Tokens to load the model programmatically.
Before we can load the model make sure you have a valid HF Token. You can create a token by going to your Hugging Face Settings and clicking on the New token
button. Make sure the enviornment has enough diskspace to store the model, ~30GB should be enough.
We can now test our pipeline and generate an image
The next step is to create a latency baseline we use the measure_latency
function, which implements a simple python loop to run inference and calculate the avg, mean & p95 latency for our model.
We are going to use the same prompt as we used in our example.
Our pipelines achieves latency of 4.57s
on a single GPU. This is a good baseline for our optimization.
InferenceEngine
3. Optimize Stable Diffusion for GPU using DeepSpeeds The next and most important step is to optimize our pipeline for GPU inference. This will be done using the DeepSpeed InferenceEngine
. The InferenceEngine
is initialized using the init_inference
method. We are going to replace the models
including the UNET
and CLIP
model in our pipeline with DeepSpeed optimized models.
The init_inference
method expects as parameters atleast:
model
: The model to optimize, in our case the whole pipeline.mp_size
: The number of GPUs to use.dtype
: The data type to use.replace_with_kernel_inject
: Whether inject custom kernels.
You can find more information about the init_inference
method in the DeepSpeed documentation or thier inference blog.
Note: You might need to restart your kernel if you are running into a CUDA OOM error.
We can now inspect a model graph to see that the vanilla UNet2DConditionModel
has been replaced with an DSUNet
, which includes the DeepSpeedAttention
and triton_flash_attn_kernel
module, custom nn.Module
that is optimized for inference.
4. Evaluate the performance and speed
As the last step, we want to take a detailed look at the performance of our optimized pipelines. Applying optimization techniques, like graph optimizations or mixed-precision, not only impact performance (latency) those also might have an impact on the accuracy of the model. So accelerating your model comes with a trade-off.
Let's test the performance (latency) of our optimized pipeline. We will use the same prompt as for our vanilla model.
Our Optimized DeepSpeed pipeline achieves latency of 2.68s
. This is a 1.7x improvement over our baseline. Let's take a look at the performance of our optimized pipeline.
We managed to accelerate the CompVis/stable-diffusion-v1-4
pipeline latency from 4.57s
to 2.68s
for generating a 512x512
large image. This results into a 1.7x improvement.
Conclusion
We successfully optimized our Stable Diffusion with DeepSpeed-inference and managed to decrease our model latency from 4.57s
to 2.68s
or 1.7x.
Those are good results results thinking of that we only needed to add 1 additional line of code, but applying the optimization was as easy as adding one additional call to deepspeed.init_inference
.
But I have to say that this isn't a plug-and-play process you can transfer to any Transformers model, task, or dataset. Also, make sure to check if your model is compatible with DeepSpeed-Inference.
If you want to learn more about Stable Diffusion you should check out:
- Stable Diffusion with ๐งจ Diffusers
- Stable Diffusion on Amazon SageMaker
- Stable Diffusion Image Generation under 1 second w. DeepSpeed MII
Thanks for reading! If you have any questions, feel free to contact me on Twitter or LinkedIn.