Deploy FLAN-T5 XXL on Amazon SageMaker
Welcome to this Amazon SageMaker guide on how to deploy the FLAN-T5-XXL on Amazon SageMaker for inference. We will deploy philschmid/flan-t5-xxl-sharded-fp16 to Amazon SageMake for real-time inference using Hugging Face Inference Deep Learning Container.
What we are going to do
- Create FLAN-T5 XXL inference script with bnb quantization
- Create SageMaker
model.tar.gz
artifact - Deploy the model to Amazon SageMaker
- Run inference using the deployed model
Quick intro: FLAN-T5, just a better T5
FLAN-T5 released with the Scaling Instruction-Finetuned Language Models paper is an enhanced version of T5 that has been finetuned in a mixture of tasks. The paper explores instruction finetuning with a particular focus on (1) scaling the number of tasks, (2) scaling the model size, and (3) finetuning on chain-of-thought data. The paper discovers that overall instruction finetuning is a general method for improving the performance and usability of pretrained language models.
- Paper: https://arxiv.org/abs/2210.11416
- Official repo: https://github.com/google-research/t5x
Before we can get started we have to install the missing dependencies to be able to create our model.tar.gz
artifact and create our Amazon SageMaker endpoint.
We also have to make sure we have the permission to create our SageMaker Endpoint.
If you are going to use Sagemaker in a local environment (not SageMaker Studio or Notebook Instances). You need access to an IAM Role with the required permissions for Sagemaker. You can find here more about it.
1. Create FLAN-T5 XXL inference script with bnb quantization
Amazon SageMaker allows us to customize the inference script by providing a inference.py
file. The inference.py
file is the entry point to our model. It is responsible for loading the model and handling the inference request. If you are used to deploying Hugging Face Transformers that might be new to you. Usually, we just provide the HF_MODEL_ID
and HF_TASK
and the Hugging Face DLC takes care of the rest. For FLAN-T5-XXL
thats not yet possible. We have to provide the inference.py
file and implement the model_fn
and predict_fn
functions to efficiently load the 11B large model.
If you want to learn more about creating a custom inference script you can check out Creating document embeddings with Hugging Face's Transformers & Amazon SageMaker
In addition to the inference.py
file we also have to provide a requirements.txt
file. The requirements.txt
file is used to install the dependencies for our inference.py
file.
The first step is to create a code/
directory.
As next we create a requirements.txt
file and add the accelerate
and bitsandbytes
library to it. The accelerate
library is used efficiently to load the model on the GPU. The bitsandbytes
library is used to quantize the model to 8bit using LLM.int8(). LLM.int8 introduces a new quantization technique for Int8 matrix multiplication, which cuts the memory needed for inference by half while. To learn more about check out this blog post or the paper.
The last step for our inference handler is to create the inference.py
file. The inference.py
file is responsible for loading the model and handling the inference request. The model_fn
function is called when the model is loaded. The predict_fn
function is called when we want to do inference.
We are using the AutoModelForSeq2SeqLM
class from transformers load the model from the local directory (model_dir
) in the model_fn
. In the predict_fn
function we are using the generate
function from transformers to generate the text for a given input prompt.
model.tar.gz
artifact
2. Create SageMaker To use our inference.py
we need to bundle it together with our model weights into a model.tar.gz
. The archive includes all our model-artifcats to run inference. The inference.py
script will be placed into a code/
folder. We will use the huggingface_hub
SDK to easily download philschmid/flan-t5-xxl-sharded-fp16 from Hugging Face and then upload it to Amazon S3 with the sagemaker
SDK. The model philschmid/flan-t5-xxl-sharded-fp16
is a sharded fp16 version of the google/flan-t5-xxl
Make sure the enviornment has enough diskspace to store the model, ~30GB should be enough.
The next step is to copy the code/
directory into the model/
directory.
Before we can upload the model to Amazon S3 we have to create a model.tar.gz
archive. Important is that the archive should directly contain all files and not a folder with the files. For example, your file should look like this:
model.tar.gz/
|- config.json
|- pytorch_model-00001-of-00012.bin
|- tokenizer.json
|- ...
After we created the model.tar.gz
archive we can upload it to Amazon S3. We will use the sagemaker
SDK to upload the model to our sagemaker session bucket.
3. Deploy the model to Amazon SageMaker
After we have uploaded our model archive we can deploy our model to Amazon SageMaker. We will use HuggingfaceModel
to create our real-time inference endpoint.
We are going to deploy the model to an g5.xlarge
instance. The g5.xlarge
instance is a GPU instance with 1 NVIDIA A10G GPU. If you are interested in how you could add autoscaling to your endpoint you can check out Going Production: Auto-scaling Hugging Face Transformers with Amazon SageMaker.
4. Run inference using the deployed model
The .deploy()
returns an HuggingFacePredictor
object which can be used to request inference using the .predict()
method. Our endpoint expects a json
with at least inputs
key.
When using generative models, most of the time you want to configure or customize your prediction to fit your needs, for example by using beam search, configuring the max or min length of the generated sequence, or adjusting the temperature to reduce repetition. The Transformers library provides different strategies and kwargs to do this, the Hugging Face Inference toolkit offers the same functionality using the parameters attribute of your request payload. Below you can find examples on how to generate text without parameters, with beam search, and using custom configurations. If you want to learn about different decoding strategies check out this blog post.
Lets try another examples! This time we focus ond questions answering with a step by step approach including some simple math.
Delete model and endpoint
To clean up, we can delete the model and endpoint.
Thanks for reading! If you have any questions, feel free to contact me on Twitter or LinkedIn.