Securely deploy LLMs inside VPCs with Hugging Face and Amazon SageMaker
This is an example on how to deploy the open-source LLMs to Amazon SageMaker for inference using the new Hugging Face LLM Inference Container from Amazon S3. The new Hugging Face LLM Inference Container makes it super easy to deploy LLMs by simply providing a HF_MODEL_ID
pointing to the Hugging Face Repository and the container takes care of the rest.
But for some workloads you cannot load the model from Hugging face Hub and need to load your model from Amazon S3 since there is not internet access for your endpoint.
This examples demonstrate how to deploy an open-source LLM from Amazon S3 to Amazon SageMaker using the new Hugging Face LLM Inference Container. We are going to deploy the HuggingFaceH4/starchat-beta.
The example covers:
- Setup development environment
- Upload the model to Amazon S3
- Retrieve the new Hugging Face LLM DLC
- Deploy Starchat-beta to Amazon SageMaker
- Test the model and run inference
- Clean up
If you want to learn more about the Hugging Face LLM Inference DLC check out the introduction here. Lets get started!
1. Setup development environment
We are going to use the sagemaker
python SDK to deploy HuggingFaceH4/starchat-beta. to Amazon SageMaker. We need to make sure to have an AWS account configured and the sagemaker
python SDK installed.
If you are going to use Sagemaker in a local environment. You need access to an IAM Role with the required permissions for Sagemaker. You can find here more about it.
2. Upload the model to Amazon S3
To deploy our model from Amazon S3 we need to bundle it together with our model weights into a model.tar.gz
. The Hugging Face LLM inference DLCs uses safetensors weights to load models. Due to the filesystem constraints of /opt/ml/model
we need to make sure the model we want to deploy has *.safetensors
weights available.
If the model, e.g. google/flan-ul2 has no safetensors
weights available we can use the safetensors/convert_large space to create them. The Space will open a PR on the original repository with the safetensors
weights, which means we can use the safetensors
weights from the original repository via the revision
parameter.
Note: Depending on the size the conversion can take ~10 minutes.
Alternative you can save directly to safetensors
with model.save_pretrained(..., safe_serialization=True)
with safetenors
installed during your training.
The model.tar.gz
archive includes all our model-artifcats to run inference. We will use the huggingface_hub
SDK to easily download HuggingFaceH4/starchat-beta
from Hugging Face and then upload it to Amazon S3 with the sagemaker SDK.
Make sure the enviornment has enough diskspace to store the model, ~35GB should be enough.
Important is that the archive should directly contain all files and not a folder with the files. For example, your file should look like this:
model.tar.gz/
|- config.json
|- model-00001-of-00005.safetensors
|- tokenizer.json
|- ...
We are using pigz
to parallelize the archiving. Note: you might need to install it, e.g. apt install pigz
.
After we created the model.tar.gz
archive we can upload it to Amazon S3. We will use the sagemaker
SDK to upload the model to our sagemaker session bucket.
3. Retrieve the new Hugging Face LLM DLC
Compared to deploying regular Hugging Face models we first need to retrieve the container uri and provide it to our HuggingFaceModel
model class with a image_uri
pointing to the image. To retrieve the new Hugging Face LLM DLC in Amazon SageMaker, we can use the get_huggingface_llm_image_uri
method provided by the sagemaker
SDK. This method allows us to retrieve the URI for the desired Hugging Face LLM DLC based on the specified backend
, session
, region
, and version
. You can find the available versions here
4. Deploy Starchat-beta to Amazon SageMaker
To deploy HuggingFaceH4/starchat-beta to Amazon SageMaker we create a HuggingFaceModel
model class and define our endpoint configuration including the hf_model_id
, instance_type
etc. We will use a g5.12xlarge
instance type, which has 4 NVIDIA A10G GPUs and 96GB of GPU memory.
After we have created the HuggingFaceModel
we can deploy it to Amazon SageMaker using the deploy
method. We will deploy the model with the ml.g5.12xlarge
instance type. TGI will automatically distribute and shard the model across all GPUs.
SageMaker will now create our endpoint and deploy the model to it. This can takes a 10-15 minutes.
5. Test the model and run inference
After our endpoint is deployed we can run inference on it. We will use the predict
method from the predictor
to run inference on our endpoint. We can inference with different parameters to impact the generation. Parameters can be defined as in the parameters
attribute of the payload. You can find a list of parameters in the announcement blog post. or as part of the swagger documentation
The starchat-beta
is a conversation model for answering coding question we can simply prompt by asking our question:
<|system|>\n You are an Python Expert<|end|>\n<|user|>\n{query}<|end|>\n<|assistant|>
lets give it a first try and ask how to filter a list in python:
Now we will run inference with different parameters to impact the generation. Parameters can be defined as in the parameters
attribute of the payload. This can be used to have the model stop the generation after the turn of the bot
.
Awesome! 🚀 We have successfully deployed our model from Amazon S3 to Amazon SageMaker and run inference on it. Now, its time for you to try it out yourself and build Generation AI applications with the new Hugging Face LLM DLC on Amazon SageMaker.
6. Clean up
To clean up, we can delete the model and endpoint.
Conclusion
We successfully deployed StarChat from Amazon S3 to Amazon SageMaker without internet access or the need to access any external system. This will allow companies with strict security requirements to deploy LLMs to Amazon SageMaker inside their VPCs in an easy, secure and internet-free way.
Thanks for reading! If you have any questions, feel free to contact me on Twitter or LinkedIn.