Securely deploy LLMs inside VPCs with Hugging Face and Amazon SageMaker

June 20, 20238 minute readView Code

This is an example on how to deploy the open-source LLMs to Amazon SageMaker for inference using the new Hugging Face LLM Inference Container from Amazon S3. The new Hugging Face LLM Inference Container makes it super easy to deploy LLMs by simply providing a HF_MODEL_ID pointing to the Hugging Face Repository and the container takes care of the rest. But for some workloads you cannot load the model from Hugging face Hub and need to load your model from Amazon S3 since there is not internet access for your endpoint.

This examples demonstrate how to deploy an open-source LLM from Amazon S3 to Amazon SageMaker using the new Hugging Face LLM Inference Container. We are going to deploy the HuggingFaceH4/starchat-beta.

The example covers:

  1. Setup development environment
  2. Upload the model to Amazon S3
  3. Retrieve the new Hugging Face LLM DLC
  4. Deploy Starchat-beta to Amazon SageMaker
  5. Test the model and run inference
  6. Clean up

If you want to learn more about the Hugging Face LLM Inference DLC check out the introduction here. Lets get started!

1. Setup development environment

We are going to use the sagemaker python SDK to deploy HuggingFaceH4/starchat-beta. to Amazon SageMaker. We need to make sure to have an AWS account configured and the sagemaker python SDK installed.

!pip install "sagemaker==2.175.0" "huggingface_hub" "hf-transfer" --upgrade --quiet

If you are going to use Sagemaker in a local environment. You need access to an IAM Role with the required permissions for Sagemaker. You can find here more about it.

import sagemaker
import boto3
sess = sagemaker.Session()
# sagemaker session bucket -> used for uploading data, models and logs
# sagemaker will automatically create this bucket if it not exists
if sagemaker_session_bucket is None and sess is not None:
    # set to default bucket if a bucket name is not given
    sagemaker_session_bucket = sess.default_bucket()
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client('iam')
    role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']
sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)
print(f"sagemaker role arn: {role}")
print(f"sagemaker session region: {sess.boto_region_name}")

2. Upload the model to Amazon S3

To deploy our model from Amazon S3 we need to bundle it together with our model weights into a model.tar.gz. The Hugging Face LLM inference DLCs uses safetensors weights to load models. Due to the filesystem constraints of /opt/ml/model we need to make sure the model we want to deploy has *.safetensors weights available. If the model, e.g. google/flan-ul2 has no safetensors weights available we can use the safetensors/convert_large space to create them. The Space will open a PR on the original repository with the safetensors weights, which means we can use the safetensors weights from the original repository via the revision parameter. Note: Depending on the size the conversion can take ~10 minutes.

Alternative you can save directly to safetensors with model.save_pretrained(..., safe_serialization=True) with safetenors installed during your training.

The model.tar.gz archive includes all our model-artifcats to run inference. We will use the huggingface_hub SDK to easily download HuggingFaceH4/starchat-beta from Hugging Face and then upload it to Amazon S3 with the sagemaker SDK.

Make sure the enviornment has enough diskspace to store the model, ~35GB should be enough.

from pathlib import Path
import os
# set HF_HUB_ENABLE_HF_TRANSFER env var to enable hf-transfer for faster downloads
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
from huggingface_hub import snapshot_download
HF_MODEL_ID = "HuggingFaceH4/starchat-beta"
# create model dir
model_tar_dir = Path(HF_MODEL_ID.split("/")[-1])
# Download model from Hugging Face into model_dir
    local_dir=str(model_tar_dir), # download to model dir
    revision="main", # use a specific revision, e.g. refs/pr/21
    local_dir_use_symlinks=False, # use no symlinks to save disk space
    ignore_patterns=["*.msgpack*", "*.h5*", "*.bin*"], # to load safetensor weights
# check if safetensor weights are downloaded and available
assert len(list(model_tar_dir.glob("*.safetensors"))) > 0, "Model download failed"

Important is that the archive should directly contain all files and not a folder with the files. For example, your file should look like this:

|- config.json
|- model-00001-of-00005.safetensors
|- tokenizer.json
|- ...

We are using pigz to parallelize the archiving. Note: you might need to install it, e.g. apt install pigz.

# change to model dir
# use pigz for faster and parallel compression
!tar -cf model.tar.gz --use-compress-program=pigz *
# change back to parent dir

After we created the model.tar.gz archive we can upload it to Amazon S3. We will use the sagemaker SDK to upload the model to our sagemaker session bucket.

from sagemaker.s3 import S3Uploader
# upload model.tar.gz to s3
s3_model_uri = S3Uploader.upload(local_path=str(model_tar_dir.joinpath("model.tar.gz")), desired_s3_uri=f"s3://{sess.default_bucket()}/starchat-beta")
print(f"model uploaded to: {s3_model_uri}")

3. Retrieve the new Hugging Face LLM DLC

Compared to deploying regular Hugging Face models we first need to retrieve the container uri and provide it to our HuggingFaceModel model class with a image_uri pointing to the image. To retrieve the new Hugging Face LLM DLC in Amazon SageMaker, we can use the get_huggingface_llm_image_uri method provided by the sagemaker SDK. This method allows us to retrieve the URI for the desired Hugging Face LLM DLC based on the specified backend, session, region, and version. You can find the available versions here

from sagemaker.huggingface import get_huggingface_llm_image_uri
# retrieve the llm image uri
llm_image = get_huggingface_llm_image_uri(
# print ecr image uri
print(f"llm image uri: {llm_image}")

4. Deploy Starchat-beta to Amazon SageMaker

To deploy HuggingFaceH4/starchat-beta to Amazon SageMaker we create a HuggingFaceModel model class and define our endpoint configuration including the hf_model_id, instance_type etc. We will use a g5.12xlarge instance type, which has 4 NVIDIA A10G GPUs and 96GB of GPU memory.

import json
from sagemaker.huggingface import HuggingFaceModel
# sagemaker config
instance_type = "ml.g5.12xlarge"
number_of_gpu = 4
health_check_timeout = 300
# Define Model and Endpoint configuration parameter
config = {
  'HF_MODEL_ID': "/opt/ml/model", # path to where sagemaker stores the model
  'SM_NUM_GPUS': json.dumps(number_of_gpu), # Number of GPU used per replica
  'MAX_INPUT_LENGTH': json.dumps(1024), # Max length of input text
  'MAX_TOTAL_TOKENS': json.dumps(2048), # Max length of the generation (including input text)
  # 'HF_MODEL_QUANTIZE': "bitsandbytes",# Comment in to quantize
# create HuggingFaceModel with the image uri
llm_model = HuggingFaceModel(

After we have created the HuggingFaceModel we can deploy it to Amazon SageMaker using the deploy method. We will deploy the model with the ml.g5.12xlarge instance type. TGI will automatically distribute and shard the model across all GPUs.

# Deploy model to an endpoint
llm = llm_model.deploy(
  # volume_size=400, # If using an instance with local SSD storage, volume_size must be None, e.g. p4 but not p3
  container_startup_health_check_timeout=health_check_timeout, # 10 minutes to be able to load the model

SageMaker will now create our endpoint and deploy the model to it. This can takes a 10-15 minutes.

5. Test the model and run inference

After our endpoint is deployed we can run inference on it. We will use the predict method from the predictor to run inference on our endpoint. We can inference with different parameters to impact the generation. Parameters can be defined as in the parameters attribute of the payload. You can find a list of parameters in the announcement blog post. or as part of the swagger documentation

The starchat-beta is a conversation model for answering coding question we can simply prompt by asking our question:

<|system|>\n You are an Python Expert<|end|>\n<|user|>\n{query}<|end|>\n<|assistant|>

lets give it a first try and ask how to filter a list in python:

query = "How can i filter a list of dictionaries?"
res = llm.predict({
	"inputs": f"<|system|>\n You are an Python Expert<|end|>\n<|user|>\n{query}<|end|>\n<|assistant|>"

Now we will run inference with different parameters to impact the generation. Parameters can be defined as in the parameters attribute of the payload. This can be used to have the model stop the generation after the turn of the bot.

# define payload
prompt=f"<|system|>\n You are an Python Expert<|end|>\n<|user|>\n{query}<|end|>\n<|assistant|>"
# hyperparameters for llm
payload = {
  "inputs": prompt,
  "parameters": {
    "do_sample": True,
    "top_p": 0.95,
    "temperature": 0.2,
    "top_k": 50,
    "max_new_tokens": 256,
    "repetition_penalty": 1.03,
    "stop": ["<|end|>"]
# send request to endpoint
response = llm.predict(payload)

Awesome! 🚀 We have successfully deployed our model from Amazon S3 to Amazon SageMaker and run inference on it. Now, its time for you to try it out yourself and build Generation AI applications with the new Hugging Face LLM DLC on Amazon SageMaker.

6. Clean up

To clean up, we can delete the model and endpoint.



We successfully deployed StarChat from Amazon S3 to Amazon SageMaker without internet access or the need to access any external system. This will allow companies with strict security requirements to deploy LLMs to Amazon SageMaker inside their VPCs in an easy, secure and internet-free way.

Thanks for reading! If you have any questions, feel free to contact me on Twitter or LinkedIn.