Scale LLM Inference on Amazon SageMaker with Multi-Replica Endpoints

January 11, 202411 minute readView Code

One of the key Amazon SageMaker announcements at this year's re:Invent (2023) was the new Hardware Requirements object for Amazon SageMaker endpoints. This provides granular control over the compute resources for models deployed on SageMaker, including minimum CPU, GPU, memory, and number of replicas. This allows you to optimize your model's throughput and cost by matching the compute resources to the model's requirements and allows you to deploy multiple LLMs on the same instance. Previously it was not possible to deploy multiple replicas of a LLM or multiple LLMs on a single endpoint, can limit the overall throughput of models are not compute bound, e.g. open LLMs like a single Llama 13B on p4d.24xlarge instances.

In this post, we show how to use the new feature using the sagemaker sdk and ResourceRequirements object to optimize the deployment of Llama 13B for increased throughput and cost performance on Amazon SageMaker on a p4d.24xlarge instance. The p4d.24xlarge instance has 8x A100 GPUs 40GB, which allows us to deploy 8 replicas of Llama 13B on a single instance. You can also use this example to deploy other open LLMs like Mistral, T5 or StarCoder. Additionally it is possible to deploy multiple models on a single instance, e.g. 4x Llama 13B and 4x Mistral 7B. Check out the amazing blog post from Antje for this.

We are going to use the Hugging Face LLM DLC is a new purpose-built Inference Container to easily deploy LLMs in a secure and managed environment. The DLC is powered by Text Generation Inference (TGI) a scalelable, optimized solution for deploying and serving Large Language Models (LLMs). The Blog post also includes Hardware requirements for the different model sizes.

In the blog will cover how to:

  1. Setup development environment
  2. Retrieve the new Hugging Face LLM DLC
  3. Configure Hardware requirements per replica
  4. Deploy and Test Llama 2 on Amazon SageMaker
  5. Run performance a simple performance benchmark

Lets get started!

1. Setup development environment

We are going to use the sagemaker python SDK to deploy Llama 2 to Amazon SageMaker. We need to make sure to have an AWS account configured and the sagemaker python SDK installed.

!pip install "sagemaker>=2.199.0" "transformers==4.35.2" --upgrade --quiet

If you are going to use Sagemaker in a local environment. You need access to an IAM Role with the required permissions for Sagemaker. You can find here more about it.

import sagemaker
import boto3
sess = sagemaker.Session()
# sagemaker session bucket -> used for uploading data, models and logs
# sagemaker will automatically create this bucket if it not exists
if sagemaker_session_bucket is None and sess is not None:
    # set to default bucket if a bucket name is not given
    sagemaker_session_bucket = sess.default_bucket()
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client('iam')
    role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']
sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)
print(f"sagemaker role arn: {role}")
print(f"sagemaker session region: {sess.boto_region_name}")

2. Retrieve the new Hugging Face LLM DLC

Compared to deploying regular Hugging Face models we first need to retrieve the container uri and provide it to our HuggingFaceModel model class with a image_uri pointing to the image. To retrieve the new Hugging Face LLM DLC in Amazon SageMaker, we can use the get_huggingface_llm_image_uri method provided by the sagemaker SDK. This method allows us to retrieve the URI for the desired Hugging Face LLM DLC based on the specified backend, session, region, and version. You can find the available versions here

from sagemaker.huggingface import get_huggingface_llm_image_uri
# retrieve the llm image uri
llm_image = get_huggingface_llm_image_uri(
# print ecr image uri
print(f"llm image uri: {llm_image}")

3. Configure Hardware requirements per replica

Llama 2 comes in 3 different sizes - 7B, 13B & 70B parameters. The hardware requirements will vary based on the model size deployed to SageMaker. Below is an example configuration for Llama 13B. In addition we tried to provide some high level overview of the different hardware requirements for the different model sizes. To keep it simple we only looked at the p4d.24xlarge instance type and AWQ/GPTQ quantization.

ModelInstance TypeQuantization# replica
Llama 7B(ml.)p4d.24xlarge-8
Llama 7B(ml.)p4d.24xlargeGPTQ/AWQ8
Llama 13B(ml.)p4d.24xlarge-8
Llama 13B(ml.)p4d.24xlargeGPTQ/AWQ8
Llama 70B(ml.)p4d.24xlarge-2
Llama 70B(ml.)p4d.24xlargeGPTQ/AWQ4

We didn't test the configurations yet. If you run into errors please let me know and I will update the blog post.

from sagemaker.compute_resource_requirements.resource_requirements import ResourceRequirements
llama2_13b_resource_config = ResourceRequirements(
    requests = {
        "copies": 8, # Number of replicas
        "num_accelerators": 1, # Number of GPUs
        "num_cpus": 10,  # Number of CPU cores  96 // num_replica - more for management
        "memory": 100 * 1024,  # Minimum memory in MB 1152 // num_replica - more for management

4. Deploy Llama 2 to Amazon SageMaker

To deploy meta-llama/Llama-2-13b-chat-hf to Amazon SageMaker we create a HuggingFaceModel model class and define our endpoint configuration including the hf_model_id, instance_type and then add our ResourceRequirements object to the deploy method.

Note: This is a form to enable access to Llama 2 on Hugging Face after you have been granted access from Meta. Please visit the Meta website and accept our license terms and acceptable use policy before submitting this form. Requests will be processed in 1-2 days. We alternatively use the ungated weights from NousResearch/Llama-2-13b-chat-hf.

import json
import uuid
from sagemaker.huggingface import HuggingFaceModel
from sagemaker.enums import EndpointType
# sagemaker config
instance_type = "ml.p4d.24xlarge"
health_check_timeout = 300
# Define Model and Endpoint configuration parameter
config = {
  'HF_MODEL_ID': "meta-llama/Llama-2-13b-chat-hf", # model_id from
  'SM_NUM_GPUS': json.dumps(1), # Number of GPU used per replica
  'MAX_INPUT_LENGTH': json.dumps(2048),  # Max length of input text
  'MAX_TOTAL_TOKENS': json.dumps(4096),  # Max length of the generation (including input text)
  'MAX_BATCH_TOTAL_TOKENS': json.dumps(16384),  # Limits the number of tokens that can be processed in parallel during the generation
  'HUGGING_FACE_HUB_TOKEN': "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" # uncomment when using a private model
  # 'HUGGING_FACE_HUB_TOKEN': "<REPLACE WITH YOUR TOKEN>" # uncomment when using a private model
  # ,'HF_MODEL_QUANTIZE': "gptq", # comment in when using awq quantized checkpoint
assert config['HUGGING_FACE_HUB_TOKEN'] != '<REPLACE WITH YOUR TOKEN>', "You have to provide a token."
# create HuggingFaceModel with the image uri
llm_model = HuggingFaceModel(

After we have created the HuggingFaceModel we can deploy it to Amazon SageMaker using the deploy method using the ResourceRequirements object.

# Deploy model to an endpoint
llm = llm_model.deploy(
  initial_instance_count=1, # number of instances
  instance_type=instance_type, # base instance type
  resources=llama2_13b_resource_config, # resource config for multi-replica
  container_startup_health_check_timeout=health_check_timeout, # 10 minutes to be able to load the model
  endpoint_name=f"llama-2-13b-chat-{str(uuid.uuid4())}", # name needs to be unique
  endpoint_type=EndpointType.INFERENCE_COMPONENT_BASED # needed to use resource config

SageMaker will now create our endpoint and deploy the model to it. This can takes a 15-25 minutes, since the replicas are deployed after each other. After the endpoint is created we can use the predict method to send a request to our endpoint. To make it easier we will use the apply_chat_template method from transformers. This allow us to send "openai" like converstaions to our model.

from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained(config["HF_MODEL_ID"],use_auth_token=config["HUGGING_FACE_HUB_TOKEN"])
# OpenAI like conversational messages
messages = [
  {"role": "system", "content": "You are an helpful AWS Expert Assistant. Respond only with 1-2 sentences."},
  {"role": "user", "content": "What is Amazon SageMaker?"},
# generation parameters
parameters = {
    "do_sample" : True,
    "top_p": 0.6,
    "temperature": 0.9,
    "top_k": 50,
    "max_new_tokens": 50,
    "repetition_penalty": 1.03,
    "return_full_text": False,
res = llm.predict(
    "inputs": tok.apply_chat_template(messages, tokenize=False),
    "parameters": parameters

Sure, I'd be happy to help! Amazon SageMaker is a fully managed service that provides a range of machine learning (ML) algorithms and tools to help you build, train, and deploy ML models at scale.

5. Benchmark multi-replica endpoint

To Benchmark our new endpoint we will use the same code as for the "Llama 2 on Amazon SageMaker a Benchmark" from text-generation-inference-tests. When running the benchmark back then it was not possible to deploy multiple replicas of a model on a single endpoint. This limited the maximum throughput we could achieve with Llama 13B on p4d.24xlarge instances. We benchmark a typical usecase, with long inputs and short generation with a prompt length of 1000 tokens and number of generated tokens of 50.

To run the benchmark we need to clone the text-generation-inference-tests and also make sure we meet all the Prerequisites from the README, including the installation of k6 which used to run the benchmark.

# check out branch with test using 1000 tokens
!git clone -b "input-1000-gen-50"

Since we already a deployed endpoint we can provide the endpoint_name and our hardware requirements as inference_component. The name of the inference_component can currently only be retrieved using boto3

inference_component = llm_model.sagemaker_session.list_inference_components(endpoint_name_equals=llm.endpoint_name).get("InferenceComponents")[0].get("InferenceComponentName")
endpoint_name = llm.endpoint_name

Change the directory to sagemaker_llm_container and run the benchmark. The command below will run a load test for 90s with 300 concurrent users. The result will be saved to text-generation-inference-tests/sagemaker_llm_container/{CONIFG_NAME}.json.

!cd text-generation-inference-tests/sagemaker_llm_container && python \
  --endpoint_name {endpoint_name} \
  --inference_component {inference_component} \
  --model_id {config["HF_MODEL_ID"]} \
  --tp_degree 1 \
  --vu 300

We ran multiple benchmarks for different concurrent users starting with 100 up to 300 and MAX_BATCH_TOTAL_TOKENS of 16384. Eace request sends a prompt of 1000 tokens and generates 50 tokens. The table below includes the total number of requests, throughput (req/sec), the median request time and the percentage of successful requests (in %).

ModelConcurrent Usersrequeststhroughput (req/sec)med. latency (sec)request rate (%)
Llama 13B100232224.741461/s3.76s100%
Llama 13B150254326.316863/s5.58s100%
Llama 13B200264826.720773/s7.81s100%
Llama 13B250276925.82929/s8.89s97%
Llama 13B300277625.662124/s9.78s100%
Llama 13B350282726.713271/s12.57s100%
Llama 13B400288024.161486/s12.53s100%
Llama 13B450293426.138126/s15.69s100%
Llama 13B500297626.170628/s16.9s100%

We were able to serve 500 concurrent users with a throughput of 26.17 requests per second and a median latency of 16.9 seconds without requests failing. 500 concurrent user is not the limit of the endpoint. We could increase the number of concurrent users and extend the benchmark to find the limit of the endpoint. But this is out of scope for this blog post. It is also important to respect the latency of the endpoint. If you have a latency requirement of < 100s second, you can only serve 300 concurrent users.

!rm -rf text-generation-inference-tests

6. Clean up

To clean up, we can delete the model, endpoint and inference component for the hardware requirements.

Note: If you have issues deleting an endpoint with an attached inference component, see:

# Delete Inference Component & Model

we have to wait until the component is deleted before we can delete the endpoint. (can take 2minutes)

# If this call fails, you can delete the endpoint manually using the AWS Console


By leveraging the ability to deploy multiple replicas of models like Llama 13B on instances such as p4d.24xlarge, companies can serve a substantially higher number of concurrent users without compromising the stability of the service. While there is an increase in latency under heavy load, the system maintains its robustness. This balance between scalability and stability can be crucial for organizations looking to deploy LLMs in production environments.

Thanks for reading! If you have any questions, feel free to contact me on Twitter or LinkedIn.