Llama 2 on Amazon SageMaker a Benchmark

Published on
6 min read
View Code

Orginially published on the Hugging Face Blog

Deploying large language models (LLMs) and other generative AI models can be challenging due to their computational requirements and latency needs. To provide useful recommendations to companies looking to deploy Llama 2 on Amazon SageMaker with the Hugging Face LLM Inference Container, we created a comprehensive benchmark analyzing over 60 different deployment configurations for Llama 2.

In this benchmark, we evaluated varying sizes of Llama 2 on a range of Amazon EC2 instance types with different load levels. Our goal was to measure latency (ms per token), and throughput (tokens per second) to find the optimal deployment strategies for three common use cases:

  • Most Cost-Effective Deployment: For users looking for good performance at low cost
  • Best Latency Deployment: Minimizing latency for real-time services
  • Best Throughput Deployment: Maximizing tokens processed per second

To keep this benchmark fair, transparent, and reproducible, we share all of the assets, code, and data we used and collected:

We hope to enable customers to use LLMs and Llama 2 efficiently and optimally for their use case. Before we get into the benchmark and data, let's look at the technologies and methods we used.

What is the Hugging Face LLM Inference Container?

Hugging Face LLM DLC is a purpose-built Inference Container to easily deploy LLMs in a secure and managed environment. The DLC is powered by Text Generation Inference (TGI), an open-source, purpose-built solution for deploying and serving LLMs. TGI enables high-performance text generation using Tensor Parallelism and dynamic batching for the most popular open-source LLMs, including StarCoder, BLOOM, GPT-NeoX, Falcon, Llama, and T5. VMware, IBM, Grammarly, Open-Assistant, Uber, Scale AI, and many more already use Text Generation Inference.

What is Llama 2?

Llama 2 is a family of LLMs from Meta, trained on 2 trillion tokens. Llama 2 comes in three sizes - 7B, 13B, and 70B parameters - and introduces key improvements like longer context length, commercial licensing, and optimized chat abilities through reinforcement learning compared to Llama (1). If you want to learn more about Llama 2 check out this blog post.

What is GPTQ?

GPTQ is a post-training quantziation method to compress LLMs, like GPT. GPTQ compresses GPT (decoder) models by reducing the number of bits needed to store each weight in the model, from 32 bits down to just 3-4 bits. This means the model takes up much less memory and can run on less Hardware, e.g. Single GPU for 13B Llama2 models. GPTQ analyzes each layer of the model separately and approximates the weights to preserve the overall accuracy. If you want to learn more and how to use it, check out Optimize open LLMs using GPTQ and Hugging Face Optimum.


To benchmark the real-world performance of Llama 2, we tested 3 model sizes (7B, 13B, 70B parameters) on four different instance types with four different load levels, resulting in 60 different configurations:

  • Models: We evaluated all currently available model sizes, including 7B, 13B, and 70B.
  • Concurrent Requests: We tested configurations with 1, 5, 10, and 20 concurrent requests to determine the performance on different usage scenarios.
  • Instance Types: We evaluated different GPU instances, including g5.2xlarge, g5.12xlarge, g5.48xlarge powered by NVIDIA A10G GPUs, and p4d.24xlarge powered by NVIDIA A100 40GB GPU.
  • Quantization: We compared performance with and without quantization. We used GPTQ 4-bit as a quantization technique.

As metrics, we used Throughput and Latency defined as:

  • Throughput (tokens/sec): Number of tokens being generated per second.
  • Latency (ms/token): Time it takes to generate a single token

We used those to evaluate the performance of Llama across the different setups to understand the benefits and tradeoffs. If you want to run the benchmark yourself, we created a Github repository.

You can find the full data of the benchmark in the Amazon SageMaker Benchmark: TGI 1.0.3 Llama 2 sheet. The raw data is available on GitHub.

If you are interested in all of the details, we recommend you to dive deep into the provided raw data.

Recommendations & Insights

Based on the benchmark, we provide specific recommendations for optimal LLM deployment depending on your priorities between cost, throughput, and latency for all Llama 2 model sizes.

Note: The recommendations are based on the configuration we tested. In the future, other environments or hardware offerings, such as Inferentia2, maybe even more cost-efficient.

Most Cost-Effective Deployment

The most cost-effective configuration focuses on the right balance between performance (latency and throughput) and cost. Maximizing the output per dollar spent is the goal. We looked at the performance during 5 concurrent requests. We can see that GPTQ offers the best cost-effectiveness, allowing customers to deploy Llama 2 13B on a single GPU.

ModelQuantizationInstanceconcurrent requestsLatency (ms/token) medianThroughput (tokens/second)On-demand cost ($/h) in us-west-2Time to generate 1 M tokens (minutes)cost to generate 1M tokens ($)
Llama 2 7BGPTQg5.2xlarge534.245736120.0941633$1.52138.78$3.50
Llama 2 13BGPTQg5.2xlarge556.23748471.70560104$1.52232.43$5.87
Llama 2 70BGPTQml.g5.12xlarge5138.34792833.33372399$7.09499.99$59.08

Best Throughput Deployment

The Best Throughput configuration maximizes the number of tokens that are generated per second. This might come with some reduction in overall latency since you process more tokens simultaneously. We looked at the highest tokens per second performance during twenty concurrent requests, with some respect to the cost of the instance. The highest throughput was for Llama 2 13B on the ml.p4d.24xlarge instance with 688 tokens/sec.

ModelQuantizationInstanceconcurrent requestsLatency (ms/token) medianThroughput (tokens/second)On-demand cost ($/h) in us-west-2Time to generate 1 M tokens (minutes)cost to generate 1M tokens ($)
Llama 2 7BNoneml.g5.12xlarge2043.99524449.9423027$7.0937.04$4.38
Llama 2 13BNoneml.g5.12xlarge2067.4027465295.6378071$7.0918.72$2.21
Llama 2 70BNoneml.p4d.24xlarge2059.798591321.5369158$37.6916.61$10.43

Best Latency Deployment

The Best Latency configuration minimizes the time it takes to generate one token. Low latency is important for real-time use cases and providing a good experience to the customer, e.g. Chat applications. We looked at the lowest median for milliseconds per token during 1 concurrent request. The lowest overall latency was for Llama 2 7B on the ml.g5.12xlarge instance with 16.8ms/token.

ModelQuantizationInstanceconcurrent requestsLatency (ms/token) medianThorughput (tokens/second)On-demand cost ($/h) in us-west-2Time to generate 1 M tokens (minutes)cost to generate 1M tokens ($)
Llama 2 7BNoneml.g5.12xlarge116.81252661.45733054$7.09271.19$32.05
Llama 2 13BNoneml.g5.12xlarge121.00271547.15736567$7.09353.43$41.76
Llama 2 70BNoneml.p4d.24xlarge141.34854324.5142928$37.69679.88$427.05


In this benchmark, we tested 60 configurations of Llama 2 on Amazon SageMaker. For cost-effective deployments, we found 13B Llama 2 with GPTQ on g5.2xlarge delivers 71 tokens/sec at an hourly cost of $1.55. For max throughput, 13B Llama 2 reached 296 tokens/sec on ml.g5.12xlarge at $2.21 per 1M tokens. And for minimum latency, 7B Llama 2 achieved 16ms per token on ml.g5.12xlarge.

We hope the benchmark will help companies deploy Llama 2 optimally based on their needs. If you are want to get started deploying Llama 2 on Amazon SageMaker, check out Introducing the Hugging Face LLM Inference Container for Amazon SageMaker and Deploy Llama 2 7B/13B/70B on Amazon SageMaker blog posts.

Thanks for reading! If you have any questions, feel free to contact me on Twitter or LinkedIn.