Deploy Llama 3 on Amazon SageMaker
Earlier today Meta released Llama 3, the next iteration of the open-access Llama family. Llama 3 comes in two sizes: 8B for efficient deployment and development on consumer-size GPU, and 70B for large-scale AI native applications. Both come in base and instruction-tuned variants. In addition to the 4 models, a new version of Llama Guard was fine-tuned on Llama 3 8B and is released as Llama Guard 2 (safety fine-tune).
In this blog you will learn how to deploy meta-llama/Meta-Llama-3-70B-Instruct model to Amazon SageMaker. We are going to use the Hugging Face LLM DLC is a purpose-built Inference Container to easily deploy LLMs in a secure and managed environment. The DLC is powered by Text Generation Inference (TGI) a scalelable, optimized solution for deploying and serving Large Language Models (LLMs). The Blog post also includes Hardware requirements for the different model sizes.
In the blog will cover how to:
- Setup development environment
- Hardware requirements
- Deploy Llama 3 70b to Amazon SageMaker
- Run inference and chat with the model
- Benchmark llama 3 70B with llmperf
- Clean up
Lets get started!
1. Setup development environment
We are going to use the sagemaker
python SDK to deploy Mixtral to Amazon SageMaker. We need to make sure to have an AWS account configured and the sagemaker
python SDK installed.
If you are going to use Sagemaker in a local environment. You need access to an IAM Role with the required permissions for Sagemaker. You can find here more about it.
Compared to deploying regular Hugging Face models we first need to retrieve the container uri and provide it to our HuggingFaceModel
model class with a image_uri
pointing to the image. To retrieve the new Hugging Face LLM DLC in Amazon SageMaker, we can use the get_huggingface_llm_image_uri
method provided by the sagemaker
SDK. This method allows us to retrieve the URI for the desired Hugging Face LLM DLC based on the specified backend
, session
, region
, and version
. You can find the available versions here
Note: At the time of writing this blog post the latest version of the Hugging Face LLM DLC is not yet available via the get_huggingface_llm_image_uri
method. We are going to use the raw container uri instead.
llm image uri: 763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-tgi-inference:2.1-tgi2.0-gpu-py310-cu121-ubuntu22.04
2. Hardware requirements
Llama 3 comes in 2 different sizes - 8B & 70B parameters. The hardware requirements will vary based on the model size deployed to SageMaker. Below is a set up minimum requirements for each model size we tested.
Model | Instance Type | Quantization | # of GPUs per replica |
---|---|---|---|
Llama 8B | (ml.)g5.2xlarge | - | 1 |
Llama 70B | (ml.)g5.12xlarge | gptq / awq | 8 |
Llama 70B | (ml.)g5.48xlarge | - | 8 |
Llama 70B | (ml.)p4d.24xlarge | - | 8 |
Note: We haven't tested GPTQ or AWQ models yet.
3. Deploy Llama 3 to Amazon SageMaker
To deploy Llama 3 70B to Amazon SageMaker we create a HuggingFaceModel
model class and define our endpoint configuration including the hf_model_id
, instance_type
etc. We will use a p4d.24xlarge
instance type, which has 8 NVIDIA A100 GPUs and 320GB of GPU memory. Llama 3 70B instruct is a fine-tuned model for conversational AI this allows us to enable the Messages API from TGI to interact with llama using the common OpenAI format of messages
.
Note: Llama 3 is a gated model, please visit the Model Card and accept the license terms and acceptable use policy before submitting this form.
After we have created the HuggingFaceModel
we can deploy it to Amazon SageMaker using the deploy
method. We will deploy the model with the ml.p4d.24xlarge
instance type. TGI will automatically distribute and shard the model across all GPUs.
SageMaker will now create our endpoint and deploy the model to it. This can takes a 10-15 minutes.
4. Run inference and chat with the model
After our endpoint is deployed we can run inference on it. We will use the predict
method from the predictor
to run inference on our endpoint. We can inference with different parameters to impact the generation. Parameters can be defined as in the parameters
attribute of the payload. You can find supported parameters in the here.
The Messages API allows us to interact with the model in a conversational way. We can define the role of the message and the content. The role can be either system
,assistant
or user
. The system
role is used to provide context to the model and the user
role is used to ask questions or provide input to the model.
Okay lets test it.
5. Benchmark llama 3 70B with llmperf
We successfully deployed Llama 3 70B to Amazon SageMaker and tested it. Now we want to benchmark the model to see how it performs. We will use a llmperf fork with support for sagemaker
.
First lets install the llmperf
package.
Now we can run the benchmark with the following command. We are going to benchmark using 25
concurrent users and max 500
requests. The benchmark will measure first-time-to-token
, latency (ms/token)
and throughput (tokens/s)
full details can be found in the results
folder
🚨Important🚨: This benchmark was initiatied from Europe, while the endpoint runs in us-east-1. This has significant impact on the first-time-to-token
metric, since it includes the network communication. If you want to measure the first-time-to-token
correctly, you need to run the benchmark on the same host or your production region.
Lets parse the results and display them nicely.
Thats it! We successfully deployed, tested and benchmarked Llama 3 70B on Amazon SageMaker. The benchmark is not a full representation of the model performance, but it gives you a first good indication. If you plan to use the model in production, we recommend to run a longer and closer to your production benchmark, modify the number of replicas see (Scale LLM Inference on Amazon SageMaker with Multi-Replica Endpoints) and most importantly test the model with your own data.
6. Clean up
To clean up, we can delete the model and endpoint.
Thanks for reading! If you have any questions, feel free to contact me on Twitter or LinkedIn.