Deploy Llama 2 70B on AWS Inferentia2 with Hugging Face Optimum
Llama 2 is the latest is the open LLM from Meta, released in July 2023. It is trained on more data - 2T tokens and supports context length window upto 4K tokens and still one of the best open available LLMs. Meta fine-tuned conversational models with Reinforcement Learning from Human Feedback on over 1 million human annotations.
In this blog you will learn how to deploy meta-llama/Llama-2-70b-chat-hf model on AWS Inferentia2 with Hugging Face Optimum on Amazon SageMaker. We are going to use the Hugging Face LLM Inf2 Container a new purpose-built Inference Container to easily deploy LLMs on AWS Inferentia2 powered by Text Generation Inference and Optimum Neuron.
In the blog will cover how to:
- Setup development environment
- Retrieve the new Hugging Face LLM Inf2 DLC
- Deploy Llama 2 70B to inferentia2
- Run inference and chat with the model
- Benchmark Llama 2 70B on inferentia2
- Clean up
Lets get started! 🚀
Quick intro: AWS Inferentia 2
AWS inferentia (Inf2) are purpose-built EC2 for deep learning (DL) inference workloads. Inferentia 2 is the successor of AWS Inferentia, which promises to deliver up to 4x higher throughput and up to 10x lower latency.
instance size | accelerators | Neuron Cores | accelerator memory | vCPU | CPU Memory | on-demand price ($/h) |
---|---|---|---|---|---|---|
inf2.xlarge | 1 | 2 | 32 | 4 | 16 | 0.76 |
inf2.8xlarge | 1 | 2 | 32 | 32 | 128 | 1.97 |
inf2.24xlarge | 6 | 12 | 192 | 96 | 384 | 6.49 |
inf2.48xlarge | 12 | 24 | 384 | 192 | 768 | 12.98 |
Additionally, inferentia 2 will support the writing of custom operators in c++ and new datatypes, including FP8
(cFP8).
1. Setup development environment
We are going to use the sagemaker
python SDK to deploy Mixtral to Amazon SageMaker. We need to make sure to have an AWS account configured and the sagemaker
python SDK installed.
If you are going to use Sagemaker in a local environment. You need access to an IAM Role with the required permissions for Sagemaker. You can find here more about it.
2. Retrieve the new Hugging Face LLM Inf2 DLC
The new Hugging Face TGI Neuronx DLCs can be used to run inference on AWS Inferentia2. You can use the get_huggingface_llm_image_uri
method of the sagemaker
SDK to retrieve the appropriate Hugging Face TGI Neuronx DLC URI based on your desired backend
, session
, region
, and version
. You can find all the available versions here.
Note: At the time of writing this blog post the latest version of the Hugging Face LLM DLC is not yet available via the get_huggingface_llm_image_uri
method. We are going to use the raw container uri instead.
3. Deploy Llama 2 70B to inferentia2
At the time of writing, AWS Inferentia2 does not support dynamic shapes for inference, which means that we need to specify our sequence length and batch size ahead of time. To make it easier for customers to utilize the full power of Inferentia2, we created a neuron model cache, which contains pre-compiled configurations for the most popular LLMs, including Llama 2 70B.
This means we don't need to compile the model ourselves, but we can use the pre-compiled model from the cache. You can find compiled/cached configurations on the Hugging Face Hub. If your desired configuration is not yet cached, you can compile it yourself using the Optimum CLI or open a request at the Cache repository
Below is an example on how to compile Llama 2 70B with Optimum CLI, thats not needed in this case as we are using the pre-compiled model from the cache.
Example: Compile Llama 2 70B with Optimum CLI
Note: You need to compile models on an AWS EC2 instance with Inferentia2 support. Compilation can take up to 45 minutes.
Deploying Llama2 70B as Endpoint
Before deploying the model to Amazon SageMaker, we must define the TGI Neuronx endpoint configuration. We need to make sure the following additional parameters are defined:
HF_NUM_CORES
: Number of Neuron Cores used for the compilation.HF_BATCH_SIZE
: The batch size that was used to compile the model.HF_SEQUENCE_LENGTH
: The sequence length that was used to compile the model.HF_AUTO_CAST_TYPE
: The auto cast type that was used to compile the model.
We still need to define traditional TGI parameters with:
HF_MODEL_ID
: The Hugging Face model ID.HF_TOKEN
: The Hugging Face API token to access gated models.MAX_BATCH_SIZE
: The maximum batch size that the model can handle, equal to the batch size used for compilation.MAX_INPUT_LENGTH
: The maximum input length that the model can handle.MAX_TOTAL_TOKENS
: The maximum total tokens the model can generate, equal to the sequence length used for compilation.
Select the right instance type
Llama 2 70B is a large model and requires a lot of memory. We are going to use the inf2.48xlarge
instance type, which has 192 vCPUs and 384 GB of accelerator memory. The inf2.48xlarge
instance comes with 12 Inferentia2 accelerators that include 24 Neuron Cores. If you want to find the cached configurations for Llama 2 70B, you can find them here. In our case we will use a batch size of 4 and a sequence length of 4096.
Before we can deploy Llama 2 70B to Inferentia2, we need to make sure we are logged in to the Hugging Face Hub and have the necessary permissions to access the model. You can request access to the model here.
After that we can create our endpoint configuration and deploy the model to Amazon SageMaker.
After we have created the HuggingFaceModel
we can deploy it to Amazon SageMaker using the deploy
method. We will deploy the model with the ml.inf2.48xlarge
instance type. TGI will automatically distribute and shard the model across all GPUs.
SageMaker will now create our endpoint and deploy the model to it. This can take a 20-30 minutes, we are working on improving the deployment time.
4. Run inference and chat with the model
After our endpoint is deployed, we can run inference on it, using the predict
method from predictor
. We can provide different parameters to impact the generation, adding them to the parameters
attribute of the payload. You can find the supported parameters here, or in the open API specification of TGI in the swagger documentation
The meta-llama/Llama-2-70b-chat-hf
is a conversational chat model, meaning we can chat with it using a prompt structure like the following:
<s>[INST] <<SYS>>
{{ system_prompt }}
<</SYS>>
{{ user_msg_1 }} [/INST] {{ model_answer_1 }} </s><s>[INST] {{ user_msg_2 }} [/INST] {{ model_answer_2 }} </s><s>[INST] {{ user_msg_3 }} [/INST]
Manually preparing the prompt is error prone, so we can use the apply_chat_template
method from the tokenizer to help with it. It expects a messages
dictionary in the well-known OpenAI format, and converts it into the correct format for the model. Let's see if Llama 2 knows some facts about AWS.
Awesome, we tested infernece now lets build a cool demo which support streaming responses. Amazon SageMaker supports streaming responses from your model. We can use this to stream responses, we can leverage this to create a streaming gradio application with a better user experience.
We created a sample application that you can use to test your model. You can find the code in gradio-app.py. The application will stream the responses from the model and display them in the UI. You can also use the application to test your model with your own inputs. With share=True
you can share the application with others, since gradio with create a public link for you valid for 72 hours.
5. Benchmark Llama 2 70B on inferentia2
In the last step we are going to benchmark the model on Inferentia2. We are going to run a simple load test where we send multiple parallel requests to the model and measure the latency and throughput of the model.
We added a utils
helper to retrieve metrics from the cloudwatch logs, but this still includes Network latency and other overheads. We are working on a detailed, reproducible benchmarking guide for Optimum Neuron models on Inferentia2. Stay tuned!
Note: We want to mention again that the benchmark is not a perfect representation of the model performance, as it includes network latency and other overheads. We were sending request from eu-central-1 to us-east-1, which adds additional latency to the requests. We are working on a detailed benchmark with more metrics and a better understanding of the model performance.
Results with 250 token generation:
Note: We ran a similar test on a ml.g5.48xlarge
instance with 8 NVIDIA A10G GPUs as well, but needed to decrease the context size to 2048 from 4096 and generated tokens from 250 to 50. With g5 we achieved an thorughput of ~38.5
tokens per second.
On Inferentia2 we achieved a throughput of ~42.23
tokens per second. Thats a improvement compared to the ml.g5.48xlarge
instance. The g5
benchmark was slightly different so it is not 100% compareable.
Assuming it would be the cost-performance per token between g5.48xlarge
(16.28$/h
) and inf2.48xlarge
(12,98$
) would show a ~44.7% improvement for $ per token price. This makes inferentia2 a valid alternative to NVIDIA A10G GPUs.
6. Clean up
To clean up, we can delete the model and endpoint.
Thanks for reading! If you have any questions, feel free to contact me on Twitter or LinkedIn.