Deploy Llama 2 7B on AWS inferentia2 with Amazon SageMaker
In this end-to-end tutorial, you will learn how to deploy and speed up Llama 2 inference using AWS Inferentia2 and optimum-neuron on Amazon SageMaker. Optimum Neuron is the interface between the Hugging Face Transformers & Diffusers library and AWS Accelerators including AWS Trainium and AWS Inferentia2.
You will learn how to:
- Convert Llama 2 to AWS Neuron (Inferentia2) with
optimum-neuron
- Create a custom
inference.py
script for Llama 2 - Upload the neuron model and inference script to Amazon S3
- Deploy a Real-time Inference Endpoint on Amazon SageMaker
- Run inference and chat with Llama 2
Quick intro: AWS Inferentia 2
AWS inferentia (Inf2) are purpose-built EC2 for deep learning (DL) inference workloads. Inferentia 2 is the successor of AWS Inferentia, which promises to deliver up to 4x higher throughput and up to 10x lower latency.
instance size | accelerators | Neuron Cores | accelerator memory | vCPU | CPU Memory | on-demand price ($/h) |
---|---|---|---|---|---|---|
inf2.xlarge | 1 | 2 | 32 | 4 | 16 | 0.76 |
inf2.8xlarge | 1 | 2 | 32 | 32 | 128 | 1.97 |
inf2.24xlarge | 6 | 12 | 192 | 96 | 384 | 6.49 |
inf2.48xlarge | 12 | 24 | 384 | 192 | 768 | 12.98 |
Additionally, inferentia 2 will support the writing of custom operators in c++ and new datatypes, including FP8
(cFP8).
Let's get started! š
If you are going to use Sagemaker in a local environment (not SageMaker Studio or Notebook Instances). You need access to an IAM Role with the required permissions for Sagemaker. You can findĀ hereĀ more about it.
optimum-neuron
1. Convert Llama 2 to AWS Neuron (Inferentia2) with We are going to use the optimum-neuron to compile/convert our model to neuronx. Optimum Neuron provides a set of tools enabling easy model loading, training and inference on single- and multi-Accelerator settings for different downstream tasks.
As a first step, we need to install the optimum-neuron
and other required packages.
Tip: If you are using Amazon SageMaker Notebook Instances or Studio you can go with the conda_python3
conda kernel.
After we have installed the optimum-neuron
we can convert load and convert our model.
We are going to use the meta-llama/Llama-2-7b-chat-hf model. Llama 2 is a collection of pretrained and fine-tuned generative text models ranging in scale from 7 billion to 70 billion parameters. This is the repository for the 7B fine-tuned model, optimized for dialogue use cases.
At the time of writing, the AWS Inferentia2 does not support dynamic shapes for inference, which means that the we need to specify our image size in advanced for compiling and inference.
In simpler terms, this means we need to define the input shapes for our prompt (sequence length), batch size, height and width of the image.
We precompiled the model with the following parameters and pushed it to the Hugging Face Hub:
sequence_length
: 2048batch_size
: 2neuron
: 2.15.0
Note: If you want to compile your own model or a different Llama 2 checkpoint you need to use ~120GB of memory and the compilation can take ~60 minutes. We used an inf2.24xlarge
ec2 instance with the Hugging Face Neuron Deep Learning AMI to compile the model.
inference.py
script for Llama 2 7B
2. Create a custom The Hugging Face Inference Toolkit supports zero-code deployments on top of theĀ pipelineĀ featureĀ from š¤ Transformers. This allows users to deploy Hugging Face transformers without an inference script [Example].
Currently is this feature not supported with AWS Inferentia2, which means we need to provide an inference.py
for running inference. But optimum-neuron
has integrated support for the š¤ Diffusers pipeline feature. That way we can use the optimum-neuron
to create a pipeline for our model.
If you want to know more about the inference.py
Ā script check out this example. It explains amongst other things what the model_fn
and predict_fn
are.
We are using the NEURON_RT_NUM_CORES=2
to make sure that each HTTP worker uses 2 Neuron core for inference. In additon we are going to use "templates for chat models" and new feature of transformers, which allows us to provide OpenAI messages, which are then converted to the correct input format for the model.
For this to work we need jinja2
installed. Lets create a requirements.txt
file and install the required packages.
Now, we create our inference.py
file using the apply_chat_template
method.
3. Upload the neuron model and inference script to Amazon S3
Before we can deploy our neuron model to Amazon SageMaker we need to upload it all our model artifacts to Amazon S3.
Note: Currently inf2
instances are only available in the us-east-2
& us-east-1
region [REF]. Therefore we need to force the region to us-east-2.
Lets create our SageMaker session and upload our model to Amazon S3.
We create our model.tar.gz
with our inference.py
script.
Note: We will use pigz
for multi-core compression to speed up the process. Make sure pigz
is installed on your system, you can install it on ubuntu with sudo apt install pigz
. With pigz
and 32 cores compression takes ~2.4min
Next, we upload our model.tar.gz
to Amazon S3 using our session bucket and sagemaker
sdk.
4. Deploy a Real-time Inference Endpoint on Amazon SageMaker
After we have uploaded ourĀ model artifactsĀ to Amazon S3 can we create a customĀ HuggingfaceModel
. This class will be used to create and deploy our real-time inference endpoint on Amazon SageMaker.
The inf2.xlarge
instance type is the smallest instance type with AWS Inferentia2 support. It comes with 1 Inferentia2 chip with 2 Neuron Cores. This means we can use 2 Neuron Cores to minimize latency for our image generation.
5. Run inference and chat with Llama 2
The .deploy()
returns an HuggingFacePredictor
object which can be used to request inference. Our endpoint expects a json
with messages
. Since we are leveraging the new apply_chat_template in our inference.py script we can send "openai" like converstaions to our model.
Additionally we can send inference parameters, e.g. top_p
or temperature
using the parameters
key.
Since Llama is a conversational model lets ask a follow up question. Therefore we can extend our messages
with a new message.
Result:
system: You are an helpful AWS Expert Assistant. Respond only with 1-2 sentences.
user: What is Amazon SageMaker?
assistant: Amazon SageMaker is a fully managed service that provides a range of machine learning (ML) algorithms, tools, and frameworks to build, train, and deploy ML models at scale. It allows data scientists and engineers to focus on building better ML models instead of managing infrastructure.
user: Can I run Hugging Face Transformers on it?
assistant: Yes, you can run Hugging Face Transformers on Amazon SageMaker. Amazon SageMaker provides a pre-built Python SDK that supports popular deep learning frameworks like Hugging Face Transformers, making it easy to use these frameworks in your machine learning workflows.
If you are interested in the performance of Inferentia2 for throughput and latency check out Make your llama generation time fly with AWS Inferentia2 blog post.
Delete model and endpoint
To clean up, we can delete the model and endpoint.
Conclusion
In this end-to-end tutorial, we walked through deploying Llama 2, a large conversational AI model, for low-latency inference using AWS Inferentia2 and Amazon SageMaker.
We converted the model with optimum-neuron
, created a custom inference script, deployed a real-time endpoint, and chatted with Llama 2 using Inferentia2 acceleration.
If you are interested in the performance of Inferentia2 for throughput and latency check out Make your llama generation time fly with AWS Inferentia2 blog post.
The combination of large AI models like Llama 2 and purpose-built inference chips like Inferentia2 enables low-latency deployments. Using Amazon SageMaker we can go from training to production hosting with just a few lines of code.
Thanks for reading! If you have any questions, feel free to contact me on Twitter or LinkedIn.