LLMOps: Deploy Open LLMs using Infrastructure as Code with AWS CDK
Open Large Language models (LLMs), like Llama 2 or Falcon, are rapidly shifting the thinking of what we can achieve with AI. Those new open LLMs will enable several new business use cases or improve/optimize existing ones.
However, deploying and managing LLMs in production requires specialized infrastructure and workflows. In this blog, we'll show you how to use Infrastructure as Code with AWS Cloud Development Kit (AWS CDK) to deploy and manage Llama 2. The AWS Cloud Development Kit (AWS CDK) is an open-source software development framework that allows you to use code to define, provision, and manage your cloud infrastructure on AWS.
What you are going to do:
- Initialize and bootstrap a new CDK project
- Install the Hugging Face LLM CDK Construct
- Add LLM resource and deploy Llama 2
- Run inference and test the model
Before you get started, make sure you have the AWS CDK installed and configured your AWS credentials.
1. Initialize and bootstrap a new CDK project
Deploying applications using the CDK may require additional resources for CDK to store for example assets. The process of provisioning these initial resources is called bootstrapping. So before being able to deploy our application, you need to make sure that you bootstrapped your project. Create a new empty directory and then initialize and bootstrap the project
The cdk init
command creates files and folders inside the huggingface-cdk-example
directory to help you organize the source code for your AWS CDK app. The bin/
directory contains our app with an empty stack which is located under the lib/
directory.
2. Installing the Hugging Face LLM CDK Construct
We created a new AWS CDK construct aws-sagemaker-huggingface-llm
, to make the deployment of LLMs easier than ever before. The construct uses the Hugging Face LLM Inference DLC, built on top of Text Generation Inference (TGI), an open-code, purpose-built solution for deploying and serving Large Language Models (LLMs).
The aws-sagemaker-huggingface-llm
leverages aws-sagemaker
and abstracts all of the heavy liftings away. You can install the construction using npm
.
3. Add LLM resource and deploy Llama 2
A new CDK project is always empty in the beginning because the stack it contains doesn't define any resources. Let's a HuggingFaceLlm
resource. Therefore you need to open your stack in the lib/
directory and import the HuggingFaceLlm
into it.
The construct also provides an interface for the available arguments called HuggingFaceLlmProps, where you can define your Model id, the number of GPUs to shard the model, and custom parameters. All environmentVariables
will be passed to the container.
Note: The HuggingfaceLlm
contains the Sagemaker Endpoint as endpoint
property. Meaning that you can easily add autoscaling, monitoring, or alerts.
Before you deploy the stack, make sure the code is validated by synthesizing it using cdk
.
The cdk synth
command executes your app, which causes the resources it defines to be translated into an AWS CloudFormation template.
To deploy the stack, you can use the deploy
command from cdk
.
AWS CDK will now synthesize our stack again and potentially ask us to confirm our changes. CDK will also list the IAM statements which will be created. Confirm with y
. Now CDK will create all required resources for Amazon SageMaker and deploy our model. Once our endpoint is up and running, the deploy
command should be finished, and you should see the name our your endpoint. Example below
4. Run inference and test the model
The aws-sagemaker-huggingface-llm
construct is built on top of Amazon SageMaker. This means that the construct creates a real-time endpoint for us. To run inference, you can either use the AWS SDK (in any language), the sagemaker Python SDK or the AWS CLI. To keep things simple, use the SageMaker Python SDK.
If you haven’t installed it, you can install it with pip install sagemaker
. The sagemaker SDK implements a HuggingFacePredictor
class which makes it super easy for us to send requests to your endpoint.
Since the construct uses the Hugging Face LLM Inference DLC, you can use the same parameters for inference, including max_new_tokens
, temperature
, top_p
etc. You can find a list of supported arguments and how to prompt Llama 2 correctly in the Deploy Llama 2 7B/13B/70B on Amazon SageMaker blog post under Run inference and chat with the model. To validate that it works, you can test it with.
Thats it! You made it. Now you can go to your DevOps team and help them integrate LLMs into your products.
Conclusion
In this post, we demonstrated how Infrastructure as Code with AWS CDK enables the productive use of large language models like Llama 2 in production. We showed how the aws-sagemaker-huggingface-llm helps to deploy Llama 2 to SageMaker with minimal code.
Thanks for reading! If you have any questions, feel free to contact me on Twitter or LinkedIn.