LLMOps: Deploy Open LLMs using Infrastructure as Code with AWS CDK

August 15, 20236 minute readView Code

Open Large Language models (LLMs), like Llama 2 or Falcon, are rapidly shifting the thinking of what we can achieve with AI. Those new open LLMs will enable several new business use cases or improve/optimize existing ones.

However, deploying and managing LLMs in production requires specialized infrastructure and workflows. In this blog, we'll show you how to use Infrastructure as Code with AWS Cloud Development Kit (AWS CDK) to deploy and manage Llama 2. The AWS Cloud Development Kit (AWS CDK) is an open-source software development framework that allows you to use code to define, provision, and manage your cloud infrastructure on AWS.

What you are going to do:

Initialize and bootstrap a new CDK project
Install the Hugging Face LLM CDK Construct
Add LLM resource and deploy Llama 2
Run inference and test the model

Before you get started, make sure you have the AWS CDK installed and configured your AWS credentials.

1. Initialize and bootstrap a new CDK project

Deploying applications using the CDK may require additional resources for CDK to store for example assets. The process of provisioning these initial resources is called bootstrapping. So before being able to deploy our application, you need to make sure that you bootstrapped your project. Create a new empty directory and then initialize and bootstrap the project

# create new directory
mkdir huggingface-cdk-example && cd huggingface-cdk-example
# initialize project
cdk init app --language typescript
# bootstrap
cdk bootstrap

The cdk init command creates files and folders inside the huggingface-cdk-example directory to help you organize the source code for your AWS CDK app. The bin/ directory contains our app with an empty stack which is located under the lib/ directory.

2. Installing the Hugging Face LLM CDK Construct

We created a new AWS CDK construct aws-sagemaker-huggingface-llm, to make the deployment of LLMs easier than ever before. The construct uses the Hugging Face LLM Inference DLC, built on top of Text Generation Inference (TGI), an open-code, purpose-built solution for deploying and serving Large Language Models (LLMs).

The aws-sagemaker-huggingface-llm leverages aws-sagemaker and abstracts all of the heavy liftings away. You can install the construction using npm.

npm install aws-sagemaker-huggingface-llm

3. Add LLM resource and deploy Llama 2

A new CDK project is always empty in the beginning because the stack it contains doesn't define any resources. Let's a HuggingFaceLlm resource. Therefore you need to open your stack in the lib/ directory and import the HuggingFaceLlm into it.

import * as cdk from 'aws-cdk-lib'
import { Construct } from 'constructs'
import { HuggingFaceLlm } from 'aws-sagemaker-huggingface-llm'
 
export class HuggingfaceCdkExampleStack extends cdk.Stack {
  constructor(scope: Construct, id: string, props?: cdk.StackProps) {
    super(scope, id, props)
    // create new LLM SageMaker Endpoint
    new HuggingFaceLlm(this, 'Llama2Llm', {
      name: 'llama2-chat',
      instanceType: 'ml.g5.2xlarge',
      environmentVariables: {
        HF_MODEL_ID: 'NousResearch/Llama-2-7b-chat-hf',
        SM_NUM_GPUS: '1',
        MAX_INPUT_LENGTH: '2048',
        MAX_TOTAL_TOKENS: '4096',
        MAX_BATCH_TOTAL_TOKENS: '8192',
      },
    })
  }
}

The construct also provides an interface for the available arguments called HuggingFaceLlmProps, where you can define your Model id, the number of GPUs to shard the model, and custom parameters. All environmentVariables will be passed to the container.

Note: The HuggingfaceLlm contains the Sagemaker Endpoint as endpoint property. Meaning that you can easily add autoscaling, monitoring, or alerts.

Before you deploy the stack, make sure the code is validated by synthesizing it using cdk.

cdk synth

The cdk synth command executes your app, which causes the resources it defines to be translated into an AWS CloudFormation template.

To deploy the stack, you can use the deploy command from cdk.

cdk deploy

AWS CDK will now synthesize our stack again and potentially ask us to confirm our changes. CDK will also list the IAM statements which will be created. Confirm with y. Now CDK will create all required resources for Amazon SageMaker and deploy our model. Once our endpoint is up and running, the deploy command should be finished, and you should see the name our your endpoint. Example below

Outputs:
HuggingfaceCdkExampleStack.Llama2LlmEndpointNameBD92F39C = llama2-chat-endpoint-1h7s2afii09310d4d605026
Stack ARN:
arn:aws:cloudformation:us-east-1:558105141721:stack/HuggingfaceCdkExampleStack/484a4770-3b3f-11ee-95f2-0eabb10b55f3

4. Run inference and test the model

The aws-sagemaker-huggingface-llm construct is built on top of Amazon SageMaker. This means that the construct creates a real-time endpoint for us. To run inference, you can either use the AWS SDK (in any language), the sagemaker Python SDK or the AWS CLI. To keep things simple, use the SageMaker Python SDK.

If you haven’t installed it, you can install it with pip install sagemaker. The sagemaker SDK implements a HuggingFacePredictor class which makes it super easy for us to send requests to your endpoint.

from sagemaker.huggingface import HuggingFacePredictor
 
# create predictor
predictor = HuggingFacePredictor("YOUR ENDPOINT NAME") # llama2-chat-endpoint-1h7s2afii09310d4d605026
 
# run inference
predictor.predict({"inputs": "Can you tell me something about AWS CDK?"})

Since the construct uses the Hugging Face LLM Inference DLC, you can use the same parameters for inference, including max_new_tokens, temperature, top_p etc. You can find a list of supported arguments and how to prompt Llama 2 correctly in the Deploy Llama 2 7B/13B/70B on Amazon SageMaker blog post under Run inference and chat with the model. To validate that it works, you can test it with.

# hyperparameters for llm
prompt = f"""<s>[INST] <<SYS>>
You are an AWS Expert
<</SYS>>
 
Should I rather use AWS CDK or Terraform? [/INST]
"""
payload = {
  "inputs": prompt,
  "parameters": {
    "do_sample": True,
    "top_p": 0.6,
    "temperature": 0.9,
    "top_k": 50,
    "max_new_tokens": 512,
    "repetition_penalty": 1.03,
    "stop": ["</s>"]
  }
}
 
# send request to endpoint
response = predictor.predict(payload)
 
print(response[0]["generated_text"][len(prompt):])

Thats it! You made it. Now you can go to your DevOps team and help them integrate LLMs into your products.

Conclusion

In this post, we demonstrated how Infrastructure as Code with AWS CDK enables the productive use of large language models like Llama 2 in production. We showed how the aws-sagemaker-huggingface-llm helps to deploy Llama 2 to SageMaker with minimal code.

Thanks for reading! If you have any questions, feel free to contact me on Twitter or LinkedIn.