Deploy Llama 3.2 Vision on Amazon SageMaker

October 17, 20248 minute readView Code

Llama 3.2 is the latest release of open LLMs from the Llama family released by Meta (as of October 2024); Llama 3.2 Vision comes in two sizes: 11B for efficient deployment and development on consumer-size GPU, and 90B for large-scale applications.

In this blog you will learn how to deploy meta-llama/Llama-3.2-11B-Vision-Instruct to Amazon SageMaker. We are going to use the Hugging Face LLM DLC is a purpose-built Inference Container to easily deploy LLMs in a secure and managed environment. The DLC is powered by Text Generation Inference (TGI) a scalelable, optimized solution for deploying and serving Large Language Models (LLMs). The Blog post also includes Hardware requirements for the different model sizes.

Regarding the licensing terms, Llama 3.2 comes with a very similar license to Llama 3.1, with one key difference in the acceptable use policy: any individual domiciled in, or a company with a principal place of business in, the European Union (EU) is not being granted the license rights to use multimodal models included in Llama 3.2. This restriction does not apply to end users of a product or service that incorporates any such multimodal models, so people can still build global products with the vision variants.

For full details, please make sure to read the official license and the acceptable use policy.

In the blog will cover how to:

Setup development environment
Retrieve the new Hugging Face LLM DLC
Hardware requirements
Deploy Llama 3.2 11B to Amazon SageMaker
Run inference and chat with the model
Clean up

Lets get started!

1. Setup development environment

We are going to use the sagemaker python SDK to deploy Llama 3.2 Vision to Amazon SageMaker. We need to make sure to have an AWS account configured and the sagemaker python SDK installed.

# TODO: Update version when image included 
!pip install "sagemaker>=2.232.2" --upgrade --quiet
 
# install huggingface hub
!pip install huggingface_hub --quiet

If you are going to use Sagemaker in a local environment. You need access to an IAM Role with the required permissions for Sagemaker. You can find here more about it.

import sagemaker
import boto3
sess = sagemaker.Session()
# sagemaker session bucket -> used for uploading data, models and logs
# sagemaker will automatically create this bucket if it not exists
sagemaker_session_bucket=None
if sagemaker_session_bucket is None and sess is not None:
    # set to default bucket if a bucket name is not given
    sagemaker_session_bucket = sess.default_bucket()
 
try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client('iam')
    role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']
 
sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)
 
print(f"sagemaker role arn: {role}")
print(f"sagemaker session region: {sess.boto_region_name}")

2. Retrieve the new Hugging Face LLM DLC

Compared to deploying regular Hugging Face models we first need to retrieve the container uri and provide it to our HuggingFaceModel model class with a image_uri pointing to the image. To retrieve the new Hugging Face LLM DLC in Amazon SageMaker, we can use the get_huggingface_llm_image_uri method provided by the sagemaker SDK. This method allows us to retrieve the URI for the desired Hugging Face LLM DLC based on the specified backend, session, region, and version. You can find the available versions here

# TODO: COMMENT IN WHEN IMAGE IS RELEASED
# from sagemaker.huggingface import get_huggingface_llm_image_uri
 
# # retrieve the llm image uri
# llm_image = get_huggingface_llm_image_uri(
#   "huggingface",
#   version="2.3.1"
# )
 
# # print ecr image uri
# print(f"llm image uri: {llm_image}")
 
llm_image = f"763104351884.dkr.ecr.{sess.boto_region_name}.amazonaws.com/huggingface-pytorch-tgi-inference:2.4-tgi2.3-gpu-py311-cu124-ubuntu22.04"

3. Hardware requirements

Llama 3.2 comes in 2 different sizes - 11B & 90B parameters. The hardware requirements will vary based on the model size deployed to SageMaker. Below is a set up minimum requirements for each model size we tested.

Model	Instance Type	# of GPUs per replica
Llama 3.2 11B	`(ml.)g5/6.12xlarge`	4
Llama 3.2 90B	`(ml.)g6e.48xlarge`	8
Llama 3.2 90B	`(ml.)p4d.24xlarge`	8

Note: Amazon SageMaker currently doesn't support instance slicing meaning, e.g. for Llama 3.2 90B you cannot run multiple replica on a single instance.

These are the setups we have validated for Llama 3.2 11B and 90B models to work on SageMaker.

4. Deploy Llama 3.2 11B to Amazon SageMaker

To deploy meta-llama/Llama-3.2-11B-Vision-Instruct to Amazon SageMaker we create a HuggingFaceModel model class and define our endpoint configuration including the hf_model_id, instance_type etc. We will use a g6.12xlarge instance type, which has 4 NVIDIA L4 GPUs and 96GB of GPU memory.

As meta-llama/Llama-3.2-11B-Vision-Instruct is a gated model with restricted access on the European Union (EU), meaning that you need to accept the license agreement.

To generate a token for the Hugging Face Hub, you can follow the instructions in Hugging Face Hub - User access tokens; the generated token can either be fine-grained to have access to the model, or just overall read-only access to your account.

from huggingface_hub import interpreter_login
 
interpreter_login()

After we are logged in we can create our HuggingFaceModel.

import json
from sagemaker.huggingface import HuggingFaceModel
from huggingface_hub import get_token
 
# sagemaker config
instance_type = "ml.g6.12xlarge"
number_of_gpu = 4
health_check_timeout = 300
 
# Define Model and Endpoint configuration parameter
config = {
  'HF_MODEL_ID': "meta-llama/Llama-3.2-11B-Vision-Instruct", # model_id from hf.co/models
  'SM_NUM_GPUS': json.dumps(number_of_gpu), # Number of GPU used per replica
  'MAX_INPUT_LENGTH': json.dumps(6000),  # Max length of input text
  'MAX_TOTAL_TOKENS': json.dumps(8192),  # Max length of the generation (including input text)
  'HF_HUB_ENABLE_HF_TRANSFER': "1", # Enable HF transfer for faster downloads
  'HUGGING_FACE_HUB_TOKEN': get_token(), # Hugging Face token
  'MESSAGES_API_ENABLED': "true", # Enable messages API
}
 
# create HuggingFaceModel with the image uri
llm_model = HuggingFaceModel(
  role=role,
  image_uri=llm_image,
  env=config
)

After we have created the HuggingFaceModel we can deploy it to Amazon SageMaker using the deploy method. We will deploy the model with the ml.g6.12xlarge instance type. TGI will automatically distribute and shard the model across all GPUs.

# Deploy model to an endpoint
llm = llm_model.deploy(
  initial_instance_count=1,
  instance_type=instance_type,
  container_startup_health_check_timeout=health_check_timeout, # 10 minutes to be able to load the model
)

SageMaker will now create our endpoint and deploy the model to it. This can takes a 10-15 minutes.

5. Run inference and chat with the model

After our endpoint is deployed we can run inference on it. We will use the predict method from the predictor to run inference on our endpoint. We deployed Llama 3.2 Vision with MESSAGES_API_ENABLED=true which allows us to use the OpenAI compatible messages API. This allows us to include "type" "image_url", which can be a link to an image or a base64 encoded image. To keep things realistic we are going to upload an image to s3 and create a pre-signed url to it, which we will use in our inference request. Thats how you could handle images in a real-world application.

from botocore.client import Config
 
s3 = sess.boto_session.client('s3', config=Config(signature_version='s3v4'))
 
def upload_image(image_path):
    # params
    bucket = sess.default_bucket()
    key = os.path.join("input", os.path.basename(image_path))
    
    # Upload image to S3    
    s3.upload_file(image_path, bucket, key)
 
    # Generate pre-signed URL valid for 5 minutes
    url = s3.generate_presigned_url(
        ClientMethod='get_object', 
        Params={'Bucket': bucket, 'Key': key},
        ExpiresIn=300
    )
    return url
 
messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "How long does it take from invoice date to due date? Be short and concise."},
            {
                "type": "image_url",
                "image_url": {
                    "url": upload_image("../assets/invoice.png")
                },
            },
        ],
    },
]
 
# Make calls the endpoint including the prompt and parameters
chat = llm.predict({
  "messages":messages,
  "max_tokens": 512,
#   "do_sample": True,
  "top_p": 0.95,
  "temperature": 1.0,
  "stream": False,
})
 
print(chat["choices"][0]["message"]["content"])

To calculate the time difference between the invoice date and the due date, we need to subtract the invoice date from the due date.

Invoice Date: 11/02/2019
Due Date: 26/02/2019

Time Difference = Due Date - Invoice Date
Time Difference = 26/02/2019 - 11/02/2019
Time Difference = 15 days

Therefore, it takes 15 days from the invoice date to the due date.

6. Clean up

To clean up, we can delete the model and endpoint.

llm.delete_model()
llm.delete_endpoint()

Thanks for reading! If you have any questions, feel free to contact me on Twitter or LinkedIn.