Deploy Falcon 7B & 40B on Amazon SageMaker

Published on
8 min read
View Code

The Falcon models are taking the open-source LLM space by storm! Falcon (7B & 40B) are currently the most exciting models, offering commercial use through the Apache 2.0 license! The Falcon model family comes in two sizes 7B, trained on 1.5T tokens, and 40B, trained on 1T Tokens. Falcon 40B was trained on a multi-lingual dataset, including German, Spanish, and French!

Last week, we announced the new Hugging Face LLM Inference Container for Amazon SageMaker, which allows you to easily deploy the most popular open-source LLMs, including Falcon, StarCoder, BLOOM, GPT-NeoX, Llama, and T5.

This blog will guide you through deploying the Instruct Falcon 40B model to Amazon SageMaker. We will cover how to:

  1. Setup development environment
  2. Retrieve the new Hugging Face LLM DLC
  3. Deploy Falcon 40B to Amazon SageMaker
  4. Run inference and chat with our model

By the end of this guide, you will have a fully operational SageMaker Endpoint running the Falcon 40B, ready to be used for your Generative AI application.

1. Setup development environment

We are going to use the sagemaker python SDK to deploy BLOOM to Amazon SageMaker. We need to make sure to have an AWS account configured and the sagemaker python SDK installed.

# install supported sagemaker SDK
!pip install "sagemaker>=2.175.0" --upgrade --quiet

If you are going to use Sagemaker in a local environment, you need access to an IAM Role with the required permissions for Sagemaker. You can find here more about it.

import sagemaker
import boto3
sess = sagemaker.Session()
# sagemaker session bucket -> used for uploading data, models and logs
# sagemaker will automatically create this bucket if it not exists
if sagemaker_session_bucket is None and sess is not None:
    # set to default bucket if a bucket name is not given
    sagemaker_session_bucket = sess.default_bucket()

    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client('iam')
    role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']

sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)

print(f"sagemaker role arn: {role}")
print(f"sagemaker session region: {sess.boto_region_name}")

2. Retrieve the new Hugging Face LLM DLC

Compared to deploying regular Hugging Face models, we first need to retrieve the container uri and provide it to our HuggingFaceModel model class with a image_uri pointing to the image. To retrieve the new Hugging Face LLM DLC in Amazon SageMaker, we can use the get_huggingface_llm_image_uri method provided by the sagemaker SDK. This method allows us to retrieve the URI for the desired Hugging Face LLM DLC based on the specified backend, session, region, and version. You can find the available versions here.

from sagemaker.huggingface import get_huggingface_llm_image_uri

# retrieve the llm image uri
llm_image = get_huggingface_llm_image_uri(

# print ecr image uri
print(f"llm image uri: {llm_image}")

3. Deploy Falcon 40B to Amazon SageMaker

Note: Quotas for Amazon SageMaker can vary between accounts. If you receive an error indicating you've exceeded your quota, you can increase them through the Service Quotas console.

To deploy Falcon 40b Instruct to Amazon SageMaker, we create a HuggingFaceModel model class and define our endpoint configuration including the hf_model_id, instance_type, etc. We will use a g5.12xlarge instance type with 4 NVIDIA A10G GPUs and 96GB of GPU memory.

Note: If you plan to have long inputs > 1k tokens adjust the config and enable int-8 quantization for a smaller memory footprint of the model (that comes with a small latency increasement).

import json
from sagemaker.huggingface import HuggingFaceModel

# sagemaker config
instance_type = "ml.g5.12xlarge"
number_of_gpu = 4
health_check_timeout = 300

# TGI config
config = {
  'HF_MODEL_ID': "tiiuae/falcon-40b-instruct", # model_id from
  'SM_NUM_GPUS': json.dumps(number_of_gpu), # Number of GPU used per replica
  'MAX_INPUT_LENGTH': json.dumps(1024),  # Max length of input text
  'MAX_TOTAL_TOKENS': json.dumps(2048),  # Max length of the generation (including input text)
  # 'HF_MODEL_QUANTIZE': "bitsandbytes", # comment in to quantize

# create HuggingFaceModel
llm_model = HuggingFaceModel(

After we have created the HuggingFaceModel we can deploy it to Amazon SageMaker using the deploy method. We will deploy the model with the ml.g5.12xlarge instance type. TGI will automatically distribute and shard the model across all GPUs.

# Deploy model to an endpoint
llm = llm_model.deploy(
  # volume_size=400, # If using an instance with local SSD storage, volume_size must be None, e.g. p4 but not p3
  container_startup_health_check_timeout=health_check_timeout, # 10 minutes to be able to load the model

SageMaker will now create our endpoint and deploy the model to it. This can take 10 minutes.

4. Run inference and chat with our model

After our endpoint is deployed, we can run inference on it using the predict method from the predictor. We can use different parameters to control the generation, defining them in the parameters attribute of the payload. The Hugging Face LLM Inference Container supports a wide variety of generation parameters, including top_p, temperature, stop, max_new_token … You can find a full list of supported parameters here.

The tiiuae/falcon-40b-instruct is a conversational chat model meaning we can chat with it using the following prompt:

# define payload
prompt = """You are an helpful Assistant, called Falcon. Knowing everyting about AWS.

User: Can you tell me something about Amazon SageMaker?

# hyperparameters for llm
payload = {
  "inputs": prompt,
  "parameters": {
    "do_sample": True,
    "top_p": 0.9,
    "temperature": 0.8,
    "max_new_tokens": 1024,
    "repetition_penalty": 1.03,
    "stop": ["\nUser:","<|endoftext|>","</s>"]

# send request to endpoint
response = llm.predict(payload)

# print assistant respond
assistant = response[0]["generated_text"][len(prompt):]

As a response, you should get something like this.

Sure! Amazon SageMaker is a fully managed service that provides everything you need to build, train and deploy machine learning models. It allows you to create, train and deploy machine learning models, as well as host real-time inference to make predictions on new data.

Before we wrap up, let's use this response and chat for another turn.

new_prompt = f"""{prompt}{assistant}
User: How would you recommend start using Amazon SageMaker? If i am new to Machine Learning?
# update payload
payload["inputs"] = new_prompt

# send request to endpoint
response = llm.predict(payload)

# print assistant respond
new_assistant = response[0]["generated_text"][len(new_prompt):]

Let see what it says.

If you are new to machine learning, I would recommend starting with Amazon SageMaker Studio. This is a web-based interface that provides a step-by-step guide to creating, training, and deploying machine learning models.

Awesome! πŸš€ We have successfully deployed Falcon 40B to Amazon SageMaker and run inference on it. To clean up, we can delete the model and endpoint.



We successfully deployed Falcon 40B using the new Hugging Face LLM Inference DLC. The easy-to-use API and deployment process allowed us to deploy the Falcon 40B model to Amazon SageMaker. We covered how to set up the development environment, retrieve the new Hugging Face LLM DLC, deploy the model, and run inference on it. Remember to clean up by deleting the model and endpoint after you're done. Good luck with your Generative AI application!

Thanks for reading! If you have any questions, feel free to contact me on Twitter or LinkedIn.