philschmid blog

Autoscaling BERT with Hugging Face Transformers, Amazon SageMaker and Terraform module

#HuggingFace #AWS #BERT #Terraform
, March 01, 2022 · 6 min read

Photo by Kelvin T on Unsplash

A Few weeks ago we released a Terraform module sagemaker-huggingface, which makes it super easy to deploy Hugging Face Transformers like BERT from Amazon S3 or the Hugging Face Hub to Amazon SageMake for real-time inference.

1 module "sagemaker-huggingface" {
2 source = "philschmid/sagemaker-huggingface/aws"
3 version = "0.5.0"
4 name_prefix = "distilbert"
5 pytorch_version = "1.9.1"
6 transformers_version = "4.12.3"
7 instance_type = "ml.g4dn.xlarge"
8 hf_model_id = "distilbert-base-uncased-finetuned-sst-2-english"
9 hf_task = "text-classification"
10 autoscaling = {
11 max_capacity = 4 # The max capacity of the scalable target
12 scaling_target_invocations = 200 # The scaling target invocations (requests/minute)
13 }
14 }

You should check out the “Deploy BERT with Hugging Face Transformers, Amazon SageMaker and Terraform module” blog post if you want to know more about Terraform and how we have built the module.

TL;DR; this module should enable companies and individuals to easily deploy Hugging Face Transformers without heavy lifting.

Since then we got a lot of feedback requests from users for new additional features. Thank you for that! BTW. if you have any feedback or feature ideas feel free to open a thread in the forum.

Below can find the currently supported features + the newly supported features.


  • Deploy Hugging Face Transformers from to Amazon SageMaker
  • Deploy Hugging Face Transformers from Amazon S3 to Amazon SageMaker
  • 🆕  Deploy private Hugging Face Transformers from to Amazon SageMaker with a hf_api_token
  • 🆕  Add Autoscaling to your Amazon SageMaker endpoints with autoscaling configuration
  • 🆕  Deploy Asynchronous Inference Endpoints either from the or Amazon S3

You can find examples for all use cases in the repository of the module or in the registry. In addition to the feature updates, we also improved the naming by adding a random lower case string at the end of all resources.



Let’s test some of the new features and let us deploy an Asynchronous Inference Endpoint with autoscaling to zero.

How to deploy Asynchronous Endpoint with Autoscaling using the **sagemaker-huggingface terraform module

Before we get started, make sure you have the Terraform installed and configured, as well as access to AWS Credentials to create the necessary services. [Instructions] What are we going to do:

  • create a new Terraform configuration
  • initialize the AWS provider and our module
  • deploy our Asynchronous Endpoint
  • test the endpoint
  • destroy the infrastructure

If you want to learn about Asynchronous Inference you can check out my blog post: Asynchronous Inference with Hugging Face Transformers and Amazon SageMaker”

Create a new Terraform configuration

Each Terraform configuration must be in its own directory including a file. Our first step is to create the distilbert-terraform directory with a file.

1 mkdir async-terraform
2 touch async-terraform/
3 cd async-terraform

Initialize the AWS provider and our module

Next, we need to open the in a text editor and add the aws provider as well as our module.

Note: the snippet below assumes that you have an AWS profile default configured with the needed permissions

1 provider "aws" {
2 profile = "default"
3 region = "us-east-1"
4 }
6 # create bucket for async inference for inputs & outputs
7 resource "aws_s3_bucket" "async_inference_bucket" {
8 bucket = "async-inference-bucket"
9 }
11 module "huggingface_sagemaker" {
12 source = "philschmid/sagemaker-huggingface/aws"
13 version = "0.5.0"
14 name_prefix = "deploy-hub"
15 pytorch_version = "1.9.1"
16 transformers_version = "4.12.3"
17 instance_type = "ml.g4dn.xlarge"
18 hf_model_id = "distilbert-base-uncased-finetuned-sst-2-english"
19 hf_task = "text-classification"
20 async_config = {
21 # needs to be a s3 uri
22 s3_output_path = "s3://async-inference-bucket/async-distilbert"
23 }
24 autoscaling = {
25 min_capacity = 0
26 max_capacity = 4
27 scaling_target_invocations = 100
28 }
29 }

When we create a new configuration — or check out an existing configuration from version control — we need to initialize the directory with terraform init.

Initializing will download and install our AWS provider as well as the sagemaker-huggingface module.

1 terraform init
2 # Initializing modules...
3 # Downloading philschmid/sagemaker-huggingface/aws 0.5.0 for huggingface_sagemaker...
4 # - huggingface_sagemaker in .terraform/modules/huggingface_sagemaker
6 # Initializing the backend...
8 # Initializing provider plugins...
9 # - Finding latest version of hashicorp/random...
10 # - Finding hashicorp/aws versions matching "~> 4.0"...
11 # - Installing hashicorp/random v3.1.0...

Deploy the Asynchronous Endpoint

To deploy/apply our configuration we run terraform apply command. Terraform will then print out which resources are going to be created and ask us if we want to continue, which can we confirm with yes.

1 terraform apply

Now Terraform will deploy our model to Amazon SageMaker as a real-time endpoint. This can take 2-5 minutes.

Test the endpoint

To test our deployed endpoint we can use the aws sdk in our example we are going to use the Python SageMaker SDK (sagemaker), but you can easily switch this to use Java, Javascript, .NET, or Go SDK to invoke the Amazon SageMaker endpoint. We are going to use the sagemaker SDK since it provides an easy-to-use AsyncPredictor object which does the heavy lifting for uploading the data to Amazon S3 for us.

For initializing our Predictor we need the name of our deployed endpoint, which we can get by inspecting the output of Terraform with terraform output or going to the SageMaker service in the AWS Management console and our Amazon S3 bucket defined in our Terraform module.

We create a new file with the following snippet.

Make sure you have configured your credentials (and region) correctly and sagemaker installed

1 from sagemaker.huggingface import HuggingFacePredictor
2 from sagemaker.predictor_async import AsyncPredictor
4 ENDPOINT_NAME = "deploy-hub-ep-rzbiwuva"
5 ASYNC_S3_PATH = "s3://async-inference-bucket/async-distilbert"
7 async_predictor = AsyncPredictor(HuggingFacePredictor(ENDPOINT_NAME))
9 data = {
10 "inputs": [
11 "it 's a charming and often affecting journey .",
12 "it 's slow -- very, very slow",
13 "the mesmerizing performances of the leads keep the film grounded and keep the audience riveted .",
14 "the emotions are raw and will strike a nerve with anyone who 's ever had family trauma ."
15 ]
16 }
18 res = async_predictor.predict(data=data,input_path=ASYNC_S3_PATH)
19 print(res)

Now we can execute our request.

1 python3
2 # [{'label': 'LABEL_2', 'score': 0.8808117508888245}, {'label': 'LABEL_0', 'score': 0.6126593947410583}, {'label': 'LABEL_2', 'score': 0.9425230622291565}, {'label': 'LABEL_0', 'score': 0.5511414408683777}]

Destroy the infrastructure

To clean up our created resources we can run terraform destroy, which will delete all the created resources from the module.

More Examples

You find examples of how to deploy private Models and use Autoscaling in the repository of the module or in the registry.


The sagemaker-huggingface terraform module abstracts all the heavy lifting for deploying Transformer models to Amazon SageMaker away, which enables controlled, consistent and understandable managed deployments after concepts of IaC. This should help companies to move faster and include deployed models to Amazon SageMaker into their existing Applications and IaC definitions.

Thanks for reading! If you have any questions, feel free to contact me, through Github, or on the forum. You can also connect with me on Twitter or LinkedIn.