philschmid blog

Speed up BERT inference with Hugging Face Transformers and AWS Inferentia

#HuggingFace #AWS #BERT #Inferentia
, March 16, 2022 · 9 min read

Photo by paolo candelo on Unsplash

notebook: sagemaker/18_inferentia_inference

The adoption of BERT and Transformers continues to grow. Transformer-based models are now not only achieving state-of-the-art performance in Natural Language Processing but also for Computer Vision, Speech, and Time-Series. 💬 🖼 🎤 ⏳

Companies are now slowly moving from the experimentation and research phase to the production phase in order to use transformer models for large-scale workloads. But by default BERT and its friends are relatively slow, big, and complex models compared to the traditional Machine Learning algorithms. Accelerating Transformers and BERT is and will become an interesting challenge to solve in the future.

AWS’s take to solve this challenge was to design a custom machine learning chip designed for optimized inference workload called AWS Inferentia. AWS says that AWS Inferentia “delivers up to 80% lower cost per inference and up to 2.3X higher throughput than comparable current generation GPU-based Amazon EC2 instances.”

The real value of AWS Inferentia instances compared to GPU comes through the multiple Neuron Cores available on each device. A Neuron Core is the custom accelerator inside AWS Inferentia. Each Inferentia chip comes with 4x Neuron Cores. This enables you to either load 1 model on each core (for high throughput) or 1 model across all cores (for lower latency).


In this end-to-end tutorial, you will learn how to speed up BERT inference for text classification with Hugging Face Transformers, Amazon SageMaker, and AWS Inferentia.

You can find the notebook here: sagemaker/18_inferentia_inference

You will learn how to:

Let’s get started! 🚀

If you are going to use Sagemaker in a local environment (not SageMaker Studio or Notebook Instances), you need access to an IAM Role with the required permissions for Sagemaker. You can find here more about it.

1. Convert your Hugging Face Transformer to AWS Neuron

We are going to use the AWS Neuron SDK for AWS Inferentia. The Neuron SDK includes a deep learning compiler, runtime, and tools for converting and compiling PyTorch and TensorFlow models to neuron compatible models, which can be run on EC2 Inf1 instances.

As a first step, we need to install the Neuron SDK and the required packages.

Tip: If you are using Amazon SageMaker Notebook Instances or Studio you can go with the conda_python3 conda kernel.

1 # Set Pip repository to point to the Neuron repository
2 !pip config set global.extra-index-url
4 # Install Neuron PyTorch
5 !pip install torch-neuron==1.9.1.* neuron-cc[tensorflow] sagemaker>=2.79.0 transformers==4.12.3 --upgrade

After we have installed the Neuron SDK we can load and convert our model. Neuron models are converted using torch_neuron with its trace method similar to torchscript. You can find more information in our documentation.

To be able to convert our model we first need to select the model we want to use for our text classification pipeline from For this example, let’s go with distilbert-base-uncased-finetuned-sst-2-english but this can be easily adjusted with other BERT-like models.

1 model_id = "distilbert-base-uncased-finetuned-sst-2-english"

At the time of writing, the AWS Neuron SDK does not support dynamic shapes, which means that the input size needs to be static for compiling and inference.

In simpler terms, this means that when the model is compiled with e.g. an input of batch size 1 and sequence length of 16, the model can only run inference on inputs with that same shape.

When using a t2.medium instance the compilation takes around 3 minutes

1 import os
2 import tensorflow # to workaround a protobuf version conflict issue
3 import torch
4 import torch.neuron
5 from transformers import AutoTokenizer, AutoModelForSequenceClassification
7 # load tokenizer and model
8 tokenizer = AutoTokenizer.from_pretrained(model_id)
9 model = AutoModelForSequenceClassification.from_pretrained(model_id, torchscript=True)
11 # create dummy input for max length 128
12 dummy_input = "dummy input which will be padded later"
13 max_length = 128
14 embeddings = tokenizer(dummy_input, max_length=max_length, padding="max_length",return_tensors="pt")
15 neuron_inputs = tuple(embeddings.values())
17 # compile model with torch.neuron.trace and update config
18 model_neuron = torch.neuron.trace(model, neuron_inputs)
19 model.config.update({"traced_sequence_length": max_length})
21 # save tokenizer, neuron model and config for later use
22 save_dir="tmp"
23 os.makedirs("tmp",exist_ok=True)
25 tokenizer.save_pretrained(save_dir)
26 model.config.save_pretrained(save_dir)

2. Create a custom script for text-classification

The Hugging Face Inference Toolkit supports zero-code deployments on top of the pipeline feature from 🤗 Transformers. This allows users to deploy Hugging Face transformers without an inference script [Example].

Currently, this feature is not supported with AWS Inferentia, which means we need to provide an script for running inference.

If you would be interested in support for zero-code deployments for Inferentia let us know on the forum.

To use the inference script, we need to create an script. In our example, we are going to overwrite the model_fn to load our neuron model and the predict_fn to create a text-classification pipeline.

If you want to know more about the script check out this example. It explains amongst other things what model_fn and predict_fn are.

1 !mkdir code

We are using the NEURONCORE_GROUP_SIZES=1 to make sure that each HTTP worker uses 1 Neuron core to maximize throughput.

1 %%writefile code/
3 import os
4 from transformers import AutoConfig, AutoTokenizer
5 import torch
6 import torch.neuron
8 # To use one neuron core per worker
9 os.environ["NEURONCORE_GROUP_SIZES"] = "1"
11 # saved weights name
14 def model_fn(model_dir):
15 # load tokenizer and neuron model from model_dir
16 tokenizer = AutoTokenizer.from_pretrained(model_dir)
17 model = torch.jit.load(os.path.join(model_dir, AWS_NEURON_TRACED_WEIGHTS_NAME))
18 model_config = AutoConfig.from_pretrained(model_dir)
20 return model, tokenizer, model_config
22 def predict_fn(data, model_tokenizer_model_config):
23 # destruct model, tokenizer and model config
24 model, tokenizer, model_config = model_tokenizer_model_config
26 # create embeddings for inputs
27 inputs = data.pop("inputs", data)
28 embeddings = tokenizer(
29 inputs,
30 return_tensors="pt",
31 max_length=model_config.traced_sequence_length,
32 padding="max_length",
33 truncation=True,
34 )
35 # convert to tuple for neuron model
36 neuron_inputs = tuple(embeddings.values())
38 # run prediciton
39 with torch.no_grad():
40 predictions = model(*neuron_inputs)[0]
41 scores = torch.nn.Softmax(dim=1)(predictions)
43 # return dictonary, which will be json serializable
44 return [{"label": model_config.id2label[item.argmax().item()], "score": item.max().item()} for item in scores]

3. Create and upload the neuron model and inference script to Amazon S3

Before we can deploy our neuron model to Amazon SageMaker we need to create a model.tar.gz archive with all our model artifacts saved into tmp/, e.g. and upload this to Amazon S3.

To do this we need to set up our permissions.

1 import sagemaker
2 import boto3
3 sess = sagemaker.Session()
4 # sagemaker session bucket -> used for uploading data, models and logs
5 # sagemaker will automatically create this bucket if it not exists
6 sagemaker_session_bucket=None
7 if sagemaker_session_bucket is None and sess is not None:
8 # set to default bucket if a bucket name is not given
9 sagemaker_session_bucket = sess.default_bucket()
11 try:
12 role = sagemaker.get_execution_role()
13 except ValueError:
14 iam = boto3.client('iam')
15 role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']
17 sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)
19 print(f"sagemaker role arn: {role}")
20 print(f"sagemaker bucket: {sess.default_bucket()}")
21 print(f"sagemaker session region: {sess.boto_region_name}")

Next, we create our model.tar.gz. The script will be placed into a code/ folder.

1 # copy into the code/ directory of the model directory.
2 !cp -r code/ tmp/code/
3 # create a model.tar.gz archive with all the model artifacts and the script.
4 %cd tmp
5 !tar zcvf model.tar.gz *
6 %cd ..

Now we can upload our model.tar.gz to our session S3 bucket with sagemaker.

1 from sagemaker.s3 import S3Uploader
3 # create s3 uri
4 s3_model_path = f"s3://{sess.default_bucket()}/{model_id}"
6 # upload model.tar.gz
7 s3_model_uri = S3Uploader.upload(local_path="tmp/model.tar.gz",desired_s3_uri=s3_model_path)
8 print(f"model artifcats uploaded to {s3_model_uri}")

4. Deploy a Real-time Inference Endpoint on Amazon SageMaker

After we have uploaded our model.tar.gz to Amazon S3 can we create a custom HuggingfaceModel. This class will be used to create and deploy our real-time inference endpoint on Amazon SageMaker.

1 from sagemaker.huggingface.model import HuggingFaceModel
3 # create Hugging Face Model Class
4 huggingface_model = HuggingFaceModel(
5 model_data=s3_model_uri, # path to your model and script
6 role=role, # iam role with permissions to create an Endpoint
7 transformers_version="4.12", # transformers version used
8 pytorch_version="1.9", # pytorch version used
9 py_version='py37', # python version used
10 )
12 # Let SageMaker know that we've already compiled the model via neuron-cc
13 huggingface_model._is_compiled_model = True
15 # deploy the endpoint endpoint
16 predictor = huggingface_model.deploy(
17 initial_instance_count=1, # number of instances
18 instance_type="ml.inf1.xlarge" # AWS Inferentia Instance
19 )

5. Run and evaluate Inference performance of BERT on Inferentia

The .deploy() returns an HuggingFacePredictor object which can be used to request inference.

1 data = {
2 "inputs": "the mesmerizing performances of the leads keep the film grounded and keep the audience riveted .",
3 }
5 res = predictor.predict(data=data)
6 res

We managed to deploy our neuron compiled BERT to AWS Inferentia on Amazon SageMaker. Now, let’s test its performance. As a dummy load test, we will loop and send 10,000 synchronous requests to our endpoint.

1 # send 10000 requests
2 for i in range(10000):
3 resp = predictor.predict(
4 data={"inputs": "it 's a charming and often affecting journey ."}
5 )

Let’s inspect the performance in cloudwatch.

1 print(f"{sess.boto_region_name}#metricsV2:graph=~(metrics~(~(~'AWS*2fSageMaker~'ModelLatency~'EndpointName~'{predictor.endpoint_name}~'VariantName~'AllTraffic))~view~'timeSeries~stacked~false~region~'{sess.boto_region_name}~start~'-PT5M~end~'P0D~stat~'Average~period~30);query=~'*7bAWS*2fSageMaker*2cEndpointName*2cVariantName*7d*20{predictor.endpoint_name}")

The average latency for our BERT model is 5-6ms for a sequence length of 128.


Delete model and endpoint

To clean up, we can delete the model and endpoint.

1 predictor.delete_model()
2 predictor.delete_endpoint()


We successfully managed to compile a vanilla Hugging Face Transformers model to an AWS Inferentia compatible Neuron Model. After that we deployed our Neuron model to Amazon SageMaker using the new Hugging Face Inference DLC. We managed to achieve 5-6ms latency per neuron core, which is faster than CPU in terms of latency, and achieves a higher throughput than GPUs since we ran 4 models in parallel.

If you or you company are currently using a BERT-like Transformer for encoder tasks (text-classification, token-classification, question-answering etc.), and the latency meets your requirements you should switch to AWS Inferentia. This will not only save costs, but can also increase efficiency and performance for your models.

We are planning to do a more detailed case study on cost-performance of transformers in the future, so stay tuned!

Also if you want to learn more about accelerating transformers you should also check out Hugging Face optimum.

Thanks for reading! If you have any questions, feel free to contact me, through Github, or on the forum. You can also connect with me on Twitter or LinkedIn.