Multi-Container Endpoints with Hugging Face Transformers and Amazon SageMaker
Welcome to this getting started guide. We will use the Hugging Face Inference DLCs and Amazon SageMaker to deploy multiple transformer models as Multi-Container Endpoint. Amazon SageMaker Multi-Container Endpoint is an inference option to deploy multiple containers (multiple models) to the same SageMaker real-time endpoint. These models/containers can be accessed individually or in a pipeline. Amazon SageMaker Multi-Container Endpoint can be used to improve endpoint utilization and optimize costs. An example for this is time zone differences, the workload for model A (U.S) is mostly at during the day and the workload for model B (Germany) is mostly during the night, you can deploy model A and model B to the same SageMaker endpoint and optimize your costs.
NOTE: At the time of writing this, only CPU
Instances are supported for Multi-Container Endpoint.
Development Environment and Permissions
NOTE: You can run this demo in Sagemaker Studio, your local machine, or Sagemaker Notebook Instances
Permissions
If you are going to use Sagemaker in a local environment (not SageMaker Studio or Notebook Instances). You need access to an IAM Role with the required permissions for Sagemaker. You can find here more about it.
Multi-Container Endpoint creation
When writing this does the Amazon SageMaker Python SDK not support Multi-Container Endpoint deployments. That's why we are going to use boto3
to create the endpoint.
The first step though is to use the SDK to get our container uris for the Hugging Face Inference DLCs.
Define Hugging Face models
Next, we need to define the models we want to deploy to our multi-container endpoint. To stick with our example from the introduction, we will deploy an English sentiment-classification model and a german sentiment-classification model. For the English model, we will use distilbert-base-uncased-finetuned-sst-2-english and for the German model, we will use oliverguhr/german-sentiment-bert.
Similar to the endpoint creation with the SageMaker SDK do we need to provide the "Hub" configurations for the models as HF_MODEL_ID
and HF_TASK
.
Create Multi-Container Endpoint
After we define our model configuration, we can deploy our endpoint. To create/deploy a real-time endpoint with boto3
you need to create a "SageMaker Model", a "SageMaker Endpoint Configuration" and a "SageMaker Endpoint". The "SageMaker Model" contains our multi-container configuration including our two models. The "SageMaker Endpoint Configuration" contains the configuration for the endpoint. The "SageMaker Endpoint" is the actual endpoint.
this will take a few minutes to deploy. You can check the console to see if the endpoint is in service
Invoke Multi-Container Endpoint
To invoke our multi-container endpoint we can either use boto3
or any other AWS SDK or the Amazon SageMaker SDK. We will test both ways and do some light load testing to take a look at the performance of our endpoint in cloudwatch.
boto3
Sending requests with To send requests to our models we will use the sagemaker-runtime
with the invoke_endpoint
method. Compared to sending regular requests to a single-container endpoint we are passing TargetContainerHostname
as additional information to point to the container, which should receive the request. In our case this is either englishModel
or germanModel
.
englishModel
germanModel
HuggingFacePredictor
Sending requests with The Python SageMaker SDK can not be used for deploying Multi-Container Endpoints but can be used to invoke/send requests to those. We will use the HuggingFacePredictor
to send requests to the endpoint, where we also pass the TargetContainerHostname
as additional information to point to the container, which should receive the request. In our case this is either englishModel
or germanModel
.
Load testing the multi-container endpoint
As mentioned, we are doing some light load-testing, meaning sending a few alternating requests to the containers and looking at the latency in cloudwatch.
We can see that the latency for the englishModel
is around 2x faster than the for the germanModel
, which makes sense since the englishModel
is a DistilBERT model and the german one is a BERT-base
model.
In terms of invocations we can see both enpdoints are invocated the same amount, which makes sense since our test invoked both endpoints alternately.
Delete the Multi-Container Endpoint
Conclusion
We successfully deployed two Hugging Face Transformers to Amazon SageMaer for inference using the Multi-Container Endpoint, which allowed using the same instance two host multiple models as a container for inference. Multi-Container Endpoints are a great option to optimize compute utilization and costs for your models. Especially when you have independent inference workloads due to time differences or use-case differences.
You should try Multi-Container Endpoints for your models when you have workloads that are not correlated.
You can find the code here.
Thanks for reading! If you have any questions, feel free to contact me, through Github, or on the forum. You can also connect with me on Twitter or LinkedIn.