Asynchronous Inference with Hugging Face Transformers and Amazon SageMaker
Welcome to this getting started guide. We will use the Hugging Face Inference DLCs and Amazon SageMaker Python SDK to run an Asynchronous Inference job. Amazon SageMaker Asynchronous Inference is a new capability in SageMaker that queues incoming requests and processes them asynchronously. Compared to Batch Transform Asynchronous Inference provides immediate access to the results of the inference job rather than waiting for the job to complete.
How it works
Asynchronous inference endpoints have many similarities (and some key differences) compared to real-time endpoints. The process to create asynchronous endpoints is similar to real-time endpoints. You need to create: a model, an endpoint configuration, and an endpoint. However, there are specific configuration parameters specific to asynchronous inference endpoints, which we will explore below.
The Invocation of asynchronous endpoints differs from real-time endpoints. Rather than pass the request payload in line with the request, you upload the payload to Amazon S3 and pass an Amazon S3 URI as a part of the request. Upon receiving the request, SageMaker provides you with a token with the output location where the result will be placed once processed. Internally, SageMaker maintains a queue with these requests and processes them. During endpoint creation, you can optionally specify an Amazon SNS topic to receive success or error notifications. Once you receive the notification that your inference request has been successfully processed, you can access the result in the output Amazon S3 location.
Link to Notebook: sagemaker/16_async_inference_hf_hub
NOTE: You can run this demo in Sagemaker Studio, your local machine, or Sagemaker Notebook Instances
Development Environment and Permissions
Installation
Permissions
If you are going to use Sagemaker in a local environment (not SageMaker Studio or Notebook Instances). You need access to an IAM Role with the required permissions for Sagemaker. You can find here more about it.
HuggingFaceModel
for the Asynchronous Inference Endpoint
Create Inference We use the twitter-roberta-base-sentiment model running our async inference job. This is a RoBERTa-base model trained on ~58M tweets and finetuned for sentiment analysis with the TweetEval benchmark.
We can find our Asynchronous Inference endpoint configuration in the Amazon SageMaker Console. Our endpoint now has type async
compared to a' real-time' endpoint.
AsyncPredictor
Request Asynchronous Inference Endpoint using the The .deploy()
returns an AsyncPredictor
object which can be used to request inference. This AsyncPredictor
makes it easy to send asynchronous requests to your endpoint and get the results back. It has two methods: predict()
and predict_async()
. The predict()
method is synchronous and will block until the inference is complete. The predict_async()
method is asynchronous and will return immediately with the a AsyncInferenceResponse
, which can be used to check for the result with polling. If the result object exists in that path, get and return the result.
predict()
request example
The predict()
will upload our data
to Amazon S3 and run inference against it. Since we are using predict
it will block until the inference is complete.
predict_async()
request example
The predict_async()
will upload our data
to Amazon S3 and run inference against it. Since we are using predict_async
it will return immediately with an AsyncInferenceResponse
object.
In this example, we will loop over a csv
file and send each line to the endpoint. After that we are going to poll the endpoint until the inference is complete.
The provided tweet_data.csv
contains ~1800 tweets about different airlines.
But first, let's do a quick test to see if we can get a result from the endpoint using predict_async
predict_async()
request example
Single predict_async()
request example using a csv
file
High load Autoscale (to Zero) the Asynchronous Inference Endpoint
Amazon SageMaker supports automatic scaling (autoscaling) your asynchronous endpoint. Autoscaling dynamically adjusts the number of instances provisioned for a model in response to changes in your workload. Unlike other hosted models Amazon SageMaker supports, with Asynchronous Inference, you can also scale down your asynchronous endpoints instances to zero.
Prequistion: You need to have an running Asynchronous Inference Endpoint up and running. You can check Create Inference HuggingFaceModel
for the Asynchronous Inference Endpoint to see how to create one.
If you want to learn more check-out Autoscale an asynchronous endpoint in the SageMaker documentation.
We are going to scale our asynchronous endpoint to 0-5 instances, which means that Amazon SageMaker will scale the endpoint to 0 instances after 600
seconds or 10 minutes to save you cost and scale up to 5 instances in 300
seconds steps getting more than 5.0
invocations.
The Endpoint will now scale to zero after 600s. Let's wait until the endpoint is scaled to zero and then test sending requests and measure how long it takes to start an instance to process the requests. We are using the predict_async()
method to send the request.
IMPORTANT: Since we defined the TargetValue
to 5.0
the Async Endpoint will only start to scale out from 0 to 1 if you are sending more than 5 requests within 300 seconds.
It took about 7-9 minutes to start an instance and to process the requests. This is perfect when you have non real-time critical applications, but want to save money.
Delete the async inference endpoint & Autoscaling policy
Conclusion
We successfully deploy an Asynchronous Inference Endpoint to Amazon SageMaker using the SageMaker-Python SDK. The SageMaker SDK provides creating tooling for deploying and especially for running inference for the Asynchronous Inference Endpoint. It creates a nice AsnycPredictor
object which can be used to send requests to the endpoint, which handles all of the boilperplate behind the scenes for asynchronous inference and gives us simple APIs.
In addition to this we were able to add autosclaing to the Asynchronous Inference Endpoint with boto3
for scaling our endpoint in and out. Asynchronous Inference Endpoints can even scale down to zero, which is a great feature for non-real-time critical applications to save cost.
You should definitely try out Asynchronous Inference Endpoints for your own applications if neither batch transform
nor real-time
were the right option for you.
You can find the code here.
Thanks for reading! If you have any questions, feel free to contact me, through Github, or on the forum. You can also connect with me on Twitter or LinkedIn.