An Amazon SageMaker Inference comparison with Hugging Face Transformers
"Amazon SageMaker is a fully managed machine learning service. With SageMaker, data scientists and developers can quickly and easily build and train machine learning models, and then directly deploy them into a production-ready hosted environment." - AWS Documentation
As of today, Amazon SageMaker offers 4 different inference options with:
Each of these inference options has different characteristics and use cases. Therefore we have created a table to compare the current existing SageMaker inference in latency, execution period, payload, size, and pricing and getting-started examples on how to use each of the inference options.
Comparison table
Option | latency budget | execution period | max payload size | real-world example | accelerators (GPU) | pricing |
---|---|---|---|---|---|---|
real-time | milliseconds | constantly | 6MB | route estimation | Yes | up time of the endpoint |
batch transform | hours | ones a day/week | Unlimited | nightly embedding jobs | Yes | prediction (transform) time |
async inference | minutes | every few minutes/hours | 1GB | post-call transcription | Yes | up time of the endpoint, can sacle to 0 when there is no load |
serverless | seconds | every few minutes | 6MB | PoC for classification | No | compute time (serverless) |
Examples
You will learn how to:
- Deploy a Hugging Face Transformers For Real-Time inference.
- Deploy a Hugging Face Transformers for Batch Transform Inference.
- Deploy a Hugging Face Transformers for Asynchronous Inference.
- Deploy a Hugging Face Transformers for Serverless Inference.
If you are going to use Sagemaker in a local environment (not SageMaker Studio or Notebook Instances). You need access to an IAM Role with the required permissions for Sagemaker. You can find here more about it.
Permissions
If you are going to use Sagemaker in a local environment. You need access to an IAM Role with the required permissions for Sagemaker. You can find here more about it.
SageMaker Hugging Face Inference Toolkit
The SageMaker Hugging Face Inference Toolkit is an open-source library for serving 🤗 Transformers models on Amazon SageMaker. This library provides default pre-processing, predict and postprocessing for certain 🤗 Transformers models and tasks using the transformers pipelines
.
The Inference Toolkit accepts inputs in the inputs
key, and supports additional pipelines parameters
in the parameters key. You can provide any of the supported kwargs from pipelines
as parameters
.
Tasks supported by the Inference Toolkit API include:
text-classification
sentiment-analysis
token-classification
feature-extraction
fill-mask
summarization
translation_xx_to_yy
text2text-generation
text-generation
audio-classificatin
automatic-speech-recognition
conversational
image-classification
image-segmentation
object-detection
table-question-answering
zero-shot-classification
zero-shot-image-classification
See the following request examples for some of the tasks:
text-classification
text-generation parameterized
More documentation and a list of supported tasks can be found in the documentation.
1. Deploy a Hugging Face Transformers For Real-Time inference.
What are Amazon SageMaker Real-Time Endpoints?
Real-time inference is ideal for inference workloads where you have real-time, interactive, low latency requirements. You can deploy your model to SageMaker hosting services and get an endpoint that can be used for inference. These endpoints are fully managed and support autoscaling.
Deploying a model using SageMaker hosting services is a three-step process:
- Create a model in SageMaker —By creating a model, you tell SageMaker where it can find the model components.
- Create an endpoint configuration for an HTTPS endpoint —You specify the name of one or more models in production variants and the ML compute instances that you want SageMaker to launch to host each production variant.
- Create an HTTPS endpoint —Provide the endpoint configuration to SageMaker. The service launches the ML compute instances and deploys the model or models as specified in the configuration
Hub
Deploy a Hugging Face Transformer from theDetailed Notebook: deploy_model_from_hf_hub
To deploy a model directly from the Hub to SageMaker we need to define 2 environment variables when creating the HuggingFaceModel
. We need to define:
HF_MODEL_ID
: defines the model id, which will be automatically loaded from huggingface.co/models when creating or SageMaker Endpoint. The 🤗 Hub provides +14 000 models all available through this environment variable.HF_TASK
: defines the task for the used 🤗 Transformers pipeline. A full list of tasks can be find here.
After model is deployed we can use the predictor
to send requests.
We can easily delete the endpoint again with the following command:
Hub
Deploy a Hugging Face Transformer from theDetailed Notebook: deploy_model_from_s3
To deploy a model directly from the Hub to SageMaker we need to define 2 environment variables when creating the HuggingFaceModel
. We need to define:
HF_MODEL_ID
: defines the model id, which will be automatically loaded from huggingface.co/models when creating or SageMaker Endpoint. The 🤗 Hub provides +14 000 models all available through this environment variable.HF_TASK
: defines the task for the used 🤗 Transformers pipeline. A full list of tasks can be find here.
After model is deployed we can use the predictor
to send requests.
We can easily delete the endpoint again with the following command:
2. Deploy a Hugging Face Transformers for Batch Transform Inference.
Detailed Notebook: batch_transform_inference
What is Amazon SageMaker Batch Transform?
A Batch transform job uses a trained model to get inferences on a dataset and saves these results to an Amazon S3 location that you specify. Similar to real-time hosting it creates a web server that takes in HTTP POST but additionally a Agent. The Agent reads the data from Amazon S3 and sends it to the web server and stores the prediction at the end back to Amazon S3. The benefit of Batch Transform is that the instances are only used during the "job" and stopped afterwards.
Use batch transform when you:
- Want to get inferences for an entire dataset and index them to serve inferences in real time
- Don't need a persistent endpoint that applications (for example, web or mobile apps) can call to get inferences
- Don't need the subsecond latency that SageMaker hosted endpoints provide
3. Deploy a Hugging Face Transformers for Asynchronous Inference.
Detailed Notebook: async_inference_hf_hub
What is Amazon SageMaker Asynchronous Inference?
Amazon SageMaker Asynchronous Inference is a new capability in SageMaker that queues incoming requests and processes them asynchronously. Compared to Batch Transform Asynchronous Inference provides immediate access to the results of the inference job rather than waiting for the job to complete.
Whats the difference between batch transform & real-time inference:
- request will be uploaded to Amazon S3 and the Amazon S3 URI is passed in the request
- are always up and running but can scale to zero to save costs
- responses are also uploaded to Amazon S3 again.
- you can create a Amazon SNS topic to recieve notifications when predictions are finished
The predict()
will upload our data
to Amazon S3 and run inference against it. Since we are using predict
it will block until the inference is complete.
We can easily delete the endpoint again with the following command:
4. Deploy a Hugging Face Transformers for Serverless Inference.
Detailed Notebook: serverless_inference
What is Amazon SageMaker Serverless Inference?
Amazon SageMaker Serverless Inference is a purpose-built inference option that makes it easy for you to deploy and scale ML models. Serverless Inference is ideal for workloads which have idle periods between traffic spurts and can tolerate cold starts. Serverless endpoints automatically launch compute resources and scale them in and out depending on traffic, eliminating the need to choose instance types or manage scaling policies. This takes away the undifferentiated heavy lifting of selecting and managing servers. Serverless Inference integrates with AWS Lambda to offer you high availability, built-in fault tolerance and automatic scaling.
Use Severless Inference when you:
- Want to get started quickly in a cost-effective way
- Don't need the subsecond latency that SageMaker hosted endpoints provide
- proofs-of-concept where cold starts or scalability is not mission-critical
We can easily delete the endpoint again with the following command:
Conclusion
Every current available inference option has a good use case and allows companies to optimize their machine learning workloads in the best possible way. Not only that with the addition of SageMaker Serverless companies can now quickly built cost-effective proof-of-concepts and move them after success to real-time endpoints by changing 1 line of code.
Furthermore, this article has shown how easy it is to get started with Hugging Face Transformers on Amazon Sagemaker and how you can integrate state-of-the-art machine learning into existing applications.
Thanks for reading! If you have any questions, feel free to contact me, through Github, or on the forum. You can also connect with me on Twitter or LinkedIn.