Welcome to this getting started guide, we will use the new Hugging Face Inference DLCs and Amazon SageMaker Python SDK to deploy a transformer model for real-time inference.
In this example, we are going to deploy a trained Hugging Face Transformer model onto SageMaker for inference.
Deploy s Hugging Face Transformer model to Amazon SageMaker for Inference
To deploy a model directly from the Hub to SageMaker we need to define 2 environment variables when creating the HuggingFaceModel. We need to define:
HF_MODEL_ID: defines the model id, which will be automatically loaded from huggingface.co/models when creating or SageMaker Endpoint. The 🤗 Hub provides +10 000 models all available through this environment variable.
HF_TASK: defines the task for the used 🤗 Transformers pipeline. A full list of tasks can be found here.
Next step is to deploy our endpoint.
Architecture
The Hugging Face Inference Toolkit for SageMaker is an open-source library for serving Hugging Face transformer models on SageMaker. It utilizes the SageMaker Inference Toolkit for starting up the model server, which is responsible for handling inference requests. The SageMaker Inference Toolkit uses Multi Model Server (MMS) for serving ML models. It bootstraps MMS with a configuration and settings that make it compatible with SageMaker and allow you to adjust important performance parameters, such as the number of workers per model, depending on the needs of your scenario.
Deploying a model using SageMaker hosting services is a three-step process:
Create a model in SageMaker —By creating a model, you tell SageMaker where it can find the model components.
Create an endpoint configuration for an HTTPS endpoint —You specify the name of one or more models in production variants and the ML compute instances that you want SageMaker to launch to host each production variant.
Create an HTTPS endpoint —Provide the endpoint configuration to SageMaker. The service launches the ML compute instances and deploys the model or models as specified in the configuration
After the endpoint is deployed we can use the predictor to send requests.
Model Monitoring
To properly monitor our endpoint lets send a few hundred requests.
After that we can go to the cloudwatch dashboard to take a look.
Auto Scaling your Model
Amazon SageMaker is a fully managed service that provides every developer and data scientist with the ability to quickly build, train, and deploy machine learning (ML) models at scale.
Autoscaling is an out-of-the-box feature that monitors your workloads and dynamically adjusts the capacity to maintain steady and predictable performance at the possible lowest cost.
The following diagram is a sample architecture that showcases how a model is served as an endpoint with autoscaling enabled.
Configure Autoscaling for our Endpoint
You can define the minimum, desired, and the maximum number of instances per endpoint and, based on the autoscaling configurations, instances are managed dynamically. The following diagram illustrates this architecture.
AWS offers many different ways to auto-scale your endpoints. One of them Simple-Scaling, where you scale the instance capacity based on CPUUtilization of the instances or SageMakerVariantInvocationsPerInstance.
In this example we are going to use CPUUtilization to auto-scale our Endpoint
Create Scaling Policy with configuration details, e.g. TargetValue when the instance should be scaled.
stress test the endpoint with threaded requests
Monitor the CPUUtilization in cloudwatch
Now we check the endpoint instance_count number an see that SageMaker has scaled out.
Clean up
Conclusion
With the help of the Autoscaling groups were we able to apply elasticity without heavy lifting. The endpoint now adapts to the incoming load and scales in and out as required.
Through the simplicity of SageMaker you don't need huge Ops-teams anymore to manage and scale your machine learning models. You can do it yourself.
You can find the code here and feel free open a thread the forum.
Thanks for reading. If you have any questions, feel free to contact me, through Github, or on the forum. You can also connect with me on Twitter or LinkedIn.