Programmatically manage 🤗 Inference Endpoints
Infrastructure as Code (IaC) allows us to manage and provise infrastructure through programmatic interfaces. This approach can help companies to easier manage, update and use Generative AI in more automatic setups.
By leveraging IaC, we can streamline the deployment and maintenance of models and unlock new capabilities like batch processing, automatic model end-to-end model evaluation.
I am happy to share the huggingface_hub python library now supports Hugging Face Inference Endpoints. This allows you to programmatically manage Inference Endpoints. You can now create, update, pause, delete or send requests using the huggingface_hub
library.
Hugging Face Inference Endpoints offers an easy and secure way to deploy Generative AI models for use in production. Inference Endpoints empower developers and data scientists alike to create AI applications without managing infrastructure: simplifying the deployment process to a few clicks, including handling large volumes of requests with autoscaling, reducing infrastructure costs with scale-to-zero, and offering advanced security.
The huggingface_hub library allows you to interact with the Hugging Face Hub, a machine learning platform for creators and collaborators.
End-to-End Example
This tutorial will guide you through an example of managing Inference Endpoints using the huggingface_hub
library. We'll focus on the model Zehpyr for this example. Support for managing inference endpoints came with version 0.20
.
Before we can create and managed our endpoints we need to login using a Hugging Face Token. We also want to define our namespace
the namespace is the organization/account we use. It is the identifier in the url when you go to your profile, e.g. huggingface
.
The huggingface_hub
library provides a create_inference_endpoints
method which accepts the same parameters as the HTTP Endpoint from Inference Endpoints. This means we need to define:
endpoint_name
: The name for the endpointrepository
: The model repository to useframework
: The framework to use, most likelypytorch
task
: The task to use for LLMs this istext-generation
vendor
: The cloud provider to use, e.g.aws
region
: The region to use, e.g.us-east-1
type
: The type security type to use, e.g.protected
instance_size
: The instance size to use can be found in the UIinstance_type
: The instance type to use can be found in the UIaccelerator
: The accelerator to use, e.g.gpu
namespace
: The namespace to use, e.g.huggingface
custom_image
: The custom image to use, this is optional and can be used to define a custom image to use.
We are going to use custom_image
to use Text Generation Inference. This is the same image you get when going to the UI and deploying a LLM.
In our example we want to use the model Zephyr. First lets define our custom image. The custom_image
allows us also to define TGI specific parameters like MAX_BATCH_PREFILL_TOKENS
, MAX_INPUT_LENGTH
and MAX_TOTAL_TOKENS
. Below is an example of how those, make sure to adjust them to your needs, e.g. input length.
After we defined our custom image we can create our inference endpoint.
The huggingface_hub
library will return an InferenceEndpoint
object. This object allows us to interact with the endpoint. This means we can directly send requests, pause, delete or update the endpoint. We can also call the wait
method to wait for the endpoint to be ready for inference. This is super handy when you want to run inference right after creating the endpoint, e.g. for batch processing or automatic model evaluation.
Note: This may take a few minutes.
After the endpoint is ready the .wait()
method will return. This means we can test our endpoint and send requests.
For more details about how to use the InferenceClient, check out the Inference guide.
If we want to temporarily pause the endpoint you can call the pause
method.
To delete the endpoint we call the delete
method.
Conclusion
In this tutorial, you've learned how to use the huggingface_hub
library to create, send requests to, pause, and delete Hugging Face Inference Endpoints. This allows for efficient and scalable management of Generative AI models in production environments.
We have a more in depth documentation about managing Inference Endpoints in our documentation. The huggingface_hub
offers more capabilities like listing endpoints, updating scaling and more.
If you are missing a feature or have feedback, please let us know.
Thanks for reading! If you have any questions, feel free to contact me on Twitter or LinkedIn.