Deploy open LLMs with vLLM on Hugging Face Inference Endpoints
Hugging Face Inference Endpoints offers an easy and secure way to deploy Machine Learning models for use in production. Inference Endpoints empower developers and data scientists to create AI applications without managing infrastructure: simplifying the deployment process to a few clicks, including handling large volumes of requests with autoscaling, reducing infrastructure costs with scale-to-zero, and offering advanced security.
In this blog post, we will show you how to deploy open LLMS with vLLM on Hugging Face Inference Endpoints. VLLM is one of the most popular open-source projects for deploying large language models. It is built for high-throughput and memory-efficient deployments of LLMs. In this blog post we will show you how to deploy Llama 3 on Hugging Face Inference Endpoints using vLLM.
Create Inference Endpoints
Inference Endpoints allows you to customize the container image used for deployment. We created a vLLM container supporting loading models from /repository
since Inference Endpoints mounts the model artifacts to /repository
. We also allowed to configure the model max sequence length using the MAX_MODEL_LEN
environment variable.
We can deploy the model in just a few clicks from the UI, or take advantage of the huggingface_hub
Python library to programmatically create and manage Inference Endpoints. If you are using the UI, you have to select custom
container type in the advanced configuration and add philschmi/vllm-hf-inference-endpoints
as the container image. The UI is a great way to get started, but we currently do not support customizing the environment variables from the UI. We will show you how to do this using the huggingface_hub
Python library.
We are going to use the huggingface_hub
library to programmatically create and manage Inference Endpoints. First, we need to install the library:
Then we needd to make sure we are logged in and have access to Llama 3 model.
Note: Llama 3 is a gated model, please visit the Model Card and accept the license terms and acceptable use policy before submitting this form.
Now we can create the Inference Endpoint using the create_inference_endpoint
function. We will deploy meta-llama/Meta-Llama-3-8B-Instruct
on a single A10G instance with a MAX_MODEL_LEN
to 8192 to avoid out of memory errors.
Running this code will create an Inference Endpoint with the specified configuration. The wait
method will block until the Inference Endpoint is ready. You can check the status of the Inference Endpoint by printing the status
attribute. You can also check the status of the Inference Endpoint in the Hugging Face Inference Endpoints UI.
Note: The UI widgets are currently not working for custom container images (sending a request to "/" and not "v1/completion7/chat"). You will receive a 404 if you try it. We are working on fixing/adding this. In the meantime you need to curl or use the OpenAI SDK to interact with the model.
Test Endpoint using OpenAI SDK
We deployed vLLM using the OpenAI completion API. We can test the Inference Endpoint using the OpenAI SDK. First, we need to install the OpenAI SDK:
Then we can test the Inference Endpoint using the OpenAI SDK:
Note: We have to use /repository
as the model since thats where Inference Endpoints mounts the model.
Nice work! You have successfully deployed Llama 3 on Hugging Face Inference Endpoints using vLLM. You can now use the OpenAI SDK to interact with the model. You can also use the Hugging Face Inference Endpoints UI to monitor the model's performance and scale it up or down as needed.
You can delete the Inference Endpoint using the delete
method:
Conclusion
In this blog post, we showed you how to deploy open LLMS with vLLM on Hugging Face Inference Endpoints using the custom container image. We used the huggingface_hub
Python library to programmatically create and manage Inference Endpoints. We also showed you how to test the Inference Endpoint using the OpenAI SDK. We hope this blog post helps you deploy your LLMs on Hugging Face Inference Endpoints.
Thanks for reading! If you have any questions or feedback, please let me know on Twitter or LinkedIn.