Deploy Mixtral 8x7B on Amazon SageMaker
Mixtral 8x7B is the open LLM from Mistral AI. The Mixtral-8x7B is a Sparse Mixture of Experts it has a similar architecture to Mistral 7B, but comes with a twist: it’s actually 8 “expert” models in one. If you want to learn more about MoEs check out Mixture of Experts Explained.
In this blog you will learn how to deploy mistralai/Mixtral-8x7B-Instruct-v0.1 model to Amazon SageMaker. We are going to use the Hugging Face LLM DLC is a new purpose-built Inference Container to easily deploy LLMs in a secure and managed environment. The DLC is powered by Text Generation Inference (TGI) a scalelable, optimized solution for deploying and serving Large Language Models (LLMs). The Blog post also includes Hardware requirements for the different model sizes.
In the blog will cover how to:
- Setup development environment
- Retrieve the new Hugging Face LLM DLC
- Hardware requirements
- Deploy Mixtral 8x7B to Amazon SageMaker
- Run inference and chat with the model
- Clean up
Lets get started!
1. Setup development environment
We are going to use the sagemaker
python SDK to deploy Mixtral to Amazon SageMaker. We need to make sure to have an AWS account configured and the sagemaker
python SDK installed.
If you are going to use Sagemaker in a local environment. You need access to an IAM Role with the required permissions for Sagemaker. You can find here more about it.
2. Retrieve the new Hugging Face LLM DLC
Compared to deploying regular Hugging Face models we first need to retrieve the container uri and provide it to our HuggingFaceModel
model class with a image_uri
pointing to the image. To retrieve the new Hugging Face LLM DLC in Amazon SageMaker, we can use the get_huggingface_llm_image_uri
method provided by the sagemaker
SDK. This method allows us to retrieve the URI for the desired Hugging Face LLM DLC based on the specified backend
, session
, region
, and version
. You can find the available versions here
Note: At the time of writing this blog post the latest version of the Hugging Face LLM DLC is not yet available via the get_huggingface_llm_image_uri
method. We are going to use the raw container uri instead.
3. Hardware requirements
Mixtral 8x7B is 45B parameter Mixture-of-Experts (MoE) model. This means we’ll need to have enough VRAM to hold a dense 47B parameter model. Why 47B parameters and not 8 x 7B = 56B? That’s because in MoE models, only the FFN layers are treated as individual experts. This allows Mixtral 8x7B to achieve the latency of ~12B models by being 45B parameters.
Note: The Hugging Face LLM DLC version 1.3.1 does not yet support quantization techniques with AWQ and GPTQ. We expect to support these techniques in the next releases. We will update this blog post once the new version is available.
Model | Instance Type | Quantization | NUM_GPUS |
---|---|---|---|
Mixtral 8x7B | (ml.)p4d.24xlarge | - | 8 |
Mixtral 8x7B | (ml.)g5.48xlarge | - | 8 |
The parameter tested for MAX_INPUT_LENGTH
, MAX_BATCH_PREFILL_TOKENS
, MAX_TOTAL_TOKENS
and MAX_BATCH_TOTAL_TOKENS
are the same for both instances. You can tweak them for your needs.
4. Deploy Mixtral 8x7B to Amazon SageMaker
To deploy Mixtral 8x7B to Amazon SageMaker we create a HuggingFaceModel
model class and define our endpoint configuration including the hf_model_id
, instance_type
etc. We will use a g5.45xlarge
instance type, which has 8 NVIDIA A10G GPUs and 192GB of GPU memory. You need atleast > 100GB of GPU memory to run Mixtral 8x7B in float16 with decent input length.
After we have created the HuggingFaceModel
we can deploy it to Amazon SageMaker using the deploy
method. We will deploy the model with the ml.g5.48xlarge
instance type. TGI will automatically distribute and shard the model across all GPUs.
SageMaker will now create our endpoint and deploy the model to it. This can takes a 10-15 minutes.
5. Run inference and chat with the model
After our endpoint is deployed we can run inference on it. We will use the predict
method from the predictor
to run inference on our endpoint. We can inference with different parameters to impact the generation. Parameters can be defined as in the parameters
attribute of the payload. You can find supported parameters in the here or in the open api specification of the TGI in the swagger documentation
The mistralai/Mixtral-8x7B-Instruct-v0.1
is a conversational chat model meaning we can chat with it using the following prompt:
<s> [INST] User Instruction 1 [/INST] Model answer 1</s> [INST] User instruction 2 [/INST]
Lets see, if Mixtral can come up with some cool ideas for the summer.
Okay lets test it.
Awesome, we tested infernece now lets build a cool demo which support streaming responses. Amazon SageMaker supports streaming responses from your model. We can use this to stream responses, we can leverage this to create a streaming gradio application with a better user experience.
We created a sample application that you can use to test your model. You can find the code in gradio-app.py. The application will stream the responses from the model and display them in the UI. You can also use the application to test your model with your own inputs.
6. Clean up
To clean up, we can delete the model and endpoint.
Thanks for reading! If you have any questions, feel free to contact me on Twitter or LinkedIn.