Deploy Idefics 9B & 80B on Amazon SageMaker
IDEFICS is an open-access visual language model with 80 billion parameters that can generate text based on sequences of images and text. It was created to reproduce capabilities similar to Deepmind's closed-source Flamingo or Open AI GPT-4V model using only publicly available data and models.
In this blog you will learn how to deploy Idefics model to Amazon SageMaker. We are going to use the Hugging Face LLM DLC is a new purpose-built Inference Container to easily deploy LLMs in a secure and managed environment. The DLC is powered by Text Generation Inference (TGI) a scalelable, optimized solution for deploying and serving Large Language Models (LLMs). The Blog post also includes Hardware requirements for the different model sizes.
In the blog will cover how to:
- Setup development environment
- Retrieve the new Hugging Face LLM DLC
- Hardware requirements
- Deploy Idefics 80B to Amazon SageMaker
- Run inference and chat with the model
- Clean up
Before we gets started lets a quick look at Idefics
What is Hugging Face IDEFICS?
IDEFICS (Image-aware Decoder Enhanced à la Flamingo with Interleaved Cross-attentionS) is an open-access visual language model developed by Hugging Face. It is a reproduction of Flamingo, a closed-source model created by Deepmind.
Like GPT-4V, IDEFICS is a multimodal model that accepts sequences of images and text as input and generates text as output. The model can answer questions about images, describe visual contents, and create stories grounded in multiple images.
IDEFICS comes in two main variants, a 9B and 80B parameter version. Hugging Face also provides fine-tuned versions called idefics-80b-instruct and idefics-9b-instruct that have been adapted for conversational use cases.
IDEFICS is trained solely on publicly available data like Wikipedia, LAION, and a new 115B token dataset called OBELICS. It is built on top of existing models like CLIP-ViT-H-14-laion2B-s32B-b79K
and Llama
.
You can learn more about Idefics in the Hugging Face blog post.
1. Setup development environment
We are going to use the sagemaker
python SDK to deploy Idefics to Amazon SageMaker. We need to make sure to have an AWS account configured and the sagemaker
python SDK installed.
If you are going to use Sagemaker in a local environment. You need access to an IAM Role with the required permissions for Sagemaker. You can find here more about it.
2. Retrieve the new Hugging Face LLM DLC
Compared to deploying regular Hugging Face models we first need to retrieve the container uri and provide it to our HuggingFaceModel
model class with a image_uri
pointing to the image. To retrieve the new Hugging Face LLM DLC in Amazon SageMaker, we can use the get_huggingface_llm_image_uri
method provided by the sagemaker
SDK. This method allows us to retrieve the URI for the desired Hugging Face LLM DLC based on the specified backend
, session
, region
, and version
. You can find the available versions here
3. Hardware requirements
Idefics comes in 2 different sizes - 9B & 80B parameters. The hardware requirements will vary based on the model size deployed to SageMaker. Below is a set up minimum requirements for each model size we tested.
Note: We haven't tested GPTQ models yet.
Model | Instance Type | Quantization | # of GPUs per replica |
---|---|---|---|
Idefics 9B | (ml.)g5.12xlarge | - | 4 |
Idefics 80B | (ml.)g5.48xlarge | bitsandbytes | 8 |
Idefics 80B | (ml.)p4d.24xlarge | - | 8 |
Note: Amazon SageMaker currently doesn't support instance slicing meaning, e.g. for Idefics 80B you cannot run multiple replica on a single instance.
These are the setups we have validated for Idefics instruct 9B and 80B models to work on SageMaker.
4. Deploy Idefics 80B to Amazon SageMaker
To deploy HuggingFaceM4/idefics-80b-instruct to Amazon SageMaker we create a HuggingFaceModel
model class and define our endpoint configuration including the hf_model_id
, instance_type
etc. We will use a g5.12xlarge
instance type, which has 4 NVIDIA A10G GPUs and 96GB of GPU memory.
After we have created the HuggingFaceModel
we can deploy it to Amazon SageMaker using the deploy
method. We will deploy the model with the ml.g5.12xlarge
instance type. TGI will automatically distribute and shard the model across all GPUs.
SageMaker will now create our endpoint and deploy the model to it. This can takes a 10-15 minutes.
5. Run inference and chat with the model
After our endpoint is deployed we can run inference on it. We will use the predict
method from the predictor
to run inference on our endpoint. We can inference with different parameters to impact the generation. Parameters can be defined as in the parameters
attribute of the payload. You can find a full list of paramters in the documentation at the bottom in the GenerateParameters object.
The HuggingFaceM4/idefics-80b-instruct
is a instruction tuned model meaning we can instruct with it using the following prompt:
User:<fake_token_around_image><image><fake_token_around_image>{in_context_prompt}<end_of_utterance>\n
Assistant: {in_context_answer}<end_of_utterance>\n
User:<fake_token_around_image><image><fake_token_around_image>{prompt}<end_of_utterance>\n
Assistant:
More here for TGI we currenlty need to add the <image>
as "markdown" url with ![](https://t4.ftcdn.net/jpg/00/97/58/97/360_F_97589769_t45CqXyzjz0KXwoBZT9PRaWGHRk5hQqQ.jpg)
. In the version 1.1.0 of TGI we can only provide urls to images. To run secure inference and use local image we will upload those to s3 and make them available via signed urls and delete them after inference again. Therefore we create a helper method run_inference
, which accepts our prompt and an path to an image. The image will be uploaded to s3 and a signed url will be created. We will then run inference on our endpoint and delete the image again.
Lets get an image from internet we can use to run a request. We will use the following image:
Lets ask if i can use this power cord in the U.S.
Thats correct the cable is a European cable and not suitable for the U.S.
6. Clean up
To clean up, we can delete the model and endpoint.
Thanks for reading! If you have any questions, feel free to contact me on Twitter or LinkedIn.