Deploy Falcon 180B on Amazon SageMaker
Falcon 180B is the newest version of Falcon LLM family. It is the biggest open source model with 180B parameter and trained on more data - 3.5T tokens with context length window upto 4K tokens. Falcon 180B is created by the Technology Innovation Institute in Abu Dhabi. Falcon 180B is the most powerful open source LLM today and competes with Googles PaLM-2 Large and is better than OpenAIs GPT-3.5.
In this blog you will learn how to deploy Falcon 180B to Amazon SageMaker. We are going to use the Hugging Face LLM DLC is a new purpose-built Inference Container to easily deploy LLMs in a secure and managed environment. The DLC is powered by Text Generation Inference (TGI) a scalelable, optimized solution for deploying and serving Large Language Models (LLMs). The Blog post also includes Hardware requirements for the different model sizes.
In the blog will cover how to:
- Setup development environment
- Retrieve the new Hugging Face LLM DLC
- Hardware requirements
- Deploy Falcon 180B to Amazon SageMaker
- Run inference and chat with the model
- Create a streaming Gradio Chatbot Demo
Lets get started!
1. Setup development environment
We are going to use the sagemaker
python SDK to deploy Falcon 180B to Amazon SageMaker. We need to make sure to have an AWS account configured and the sagemaker
python SDK installed.
If you are going to use Sagemaker in a local environment. You need access to an IAM Role with the required permissions for Sagemaker. You can find here more about it.
2. Retrieve the new Hugging Face LLM DLC
Compared to deploying regular Hugging Face models we first need to retrieve the container uri and provide it to our HuggingFaceModel
model class with a image_uri
pointing to the image. To retrieve the new Hugging Face LLM DLC in Amazon SageMaker, we can use the get_huggingface_llm_image_uri
method provided by the sagemaker
SDK. This method allows us to retrieve the URI for the desired Hugging Face LLM DLC based on the specified backend
, session
, region
, and version
. You can find the available versions here
3. Hardware requirements
Falcon 180B is the biggst version of the Falcon family running it requires powerful hardware. Below is a set up minimum requirements. We plan to run more test with gptq
in the coming weeks when models are available.
Model | Instance Type | Quantization | Tested |
---|---|---|---|
Falcon 180B | (ml.)p4de.24xlarge | - | ✅ |
Falcon 180B | (ml.)p4d.24xlarge | gptq | 🛑 |
Falcon 180B | (ml.)p5.48xlarge | - | 🛑 |
4. Deploy Falcon 180B to Amazon SageMaker
To deploy tiiuae/falcon-180B-chat to Amazon SageMaker we create a HuggingFaceModel
model class and define our endpoint configuration including the hf_model_id
, instance_type
etc. We will use a p4de.24xlarge
instance type, which has 8 NVIDIA A100 GPUs and 640GB of GPU memory.
Since the latest version of the Hugging Face DLC is not yet supporting the new Falcon config, we created a separate sagemaker
revision in the repository. We will use the sagemaker
revision to deploy Falcon 180B to Amazon SageMaker.
After we have created the HuggingFaceModel
we can deploy it to Amazon SageMaker using the deploy
method. We will deploy the model with the ml.p4de.24xlarge
instance type. TGI will automatically distribute and shard the model across all GPUs.
SageMaker will now create our endpoint and deploy the model to it. This can takes a 15-20 minutes.
5. Run inference and chat with the model
After our endpoint is deployed we can run inference on it. We will use the predict
method from the predictor
to run inference on our endpoint. We can inference with different parameters to impact the generation. Parameters can be defined as in the parameters
attribute of the payload. As of today the TGI supports the following parameters:
temperature
: Controls randomness in the model. Lower values will make the model more deterministic and higher values will make the model more random. Default value is 1.0.max_new_tokens
: The maximum number of tokens to generate. Default value is 20, max value is 512.repetition_penalty
: Controls the likelihood of repetition, defaults tonull
.seed
: The seed to use for random generation, default isnull
.stop
: A list of tokens to stop the generation. The generation will stop when one of the tokens is generated.top_k
: The number of highest probability vocabulary tokens to keep for top-k-filtering. Default value isnull
, which disables top-k-filtering.top_p
: The cumulative probability of parameter highest probability vocabulary tokens to keep for nucleus sampling, default tonull
do_sample
: Whether or not to use sampling ; use greedy decoding otherwise. Default value isfalse
.best_of
: Generate best_of sequences and return the one if the highest token logprobs, default tonull
.details
: Whether or not to return details about the generation. Default value isfalse
.return_full_text
: Whether or not to return the full text or only the generated part. Default value isfalse
.truncate
: Whether or not to truncate the input to the maximum length of the model. Default value istrue
.typical_p
: The typical probability of a token. Default value isnull
.watermark
: The watermark to use for the generation. Default value isfalse
.
You can find the open api specification of the TGI in the swagger documentation
The tiiuae/falcon-180B-chat
is a conversational chat model meaning we can chat with it using the following prompt:
System: You are a helpful assistant
User: What is Amazon SageMaker?
Falcon: Amazon....
Lets see, if Clara can come up with some cool ideas for the summer.
As a response, you should get something like this.
Sure, Amazon SageMaker is a fully-managed machine learning service provided by Amazon Web Services (AWS) that enables developers and data scientists to build, train, and deploy machine learning models at scale. It offers a range of tools and capabilities for data preprocessing, feature engineering, algorithm selection, hyperparameter tuning, model training, and deployment. SageMaker also supports popular open-source frameworks like TensorFlow and PyTorch, and provides built-in algorithms for common use cases like classification, regression, and clustering. Overall, SageMaker makes it easier for organizations to adopt and implement machine learning in their applications and workflows.
6. Create a streaming Gradio Chatbot Demo
We can also create a gradio application to chat with our model. Gradio is a python library that allows you to quickly create customizable UI components around your machine learning models. You can find more about gradio here.
Amazon SageMaker now supports streaming through Server-Sent-Events, we can leverage this to create a streaming gradio application with a better user experience. Therefore we need to use the boto3
sagemaker-runtime client with the new invoke_endpoint_with_response_stream
method. To keep this example clean we move the code to a separate app.py file. The app.py
includes a gradio example on how to build a chatbot with the Falcon 180B model and SageMaker streaming.
Make sure the app.py file is in the same directory as this notebook.
Awesome! 🚀 We have successfully deployed Falcon 180B to Amazon SageMaker and run inference on it. Additionally, we have built a quick gradio application to chat with our model.
Now, its time for you to try it out yourself and build Generation AI applications with the Hugging Face LLM DLC on Amazon SageMaker.
To clean up, we can delete the model and endpoint.
Thanks for reading! If you have any questions, feel free to contact me on Twitter or LinkedIn.