In our example, we are going to leverage Hugging Face Transformers, Accelerate, and PEFT. We will also make use of new and efficient features and methods including, Flash Attention, Datset Packing and Mixed Precision Training.
If you are going to use Llama 2 you need to login into our hugging face account, to use your token for accessing the gated repository. We can do this by running the following command:
If you are going to use Sagemaker in a local environment. You need access to an IAM Role with the required permissions for Sagemaker. You can find here more about it.
2. Load and prepare the dataset
We will use Dolly an open source dataset of instruction-following records on categories outlined in the InstructGPT paper, including brainstorming, classification, closed QA, generation, information extraction, open QA, and summarization.
To load the dolly dataset, we use the load_dataset() method from the 🤗 Datasets library.
To instruct tune our model we need to convert our structured examples into a collection of tasks described via instructions. We define a formatting_function that takes a sample and returns a string with our format instruction.
lets test our formatting function on a random example.
In addition, to formatting our samples we also want to pack multiple samples to one sequence to have a more efficient training. This means that we are stacking multiple samples to one sequence and split them with an EOS Token. This makes the training more efficient. Packing/stacking samples can be done during training or before. We will do it before training to save time. We created a utility method pack_dataset that takes a dataset and a packing function and returns a packed dataset.
To pack/stack our dataset we need to first tokenize it and then we can pack it with the pack_dataset method. To prepare our dataset we will now:
Format our samples using the template method and add an EOS token at the end of each sample
Tokenize our dataset to convert it from text to tokens
Pack our dataset to 2048 tokens
After we processed the datasets we are going to use the new FileSystem integration to upload our dataset to S3. We are using the sess.default_bucket(), adjust this if you want to store the dataset in a different S3 bucket. We will use the S3 path later in our training script.
3. Fine-Tune Mistral 7B with QLoRA on Amazon SageMaker
We are going to use the recently introduced method in the paper "QLoRA: Quantization-aware Low-Rank Adapter Tuning for Language Generation" by Tim Dettmers et al. QLoRA is a new technique to reduce the memory footprint of large language models during finetuning, without sacrificing performance. The TL;DR; of how QLoRA works is:
Quantize the pretrained model to 4 bits and freezing it.
Attach small, trainable adapter layers. (LoRA)
Finetune only the adapter layers, while using the frozen quantized model for context.
We prepared a run_qlora.py, which implements QLora using PEFT to train our model. The script also merges the LoRA weights into the model weights after training. That way you can use the model as a normal model without any additional code. The model will be temporally offloaded to disk, if it is too large to fit into memory.
In Addition to QLoRA we will leverage the new Flash Attention 2 integrationg with Transformers to speed up the training. Flash Attention 2 is a new efficient attention mechanism that is up to 3x faster than the standard attention mechanism.
In order to create a sagemaker training job we need an HuggingFace Estimator. The Estimator handles end-to-end Amazon SageMaker training and deployment tasks. The Estimator manages the infrastructure use. Amazon SagMaker takes care of starting and managing all the required ec2 instances for us, provides the correct huggingface container, uploads the provided scripts and downloads the data from our S3 bucket into the container at /opt/ml/input/data. Then, it starts the training job by running.
Note: Make sure that you include the requirements.txt in the source_dir if you are using a custom training script. We recommend to just clone the whole repository.
You can also use g5.2xlarge instead of the g5.4xlarge instance type, but then it is not possible to use merge_weights parameter, since to merge the LoRA weights into the model weights, the model needs to fit into memory. But you could save the adapter weights and merge them using merge_adapter_weights.py after training.
We can now start our training job, with the .fit() method passing our S3 path to the training script.
In our example for Mistral 7B, the SageMaker training job took 13968 seconds, which is about 3.9 hours. The ml.g5.4xlarge instance we used costs $2.03 per hour for on-demand usage. As a result, the total cost for training our fine-tuned Mistral model was only ~$8.
Now lets make sure SageMaker has successfully uploaded the model to S3. We can use the model_data property of the estimator to get the S3 path to the model. Since we used merge_weights=True and disable_output_compression=True the model is stored as raw files in the S3 bucket.
You should see a similar folder structure and files in your S3 bucket:
Now, lets deploy our model to an endpoint. 🚀
4. Deploy Fine-tuned Mistral 7B on Amazon SageMaker
We are going to use the Hugging Face LLM Inference DLC a purpose-built Inference Container to easily deploy LLMs in a secure and managed environment. The DLC is powered by Text Generation Inference (TGI) solution for deploying and serving Large Language Models (LLMs).
Compared to deploying regular Hugging Face models we first need to retrieve the container uri and provide it to our HuggingFaceModel model class with a image_uri pointing to the image. To retrieve the new Hugging Face LLM DLC in Amazon SageMaker, we can use the get_huggingface_llm_image_uri method provided by the sagemaker SDK. This method allows us to retrieve the URI for the desired Hugging Face LLM DLC based on the specified backend, session, region, and version. You can find the available versions here
We can now create a HuggingFaceModel using the container uri and the S3 path to our model. We also need to set our TGI configuration including the number of GPUs, max input tokens. You can find a full list of configuration options here.
After we have created the HuggingFaceModel we can deploy it to Amazon SageMaker using the deploy method.
SageMaker will now create our endpoint and deploy the model to it. This can takes a 10-15 minutes.
5. Stream Inference Requests from the Deployed Model
Amazon SageMaker supports streaming responses from your model. We can use this to stream responses, we can leverage this to create a streaming gradio application with a better user experience.
We created a sample application that you can use to test your model. You can find the code in gradio-app.py. The application will stream the responses from the model and display them in the UI. You can also use the application to test your model with your own inputs.
Don't forget to delete the endpoint after you are done with the example.
Thanks for reading! If you have any questions, feel free to contact me on Twitter or LinkedIn.