Fine-tune Llama 3 with PyTorch FSDP and Q-Lora on Amazon SageMaker
This blog post walks you thorugh how to fine-tune a Llama 3 using PyTorch FSDP and Q-Lora with the help of Hugging Face TRL, Transformers, peft & datasets on Amazon SageMAker. In addition to FSDP we will use Flash Attention v2 implementation.
This blog is an extension and dedicated version to my Efficiently fine-tune Llama 3 with PyTorch FSDP and Q-Lora version, specifically tailored to run on Amazon SageMaker.
- Setup development environment
- Create and prepare the dataset
- Fine-tune Llama 3 on Amazon SageMaker
- Deploy & Test fine-tuned Llama 3 on Amazon SageMaker
Note: This blog was created and validated on ml.p4d.24xlarge
and ml.g5.48.xlarge
instances. The configurations and code are optimized for ml.p4d.24xlarge
with 8xA100 GPUs each with 40GB of Memory. We tried ml.g5.12xlarge
but Amazon SageMaker reserves more memory than EC2. We plan to add support for trn1
in the coming weeks.
FSDP + Q-Lora Background
In a collaboration between Answer.AI, Tim Dettmers Q-Lora creator and Hugging Face, we are proud to announce to share the support of Q-Lora and PyTorch FSDP (Fully Sharded Data Parallel). FSDP and Q-Lora allows you now to fine-tune Llama 2 70b or Mixtral 8x7B on 2x consumer GPUs (24GB). If you want to learn more about the background of this collaboration take a look at You can now train a 70b language model at home. Hugging Face PEFT is were the magic happens for this happens, read more about it in the PEFT documentation.
- PyTorch FSDP is a data/model parallelism technique that shards model across GPUs, reducing memory requirements and enabling the training of larger models more efficiently.
- Q-LoRA is a fine-tuning method that leverages quantization and Low-Rank Adapters to efficiently reduced computational requirements and memory footprint.
This blog post walks you thorugh how to fine-tune open LLMs from Hugging Face using Amazon SageMaker. This blog is an extension and dedicated version to my How to Fine-Tune LLMs in 2024 with Hugging Face version, specifically tailored to run on Amazon SageMaker.
1. Setup Development Environment
Our first step is to install Hugging Face Libraries we need on the client to correctly prepare our dataset and start our training/evaluations jobs.
Next we need to login into Hugging Face to access the Llama 3 70b model and store our trained model on Hugging Face. If you don't have an account yet and accepted the terms, you can create one here.
If you are going to use Sagemaker in a local environment. You need access to an IAM Role with the required permissions for Sagemaker. You can find here more about it.
2. Create and prepare the dataset
After our environment is set up, we can start creating and preparing our dataset. A fine-tuning dataset should have a diverse set of demonstrations of the task you want to solve. If you want to learn more about how to create a dataset, take a look at the How to Fine-Tune LLMs in 2024 with Hugging Face.
We will use the HuggingFaceH4/no_robots dataset a high-quality dataset of 10,000 instructions and demonstrations created by skilled human annotators. This data can be used for supervised fine-tuning (SFT) to make language models follow instructions better. No Robots was modelled after the instruction dataset described in OpenAI's InstructGPT paper, and is comprised mostly of single-turn instructions.
The no_robots dataset has 10,000 split into 9,500 training and 500 test examples. Some samples are not including a system
message. We will load the dataset with the datasets
library, add a missing system
message and save them to separate json files.
After we processed the datasets we are going to use the FileSystem integration to upload our dataset to S3. We are using the sess.default_bucket()
, adjust this if you want to store the dataset in a different S3 bucket. We will use the S3 path later in our training script.
3. Fine-tune Llama 3 on Amazon SageMaker
We are now ready to fine-tune our model. We will use the SFTTrainer from trl
to fine-tune our model. The SFTTrainer
makes it straightfoward to supervise fine-tune open LLMs. The SFTTrainer
is a subclass of the Trainer
from the transformers
. We prepared a script run_fsdp_qlora.py which will loads the dataset from disk, prepare the model, tokenizer and start the training. It usees the SFTTrainer from trl
to fine-tune our model. The SFTTrainer
makes it straightfoward to supervise fine-tune open LLMs supporting:
- Dataset formatting, including conversational and instruction format (✅ used)
- Training on completions only, ignoring prompts (❌ not used)
- Packing datasets for more efficient training (✅ used)
- PEFT (parameter-efficient fine-tuning) support including Q-LoRA (✅ used)
- Preparing the model and tokenizer for conversational fine-tuning (❌ not used, see below)
For configuration we use the new TrlParser
, that allows us to provide hyperparameters in a yaml file. This yaml
will be uploaded and provided to Amazon SageMaker similar to our datasets. Below is the config file for fine-tuning Llama 3 70B on 8x A100 GPUs or 4x24GB GPUs. We are saving the config file as fsdp_qlora_llama3_70b.yaml
and upload it to S3.
For the chat template we use the Anthropic/Vicuna template, not the official one. Since we then would need to train and save the embedding layer as well leading to more memory requirements. If you wnat to use the official Llama 3 template comment in the LLAMA_3_CHAT_TEMPLATE
in the run_fsdp_qlora.py
script and make sure to add modules_to_save. The template used will look like this.
You are a helpful Assistant.
Human: What is 2+2?
Assistant: 2+2 equals 4.
Lets upload the config file to S3.
In order to create a sagemaker training job we need an HuggingFace
Estimator. The Estimator handles end-to-end Amazon SageMaker training and deployment tasks. The Estimator manages the infrastructure use. Amazon SagMaker takes care of starting and managing all the required ec2 instances for us, provides the correct huggingface container, uploads the provided scripts and downloads the data from our S3 bucket into the container at /opt/ml/input/data
. Then, it starts the training job by running.
Note: Make sure that you include the
requirements.txt
in thesource_dir
if you are using a custom training script. We recommend to just clone the whole repository.
To use torchrun
to execute our scripts, we only have to define the distribution
parameter in our Estimator and set it to {"torch_distributed": {"enabled": True}}
. This tells SageMaker to launch our training job with.
The HuggingFace
configuration below will start a training job on 1x p4d.24xlarge
using 8x A100 GPUs. The amazing part about SageMaker is that you can easily scale up to 2x p4d.24xlarge
by modifying the instance_count
. SageMaker will take care of the rest for you.
Note: When using QLoRA, we only train adapters and not the full model. The run_fsdp_qlora.py merges the base_model
with the adapter
at the end of the training to directly be able to deploy to Amazon SageMaker.
We can now start our training job, with the .fit()
method passing our S3 path to the training script.
In our example the training Llama 3 70B with Flash Attention for 2 epochs with a dataset of 10k samples takes 5052 seconds (~84minutes) on a ml.p4d.24xlarge
or ~$50.
4. Deploy & Test fine-tuned Llama 3 on Amazon SageMaker
Evaluating LLMs is crucial for understanding their capabilities and limitations, yet it poses significant challenges due to their complex and opaque nature. There are multiple ways to evaluate a fine-tuned model. You could either use an additional Training job to evaluate the model as we demonstrated in Evaluate LLMs with Hugging Face Lighteval on Amazon SageMaker or you can deploy the model to an endpoint and interactively test the model. We are going to use the latter approach in this example. We will load our eval dataset and evaluate the model on those samples, using a simple loop and accuracy as our metric.
Note: Evaluating Generative AI models is not a trivial task since 1 input can have multiple correct outputs. If you want to learn more about evaluating generative models, check out Evaluate LLMs and RAG a practical example using Langchain and Hugging Face blog post.
We are going to use the Hugging Face LLM Inference DLC a purpose-built Inference Container to easily deploy LLMs in a secure and managed environment. The DLC is powered by Text Generation Inference (TGI) solution for deploying and serving Large Language Models (LLMs).
We can now create a HuggingFaceModel
using the container uri and the S3 path to our model. We also need to set our TGI configuration including the number of GPUs, max input tokens. You can find a full list of configuration options here.
After we have created the HuggingFaceModel we can deploy it to Amazon SageMaker using the deploy method.
SageMaker will now create our endpoint and deploy the model to it. This can takes a 15-20 minutes. Afterwards, we can test our model by sending some example inputs to the endpoint. We will use the predict
method of the predictor to send the input to the model and get the output.
Note: If you want to learn more about streaming responses or benchmarking your endpoint checkout Deploy Llama 3 on Amazon SageMaker and Deploy Mixtral 8x7B on Amazon SageMaker .
To clean up, we can delete the model and endpoint.
Thanks for reading! If you have any questions or feedback, please let me know on Twitter or LinkedIn.