Fine-tune LLaMA 2 (7-70B) on Amazon SageMaker
In this sagemaker example, we are going to learn how to fine-tune LLaMA 2 using QLoRA: Efficient Finetuning of Quantized LLMs. LLaMA 2 is the next version of the LLaMA. Compared to the V1 model, it is trained on more data - 2T tokens and supports context length window upto 4K tokens. Learn more about LLaMa 2 in the "" blog post.
QLoRA is an efficient finetuning technique that quantizes a pretrained language model to 4 bits and attaches small “Low-Rank Adapters” which are fine-tuned. This enables fine-tuning of models with up to 65 billion parameters on a single GPU; despite its efficiency, QLoRA matches the performance of full-precision fine-tuning and achieves state-of-the-art results on language tasks.
In our example, we are going to leverage Hugging Face Transformers, Accelerate, and PEFT.
In Detail you will learn how to:
- Setup Development Environment
- Load and prepare the dataset
- Fine-Tune LLaMA 13B with QLoRA on Amazon SageMaker
- Deploy Fine-tuned LLM on Amazon SageMaker
Quick intro: PEFT or Parameter Efficient Fine-tuning
PEFT, or Parameter Efficient Fine-tuning, is a new open-source library from Hugging Face to enable efficient adaptation of pre-trained language models (PLMs) to various downstream applications without fine-tuning all the model's parameters. PEFT currently includes techniques for:
- (Q)LoRA: LORA: LOW-RANK ADAPTATION OF LARGE LANGUAGE MODELS
- Prefix Tuning: P-Tuning v2: Prompt Tuning Can Be Comparable to Fine-tuning Universally Across Scales and Tasks
- P-Tuning: GPT Understands, Too
- Prompt Tuning: The Power of Scale for Parameter-Efficient Prompt Tuning
- IA3: Infused Adapter by Inhibiting and Amplifying Inner Activations
Access LLaMA 2
Before we can start training we have to make sure that we accepted the license of llama 2 to be able to use it. You can accept the license by clicking on the Agree and access repository button on the model page at:
1. Setup Development Environment
To access any LLaMA 2 asset we need to login into our hugging face account. We can do this by running the following command:
If you are going to use Sagemaker in a local environment. You need access to an IAM Role with the required permissions for Sagemaker. You can find here more about it.
2. Load and prepare the dataset
we will use the dolly an open source dataset of instruction-following records generated by thousands of Databricks employees in several of the behavioral categories outlined in the InstructGPT paper, including brainstorming, classification, closed QA, generation, information extraction, open QA, and summarization.
To load the databricks/databricks-dolly-15k
dataset, we use the load_dataset()
method from the 🤗 Datasets library.
To instruct tune our model we need to convert our structured examples into a collection of tasks described via instructions. We define a formatting_function
that takes a sample and returns a string with our format instruction.
lets test our formatting function on a random example.
In addition, to formatting our samples we also want to pack multiple samples to one sequence to have a more efficient training.
We define some helper functions to pack our samples into sequences of a given length and then tokenize them.
After we processed the datasets we are going to use the new FileSystem integration to upload our dataset to S3. We are using the sess.default_bucket()
, adjust this if you want to store the dataset in a different S3 bucket. We will use the S3 path later in our training script.
3. Fine-Tune LLaMA 13B with QLoRA on Amazon SageMaker
We are going to use the recently introduced method in the paper "QLoRA: Quantization-aware Low-Rank Adapter Tuning for Language Generation" by Tim Dettmers et al. QLoRA is a new technique to reduce the memory footprint of large language models during finetuning, without sacrificing performance. The TL;DR; of how QLoRA works is:
- Quantize the pretrained model to 4 bits and freezing it.
- Attach small, trainable adapter layers. (LoRA)
- Finetune only the adapter layers, while using the frozen quantized model for context.
We prepared a run_clm.py, which implements QLora using PEFT to train our model. The script also merges the LoRA weights into the model weights after training. That way you can use the model as a normal model without any additional code. Make sure to also add the requirements.txt into your source_dir
folder that way SageMaker will install the needed libraries including peft
.
In order to create a sagemaker training job we need an HuggingFace
Estimator. The Estimator handles end-to-end Amazon SageMaker training and deployment tasks. The Estimator manages the infrastructure use.
SagMaker takes care of starting and managing all the required ec2 instances for us, provides the correct huggingface container, uploads the provided scripts and downloads the data from our S3 bucket into the container at /opt/ml/input/data
. Then, it starts the training job by running.
Harwarde requirements
We also ran several experiments to determine, which instance type can be used for the different model sizes. The following table shows the results of our experiments. The table shows the instance type, model size, context length, and max batch size.
Model | Instance Type | Max Batch Size | Context Length |
---|---|---|---|
LLaMa 7B | (ml.)g5.4xlarge | 3 | 2048 |
LLaMa 13B | (ml.)g5.4xlarge | 2 | 2048 |
LLaMa 70B | (ml.)p4d.24xlarge | 1++ (need to test more configs) | 2048 |
You can also use
g5.2xlarge
instead of theg5.4xlarge
instance type, but then it is not possible to usemerge_weights
parameter, since to merge the LoRA weights into the model weights, the model needs to fit into memory. But you could save the adapter weights and merge them using merge_adapter_weights.py after training.
Note: We plan to extend this list in the future. feel free to contribute your setup!
We can now start our training job, with the .fit()
method passing our S3 path to the training script.
In our example for LLaMA 13B, the SageMaker training job took 31728 seconds
, which is about 8.8 hours
. The ml.g5.4xlarge instance we used costs $2.03 per hour
for on-demand usage. As a result, the total cost for training our fine-tuned LLaMa 2 model was only ~$18
.
4. Deploy Fine-tuned LLM on Amazon SageMaker
Note: We are currently working on releasing an new LLM container to support GQA for the 70B model.
You can deploy your fine-tuned LLaMA model to a SageMaker endpoint and use it for inference. Check out the Deploy Falcon 7B & 40B on Amazon SageMaker and Securely deploy LLMs inside VPCs with Hugging Face and Amazon SageMaker for more details.
Thanks for reading! If you have any questions, feel free to contact me on Twitter or LinkedIn.