In this Amazon SageMaker example, we are going to learn how to fine-tune tiiuae/falcon-180B using QLoRA: Efficient Finetuning of Quantized LLMs with Flash Attention. Falcon 180B is the newest version of Falcon LLM family. It is the biggest open source model with 180B parameter and trained on more data - 3.5T tokens with context length window upto 4K tokens.
QLoRA is an efficient finetuning technique that quantizes a pretrained language model to 4 bits and attaches small “Low-Rank Adapters” which are fine-tuned. This enables fine-tuning of models with up to 65 billion parameters on a single GPU; despite its efficiency, QLoRA matches the performance of full-precision fine-tuning and achieves state-of-the-art results on language tasks.
Before we can start training we have to make sure that we accepted the license tiiuae/falcon-180B to be able to use it. You can accept the license by clicking on the Agree and access repository button on the model page at:
To access any Falcon 180B asset we need to login into our hugging face account. We can do this by running the following command:
If you are going to use Sagemaker in a local environment. You need access to an IAM Role with the required permissions for Sagemaker. You can find here more about it.
2. Load and prepare the dataset
we will use the dolly an open source dataset of instruction-following records generated by thousands of Databricks employees in several of the behavioral categories outlined in the InstructGPT paper, including brainstorming, classification, closed QA, generation, information extraction, open QA, and summarization.
To load the samsum dataset, we use the load_dataset() method from the 🤗 Datasets library.
To instruct tune our model we need to convert our structured examples into a collection of tasks described via instructions. We define a formatting_function that takes a sample and returns a string with our format instruction.
lets test our formatting function on a random example.
In addition, to formatting our samples we also want to pack multiple samples to one sequence to have a more efficient training.
We define some helper functions to pack our samples into sequences of a given length and then tokenize them.
After we processed the datasets we are going to use the new FileSystem integration to upload our dataset to S3. We are using the sess.default_bucket(), adjust this if you want to store the dataset in a different S3 bucket. We will use the S3 path later in our training script.
3. Fine-Tune Falcon 180B with QLoRA on Amazon SageMaker
We are going to use the recently introduced method in the paper "QLoRA: Quantization-aware Low-Rank Adapter Tuning for Language Generation" by Tim Dettmers et al. QLoRA is a new technique to reduce the memory footprint of large language models during finetuning, without sacrificing performance. The TL;DR; of how QLoRA works is:
Quantize the pretrained model to 4 bits and freezing it.
Attach small, trainable adapter layers. (LoRA)
Finetune only the adapter layers, while using the frozen quantized model for context.
We prepared a run_clm.py, which implements QLora using PEFT and Flash Attention 2 for efficient training.
The script also merges the LoRA weights into the model weights after training. That way you can use the model as a normal model without any additional code.
Make sure the you copy the whole scripts folder, which includes the requirements.txt to install additional packages needed for QLoRA and Flash Attention.
Harwarde requirements
We only run experiments on p4d.24xlarge so far, but based on heuristics it should be possible to run on a g5.48xlarge as well, but it will be slower.
We can now start our training job, with the .fit() method passing our S3 path to the training script.
In our example for Falcon 180B, the SageMaker training job took 348 minutes or 5.8 hours for 1 epoch with merging the weights. The ml.p4d.24xlarge instance we used costs $37.688 per hour for on-demand usage. As a result, the total cost for training was ~$256.
For comparison the pretraining cost of Falcon 180B was ~7,000,000 GPU hours, which is 300,000 more than fine-tuning for 3 epochs.