Fine-tune Falcon 180B with QLoRA and Flash Attention on Amazon SageMaker
In this Amazon SageMaker example, we are going to learn how to fine-tune tiiuae/falcon-180B using QLoRA: Efficient Finetuning of Quantized LLMs with Flash Attention. Falcon 180B is the newest version of Falcon LLM family. It is the biggest open source model with 180B parameter and trained on more data - 3.5T tokens with context length window upto 4K tokens.
QLoRA is an efficient finetuning technique that quantizes a pretrained language model to 4 bits and attaches small “Low-Rank Adapters” which are fine-tuned. This enables fine-tuning of models with up to 65 billion parameters on a single GPU; despite its efficiency, QLoRA matches the performance of full-precision fine-tuning and achieves state-of-the-art results on language tasks.
In our example, we are going to leverage Hugging Face Transformers, Accelerate, and PEFT.
In Detail you will learn how to:
- Setup Development Environment
- Load and prepare the dataset
- Fine-Tune Falcon 180B with QLoRA on Amazon SageMaker
Access Falcon 180B
Before we can start training we have to make sure that we accepted the license tiiuae/falcon-180B to be able to use it. You can accept the license by clicking on the Agree and access repository button on the model page at:
1. Setup Development Environment
!pip install "transformers==4.31.0" "datasets[s3]==2.13.0" sagemaker --upgrade --quiet
To access any Falcon 180B asset we need to login into our hugging face account. We can do this by running the following command:
!huggingface-cli login --token YOUR_TOKEN
If you are going to use Sagemaker in a local environment. You need access to an IAM Role with the required permissions for Sagemaker. You can find here more about it.
import sagemaker
import boto3
sess = sagemaker.Session()
# sagemaker session bucket -> used for uploading data, models and logs
# sagemaker will automatically create this bucket if it not exists
sagemaker_session_bucket=None
if sagemaker_session_bucket is None and sess is not None:
# set to default bucket if a bucket name is not given
sagemaker_session_bucket = sess.default_bucket()
try:
role = sagemaker.get_execution_role()
except ValueError:
iam = boto3.client('iam')
role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']
sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)
print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {sess.default_bucket()}")
print(f"sagemaker session region: {sess.boto_region_name}")
2. Load and prepare the dataset
we will use the dolly an open source dataset of instruction-following records generated by thousands of Databricks employees in several of the behavioral categories outlined in the InstructGPT paper, including brainstorming, classification, closed QA, generation, information extraction, open QA, and summarization.
{
"instruction": "What is world of warcraft",
"context": "",
"response": "World of warcraft is a massive online multi player role playing game. It was released in 2004 by bizarre entertainment"
}
To load the samsum
dataset, we use the load_dataset()
method from the 🤗 Datasets library.
from datasets import load_dataset
from random import randrange
# Load dataset from the hub
dataset = load_dataset("databricks/databricks-dolly-15k", split="train")
print(f"dataset size: {len(dataset)}")
print(dataset[randrange(len(dataset))])
# dataset size: 15011
To instruct tune our model we need to convert our structured examples into a collection of tasks described via instructions. We define a formatting_function
that takes a sample and returns a string with our format instruction.
def format_dolly(sample):
instruction = f"### Instruction\n{sample['instruction']}"
context = f"### Context\n{sample['context']}" if len(sample["context"]) > 0 else None
response = f"### Answer\n{sample['response']}"
# join all the parts together
prompt = "\n\n".join([i for i in [instruction, context, response] if i is not None])
return prompt
lets test our formatting function on a random example.
from random import randrange
print(format_dolly(dataset[randrange(len(dataset))]))
In addition, to formatting our samples we also want to pack multiple samples to one sequence to have a more efficient training.
from transformers import AutoTokenizer
model_id = "tiiuae/falcon-180B" # sharded weights
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token
We define some helper functions to pack our samples into sequences of a given length and then tokenize them.
from random import randint
from itertools import chain
from functools import partial
# template dataset to add prompt to each sample
def template_dataset(sample):
sample["text"] = f"{format_dolly(sample)}{tokenizer.eos_token}"
return sample
# apply prompt template per sample
dataset = dataset.map(template_dataset, remove_columns=list(dataset.features))
# print random sample
print(dataset[randint(0, len(dataset))]["text"])
# empty list to save remainder from batches to use in next batch
remainder = {"input_ids": [], "attention_mask": [], "token_type_ids": []}
def chunk(sample, chunk_length=2048):
# define global remainder variable to save remainder from batches to use in next batch
global remainder
# Concatenate all texts and add remainder from previous batch
concatenated_examples = {k: list(chain(*sample[k])) for k in sample.keys()}
concatenated_examples = {k: remainder[k] + concatenated_examples[k] for k in concatenated_examples.keys()}
# get total number of tokens for batch
batch_total_length = len(concatenated_examples[list(sample.keys())[0]])
# get max number of chunks for batch
if batch_total_length >= chunk_length:
batch_chunk_length = (batch_total_length // chunk_length) * chunk_length
# Split by chunks of max_len.
result = {
k: [t[i : i + chunk_length] for i in range(0, batch_chunk_length, chunk_length)]
for k, t in concatenated_examples.items()
}
# add remainder to global variable for next batch
remainder = {k: concatenated_examples[k][batch_chunk_length:] for k in concatenated_examples.keys()}
# prepare labels
result["labels"] = result["input_ids"].copy()
return result
# tokenize and chunk dataset
lm_dataset = dataset.map(
lambda sample: tokenizer(sample["text"]), batched=True, remove_columns=list(dataset.features)
).map(
partial(chunk, chunk_length=2048),
batched=True,
)
# Print total number of samples
print(f"Total number of samples: {len(lm_dataset)}")
After we processed the datasets we are going to use the new FileSystem integration to upload our dataset to S3. We are using the sess.default_bucket()
, adjust this if you want to store the dataset in a different S3 bucket. We will use the S3 path later in our training script.
# save train_dataset to s3
training_input_path = f's3://{sess.default_bucket()}/processed/falcon/dolly/train'
lm_dataset.save_to_disk(training_input_path)
print("uploaded data to:")
print(f"training dataset to: {training_input_path}")
3. Fine-Tune Falcon 180B with QLoRA on Amazon SageMaker
We are going to use the recently introduced method in the paper "QLoRA: Quantization-aware Low-Rank Adapter Tuning for Language Generation" by Tim Dettmers et al. QLoRA is a new technique to reduce the memory footprint of large language models during finetuning, without sacrificing performance. The TL;DR; of how QLoRA works is:
- Quantize the pretrained model to 4 bits and freezing it.
- Attach small, trainable adapter layers. (LoRA)
- Finetune only the adapter layers, while using the frozen quantized model for context.
We prepared a run_clm.py, which implements QLora using PEFT and Flash Attention 2 for efficient training. The script also merges the LoRA weights into the model weights after training. That way you can use the model as a normal model without any additional code.
Make sure the you copy the whole
scripts
folder, which includes therequirements.txt
to install additional packages needed for QLoRA and Flash Attention.
Harwarde requirements
We only run experiments on p4d.24xlarge
so far, but based on heuristics it should be possible to run on a g5.48xlarge
as well, but it will be slower.
import time
from sagemaker.huggingface import HuggingFace
from huggingface_hub import HfFolder
# define Training Job Name
job_name = f'huggingface-qlora-{time.strftime("%Y-%m-%d-%H-%M-%S", time.localtime())}'
# hyperparameters, which are passed into the training job
hyperparameters = {
'model_id': model_id, # pre-trained model
'dataset_path': '/opt/ml/input/data/training', # path where sagemaker will save training dataset
'epochs': 1, # number of training epochs
'per_device_train_batch_size': 4, # batch size for training
'lr': 2e-4, # learning rate used during training
'hf_token': HfFolder.get_token(), # huggingface token to access Falcon 180b
}
# create the Estimator
huggingface_estimator = HuggingFace(
entry_point = 'run_clm.py', # train script
source_dir = 'scripts', # directory which includes all the files needed for training
instance_type = 'ml.p4d.24xlarge', # instances type used for the training job
instance_count = 1, # the number of instances used for training
max_run = 2*24*60*60, # maximum runtime in seconds (days * hours * minutes * seconds)
base_job_name = job_name, # the name of the training job
role = role, # Iam role used in training job to access AWS ressources, e.g. S3
volume_size = 300, # the size of the EBS volume in GB
transformers_version = '4.28', # the transformers version used in the training job
pytorch_version = '2.0', # the pytorch_version version used in the training job
py_version = 'py310', # the python version used in the training job
hyperparameters = hyperparameters, # the hyperparameters passed to the training job
environment = { "HUGGINGFACE_HUB_CACHE": "/tmp/.cache" }, # set env variable to cache models in /tmp
disable_output_compression = True # not compress output to save training time and cost
)
We can now start our training job, with the .fit()
method passing our S3 path to the training script.
# define a data input dictonary with our uploaded s3 uris
data = {'training': training_input_path}
# starting the train job with our uploaded datasets as input
huggingface_estimator.fit(data, wait=True)
In our example for Falcon 180B, the SageMaker training job took 348 minutes
or 5.8 hours
for 1 epoch with merging the weights. The ml.p4d.24xlarge instance we used costs $37.688 per hour
for on-demand usage. As a result, the total cost for training was ~$256
.
For comparison the pretraining cost of Falcon 180B was ~7,000,000
GPU hours, which is 300,000
more than fine-tuning for 3 epochs.
Next Steps
You can deploy your fine-tuned Falcon 180B model to a SageMaker endpoint and use it for inference. Check out the Deploy Falcon 180B on Amazon SageMaker and Securely deploy LLMs inside VPCs with Hugging Face and Amazon SageMaker for more details.
Thanks for reading! If you have any questions, feel free to contact me on Twitter or LinkedIn.