Quick intro: PEFT or Parameter Efficient Fine-tuning
PEFT, or Parameter Efficient Fine-tuning, is a new open-source library from Hugging Face to enable efficient adaptation of pre-trained language models (PLMs) to various downstream applications without fine-tuning all the model's parameters. PEFT currently includes techniques for:
Note: This tutorial was created and run on a g5.2xlarge AWS EC2 Instance, including 1 NVIDIA A10G.
1. Setup Development Environment
In our example, we use the PyTorch Deep Learning AMI with already set up CUDA drivers and PyTorch installed. We still have to install the Hugging Face Libraries, including transformers and datasets. Running the following cell will install all the required packages.
2. Load and prepare the dataset
we will use the samsum dataset, a collection of about 16k messenger-like conversations with summaries. Conversations were created and written down by linguists fluent in English.
To load the samsum dataset, we use the load_dataset() method from the 🤗 Datasets library.
To train our model, we need to convert our inputs (text) to token IDs. This is done by a 🤗 Transformers Tokenizer. If you are not sure what this means, check out chapter 6 of the Hugging Face Course.
Before we can start training, we need to preprocess our data. Abstractive Summarization is a text-generation task. Our model will take a text as input and generate a summary as output. We want to understand how long our input and output will take to batch our data efficiently.
We preprocess our dataset before training and save it to disk. You could run this step on your local machine or a CPU and upload it to the Hugging Face Hub.
3. Fine-Tune T5 with LoRA and bnb int-8
In addition to the LoRA technique, we will use bitsanbytes LLM.int8() to quantize out frozen LLM to int8. This allows us to reduce the needed memory for FLAN-T5 XXL ~4x.
The first step of our training is to load the model. We are going to use philschmid/flan-t5-xxl-sharded-fp16, which is a sharded version of google/flan-t5-xxl. The sharding will help us to not run off of memory when loading the model.
Now, we can prepare our model for the LoRA int-8 training using peft.
As you can see, here we are only training 0.16% of the parameters of the model! This huge memory gain will enable us to fine-tune the model without memory issues.
Next is to create a DataCollator that will take care of padding our inputs and labels. We will use the DataCollatorForSeq2Seq from the 🤗 Transformers library.
The last step is to define the hyperparameters (TrainingArguments) we want to use for our training.
Let's now train our model and run the cells below. Note that for T5, some layers are kept in float32 for stability purposes.
The training took ~10:36:00 and cost ~13.22$ for 10h of training. For comparison a full fine-tuning on FLAN-T5-XXL with the same duration (10h) requires 8x A100 40GBs and costs ~322$.
We can save our model to use it for inference and evaluate it. We will save it to disk for now, but you could also upload it to the Hugging Face Hub using the model.push_to_hub method.
Our LoRA checkpoint is only 84MB small and includes all of the learnt knowleddge for samsum.
4. Evaluate & run Inference with LoRA FLAN-T5
We are going to use evaluate library to evaluate the rogue score. We can run inference using PEFT and transformers. For our FLAN-T5 XXL model, we need at least 18GB of GPU memory.
Let’s load the dataset again with a random sample to try the summarization.
Nice! our model works! Now, lets take a closer look and evaluate it against the test set of processed dataset from samsum. Therefore we need to use and create some utilities to generate the summaries and group them together. The most commonly used metrics to evaluate summarization task is rogue_score short for Recall-Oriented Understudy for Gisting Evaluation). This metric does not behave like the standard accuracy: it will compare a generated summary against a set of reference summaries.