In this blog, you will learn how to fine-tune google/flan-t5-base for chat & dialogue summarization using Hugging Face Transformers. If you already know T5, FLAN-T5 is just better at everything. For the same number of parameters, these models have been fine-tuned on more than 1000 additional tasks covering also more languages.
In this example we will use the samsum dataset a collection of about 16k messenger-like conversations with summaries. Conversations were created and written down by linguists fluent in English.
Before we can start, make sure you have a Hugging Face Account to save artifacts and experiments.
Quick intro: FLAN-T5, just a better T5
FLAN-T5 released with the Scaling Instruction-Finetuned Language Models paper is an enhanced version of T5 that has been finetuned in a mixture of tasks. The paper explores instruction finetuning with a particular focus on (1) scaling the number of tasks, (2) scaling the model size, and (3) finetuning on chain-of-thought data. The paper discovers that overall instruction finetuning is a general method for improving the performance and usability of pretrained language models.
Note: This tutorial was created and run on a p3.2xlarge AWS EC2 Instance including a NVIDIA V100.
1. Setup Development Environment
Our first step is to install the Hugging Face Libraries, including transformers and datasets. Running the following cell will install all the required packages.
This example will use the Hugging Face Hub as a remote model versioning service. To be able to push our model to the Hub, you need to register on the Hugging Face.
If you already have an account, you can skip this step.
After you have an account, we will use the notebook_login util from the huggingface_hub package to log into our account and store our token (access key) on the disk.
2. Load and prepare samsum dataset
we will use the samsum dataset a collection of about 16k messenger-like conversations with summaries. Conversations were created and written down by linguists fluent in English.
To load the samsum dataset, we use the load_dataset() method from the 🤗 Datasets library.
Lets checkout an example of the dataset.
To train our model we need to convert our inputs (text) to token IDs. This is done by a 🤗 Transformers Tokenizer. If you are not sure what this means check out chapter 6 of the Hugging Face Course.
before we can start training we need to preprocess our data. Abstractive Summarization is a text2text-generation task. This means our model will take a text as input and generate a summary as output. For this we want to understand how long our input and output will be to be able to efficiently batch our data.
3. Fine-tune and evaluate FLAN-T5
After we have processed our dataset, we can start training our model. Therefore we first need to load our FLAN-T5 from the Hugging Face Hub. In the example we are using a instance with a NVIDIA V100 meaning that we will fine-tune the base version of the model.
I plan to do a follow-up post on how to fine-tune the xxl version of the model using Deepspeed.
We want to evaluate our model during training. The Trainer supports evaluation during training by providing a compute_metrics.
The most commonly used metrics to evaluate summarization task is rogue_score short for Recall-Oriented Understudy for Gisting Evaluation). This metric does not behave like the standard accuracy: it will compare a generated summary against a set of reference summaries
We are going to use evaluate library to evaluate the rogue score.
Before we can start training is to create a DataCollator that will take care of padding our inputs and labels. We will use the DataCollatorForSeq2Seq from the 🤗 Transformers library.
The last step is to define the hyperparameters (TrainingArguments) we want to use for our training. We are leveraging the Hugging Face Hub integration of the Trainer to automatically push our checkpoints, logs and metrics during training into a repository.
We can start our training by using the train method of the Trainer.
Nice, we have trained our model. 🎉 Lets run evaluate the best model again on the test set.
The best score we achieved is an rouge1 score of 47.23.
Lets save our results and tokenizer to the Hugging Face Hub and create a model card.
4. Run Inference and summarize ChatGPT dialogues
Now we have a trained model, we can use it to run inference. We will use the pipeline API from transformers and a test example from our dataset.
output
Thanks for reading! If you have any questions, feel free to contact me on Twitter or LinkedIn.