Getting started with Pytorch 2.0 and Hugging Face Transformers
On December 2, 2022, the PyTorch Team announced PyTorch 2.0 at the PyTorch Conference, focused on better performance, being faster, more pythonic, and staying as dynamic as before.
This blog post explains how to get started with PyTorch 2.0 and Hugging Face Transformers today. It will cover how to fine-tune a BERT model for Text Classification using the newest PyTorch 2.0 features.
You will learn how to:
- Setup environment & install Pytorch 2.0
- Load and prepare the dataset
- Fine-tune & evaluate BERT model with the Hugging Face
Trainer
- Run Inference & test model
Before we can start, make sure you have a Hugging Face Account to save artifacts and experiments.
Quick intro: Pytorch 2.0
PyTorch 2.0 or, better, 1.14 is entirely backward compatible. Pytorch 2.0 will not require any modification to existing PyTorch code but can optimize your code by adding a single line of code with model = torch.compile(model)
.
If you ask yourself, why is there a new major version and no breaking changes? The PyTorch team answered this question in their FAQ: “We were releasing substantial new features that we believe change how you meaningfully use PyTorch, so we are calling it 2.0 instead.”
Those new features include top-level support for TorchDynamo, AOTAutograd, PrimTorch, and TorchInductor.
This allows PyTorch 2.0 to achieve a 1.3x-2x training time speedups supporting today's 46 model architectures from HuggingFace Transformers
If you want to learn more about PyTorch 2.0, check out the official “GET STARTED”.
Now we know how PyTorch 2.0 works, let's get started. 🚀
Note: This tutorial was created and run on a g5.xlarge AWS EC2 Instance, including an NVIDIA A10G GPU.
1. Setup environment & install Pytorch 2.0
Our first step is to install PyTorch 2.0 and the Hugging Face Libraries, including transformers
and datasets
.
Additionally, we are installing the latest version of transformers
from the main
git branch, which includes the native integration of PyTorch 2.0 into the Trainer
.
This example will use the Hugging Face Hub as a remote model versioning service. To push our model to the Hub, you must register on the Hugging Face. If you already have an account, you can skip this step. After you have an account, we will use the login
util from the huggingface_hub
package to log into our account and store our token (access key) on the disk.
2. Load and prepare the dataset
To keep the example straightforward, we are training a Text Classification model on the BANKING77 dataset. The BANKING77 dataset provides a fine-grained set of intents (classes) in a banking/finance domain. It comprises 13,083 customer service queries labeled with 77 intents. It focuses on fine-grained single-domain intent detection.
We will use the load_dataset()
method from the 🤗 Datasets library to load the banking77
Let’s check out an example of the dataset.
To train our model, we need to convert our "Natural Language" to token IDs. This is done by a Tokenizer, which tokenizes the inputs (including converting the tokens to their corresponding IDs in the pre-trained vocabulary) if you want to learn more about this, out chapter 6 of the Hugging Face Course.
Trainer
3. Fine-tune & evaluate BERT model with the Hugging Face After we have processed our dataset, we can start training our model. We will use the bert-base-uncased model. The first step is to load our model with AutoModelForSequenceClassification
class from the Hugging Face Hub. This will initialize the pre-trained BERT weights with a classification head on top. Here we pass the number of classes (77) from our dataset and the label names to have readable outputs for inference.
We evaluate our model during training. The Trainer
supports evaluation during training by providing a compute_metrics
method. We use the evaluate
library to calculate the f1 metric during training on our test split.
The last step is to define the hyperparameters (TrainingArguments
) we use for our training. Here we are adding the PyTorch 2.0 introduced features for fast training times. To use the latest improvements of PyTorch 2.0, we only need to pass the torch_compile
option in the TrainingArguments
.
We also leverage the Hugging Face Hub integration of the Trainer
to push our checkpoints, logs, and metrics during training into a repository.
We can start our training by using the train
method of the Trainer
.
Using Pytorch 2.0 and supported features in transformers
allows us train our BERT model on 10_000
samples within 457.7964
seconds.
We also ran the training without the torch_compile
option to compare the training times. The training without torch_compile
took 457 seconds, had a train_samples_per_second
value of 65.55 and an f1
score of 0.931
.
By using the torch_compile
option and the adamw_torch_fused
optimized , we can see that the training time is reduced by 52.5% compared to the training without PyTorch 2.0.
Our absoulte training time went down from 696s to 457. The train_samples_per_second
value increased from 43 to 65. The f1
score is the same/slighty better than the training without torch_compile
.
Pytorch 2.0 is incredible powerful! 🚀
Lets save our results and tokenizer to the Hugging Face Hub and create a model card.
4. Run Inference & test model
To wrap up this tutorial, we will run inference on a few examples and test our model. We will use the pipeline
method from the transformers
library to run inference on our model.
Conclusion
In this tutorial, we learned how to use PyTorch 2.0 to train a text classification model on the BANKING77 dataset. We saw that PyTorch 2.0 is a powerful tool to speed up your training times. In our example running on a NVIDIA A10G we managed to achieve 52.5% better performance. The Hugging Face Trainer allows you to easily integrate PyTorch 2.0 into your training pipeline by simply adding the torch_compile
option to the TrainingArguments
. We can further benefit from PyTorch 2.0 by using the new fused AdamW optimizer when bf16 is available.
Additionally, I want to mentioned that we reduced the training time by 52%, which could be interpreted in a cost saving of 52% for the training or in 52% faster iterations cycles and time to production. You should be able to see even better improvements by using A100 GPUs or by reducing the "Trainer" overhead, e.g. removing evaluation and logging.
PyTorch 2.0 is now officially launched and we are excited to see what the future brings. 🚀
Thanks for reading! If you have any questions, feel free to contact me on Twitter or LinkedIn.