Getting started with Transformers and TPU using PyTorch
Tensor Processing Units (TPU) are specialized accelerators developed by Google to speed up machine learning tasks. They are built from the ground up with a focus on machine & deep learning workloads.
TPUs are available on the Google Cloud and can be used with popular deep learning frameworks, including TensorFlow, JAX, and PyTorch.
This blog post will cover how to get started with Hugging Face Transformers and TPUs using PyTorch and accelerate. You will learn how to fine-tune a BERT model for Text Classification using the newest Google Cloud TPUs.
You will learn how to:
- Launch TPU VM on Google Cloud
- Setup Jupyter environment & install Transformers
- Load and prepare the dataset
- Fine-tune BERT on the TPU with the Hugging Face
accelerate
Before we can start, make sure you have a Hugging Face Account to save artifacts and experiments.
1. Launch TPU VM on Google Cloud
The first step is to create a TPU development environment. We are going to use the Google Cloud CLI gcloud
to create a cloud TPU VM using PyTorch 1.13 image.
If you don´t have the cloud
installed check out the documentation or run the command below.
We can now create our cloud TPU VM with our preferred region, project and version.
Note: Make sure to have the Cloud TPU API enabled to create your Cloud TPU VM
2. Setup Jupyter environment & install Transformers
Our cloud TPU VM is now running, and we can ssh into it, but who likes to develop inside a terminal? We want to set up a Jupyter
environment, which we can access through our local browser. For this, we need to add a port for forwarding in the gcloud
ssh command, which will tunnel our localhost traffic to the cloud TPU.
Before we can access our environment, we need to install jupyter
and the Hugging Face Libraries, including transformers
and datasets
. Running the following command will install all the required packages.
We can now start our jupyter
server.
You should see a familiar jupyter
output with a URL to the notebook.
http://localhost:8080/?token=8c1739aff1755bd7958c4cfccc8d08cb5da5234f61f129a9
We can click on it, and a jupyter
environment opens in our local browser.
We can now create a new notebook and test to see if we have access to the TPUs.
Awesome! 🎉 We can use our TPU with PyTorch. Let's get to our example.
NOTE: make sure to restart your notebook to not longer allocate a TPU with the tensor we created!
3. Load and prepare the dataset
We are training a Text Classification model on the BANKING77 dataset to keep the example straightforward. The BANKING77 dataset provides a fine-grained set of intents (classes) in a banking/finance domain. It comprises 13,083 customer service queries labeled with 77 intents. It focuses on fine-grained single-domain intent detection.
This is the same dataset we used for the “Getting started with Pytorch 2.0 and Hugging Face Transformers”, which will help us to compare the performance later.
We will use the load_dataset()
method from the 🤗 Datasets library to load the banking77
.
Let’s check out an example of the dataset.
To train our model, we need to convert our "Natural Language" to token IDs. This is done by a Tokenizer, which tokenizes the inputs (including converting the tokens to their corresponding IDs in the pre-trained vocabulary) if you want to learn more about this, out chapter 6 of the Hugging Face Course.
Since TPUs expect a fixed shape of inputs, we need to make sure to truncate or pad all samples to the same length.
We are using Hugging Face accelerate to train our model in this example. Accelerate is a library to easily write PyTorch training loops for agnostic Hardware setups, which makes it super easy to write TPU training methods without the need to know any XLA features.
accelerate
4. Fine-tune BERT on the TPU with the Hugging Face Accelerate is enables PyTorch users run PyTorch training across any distributed configuration by adding just four lines of code! Built on torch_xla
and torch.distributed
, 🤗 Accelerate takes care of the heavy lifting, so you don’t have to write any custom code to adapt to these platforms.
Accelerate implements a notebook launcher, which allows you to easily start your training jobs from a notebook cell rather than needing to use torchrun
or other launcher, which makes experimenting so much easier, since we can write all the code in the notebook rather than the need to create long and complex python scripts. We are going to use the notebook_launcher
which will allow us to skip the accelerate config
command, since we define our environment inside the notebook.
The two most important things to remember for training on TPUs is that the accelerator
object has to be defined inside the training_function
, and your model should be created outside the training function.
We will load our model with the AutoModelForSequenceClassification
class from the Hugging Face Hub. This will initialize the pre-trained BERT weights with a classification head on top. Here we pass the number of classes (77) from our dataset and the label names to have readable outputs for inference.
We evaluate our model during training. We use the evaluate
library to calculate the f1 metric during training on our test split.
We can now write our train_function
. If you want to learn more about how to adjust a basic PyTorch training loop to accelerate
you can take a look at the Migrating your code to 🤗 Accelerate guide.
We are using a magic cell %%writefile
to write the train_function
to an external train.py
module to properly use it in ipython
. The train.py
module also includes a create_dataloaders
method, which will be used to create our DataLoaders
for training using the tokenized dataset.
The last step is to define the hyperparameters
we use for our training.
And we're ready for launch! It's super easy with the notebook_launcher
from the Accelerate library.
Note: You may notice that training seems exceptionally slow at first. This is because TPUs first run through a few batches of data to see how much memory to allocate before utilizing this configured memory allocation extremely efficiently.
We are using 8x v3
TPUs with a global batch size of 256
, achieving 481 train_samples_per_second
The training with compilation and evaluation took 220
seconds and achieved an f1
score of 0.915
.
Conclusion
In this tutorial, we learned how to train a BERT model for text classification model with the BANKING77 dataset on Google Cloud TPUs. Hugging Face accelerate allows you to easily run any PyTorch training loop on TPUs with minimal code changes.
We compared our training with the results of the “Getting started with Pytorch 2.0 and Hugging Face Transformers”, which uses the Hugging Face Trainer and Pytorch 2.0 on NVIDIA A10G GPU. The TPU accelerate version delivers a 200% reduction in training time for us to fine-tune BERT within 3,5 minutes for less than 0,5$.
Moving your training to TPUs can help increase the iteration and speed of your models and data science teams.
Thanks for reading! If you have any questions, feel free to contact me on Twitter or LinkedIn.