Distributed training on multilingual BERT with Hugging Face Transformers & Amazon SageMaker
Welcome to this end-to-end multilingual Text-Classification example using PyTorch. In this demo, we will use the Hugging Faces transformers
and datasets
library together with Pytorch
to fine-tune a multilingual transformer for text-classification. This example is a derived version of the text-classificiaton.ipynb notebook and uses Amazon SageMaker for distributed training. In the text-classificiaton.ipynb we showed how to fine-tune distilbert-base-multilingual-cased
on the amazon_reviews_multi
dataset for sentiment-analysis
. This dataset has over 1.2 million data points, which is huge. Running training would take on 1x NVIDIA V100 takes around 6,5h for batch_size
16, which is quite long.
To scale and accelerate our training we will use Amazon SageMaker, which provides two strategies for distributed training, data parallelism and model parallelism. Data parallelism splits a training set across several GPUs, while model parallelism splits a model across several GPUs. We are going to use SageMaker Data Parallelism, which has been built into the Trainer API. To be able use data-parallelism we only have to define the distribution
parameter in our HuggingFace
estimator.
I moved the "training" part of the text-classificiaton.ipynb notebook into a separate training script train.py, which accepts the same hyperparameter and can be run on Amazon SageMaker using the HuggingFace
estimator.
Our goal is to decrease the training duration by scaling our global/effective batch size from 16 up to 128, which is 8x bigger than before. For monitoring our training we will use the new Training Metrics support by the Hugging Face Hub
Installation
This example will use the Hugging Face Hub as remote model versioning service. To be able to push our model to the Hub, you need to register on the Hugging Face.
If you already have an account you can skip this step.
After you have an account, we will use the notebook_login
util from the huggingface_hub
package to log into our account and store our token (access key) on the disk.
Setup & Configuration
In this step we will define global configurations and parameters, which are used across the whole end-to-end fine-tuning proccess, e.g. tokenizer
and model
we will use.
Note: The execution role is only available when running a notebook within SageMaker (SageMaker Notebook Instances or Studio). If you run get_execution_role
in a notebook not on SageMaker, expect a region error.
You can comment in the cell below and provide a an IAM Role name with SageMaker permissions to setup your environment out side of SageMaker.
In this example are we going to fine-tune the distilbert-base-multilingual-cased a multilingual DistilBERT model.
Dataset & Pre-processing
As Dataset we will use the amazon_reviews_multi a multilingual text-classification. The dataset contains reviews in English, Japanese, German, French, Chinese and Spanish, collected between November 1, 2015 and November 1, 2019. Each record in the dataset contains the review text, the review title, the star rating, an anonymized reviewer ID, an anonymized product ID and the coarse-grained product category (e.g. ‘books’, ‘appliances’, etc.) The corpus is balanced across stars, so each star rating constitutes 20% of the reviews in each language.
To load the amazon_reviews_multi
dataset, we use the load_dataset()
method from the 🤗 Datasets library.
Pre-processing & Tokenization
The amazon_reviews_multi has 5 classes (stars
) to match those into a sentiment-analysis
task we will map those star ratings to the following classes labels
:
[1-2]
:Negative
[3]
:Neutral
[4-5]
:Positive
Those labels
can be later used to create a user friendly output after we fine-tuned our model.
Before we prepare the dataset for training. Lets take a quick look at the class distribution of the dataset.
The Distribution is not perfect, but lets give it a try and improve on this later.
To train our model we need to convert our "Natural Language" to token IDs. This is done by a 🤗 Transformers Tokenizer which will tokenize the inputs (including converting the tokens to their corresponding IDs in the pretrained vocabulary). If you are not sure what this means check out chapter 6 of the Hugging Face Course.
Additionally we add the truncation=True
and max_length=512
to align the length and truncate texts that are bigger than the maximum size allowed by the model.
Before we can start our distributed Training, we need to upload our already pre-processed dataset to Amazon S3. Therefore we will use the built-in utils of datasets
Creating an Estimator and start a training job
Last step before we can start our managed training is to define our Hyperparameters, create our sagemaker HuggingFace
estimator and configure distributed training.
Since, we are using SageMaker Data Parallelism our total_batch_size will be per_device_train_batch_size * n_gpus.
Since we are using the Hugging Face Hub intergration with Tensorboard we can inspect our progress directly on the hub, as well as testing checkpoints during the training.
Conclusion
We managed to scale our training from 1x GPU to 8x GPU without any issues or code changes required. We used the Python SageMaker SDK to create our managed training job and only needed to provide some information about the environment our training should run, our training script and our hyperparameters.
With this we were able to reduce the training time from 6,5 hours to ~1,5 hours, which is huge! With this we can evaluate and test ~5x more models than before.
You can find the code here and feel free to open a thread on the forum.
Thanks for reading. If you have any questions, feel free to contact me, through Github, or on the forum. You can also connect with me on Twitter or LinkedIn.