Pre-Training BERT with Hugging Face Transformers and Habana Gaudi
In this Tutorial, you will learn how to pre-train BERT-base from scratch using a Habana Gaudi-based DL1 instance on AWS to take advantage of the cost-performance benefits of Gaudi. We will use the Hugging Face Transformers, Optimum Habana and Datasets libraries to pre-train a BERT-base model using masked-language modeling, one of the two original BERT pre-training tasks. Before we get started, we need to set up the deep learning environment.
You will learn how to:
Note: Steps 1 to 3 can/should be run on a different instance size since those are CPU intensive tasks.
Requirements
Before we start, make sure you have met the following requirements
- AWS Account with quota for DL1 instance type
- AWS CLI installed
- AWS IAM user configured in CLI with permission to create and manage ec2 instances
Helpful Resources
- Setup Deep Learning environment for Hugging Face Transformers with Habana Gaudi on AWS
- Deep Learning setup made easy with EC2 Remote Runner and Habana Gaudi
- Optimum Habana Documentation
- Pre-training script
- Code: pre-training-bert.ipynb
What is BERT?
BERT, short for Bidirectional Encoder Representations from Transformers, is a Machine Learning (ML) model for natural language processing. It was developed in 2018 by researchers at Google AI Language and serves as a swiss army knife solution to 11+ of the most common language tasks, such as sentiment analysis and named entity recognition.
Read more about BERT in our BERT 101 ๐ค State Of The Art NLP Model Explained blog.
What is a Masked Language Modeling (MLM)?
MLM enables/enforces bidirectional learning from text by masking (hiding) a word in a sentence and forcing BERT to bidirectionally use the words on either side of the covered word to predict the masked word.
Masked Language Modeling Example:
Read more about Masked Language Modeling here.
Let's get started. ๐
Note: Steps 1 to 3 were run on a AWS c6i.12xlarge instance.
1. Prepare the dataset
The Tutorial is "split" into two parts. The first part (step 1-3) is about preparing the dataset and tokenizer. The second part (step 4) is about pre-training BERT on the prepared dataset. Before we can start with the dataset preparation we need to setup our development environment. As mentioned in the introduction you don't need to prepare the dataset on the DL1 instance and could use your notebook or desktop computer.
At first we are going to install transformers
, datasets
and git-lfs
to push our tokenizer and dataset to the Hugging Face Hub for later use.
To finish our setup let's log into the Hugging Face Hub to push our dataset, tokenizer, model artifacts, logs and metrics during training and afterwards to the Hub.
To be able to push our model to the Hub, you need to register on the Hugging Face Hub.
We will use the notebook_login
util from the huggingface_hub
package to log into our account. You can get your token in the settings at Access Tokens.
Since we are now logged in let's get the user_id
, which will be used to push the artifacts.
The original BERT was pretrained on Wikipedia and BookCorpus datasets. Both datasets are available on the Hugging Face Hub and can be loaded with datasets
.
Note: For wikipedia we will use the 20220301
, which is different from the original split.
As a first step we are loading the datasets and merging them together to create on big dataset.
We are not going to do some advanced dataset preparation, like de-duplication, filtering or any other pre-processing. If you are planning to apply this notebook to train your own BERT model from scratch I highly recommend including those data preparation steps into your workflow. This will help you improve your Language Model.
2. Train a Tokenizer
To be able to train our model we need to convert our text into a tokenized format. Most Transformer models are coming with a pre-trained tokenizer, but since we are pre-training our model from scratch we also need to train a Tokenizer on our data. We can train a tokenizer on our data with transformers
and the BertTokenizerFast
class.
More information about training a new tokenizer can be found in our Hugging Face Course.
We can start training the tokenizer with train_new_from_iterator()
.
We push the tokenizer to the Hugging Face Hub for later training our model.
3. Preprocess the dataset
Before we can get started with training our model, the last step is to pre-process/tokenize our dataset. We will use our trained tokenizer to tokenize our dataset and then push it to the hub to load it easily later in our training. The tokenization process is also kept pretty simple, if documents are longer than 512
tokens those are truncated and not split into several documents.
As data processing function we will concatenate all texts from our dataset and generate chunks of tokenizer.model_max_length
(512).
The last step before we can start with our training is to push our prepared dataset to the hub.
4. Pre-train BERT on Habana Gaudi
In this example, we are going to use Habana Gaudi on AWS using the DL1 instance to run the pre-training. We will use the Remote Runner toolkit to easily launch our pre-training on a remote DL1 Instance from our local setup. You can check-out Deep Learning setup made easy with EC2 Remote Runner and Habana Gaudi if you want to know more about how this works.
When using GPUs you would use the Trainer and TrainingArguments. Since we are going to run our training on Habana Gaudi we are leveraging the optimum-habana
library, we can use the GaudiTrainer and GaudiTrainingArguments instead. The GaudiTrainer
is a wrapper around the Trainer that allows you to pre-train or fine-tune a transformer model on Habana Gaudi instances.
The DL1
instance we use has 8 available HPU-cores meaning we can leverage distributed data-parallel training for our model.
To run our training as distributed training we need to create a training script, which can be used with multiprocessing to run on all HPUs.
We have created a run_mlm.py script implementing masked-language modeling using the GaudiTrainer
. To execute our distributed training we use the DistributedRunner
runner from optimum-habana
and pass our arguments. Alternatively, you could check-out the gaudi_spawn.py in the optimum-habana repository.
Before we can start our training we need to define the hyperparameters
we want to use for our training. We are leveraging the Hugging Face Hub integration of the GaudiTrainer
to automatically push our checkpoints, logs and metrics during training into a repository.
We can start our training by creating a EC2RemoteRunner
and then launch
it. This will then start our AWS EC2 DL1 instance and run our run_mlm.py
script on it using the huggingface/optimum-habana:latest
container.
This experiment ran for 60k steps
In our hyperparameters
we defined a max_steps
property, which limited the pre-training to only 100_000
steps. The 100_000
steps with a global batch size of 256
took around 12,5 hours.
BERT was originally pre-trained on 1 Million Steps with a global batch size of 256
:
We train with batch size of 256 sequences (256 sequences * 512 tokens = 128,000 tokens/batch) for 1,000,000 steps, which is approximately 40 epochs over the 3.3 billion word corpus.
Meaning if we want to do a full pre-training it would take around 125h hours (12,5 hours * 10) and would cost us around ~$1,650 using Habana Gaudi on AWS, which is extremely cheap.
For comparison, the DeepSpeed Team, who holds the record for the fastest BERT-pretraining, reported that pre-training BERT on 1 DGX-2 (powered by 16 NVIDIA V100 GPUs with 32GB of memory each) takes around 33,25 hours.
To compare the cost we can use the p3dn.24xlarge as reference, which comes with 8x NVIDIA V100 32GB GPUs and costs 31,22$/h. We would need two of these instances to have the same "setup" as the one DeepSpeed reported, for now we are ignoring any overhead created to the multi-node setup (I/O, Network etc.). This would bring the cost of the DeepSpeed GPU based training on AWS to around ~$2,075, which is 25% more than what Habana Gaudi currently delivers.
Something to note here is that using DeepSpeed in general improves the performance by a factor of ~1.5 - 2. A factor of ~1.5 - 2x, means that the same pre-training job without DeepSpeed would likely take twice as long and cost twice as much or ~$3-4k.
We are looking forward on to do the experiment again once the Gaudi DeepSpeed integration is more widely available.
Conclusion
That's it for this Tutorial. Now you know the basics on how to pre-train BERT from scratch using Hugging Face Transformers and Habana Gaudi. You also saw how easy it is to migrate from the Trainer
to the GaudiTrainer
.
We compared our implementation with the fastest BERT-pretraining results and saw that Habana Gaudi still delivers a 25% cost reduction and allows us to pre-train BERT for ~$1,650.
Those results are incredible since it will allow companies to adapt their pre-trained models to their language and domain to improve accuracy up to 10% compared to the general BERT models.
If you are interested in training your own BERT or other Transformers models from scratch to reduce cost and improve accuracyy, contact our experts to learn about our Expert Acceleration Program. To learn more about Habana solutions, read about our partnership and how to contact them.
Code: pre-training-bert.ipynb
Thanks for reading! If you have any questions, feel free to contact me, through Github, or on the forum. You can also connect with me on Twitter or LinkedIn.