philschmid blog

Hugging Face Transformers with Keras: Fine-tune a non-English BERT for Named Entity Recognition

#HuggingFace #Keras #BERT #Tensorflow
, December 21, 2021 · 10 min read

Photo by Monika Grabkowska on Unsplash

Welcome to this end-to-end Named Entity Recognition example using Keras. In this tutorial, we will use the Hugging Faces transformers and datasets library together with Tensorflow & Keras to fine-tune a pre-trained non-English transformer for token-classification (ner).

If you want a more detailed example for token-classification you should check out this notebook or the chapter 7 of the Hugging Face Course.

Installation

1 #!pip install "tensorflow==2.6.0"
2 !pip install transformers datasets seqeval tensorboard --upgrade
1 !sudo apt-get install git-lfs

This example will use the Hugging Face Hub as remote model versioning service. To be able to push our model to the Hub, you need to register on the Hugging Face. If you already have an account you can skip this step. After you have an account, we will use the notebook_login util from the huggingface_hub package to log into our account and store our token (access key) on the disk.

1 from huggingface_hub import notebook_login
2
3 notebook_login()

Setup & Configuration

In this step we will define global configurations and paramters, which are used across the whole end-to-end fine-tuning proccess, e.g. tokenizer and model we will use.

In this example are we going to fine-tune the deepset/gbert-base a German BERT model.

1 model_id = "deepset/gbert-base"

You can change the model_id to another BERT-like model for a different language, e.g. Italian or French to use this script to train a French or Italian Named Entity Recognition Model. But don’t forget to also adjust the dataset in the next step.

Dataset & Pre-processing

As Dataset we will use the GermanNER a german named entity recognition dataset from GermaNER: Free Open German Named Entity Recognition Tool paper. The dataset contains the four default coarse named entity classes LOCation, PERson, ORGanisation, and OTHer from the GermEval 2014 task. If you are fine-tuning in a different language then German you can search on the Hub for a dataset for your language or you can take a look at Datasets for Entity Recognition

1 dataset_id="germaner"
2
3 seed=33

To load the germaner dataset, we use the load_dataset() method from the 🤗 Datasets library.

1 from datasets import load_dataset
2
3 dataset = load_dataset(dataset_id)

We can display all our NER classes by inspecting the features of our dataset. Those ner_labels will be later used to create a user friendly output after we fine-tuned our model.

1 # accessing the "train" split for the "ner_tags" feature
2 ner_labels = dataset["train"].features["ner_tags"].feature.names
3 # ['B-LOC', 'B-ORG', 'B-OTH', 'B-PER', 'I-LOC', 'I-ORG', 'I-OTH', 'I-PER', 'O']

Pre-processing & Tokenization

To train our model we need to convert our “Natural Language” to token IDs. This is done by a 🤗 Transformers Tokenizer which will tokenize the inputs (including converting the tokens to their corresponding IDs in the pretrained vocabulary). If you are not sure what this means check out chapter 6 of the Hugging Face Course.

1 from transformers import AutoTokenizer
2
3 tokenizer = AutoTokenizer.from_pretrained(model_id)

Compared to a text-classification dataset of question-answering dataset is “text” of the germaner already split into a list of words (tokens). So cannot use tokenzier(text) we need to pass is_split_into_words=True to the tokenizer method. Additionally we add the truncation=True to truncate texts that are bigger than the maximum size allowed by the model.

1 def tokenize_and_align_labels(examples):
2 tokenized_inputs = tokenizer(
3 examples["tokens"], truncation=True, is_split_into_words=True
4 )
5
6 labels = []
7 for i, label in enumerate(examples[f"ner_tags"]):
8 # get a list of tokens their connecting word id (for words tokenized into multiple chunks)
9 word_ids = tokenized_inputs.word_ids(batch_index=i)
10 previous_word_idx = None
11 label_ids = []
12 for word_idx in word_ids:
13 # Special tokens have a word id that is None. We set the label to -100 so they are automatically
14 # ignored in the loss function.
15 if word_idx is None:
16 label_ids.append(-100)
17 # We set the label for the first token of each word.
18 elif word_idx != previous_word_idx:
19 label_ids.append(label[word_idx])
20 # For the other tokens in a word, we set the label to the current
21 else:
22 label_ids.append(label[word_idx])
23 previous_word_idx = word_idx
24
25 labels.append(label_ids)
26
27 tokenized_inputs["labels"] = labels
28 return tokenized_inputs

process our dataset using .map method with batched=True.

1 tokenized_datasets = dataset.map(tokenize_and_align_labels, batched=True)

Since we later only need the tokenized + labels columns for the model to train, we are just filtering out which columns have been added by processing the dataset. The tokenizer_columns are the dataset column(s) to load in the tf.data.Dataset

1 pre_tokenizer_columns = set(dataset["train"].features)
2 tokenizer_columns = list(set(tokenized_datasets["train"].features) - pre_tokenizer_columns)
3 # ['attention_mask', 'labels', 'token_type_ids', 'input_ids']

Since our dataset only includes one split (train) we need to train_test_split ourself to have an evaluation/test dataset for evaluating the result during and after training.

1 # test size will be 15% of train dataset
2 test_size=.15
3
4 processed_dataset = tokenized_datasets["train"].shuffle(seed=seed).train_test_split(test_size=test_size)
5 processed_dataset

Fine-tuning the model using Keras

Now that our dataset is processed, we can download the pretrained model and fine-tune it. But before we can do this we need to convert our Hugging Face datasets Dataset into a tf.data.Dataset. For this we will us the .to_tf_dataset method and a data collator for token-classification (Data collators are objects that will form a batch by using a list of dataset elements as input).

Hyperparameter

1 from huggingface_hub import HfFolder
2 import tensorflow as tf
3
4 id2label = {str(i): label for i, label in enumerate(ner_labels)}
5 label2id = {v: k for k, v in id2label.items()}
6
7 num_train_epochs = 5
8 train_batch_size = 16
9 eval_batch_size = 32
10 learning_rate = 2e-5
11 weight_decay_rate=0.01
12 num_warmup_steps=0
13 output_dir=model_id.split("/")[1]
14 hub_token = HfFolder.get_token() # or your token directly "hf_xxx"
15 hub_model_id = f'{model_id.split("/")[1]}-{dataset_id}'
16 fp16=True
17
18 # Train in mixed-precision float16
19 # Comment this line out if you're using a GPU that will not benefit from this
20 if fp16:
21 tf.keras.mixed_precision.set_global_policy("mixed_float16")

Converting the dataset to a tf.data.Dataset

1 from transformers import DataCollatorForTokenClassification
2
3 # Data collator that will dynamically pad the inputs received, as well as the labels.
4 data_collator = DataCollatorForTokenClassification(
5 tokenizer=tokenizer, return_tensors="tf"
6 )
7
8 # converting our train dataset to tf.data.Dataset
9 tf_train_dataset = processed_dataset["train"].to_tf_dataset(
10 columns= tokenizer_columns,
11 shuffle=False,
12 batch_size=train_batch_size,
13 collate_fn=data_collator,
14 )
15
16 # converting our test dataset to tf.data.Dataset
17 tf_eval_dataset = processed_dataset["test"].to_tf_dataset(
18 columns=tokenizer_columns,
19 shuffle=False,
20 batch_size=eval_batch_size,
21 collate_fn=data_collator,
22 )

Download the pretrained transformer model and fine-tune it.

1 from transformers import TFAutoModelForTokenClassification, create_optimizer
2
3
4 num_train_steps = len(tf_train_dataset) * num_train_epochs
5 optimizer, lr_schedule = create_optimizer(
6 init_lr=learning_rate,
7 num_train_steps=num_train_steps,
8 weight_decay_rate=weight_decay_rate,
9 num_warmup_steps=num_warmup_steps,
10 )
11
12 model = TFAutoModelForTokenClassification.from_pretrained(
13 model_id,
14 id2label=id2label,
15 label2id=label2id,
16 )
17
18 model.compile(optimizer=optimizer)

Callbacks

As mentioned in the beginning we want to use the Hugging Face Hub for model versioning and monitoring. Therefore we want to push our models weights, during training and after training to the Hub to version it. Additionally we want to track the peformance during training therefore we will push the Tensorboard logs along with the weights to the Hub to use the “Training Metrics” Feature to monitor our training in real-time.

1 import os
2 from transformers.keras_callbacks import PushToHubCallback
3 from tensorflow.keras.callbacks import TensorBoard as TensorboardCallback
4
5 callbacks=[]
6
7 callbacks.append(TensorboardCallback(log_dir=os.path.join(output_dir,"logs")))
8 if hub_token:
9 callbacks.append(PushToHubCallback(output_dir=output_dir,
10 tokenizer=tokenizer,
11 hub_model_id=hub_model_id,
12 hub_token=hub_token))

tensorboard

Training

Start training with calling model.fit

1 model.fit(
2 tf_train_dataset,
3 validation_data=tf_eval_dataset,
4 callbacks=callbacks,
5 epochs=num_train_epochs,
6 )

Evaluation

The traditional framework used to evaluate token classification prediction is seqeval. This metric does not behave like the standard accuracy: it will actually take the lists of labels as strings, not integers, so we will need to fully decode the predictions and labels before passing them to the metric.

1 from datasets import load_metric
2 import numpy as np
3
4
5 metric = load_metric("seqeval")
6
7
8 def evaluate(model, dataset, ner_labels):
9 all_predictions = []
10 all_labels = []
11 for batch in dataset:
12 logits = model.predict(batch)["logits"]
13 labels = batch["labels"]
14 predictions = np.argmax(logits, axis=-1)
15 for prediction, label in zip(predictions, labels):
16 for predicted_idx, label_idx in zip(prediction, label):
17 if label_idx == -100:
18 continue
19 all_predictions.append(ner_labels[predicted_idx])
20 all_labels.append(ner_labels[label_idx])
21 return metric.compute(predictions=[all_predictions], references=[all_labels])
22
23 results = evaluate(model, tf_eval_dataset, ner_labels=list(model.config.id2label.values()))
1 {'LOC': {'precision': 0.8931558935361217,
2 'recall': 0.9115250291036089,
3 'f1': 0.9022469752256578,
4 'number': 2577},
5 'ORG': {'precision': 0.7752112676056339,
6 'recall': 0.8075117370892019,
7 'f1': 0.7910319057200345,
8 'number': 1704},
9 'OTH': {'precision': 0.6788389513108615,
10 'recall': 0.7308467741935484,
11 'f1': 0.703883495145631,
12 'number': 992},
13 'PER': {'precision': 0.9384366140137708,
14 'recall': 0.9430199430199431,
15 'f1': 0.9407226958993098,
16 'number': 2457},
17 'overall_precision': 0.8520523797532108,
18 'overall_recall': 0.8754204398447607,
19 'overall_f1': 0.8635783563042368,
20 'overall_accuracy': 0.976147969774973}

Create Model Card with evaluation results

To complete our Hugging Face Hub repository we will create a model card with the used hyperparameters and the evaluation results.

1 from transformers.modelcard import TrainingSummary
2
3
4 eval_results = {
5 "precision":float(results["overall_precision"]),
6 "recall":float(results["overall_recall"]),
7 "f1":float(results["overall_f1"]),
8 "accuracy":float(results["overall_accuracy"]),
9 }
10
11 training_summary = TrainingSummary(
12 model_name = hub_model_id,
13 language = "de",
14 tags=[],
15 finetuned_from=model_id,
16 tasks="token-classification",
17 dataset=dataset_id,
18 dataset_tags=dataset_id,
19 dataset_args="default",
20 eval_results=eval_results,
21 hyperparameters={
22 "num_train_epochs": num_train_epochs,
23 "train_batch_size": train_batch_size,
24 "eval_batch_size": eval_batch_size,
25 "learning_rate": learning_rate,
26 "weight_decay_rate": weight_decay_rate,
27 "num_warmup_steps": num_warmup_steps,
28 "fp16": fp16
29 }
30 )
31 model_card = training_summary.to_model_card()
32
33 model_card_path = os.path.join(output_dir, "README.md")
34
35 with open(model_card_path, "w") as f:
36 f.write(model_card)

push model card to repository

1 from huggingface_hub import HfApi
2
3 api = HfApi()
4
5 user = api.whoami(hub_token)
6
7 api.upload_file(
8 token=hub_token,
9 repo_id=f"{user['name']}/{hub_model_id}",
10 path_or_fileobj=model_card_path,
11 path_in_repo="README.md",
12 )

model-card


Run Managed Training using Amazon Sagemaker

If you want to run this examples on Amazon SageMaker to benefit from the Training Platform follow the cells below. I converted the Notebook into a python script train.py, which accepts same hyperparameter and can we run on SageMaker using the HuggingFace estimator

1 #!pip install sagemaker
1 import sagemaker
2
3 sess = sagemaker.Session()
4 # sagemaker session bucket -> used for uploading data, models and logs
5 # sagemaker will automatically create this bucket if it not exists
6 sagemaker_session_bucket=None
7 if sagemaker_session_bucket is None and sess is not None:
8 # set to default bucket if a bucket name is not given
9 sagemaker_session_bucket = sess.default_bucket()
10
11 role = sagemaker.get_execution_role()
12 sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)
13
14 print(f"sagemaker role arn: {role}")
15 print(f"sagemaker bucket: {sess.default_bucket()}")
16 print(f"sagemaker session region: {sess.boto_region_name}")
1 from sagemaker.huggingface import HuggingFace
2
3 # gets role for executing training job
4 role = sagemaker.get_execution_role()
5 hyperparameters = {
6 'model_id': 'deepset/gbert-base',
7 'dataset_id': 'germaner',
8 'num_train_epochs': 5,
9 'train_batch_size': 16,
10 'eval_batch_size': 32,
11 'learning_rate': 2e-5,
12 'weight_decay_rate': 0.01,
13 'num_warmup_steps': 0,
14 'hub_token': HfFolder.get_token(),
15 'hub_model_id': 'sagemaker-gbert-base-germaner',
16 'fp16': True
17 }
18
19
20 # creates Hugging Face estimator
21 huggingface_estimator = HuggingFace(
22 entry_point='train.py',
23 source_dir='./scripts',
24 instance_type='ml.p3.2xlarge',
25 instance_count=1,
26 role=role,
27 transformers_version='4.12.3',
28 tensorflow_version='2.5.1',
29 py_version='py36',
30 hyperparameters = hyperparameters
31 )
32
33 # starting the train job
34 huggingface_estimator.fit()

Conclusion

We managed to successfully fine-tune a German BERT model using Transformers and Keras, without any heavy lifting or complex and unnecessary boilerplate code. The new utilities like .to_tf_dataset are improving the developer experience of the Hugging Face ecosystem to become more Keras and TensorFlow friendly. Combining those new features with the Hugging Face Hub we get a fully-managed MLOps pipeline for model-versioning and experiment management using Keras callback API.

Big Thanks to Matt for all the work he is doing to improve the experience using Transformers and Keras.

Now its your turn! Adjust the notebook to train a BERT for another language like French, Spanish or Italian. 🇫🇷 🇪🇸 🇮🇹


You can find the code here and feel free to open a thread on the forum.

Thanks for reading. If you have any questions, feel free to contact me, through Github, or on the forum. You can also connect with me on Twitter or LinkedIn.