BERT Text Classification in a different language
Currently, we have 7.5 billion people living on the world in around 200 nations. Only 1.2 billion people of them are native English speakers. This leads to a lot of unstructured non-English textual data.
Most of the tutorials and blog posts demonstrate how to build text classification, sentiment analysis, question-answering, or text generation models with BERT based architectures in English. In order to overcome this missing, I am going to show you how to build a non-English multi-class text classification model.
Opening my article let me guess it’s safe to assume that you have heard of BERT. If you haven’t, or if you’d like a refresh, I recommend reading this paper.
In deep learning, there are currently two options for how to build language models. You can build either monolingual models or multilingual models.
"multilingual, or not multilingual, that is the question" - as Shakespeare would have said
Multilingual models describe machine learning models that can understand different languages. An example of a multilingual model is mBERT from Google research. This model supports and understands 104 languages. Monolingual models, as the name suggest can understand one language.
Multilingual models are already achieving good results on certain tasks. But these models are bigger, need more data, and also more time to be trained. These properties lead to higher costs due to the larger amount of data and time resources needed.
Due to this fact, I am going to show you how to train a monolingual non-English BERT-based multi-class text classification model. Wow, that was a long sentence!
Tutorial
We are going to use Simple Transformers - an NLP library based on the Transformers library by HuggingFace. Simple Transformers allows us to fine-tune Transformer models in a few lines of code.
As the dataset, we are going to use the Germeval 2019, which consists of
German tweets. We are going to detect and classify abusive language tweets. These tweets are categorized in 4 classes:
PROFANITY
, INSULT
, ABUSE
, and OTHERS
. The highest score achieved on this dataset is 0.7361
.
We are going to:
- install Simple Transformers library
- select a pre-trained monolingual model
- load the dataset
- train/fine-tune our model
- evaluate the results of training
- save the trained model
- load the model and predict a real example
I am using Google Colab with a GPU runtime for this tutorial. If you are not sure how to use a GPU Runtime take a look here.
Install Simple Transformers library
First, we install simpletransformers
with pip. If you are not using Google colab you can check out the installation
guide here.
Select a pre-trained monolingual model
Next, we select the pre-trained model. As mentioned above the Simple Transformers library is based on the Transformers library from HuggingFace. This enables us to use every pre-trained model provided in the Transformers library and all community-uploaded models. For a list that includes all community-uploaded models, I refer to https://huggingface.co/models.
We are going to use the distilbert-base-german-cased
model, a
smaller, faster, cheaper version of BERT. It uses 40%
less parameters than bert-base-uncased
and runs 60% faster while still preserving over 95% of Bert’s performance.
Load the dataset
The dataset is stored in two text files we can retrieve from the
competition page. One option to download them is using 2 simple wget
CLI
commands.
Afterward, we use some pandas
magic to create a dataframe.
Since we don't have a test dataset, we split our dataset — train_df
and test_df
. We use 90% of the data for training
(train_df
) and 10% for testing (test_df
).
Load pre-trained model
The next step is to load the pre-trained model. We do this by creating a ClassificationModel
instance called model
.
This instance takes the parameters of:
- the architecture (in our case
"bert"
) - the pre-trained model (
"distilbert-base-german-cased"
) - the number of class labels (
4
) - and our hyperparameter for training (
train_args
).
You can configure the hyperparameter mwithin a wide range of possibilities. For a detailed description of each attribute, please refer to the documentation.
Train/fine-tune our model
To train our model we only need to run model.train_model()
and specify which dataset to train on.
Evaluate the results of training
After we trained our model successfully we can evaluate it. Therefore we create a simple helper function
f1_multiclass()
, which is used to calculate the f1_score
. The f1_score
is a measure for model accuracy. More on
that here.
We achieved an f1_score
of 0.6895
. Initially, this seems rather low, but keep in mind: the highest submission at
Germeval 2019 was 0.7361
. We would have achieved a top 20 rank
without tuning the hyperparameter. This is pretty impressive!
In a future post, I am going to show you how to achieve a higher f1_score
by tuning the hyperparameters.
Save the trained model
Simple Transformers saves the model
automatically every 2000
steps and at the end of the training process. The
default directory is outputs/
. But the output_dir
is a hyperparameter and can be overwritten. I created a helper
function pack_model()
, which we use to pack
all required model files into a tar.gz
file for deployment.
Load the model and predict a real example
As a final step, we load and predict a real example. Since we packed our files a step earlier with pack_model()
, we
have to unpack
them first. Therefore I wrote another helper function unpack_model()
to unpack our model files.
To load a saved model, we only need to provide the path
to our saved files and initialize it the same way as we did it
in the training step. Note: you will need to specify the correct (usually the same used in training) args when loading
the model.
After initializing it we can use the model.predict()
function to classify an output with a given input. In this
example, we take a tweet from the Germeval 2018 dataset.
Our model predicted the correct class OTHER
and INSULT
.
Resume
Concluding, we can say we achieved our goal to create a non-English BERT-based text classification model.
Our example referred to the German language but can easily be transferred into another language. HuggingFace offers a lot of pre-trained models for languages like French, Spanish, Italian, Russian, Chinese, ...
Thanks for reading. You can find the colab notebook with the complete code here.
If you have any questions, feel free to contact me.