Fine-tune a non-English GPT-2 Model with Huggingface
Unless you’re living under a rock, you probably have heard about OpenAI's GPT-3 language model.
You might also have seen all the crazy demos, where the model writes JSX
, HTML
code, or its capabilities in the area
of zero-shot / few-shot learning. Simon O'Regan wrote an
article with excellent demos and projects built on top of GPT-3.
A Downside of GPT-3 is its 175 billion parameters, which results in a model size of around 350GB. For comparison, the biggest implementation of the GPT-2 iteration has 1,5 billion parameters. This is less than 1/116 in size.
In fact, with close to 175B trainable parameters, GPT-3 is much bigger in terms of size in comparison to any other model else out there. Here is a comparison of the number of parameters of recent popular NLP models, GPT-3 clearly stands out.
This is all magnificent, but you do not need 175 billion parameters to get good results in text-generation
.
There are already tutorials on how to fine-tune GPT-2. But a lot of them are obsolete or outdated. In this tutorial, we
are going to use the transformers
library by Huggingface in their newest version (3.1.0).
We will use the new Trainer
class and fine-tune our GPT-2 Model with German recipes from
chefkoch.de.
You can find everything we are doing in this colab notebook.
Huggingface
Transformers Library byThe Transformers library provides state-of-the-art machine learning architectures like BERT, GPT-2, RoBERTa, XLM, DistilBert, XLNet, T5 for Natural Language Understanding (NLU), and Natural Language Generation (NLG). It also provides thousands of pre-trained models in 100+ different languages and is deeply interoperable between PyTorch & TensorFlow 2.0. It enables developers to fine-tune machine learning models for different NLP-tasks like text classification, sentiment analysis, question-answering, or text generation.
Tutorial
In the tutorial, we fine-tune a German GPT-2 from the Huggingface model hub. As data, we use the German Recipes Dataset, which consists of 12190 german recipes with metadata crawled from chefkoch.de.
We will use the recipe Instructions to fine-tune our GPT-2 model and let us write recipes afterwards that we can cook.
We use a Google Colab with a GPU runtime for this tutorial. If you are not sure how to use a GPU Runtime take a look here.
What are we going to do:
- load the dataset from Kaggle
- prepare the dataset and build a
TextDataset
- initialize
Trainer
withTrainingArguments
and GPT-2 model - train and save the model
- test the model
You can find everything we do in this colab notebook.
Load the dataset from Kaggle
As already mentioned in the introduction of the tutorial we use the "German Recipes Dataset" dataset from Kaggle. The dataset consists of 12190 german recipes with metadata crawled from chefkoch.de. In this example, we only use the Instructions of the recipes. We download the dataset by using the "Download" button and upload it to our colab notebook since it only has a zipped size of 4,7MB.
After we uploaded the file we use unzip
to extract the recipes.json
.
You also could use the kaggle
CLI to download the dataset, but be aware you need your Kaggle credentials in the colab
notebook.
here an example of a recipe.
TextDataset
Prepare the dataset and build a The next step is to extract the instructions from all recipes and build a TextDataset
. The TextDataset
is a custom
implementation of the
Pytroch Dataset
class implemented
by the transformers library. If you want to know more about Dataset
in Pytorch you can check out this
youtube video.
First, we split the recipes.json
into a train
and test
section. Then we extract Instructions
from the recipes
and write them into a train_dataset.txt
and test_dataset.txt
The next step is to download the tokenizer. We use the tokenizer from the german-gpt2
model.
Now we can build our TextDataset
. Therefore we create a TextDataset
instance with the tokenizer
and the path to
our datasets. We also create our data_collator
, which is used in training to form a batch from our dataset.
Trainer
with TrainingArguments
and GPT-2 model
Initialize The Trainer class provides an API
for feature-complete training. It is used in most of
the example scripts from Huggingface. Before we can instantiate our
Trainer
we need to download our GPT-2 model and create
TrainingArguments. The
TrainingArguments
are used to define the Hyperparameters, which we use in the training process like the
learning_rate
, num_train_epochs
, or per_device_train_batch_size
. You can find a complete list
here.
Train and Save the model
To train the model we can simply run trainer.train()
.
After training is done you can save the model by calling save_model()
. This will save the trained model to our
output_dir
from our TrainingArguments
.
Test the model
To test the model we use another
highlight of the transformers library
called pipeline
. Pipelines are
objects that offer a simple API dedicated to several tasks, text-generation
amongst others.
result:
"Zuerst Tomaten dazu geben und 2 Minuten kochen lassen. Die Linsen ebenfalls in der Brühe anbrühen.Die Tomaten auspressen. Mit der Butter verrühren. Den Kohl sowie die Kartoffeln andünsten, bis sie weich sind. "
Well, thats it. We've done it👨🏻🍳. We have successfully fine-tuned our gpt-2 model to write us recipes.
To improve our results we could train it longer and adjust our TrainingArguments
or enlarge the dataset.
You can find everything in this colab notebook.
Thanks for reading. If you have any questions, feel free to contact me or comment on this article. You can also connect with me on Twitter or LinkedIn.