In this blog, you will learn how to fine-tune LayoutLM (v1) for document-understand using Hugging Face Transformers. LayoutLM is a document image understanding and information extraction transformers. LayoutLM (v1) is the only model in the LayoutLM family with an MIT-license, which allows it to be used for commercial purposes compared to other LayoutLMv2/LayoutLMv3.
We will use the FUNSD dataset a collection of 199 fully annotated forms. More information for the dataset can be found at the dataset page.
Before we can start, make sure you have a Hugging Face Account to save artifacts and experiments.
Quick intro: LayoutLM by Microsoft Research
LayoutLM is a multimodal Transformer model for document image understanding and information extraction transformers and can be used form understanding and receipt understanding.
Now we know how LayoutLM works, so let's get started. 🚀
Note: This tutorial was created and run on a g4dn.xlarge AWS EC2 Instance including a NVIDIA T4.
1. Setup Development Environment
Our first step is to install the Hugging Face Libraries, including transformers and datasets. Running the following cell will install all the required packages.
Additinoally, we need to install an OCR-library to extract text from images. We will use pytesseract.
This example will use the Hugging Face Hub as a remote model versioning service. To be able to push our model to the Hub, you need to register on the Hugging Face.
If you already have an account, you can skip this step.
After you have an account, we will use the notebook_login util from the huggingface_hub package to log into our account and store our token (access key) on the disk.
2. Load and prepare FUNSD dataset
We will use the FUNSD dataset a collection of 199 fully annotated forms. The dataset is available on Hugging Face at nielsr/funsd.
Note: The LayoutLM model doesn't have a AutoProcessor to nice create the our input documents, but we can use the LayoutLMv2Processor instead.
To load the funsd dataset, we use the load_dataset() method from the 🤗 Datasets library.
Lets checkout an example of the dataset.
We can display all our classes by inspecting the features of our dataset. Those ner_tags will be later used to create a user friendly output after we fine-tuned our model.
To train our model we need to convert our inputs (text/image) to token IDs. This is done by a 🤗 Transformers Tokenizer and PyTesseract. If you are not sure what this means check out chapter 6 of the Hugging Face Course.
Before we can process our dataset we need to define the features or the processed inputs, which are later based into the model. Features are a special dictionary that defines the internal structure of a dataset.
Compared to traditional NLP datasets we need to add the bbox feature, which is a 2D array of the bounding boxes for each token.
3. Fine-tune and evaluate LayoutLM
After we have processed our dataset, we can start training our model. Therefore we first need to load the microsoft/layoutlm-base-uncased model with the LayoutLMForTokenClassification class with the label mapping of our dataset.
We want to evaluate our model during training. The Trainer supports evaluation during training by providing a compute_metrics.
We are going to use seqeval and the evaluate library to evaluate the overall f1 score for all tokens.
The last step is to define the hyperparameters (TrainingArguments) we want to use for our training. We are leveraging the Hugging Face Hub integration of the Trainer to automatically push our checkpoints, logs and metrics during training into a repository.
We can start our training by using the train method of the Trainer.
Nice, we have trained our model. 🎉 The best score we achieved is an overall f1 score of 0.787.
After our training is done we also want to save our processor to the Hugging Face Hub and create a model card.
4. Run inference and parse form
Now we have a trained model, we can use it to run inference. We will create a function that takes a document image and returns the extracted text and the bounding boxes.
Conclusion
We managed to successfully fine-tune our LayoutLM to extract information from forms. With only 149 training examples we achieved an overall f1 score of 0.787, which is impressive and another prove for the power of transfer learning.
Now its your time to integrate LayoutLM into your own projects. 🚀
Thanks for reading! If you have any questions, feel free to contact me, through Github, or on the forum. You can also connect with me on Twitter or LinkedIn.