Fine-tuning LayoutLM for document-understanding using Keras & Hugging Face Transformers

Published on
17 min read
View Code

In this blog, you will learn how to fine-tune LayoutLM (v1) for document-understand using Tensorflow, Keras & Hugging Face Transformers. LayoutLM is a document image understanding and information extraction transformers and was originally published by Microsoft Research as PyTorch model, which was later converted to Keras by the Hugging Face Team. LayoutLM (v1) is the only model in the LayoutLM family with an MIT-license, which allows it to be used for commercial purposes compared to other LayoutLMv2/LayoutLMv3.

We will use the FUNSD dataset a collection of 199 fully annotated forms. More information for the dataset can be found at the dataset page.

You will learn how to:

  1. Setup Development Environment
  2. Load and prepare FUNSD dataset
  3. Fine-tune and evaluate LayoutLM
  4. Run inference and parse form

Before we can start, make sure you have a Hugging Face Account to save artifacts and experiments.

Quick intro: LayoutLM by Microsoft Research

LayoutLM is a multimodal Transformer model for document image understanding and information extraction transformers and can be used form understanding and receipt understanding.


Now we know how LayoutLM works, so let's get started. 🚀

Note: This tutorial was created and run on a g4dn.xlarge AWS EC2 Instance including a NVIDIA T4.

1. Setup Development Environment

Our first step is to install the Hugging Face Libraries, including transformers and datasets. Running the following cell will install all the required packages. Additinoally, we need to install an OCR-library to extract text from images. We will use pytesseract.

If you haven't set up a Tensorflow environment you can use the conda snippet below.

conda create --channel=conda-forge --name tf \
  python=3.9 \
  nvidia::cudatoolkit=11.2 \
# ubuntu
!sudo apt install -y tesseract-ocr
# python
!pip install pytesseract transformers datasets evaluate seqeval tensorboard

# install git-fls for pushing model and logs to the hugging face hub
!sudo apt-get install git-lfs --yes

This example will use the Hugging Face Hub as a remote model versioning service. To be able to push our model to the Hub, you need to register on the Hugging Face. If you already have an account, you can skip this step. After you have an account, we will use the notebook_login util from the huggingface_hub package to log into our account and store our token (access key) on the disk.

from huggingface_hub import notebook_login


2. Load and prepare FUNSD dataset

We will use the FUNSD dataset a collection of 199 fully annotated forms. The dataset is available on Hugging Face at nielsr/funsd.

Note: The LayoutLM model doesn't have a AutoProcessor to create the our input documents, but we can use the LayoutLMv2Processor instead.

dataset_id ="nielsr/funsd"

To load the funsd dataset, we use the load_dataset() method from the 🤗 Datasets library.

from datasets import load_dataset

dataset = load_dataset(dataset_id)

print(f"Train dataset size: {len(dataset['train'])}")
print(f"Test dataset size: {len(dataset['test'])}")
# Train dataset size: 149
# Test dataset size: 50

Lets checkout an example of the dataset.

from PIL import Image, ImageDraw, ImageFont

image =['train'][40]['image_path'])
image = image.convert("RGB")

We can display all our classes by inspecting the features of our dataset. Those ner_tags will be later used to create a user friendly output after we fine-tuned our model.

labels = dataset['train'].features['ner_tags'].feature.names
print(f"Available labels: {labels}")

id2label = {v: k for v, k in enumerate(labels)}
label2id = {k: v for v, k in enumerate(labels)}
#     Available labels: ['O', 'B-HEADER', 'I-HEADER', 'B-QUESTION', 'I-QUESTION', 'B-ANSWER', 'I-ANSWER']

To train our model we need to convert our inputs (text/image) to token IDs. This is done by a 🤗 Transformers Tokenizer and PyTesseract. If you are not sure what this means check out chapter 6 of the Hugging Face Course.

from transformers import LayoutLMv2Processor

# use LayoutLMv2 processor without ocr since the dataset already includes the ocr text
processor = LayoutLMv2Processor.from_pretrained(processor_id, apply_ocr=False)

Before we can process our dataset we need to define the features or the processed inputs, which are later based into the model. Features are a special dictionary that defines the internal structure of a dataset. Compared to traditional NLP datasets we need to add the bbox feature, which is a 2D array of the bounding boxes for each token.

from PIL import Image
from functools import partial
from datasets import Features, Sequence, ClassLabel, Value, Array2D

# we need to define custom features
features = Features(
        "input_ids": Sequence(feature=Value(dtype="int64")),
        "attention_mask": Sequence(Value(dtype="int64")),
        "token_type_ids": Sequence(Value(dtype="int64")),
        "bbox": Array2D(dtype="int64", shape=(512, 4)),
        "labels": Sequence(ClassLabel(names=labels)),

# preprocess function to perpare into the correct format for the model
def process(sample, processor=None):
    encoding = processor(["image_path"]).convert("RGB"),
    del encoding["image"]
    return encoding

# process the dataset and format it to pytorch
proc_dataset =
    partial(process, processor=processor),
    remove_columns=["image_path", "words", "ner_tags", "id", "bboxes"],

# dict_keys(['input_ids', 'token_type_ids', 'attention_mask', 'bbox','lables'])

3. Fine-tune and evaluate LayoutLM

After we have processed our dataset, we can start training our model. Therefore we first need to load the microsoft/layoutlm-base-uncased model with the TFLayoutLMForTokenClassification class with the label mapping of our dataset.

from transformers import TFLayoutLMForTokenClassification

# huggingface hub model id
model_id = "microsoft/layoutlm-base-uncased"

# load model with correct number of labels and mapping
model = TFLayoutLMForTokenClassification.from_pretrained(
    model_id, num_labels=len(labels), label2id=label2id, id2label=id2label

Before we can train our model we have todefine the hyperparameters we want to use for our training. Therefore will create a dataclass.

from dataclasses import dataclass
from huggingface_hub import HfFolder
import tensorflow as tf

class Hyperparameters:
    num_train_epochs: int = 8
    train_batch_size: int = 8
    eval_batch_size: int = 8
    learning_rate: float = 3e-5
    weight_decay_rate: float = 0.01
    output_dir: str = 'layoutlm-funsd-tf'
    hub_model_id: str = f'layoutlm-funsd-tf'
    hub_token: str = HfFolder.get_token()  # or your token directly "hf_xxx"
    fp16 = True
    # Train in mixed-precision float16
    def __post_init__(self):
        if self.fp16:

hp = Hyperparameters()

The next step is to convert our dataset to a this can be done by the model.prepare_tf_dataset.

# prepare train dataset
tf_train_dataset = model.prepare_tf_dataset(

# prepare test dataset
tf_test_dataset = model.prepare_tf_dataset(

As mentioned in the beginning we want to use the Hugging Face Hub for model versioning and monitoring. Therefore we want to push our model weights, during training and after training to the Hub to version it. Additionally, we want to track the performance during training therefore we will push the Tensorboard logs along with the weights to the Hub to use the "Training Metrics" Feature to monitor our training in real-time.

Additionally, we are going to use the KerasMetricCallback from the transformers library to evaluate our model during training using seqeval and evaluate. The KerasMetricCallback allows us to compute metrics at the end of every epoch. It is particularly useful for common NLP metrics, like BLEU and ROUGE as well as for class specific f1 scores, like seqeval provides.

import evaluate
import numpy as np

# load seqeval metric
metric = evaluate.load("seqeval")

# labels of the model
ner_labels = list(model.config.id2label.values())

def compute_metrics(p):
    predictions, labels = p
    predictions = np.argmax(predictions, axis=-1)

    all_predictions = []
    all_labels = []
    for prediction, label in zip(predictions, labels):
        for predicted_idx, label_idx in zip(prediction, label):
            if label_idx == -100:
    res = metric.compute(predictions=[all_predictions], references=[all_labels])
    return {
        "overall_precision": res["overall_precision"],
        "overall_recall": res["overall_recall"],
        "overall_f1": res["overall_f1"],
        "overall_accuracy": res["overall_accuracy"],

We can add all our callbacks to a list which we then provide to the method. We are using the following callbacks:

  • TensorBoard: To log our training metrics to Tensorboard
  • PushToHubCallback: To push our model weights and Tensorboard logs to the Hub
  • KerasMetricCallback: To evaluate our model during training
import os
from transformers.keras_callbacks import PushToHubCallback, KerasMetricCallback
from tensorflow.keras.callbacks import TensorBoard as TensorboardCallback

callbacks = []

callbacks.append(TensorboardCallback(log_dir=os.path.join(hp.output_dir, "logs")))
callbacks.append(KerasMetricCallback(metric_fn=compute_metrics, eval_dataset=tf_test_dataset))

if hp.hub_token:
    callbacks.append(PushToHubCallback(output_dir=hp.output_dir, hub_model_id=hp.hub_model_id, hub_token=hp.hub_token))

Before we can start our training we have to create the optimizer and compile our model.

import tensorflow as tf
from transformers import AdamWeightDecay

# create optimizer width weigh decay
optimizer = AdamWeightDecay(
    learning_rate = hp.learning_rate,
    weight_decay_rate = hp.weight_decay_rate,

# compile model

We can ttart training with calling providing the training and validation dataset, along with the hyperparameters, optimizer, metrics and callbacks we defined before.

# train model
train_results =


    18/18 [==============================] - 47s 2s/step - loss: 1.6940 - val_loss: 1.4151 - overall_precision: 0.2686 - overall_recall: 0.2785 - overall_f1: 0.2735 - overall_accuracy: 0.5128
    Epoch 2/8
    18/18 [==============================] - 20s 1s/step - loss: 1.1731 - val_loss: 0.8665 - overall_precision: 0.5771 - overall_recall: 0.6101 - overall_f1: 0.5932 - overall_accuracy: 0.7267
    Epoch 3/8
    18/18 [==============================] - 40s 2s/step - loss: 0.7612 - val_loss: 0.6849 - overall_precision: 0.6362 - overall_recall: 0.7336 - overall_f1: 0.6814 - overall_accuracy: 0.7784
    Epoch 4/8
    18/18 [==============================] - 20s 1s/step - loss: 0.5630 - val_loss: 0.6265 - overall_precision: 0.6748 - overall_recall: 0.7592 - overall_f1: 0.7145 - overall_accuracy: 0.8017
    Epoch 5/8
    18/18 [==============================] - 40s 2s/step - loss: 0.4441 - val_loss: 0.6256 - overall_precision: 0.6935 - overall_recall: 0.7767 - overall_f1: 0.7328 - overall_accuracy: 0.8036
    Epoch 6/8
    18/18 [==============================] - 20s 1s/step - loss: 0.3641 - val_loss: 0.6402 - overall_precision: 0.7115 - overall_recall: 0.7772 - overall_f1: 0.7429 - overall_accuracy: 0.7940
    Epoch 7/8
    18/18 [==============================] - 40s 2s/step - loss: 0.2781 - val_loss: 0.6248 - overall_precision: 0.7176 - overall_recall: 0.7868 - overall_f1: 0.7506 - overall_accuracy: 0.8141
    Epoch 8/8
    18/18 [==============================] - 20s 1s/step - loss: 0.2280 - val_loss: 0.6532 - overall_precision: 0.7218 - overall_recall: 0.7878 - overall_f1: 0.7534 - overall_accuracy: 0.8144

Nice, we have trained our model. 🎉 The best score we achieved is an overall f1 score of 0.7534.


After our training is done we also want to save our processor to the Hugging Face Hub and create a model card.

# change apply_ocr to True to use the ocr text for inference
processor.feature_extractor.apply_ocr = True
# Save processor and create model card
model.push_to_hub(hp.hub_model_id, use_auth_token=hp.hub_token)
processor.push_to_hub(hp.hub_model_id, use_auth_token=hp.hub_token)

4. Run Inference

Now we have a trained model, we can use it to run inference. We will create a function that takes a document image and returns the extracted text and the bounding boxes.

from transformers import TFLayoutLMForTokenClassification, LayoutLMv2Processor
from PIL import Image, ImageDraw, ImageFont
import tensorflow as tf

# load model and processor from huggingface hub
model = TFLayoutLMForTokenClassification.from_pretrained("philschmid/layoutlm-funsd-tf")
processor = LayoutLMv2Processor.from_pretrained("philschmid/layoutlm-funsd-tf")

# helper function to unnormalize bboxes for drawing onto the image
def unnormalize_box(bbox, width, height):
    return [
        width * (bbox[0] / 1000),
        height * (bbox[1] / 1000),
        width * (bbox[2] / 1000),
        height * (bbox[3] / 1000),

label2color = {
    "B-HEADER": "blue",
    "B-QUESTION": "red",
    "B-ANSWER": "green",
    "I-HEADER": "blue",
    "I-QUESTION": "red",
    "I-ANSWER": "green",
# draw results onto the image
def draw_boxes(image, boxes, predictions):
    width, height = image.size
    normalizes_boxes = [unnormalize_box(box, width, height) for box in boxes]

    # draw predictions over the image
    draw = ImageDraw.Draw(image)
    font = ImageFont.load_default()
    for prediction, box in zip(predictions, normalizes_boxes):
        if prediction == "O":
        draw.rectangle(box, outline="black")
        draw.rectangle(box, outline=label2color[prediction])
        draw.text((box[0] + 10, box[1] - 10), text=prediction, fill=label2color[prediction], font=font)
    return image

# run inference
def run_inference(path, model=model, processor=processor, output_image=True):
    # create model input
    image ="RGB")
    encoding = processor(image, return_tensors="tf")
    del encoding["image"]
    # run inference
    outputs = model(**encoding)
    predictions = tf.squeeze(tf.argmax(outputs.logits, axis=-1)).numpy()
    # get labels
    labels = [model.config.id2label[prediction] for prediction in predictions]
    if output_image:
        return draw_boxes(image, encoding["bbox"][0], labels)
        return labels




We managed to successfully fine-tune our LayoutLM to extract information from forms. With only 149 training examples we achieved an overall f1 score of 0.7534, which is impressive and another prove for the power of transfer learning.

Now its your time to integrate Transformers into your own projects. 🚀

Thanks for reading! If you have any questions, feel free to contact me, through Github, or on the forum. You can also connect with me on Twitter or LinkedIn.