In this tutorial, you will learn how to fine-tune and deploy Donut-base for document-understand/document-parsing using Hugging Face Transformers and Amazon SageMaker. Donut is a new document-understanding model achieving state-of-art performance with an MIT-license, which allows it to be used for commercial purposes compared to other models like LayoutLMv2/LayoutLMv3.
Quick intro: Document Understanding Transformer (Donut) by ClovaAI
Document Understanding Transformer (Donut) is a new Transformer model for OCR-free document understanding. It doesn't require an OCR engine to process scanned documents but is achieving state-of-the-art performances on various visual document understanding tasks, such as visual document classification or information extraction (a.k.a. document parsing).
Donut is a multimodal sequence-to-sequence model with a vision encoder (Swin Transformer) and text decoder (BART). The encoder receives the images and computes it into an embedding, which is then passed to the decoder, which generates a sequence of tokens.
If you are going to use Sagemaker in a local environment. You need access to an IAM Role with the required permissions for Sagemaker. You can find here more about it.
2. Load SROIE dataset
We will use the SROIE dataset a collection of 1000 scanned receipts including their OCR, more specifically we will use the dataset from task 2 "Scanned Receipt OCR". The available dataset on Hugging Face (darentang/sroie) is not compatible with Donut. Thats why we will use the original dataset together with the imagefolder feature of datasets to load our dataset. Learn more about loading image data here.
Note: The test data for task2 is sadly not available. Meaning that we end up only with 626 images.
First, we will clone the repository, extract the dataset into a separate folder and remove the unnecessary files.
Now we have two folders inside the data/ directory. One contains the images of the receipts and the other contains the OCR text. The next step is to create a metadata.json file that contains the information about the images including the OCR-text. This is necessary for the imagefolder feature of datasets.
The metadata.json should look at the end similar to the example below.
In our example will "text" column contain the OCR text of the image, which will later be used for creating the Donut specific format.
Good Job! Now we can load the dataset using the imagefolder feature of datasets.
Now, lets take a closer look at our dataset
3. Preprocess and upload dataset for Donut
As we learned in the introduction, Donut is a sequence-to-sequence model with a vision encoder and text decoder. When fine-tuning the model we want it to generate the "text" based on the image we pass it. Similar to NLP tasks, we have to tokenize and preprocess the text.
Before we can tokenize the text, we need to transform the JSON string into a Donut compatible document.
current JSON string
Donut document
To easily create those documents the ClovaAI team has created a json2token method, which we extract and then apply.
The next step is to tokenize our text and encode the images into tensors. Therefore we need to load DonutProcessor, add our new special tokens and adjust the size of the images when processing from [1920, 2560] to [720, 960] to need less memory and have faster training.
Now, we can prepare our dataset, which we will use for the training later.
Before we can upload our dataset to S3 for training we want to split the dataset into train and test sets.
After that is done we use the new FileSystem integration to upload our dataset to S3. We are using the sess.default_bucket(), adjust this if you want to store the dataset in a different S3 bucket. We will use the S3 path later in our training script.
4. Fine-tune Donut model on Amazon SageMaker
After we have processed our dataset, we can start training our model using a Amazon SageMaker training job using the HuggingFace Estimator. The Estimator handles end-to-end Amazon SageMaker training and deployment tasks. The Estimator manages the infrastructure use.
SagMaker takes care of starting and managing all the required ec2 instances for us, provides the correct huggingface container, uploads the provided scripts and downloads the data from our S3 bucket into the container at /opt/ml/input/data. Then, it starts the training job by running.
Important steps we need to think of is that we extended the DonutProcessor earlier and added special tokens, which we need to pass through to our training script. We also need to pass the image_size and max_length to our training script.
In addition to loading our model, we are resizing the embedding layer to match newly added tokens and adjusting the image_size of our encoder to match our dataset. We are also adding tokens for inference later.
Lets start the training job and wait until it is finished. This will take around 30 minutes.
5. Deploy Donut model on Amazon SageMaker
During the training we copied a infernece.py into out model.tar.gz which allows us now to easily deploy our model to SageMaker for inference.
The inference.py implements a custom model_fn and predict_fn for our Donut model. The model_fn loads the model and processor and the predict_fn tokenizes the input and returns the prediction.
Before we can deploy model with the HuggingFaceModel class we need to create a new serializer, which supports our image data. The Serializer are used in Predictor and in the predict method to serializer our data to a specific mime-type. The default serialzier for the HuggingFacePredcitor is a JSON serializer, but since we are not going to send text data to the endpoint we will use the DataSerializer.
SageMaker starts the deployment process by creating a SageMaker Endpoint Configuration and a SageMaker Endpoint. The Endpoint Configuration defines the model and the instance type.
Lets test by using a example from the test split.
Awesome!! Our fine-tuned model parsed the document correctly and extracted the right values. The next step is to evalute our model on the test set. Since the model itself is a seq2seq is not that straightforward to evaluate.
To keep things simple we will use rogue short for Recall-Oriented Understudy for Gisting Evaluation. This metric does not behave like the standard accuracy: it will compare a generated text against a set of reference text. The rogue score is mostly used for summarization or machine translation tasks.
The higher the score the closer the generated text is to the reference text.
Our model achieves an rogue 1 score of 81.7% on the test set. The rogue1 refers to the overlap of unigrams (each word) between the prediction and reference.
Note: The evaluation we did was very simple.
In an inference test the model predicted for the address the value NO. 31G&33G, JALAN SETIA INDAH X ,U13/X 40170 SETIA ALAM and the ground truth was 'NO. 31G&33G, JALAN SETIA INDAH X,U13/X 40170 SETIA ALAM', where the only difference is the whitespace in between X and ,U13/X.
Clean up
To avoid unnecessary costs, we should delete the endpoint and the model.
Thanks for reading! If you have any questions, feel free to contact me on Twitter or LinkedIn.