philschmid blog

Image Classification with Hugging Face Transformers and `Keras`

#HuggingFace #Keras #ViT #Tensorflow
, January 04, 2022 · 10 min read

Photo by USGS on Unsplash

Welcome to this end-to-end Image Classification example using Keras and Hugging Face Transformers. In this demo, we will use the Hugging Faces transformers and datasets library together with Tensorflow & Keras to fine-tune a pre-trained vision transformer for image classification.

We are going to use the EuroSAT dataset for land use and land cover classification. The dataset is based on Sentinel-2 satellite images covering 13 spectral bands and consisting out of 10 classes within total 27,000 labeled and geo-referenced images.

More information for the dataset can be found at the repository.

We are going to use all of the great Features from the Hugging Face ecosystem like model versioning and experiment tracking as well as all the great features of Keras like Early Stopping and Tensorboard.

Quick intro: Vision Transformer (ViT) by Google Brain

The Vision Transformer (ViT) is basically BERT, but applied to images. It attains excellent results compared to state-of-the-art convolutional networks. In order to provide images to the model, each image is split into a sequence of fixed-size patches (typically of resolution 16x16 or 32x32), which are linearly embedded. One also adds a [CLS] token at the beginning of the sequence in order to classify images. Next, one adds absolute position embeddings and provides this sequence to the Transformer encoder.

vision-transformer-architecture

Installation

1 #!pip install "tensorflow==2.6.0"
2 !pip install transformers "datasets>=1.17.0" tensorboard --upgrade
1 !sudo apt-get install git-lfs

This example will use the Hugging Face Hub as a remote model versioning service. To be able to push our model to the Hub, you need to register on the Hugging Face. If you already have an account you can skip this step. After you have an account, we will use the notebook_login util from the huggingface_hub package to log into our account and store our token (access key) on the disk.

1 from huggingface_hub import notebook_login
2
3 notebook_login()

Setup & Configuration

In this step, we will define global configurations and parameters, which are used across the whole end-to-end fine-tuning process, e.g. feature extractor and model we will use.

In this example are we going to fine-tune the google/vit-base-patch16-224-in21k a Vision Transformer (ViT) pre-trained on ImageNet-21k (14 million images, 21,843 classes) at resolution 224x224.

1 model_id = "google/vit-base-patch16-224-in21k"

You can easily adjust the model_id to another Vision Transformer model, e.g. google/vit-base-patch32-384

Dataset & Pre-processing

As Dataset we will use the EuroSAT an image classification dataset based on satellite images caputred by the Sentinel-2. The dataset consisting out of 10 classes (Forest, River, Highway, AnnualCrop,SeaLake, HerbaceousVegetation, Industrial, Residential, PermanentCrop, Pasture) with in total 27,000 labeled and geo-referenced images.

eurosat-sample
Source: EuroSAT

The EuroSAT is not yet available as a dataset in the datasets library. To be able to create a Dataset instance we need to write a small little helper function, which will load our Dataset from the filesystem and create the instance to use later for training.

As a first step, we need to download the dataset to our filesystem and unzip it.

1 !wget https://madm.dfki.de/files/sentinel/EuroSAT.zip
2 !unzip EuroSAT.zip -d EuroSAT

We should now have a directory structure that looks like this:

1 EuroSAT/2750/
2 ├── AnnualCrop/
3 └── AnnualCrop_1.jpg
4 ├── Forest/
5 └── Forest_1.jpg
6 ├── HerbaceousVegetation/
7 └── HerbaceousVegetation_1.jpg
8 ├── Highway/
9 └── Highway_1.jpg
10 ├── Pasture/
11 └── Pasture_1.jpg
12 ├── PermanentCrop/
13 └── PermanentCrop_1.jpg
14 ├── Residential/
15 └── Residential_1.jpg
16 ├── River/
17 └── River_1.jpg
18 └── SeaLake/
19 └── SeaLake_1.jpg

At the time of writing this example datasets does not yet support loading image dataset from the filesystem. Therefore we create a create_image_folder_dataset helper function to load the dataset from the filesystem. This method creates our _CLASS_NAMES and our datasets.Features. After that, it iterates through the filesystem and creates a Dataset instance.

1 import os
2 import datasets
3
4 def create_image_folder_dataset(root_path):
5 """creates `Dataset` from image folder structure"""
6
7 # get class names by folders names
8 _CLASS_NAMES= os.listdir(root_path)
9 # defines `datasets` features`
10 features=datasets.Features({
11 "img": datasets.Image(),
12 "label": datasets.features.ClassLabel(names=_CLASS_NAMES),
13 })
14 # temp list holding datapoints for creation
15 img_data_files=[]
16 label_data_files=[]
17 # load images into list for creation
18 for img_class in os.listdir(root_path):
19 for img in os.listdir(os.path.join(root_path,img_class)):
20 path_=os.path.join(root_path,img_class,img)
21 img_data_files.append(path_)
22 label_data_files.append(img_class)
23 # create dataset
24 ds = datasets.Dataset.from_dict({"img":img_data_files,"label":label_data_files},features=features)
25 return ds
1 eurosat_ds = create_image_folder_dataset("EuroSAT/2750")

We can display all our classes by inspecting the features of our dataset. Those labels can be later used to create a user friendly output when predicting.

1 img_class_labels = eurosat_ds.features["label"].names

Pre-processing

To train our model we need to convert our “Images” to pixel_values. This is done by a 🤗 Transformers Feature Extractor which allows us to augment and convert the images into a 3D Array to be fed into our model.

1 from transformers import ViTFeatureExtractor
2 from tensorflow import keras
3 from tensorflow.keras import layers
4
5
6 feature_extractor = ViTFeatureExtractor.from_pretrained(model_id)
7
8 # learn more about data augmentation here: https://www.tensorflow.org/tutorials/images/data_augmentation
9 data_augmentation = keras.Sequential(
10 [
11 layers.Resizing(feature_extractor.size, feature_extractor.size),
12 layers.Rescaling(1./255),
13 layers.RandomFlip("horizontal"),
14 layers.RandomRotation(factor=0.02),
15 layers.RandomZoom(
16 height_factor=0.2, width_factor=0.2
17 ),
18 ],
19 name="data_augmentation",
20 )
21 # use keras image data augementation processing
22 def augmentation(examples):
23 # print(examples["img"])
24 examples["pixel_values"] = [data_augmentation(image) for image in examples["img"]]
25 return examples
26
27
28 # basic processing (only resizing)
29 def process(examples):
30 examples.update(feature_extractor(examples['img'], ))
31 return examples
32
33 # we are also renaming our label col to labels to use `.to_tf_dataset` later
34 eurosat_ds = eurosat_ds.rename_column("label", "labels")

process our dataset using .map method with batched=True.

1 processed_dataset = eurosat_ds.map(process, batched=True)
2 processed_dataset
3
4 # # augmenting dataset takes a lot of time
5 # processed_dataset = eurosat_ds.map(augmentation, batched=True)

Since our dataset doesn’t includes any split we need to train_test_split ourself to have an evaluation/test dataset for evaluating the result during and after training.

1 # test size will be 15% of train dataset
2 test_size=.15
3
4 processed_dataset = processed_dataset.shuffle().train_test_split(test_size=test_size)

Fine-tuning the model using Keras

Now that our dataset is processed, we can download the pretrained model and fine-tune it. But before we can do this we need to convert our Hugging Face datasets Dataset into a tf.data.Dataset. For this, we will use the .to_tf_dataset method and a data collator (Data collators are objects that will form a batch by using a list of dataset elements as input).

Hyperparameter

1 from huggingface_hub import HfFolder
2 import tensorflow as tf
3
4 id2label = {str(i): label for i, label in enumerate(img_class_labels)}
5 label2id = {v: k for k, v in id2label.items()}
6
7 num_train_epochs = 5
8 train_batch_size = 32
9 eval_batch_size = 32
10 learning_rate = 3e-5
11 weight_decay_rate=0.01
12 num_warmup_steps=0
13 output_dir=model_id.split("/")[1]
14 hub_token = HfFolder.get_token() # or your token directly "hf_xxx"
15 hub_model_id = f'{model_id.split("/")[1]}-euroSat'
16 fp16=True
17
18 # Train in mixed-precision float16
19 # Comment this line out if you're using a GPU that will not benefit from this
20 if fp16:
21 tf.keras.mixed_precision.set_global_policy("mixed_float16")

Converting the dataset to a tf.data.Dataset

1 from transformers import DefaultDataCollator
2
3 # Data collator that will dynamically pad the inputs received, as well as the labels.
4 data_collator = DefaultDataCollator(return_tensors="tf")
5
6 # converting our train dataset to tf.data.Dataset
7 tf_train_dataset = processed_dataset["train"].to_tf_dataset(
8 columns=['pixel_values'],
9 label_cols=["labels"],
10 shuffle=True,
11 batch_size=train_batch_size,
12 collate_fn=data_collator)
13
14 # converting our test dataset to tf.data.Dataset
15 tf_eval_dataset = processed_dataset["test"].to_tf_dataset(
16 columns=['pixel_values'],
17 label_cols=["labels"],
18 shuffle=True,
19 batch_size=eval_batch_size,
20 collate_fn=data_collator)

Download the pre-trained transformer model and fine-tune it.

1 from transformers import TFViTForImageClassification, create_optimizer
2 import tensorflow as tf
3
4 # create optimizer wight weigh decay
5 num_train_steps = len(tf_train_dataset) * num_train_epochs
6 optimizer, lr_schedule = create_optimizer(
7 init_lr=learning_rate,
8 num_train_steps=num_train_steps,
9 weight_decay_rate=weight_decay_rate,
10 num_warmup_steps=num_warmup_steps,
11 )
12
13 # load pre-trained ViT model
14 model = TFViTForImageClassification.from_pretrained(
15 model_id,
16 num_labels=len(img_class_labels),
17 id2label=id2label,
18 label2id=label2id,
19 )
20
21 # define loss
22 loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
23
24 # define metrics
25 metrics=[
26 tf.keras.metrics.SparseCategoricalAccuracy(name="accuracy"),
27 tf.keras.metrics.SparseTopKCategoricalAccuracy(3, name="top-3-accuracy"),
28 ]
29
30 # compile model
31 model.compile(optimizer=optimizer,
32 loss=loss,
33 metrics=metrics
34 )

If you want to create you own classification head or if you want to add the augmentation/processing layer to your model, you can directly use the functional Keras API. Below you find an example on how you would create a classification head.

1 # alternatively create Image Classification model using Keras Layer and ViTModel
2 # here you can also add the processing layers of keras
3
4 import tensorflow as tf
5 from transformers import TFViTModel
6
7 base_model = TFViTModel.from_pretrained('google/vit-base-patch16-224-in21k')
8
9
10 # inputs
11 pixel_values = tf.keras.layers.Input(shape=(3,224,224), name='pixel_values', dtype='float32')
12
13 # model layer
14 vit = base_model.vit(pixel_values)[0]
15 classifier = tf.keras.layers.Dense(10, activation='softmax', name='outputs')(vit[:, 0, :])
16
17 # model
18 keras_model = tf.keras.Model(inputs=pixel_values, outputs=classifier)

Callbacks

As mentioned in the beginning we want to use the Hugging Face Hub for model versioning and monitoring. Therefore we want to push our model weights, during training and after training to the Hub to version it. Additionally, we want to track the performance during training therefore we will push the Tensorboard logs along with the weights to the Hub to use the “Training Metrics” Feature to monitor our training in real-time.

1 import os
2 from transformers.keras_callbacks import PushToHubCallback
3 from tensorflow.keras.callbacks import TensorBoard as TensorboardCallback, EarlyStopping
4
5 callbacks=[]
6
7 callbacks.append(TensorboardCallback(log_dir=os.path.join(output_dir,"logs")))
8 callbacks.append(EarlyStopping(monitor="val_accuracy",patience=1))
9 if hub_token:
10 callbacks.append(PushToHubCallback(output_dir=output_dir,
11 hub_model_id=hub_model_id,
12 hub_token=hub_token))

tensorboard

Training

Start training with calling model.fit

1 train_results = model.fit(
2 tf_train_dataset,
3 validation_data=tf_eval_dataset,
4 callbacks=callbacks,
5 epochs=num_train_epochs,
6 )

As the time of writing this feature_extractor doesn’t yet support push_to_hub thats why we are pushing it manually.

1 from huggingface_hub import HfApi
2
3 api = HfApi()
4
5 user = api.whoami(hub_token)
6
7
8 feature_extractor.save_pretrained(output_dir)
9
10 api.upload_file(
11 token=hub_token,
12 repo_id=f"{user['name']}/{hub_model_id}",
13 path_or_fileobj=os.path.join(output_dir,"preprocessor_config.json"),
14 path_in_repo="preprocessor_config.json",
15 )

model-card

Run Managed Training using Amazon Sagemaker

If you want to run this examples on Amazon SageMaker to benefit from the Training Platform follow the cells below. I converted the Notebook into a python script train.py, which accepts same hyperparameter and can we run on SageMaker using the HuggingFace estimator

1 #!pip install sagemaker
1 import sagemaker
2
3 sess = sagemaker.Session()
4 # sagemaker session bucket -> used for uploading data, models and logs
5 # sagemaker will automatically create this bucket if it not exists
6 sagemaker_session_bucket=None
7 if sagemaker_session_bucket is None and sess is not None:
8 # set to default bucket if a bucket name is not given
9 sagemaker_session_bucket = sess.default_bucket()
10
11 role = sagemaker.get_execution_role()
12 sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)
13
14 print(f"sagemaker role arn: {role}")
15 print(f"sagemaker bucket: {sess.default_bucket()}")
16 print(f"sagemaker session region: {sess.boto_region_name}")
1 from sagemaker.huggingface import HuggingFace
2
3 # gets role for executing training job
4 role = sagemaker.get_execution_role()
5 hyperparameters = {
6 'model_id': 'google/vit-base-patch16-224-in21k',
7 'num_train_epochs': 5,
8 'train_batch_size': 32,
9 'eval_batch_size': 32,
10 'learning_rate': 3e-5,
11 'weight_decay_rate': 0.01,
12 'num_warmup_steps': 0,
13 'hub_token': HfFolder.get_token(),
14 'hub_model_id': 'sagemaker-vit-base-patch16-224-in21k-eurosat',
15 'fp16': True
16 }
17
18
19 # creates Hugging Face estimator
20 huggingface_estimator = HuggingFace(
21 entry_point='train.py',
22 source_dir='./scripts',
23 instance_type='ml.p3.2xlarge',
24 instance_count=1,
25 role=role,
26 transformers_version='4.12.3',
27 tensorflow_version='2.5.1',
28 py_version='py36',
29 hyperparameters = hyperparameters
30 )

upload our raw dataset to s3

1 from sagemaker.s3 import S3Uploader
2
3
4 dataset_uri = S3Uploader.upload(local_path="EuroSat",desired_s3_uri=f"s3://{sess.default_bucket()}/EuroSat")

After the dataset is uploaded we can start the training a pass our s3_uri as argument.

1 # starting the train job
2 huggingface_estimator.fit({"dataset": dataset_uri})

Conclusion

We managed to successfully fine-tune a Vision Transformer using Transformers and Keras, without any heavy lifting or complex and unnecessary boilerplate code. The new utilities like .to_tf_dataset are improving the developer experience of the Hugging Face ecosystem to become more Keras and TensorFlow friendly. Combining those new features with the Hugging Face Hub we get a fully-managed MLOps pipeline for model-versioning and experiment management using Keras callback API.

Additionally, people can now leverage the Keras vision ecosystem together with Transformers, to create their own custom models including preprocessing layers or customer classification heads.


You can find the code here and feel free to open a thread on the forum.

Thanks for reading. If you have any questions, feel free to contact me, through Github, or on the forum. You can also connect with me on Twitter or LinkedIn.