philschmid

Getting started with CNNs by calculating LeNet-Layer manually

Published on
7 min read

The idea of CNNs is intelligently adapt to the properties of images by reducing the dimension. To achieve this convolutional layer and pooling layer are used. Convolutional layers are reducing the dimensions by adding filters (kernel windows) to the Input. The dimension can reduce by applying kernel windows to calculate new outputs. Assuming the input shape is nhxnwn_{h} x n_{w} and the kernel window ist khxkwk_{h} x k_{w} then the output shape will be.

(nhkh+1) x (nwkw+1)(n_{h} - k _{h}+1 )\ x\ (n_{w} - k_{w}+1 )

Pooling Layers are reducing the dimension by aggregating the input elements. Assuming the input shape is nhxnwn_{h} x n_{w} and the pooling method is average with a kernel window of khxkwk_{h} x k_{w} then the output shape will be

(nk+p+s)/s(n - k +p+s)/s

The Explanation for pp and ss will follow in the section of Stride and Padding.

Example CNNs Architecture LeNet-5

lenet-5-architecture

To understand what is happening in each layer we have to clarify a few basics. Let’s start with Stride and Padding

Stride and Padding

As described in the introduction the goal of a CNNs is to reduce the dimension by applying a layer. A tricky part of reducing dimensions is not to erase peaces of information from the original input, for example, if you have an input of 100 x 100 and apply 5 layer of 5 x 5 you reduce the size of dimension to 80 x 80 or you erase 20% in 5 layers. This is where Stride and Padding can be helpful.

(100h5h+1) x (100w5w+1)=95repeat it 5 times(100_{h} - 5_{h}+1 )\ x\ (100_{w} - 5_{w}+1 ) = 95\newline repeat\ it\ 5\ times

Padding

You can define padding as adding extra pixels as filler around the original input to decrease the erasion of information.

example-of-padding-1x1
(nhkh+ph+1) x nwkw+pw+1)(n_{h} - k _{h}+p_{h}+1 )\ x \ n_{w} - k_{w}+p_{w}+1 )

if we now add a 1x1 padding to our 100 x 100 input example the reduction of the dimension changes to 85 x 85.

Stride

When calculating inputs with kernel window you start at the top-left corner of the input and then slide it overall locations from left to right and top to bottom. The default behavior is sliding by one at a time. The problem of sliding by one can sometimes result in computational inefficency for example, if you have a 4k input image you don’t want to calculate and slide by one. To optimize this we can slide by more than one to downsample our output. This sliding is called stride.

example-of-stride-1x1
example-of-stride-2x2
(nhkh+ph+sh)/sh x nwkw+pw+/sw)/sw(n_{h} - k _{h}+p_{h}+s_{h})/s_{h}\ x \ n_{w} - k_{w}+p_{w}+/s_{w} )/s_{w}

if we now add a 2x2 stride to our 100 x 100 input example with padding and apply only 1 layer the reduction of the dimension changes to 49 x 49. If you have stride of 0 or None just means having a stride of 1.

Pooling Layer

pooling-window

Average 2D Pooling Layer

average-pooling-layer

Max 2D Pooling Layer

max-pooling-layer

Fully-Connected / Dense Layer

A Fully-Connected / Dense Layer represents a matrix-vector multiplication, where each input Neuron is connected to the output Neuron by a weight. A dense layer is used to change the dimensions of your input. Mathematically speaking, it applies a rotation, scaling, translation transform to your vector.

Dense Layer are calculated same as linear layers wx+bwx+b, but the end result is passed through Activation function.

((current layer nprevious layer n(X x X x X))+b((current\ layer\ n *previous\ layer\ n(X\ x\ X\ x\ X))+b

Calculating CNN-Layers in LeNet-5

For Calculating the CNN-Layers we are using the formula from Yann LeCun LeNet-5 Paper

(nh+2phfh)/sh+1 x (nw+2pwfw)/sw+1 x Nc(n_{h} +2p_{h}-f_{h})/ s_{h} +1\ x\ (n_{w} +2p_{w} -f_{w})/ s_{w} +1\ x\ Nc

Variable definiton

n=dimension of inputtensorn=dimension\ of\ input-tensor
p=padding (32x32 by p=1  34x34)p=padding\ (32x32\ by\ p=1\ \rightarrow\ 34x34)
f=filter sizef= filter\ size
Nc=number of filtersNc = number\ of\ filters

The LeNet-5 was trained with Images of the Size of 32x32x1. The first Layer are 6 5x5 filters applied with a stride of 1. This results in the following variables:

Calculating first layer

Variables are defined like:
n=32n=32
p=0p=0
f=5f=5
s=1s=1
Nc=6Nc=6
Add Variables to the formula:

(32+(20)5)/1+1 x (32+(20)5)/1+1 x 6 == 28 x 28 x 6(32+(2*0)-5)/1+1\ x\ (32+(2*0)-5)/1+1\ x\ 6\ ==\ 28\ x\ 28\ x\ 6

Calculating Pooling-Layers in LeNet-5

The LeNet-5 is using average pooling back then when this paper was published, people used average pooling much more than max pooling.

(nk+p+s)/s x (nk+p+s)/s x Nc(n - k +p+s)/s\ x\ (n - k +p+s)/s\ x\ Nc

Variable definition

n=dimension of inputtensorn=dimension\ of\ input-tensor
k=pooling window sizek=pooling\ window\ size
p=padding (32x32 by p=1  34x34)p=padding\ (32x32\ by\ p=1\ \rightarrow\ 34x34)
s=strides=stride

Calculating first layer

Variables are defined like:
n=28n=28
k=2k=2
p=0p=0
s=2s=2
Nc=6Nc=6
Add Variables to the formula:

(282+0+2)/2 x (282+0+2)/2 x 6== 14 x 14 x 6(28- 2 +0+2)/2\ x\ (28 - 2 +0+2)/2\ x\ 6 == \ 14\ x\ 14\ x\ 6

The calculation can now be done analogously for the remaining layers until the last layer before you would reached 1x1xX output-layer. Afterward you use FC-Layers and softmax for your classification.


Refrence:
vdumoulin/conv_arithmetic · GitHub