Deep Learning Lectures

class: center, middle

# Ch4:  Convolutional Neural Networks and image classification

### Clément Dombry

.affiliations[
  ![UBFC](images/logo-UBFC.jpg)
  ![M2S](images/logo-m2s.png)
  ![LmB](images/logo-lmb.jpg)
]

.credits[
 Based on the lecture notes and slides by Charles Ollion et Olivier Grisel
 [available on Github](https://github.com/m2dsupsdlclass/lectures-labs) 
 THANKS TO THEM !
]

---
## Used everywhere for Vision

.center[
 <img src="images/vision.png" style="width: 600px;" />
]

---
## Many other applications

### Speech recognition & speech synthesis

### Natural Language Processing

### Protein/DNA binding prediction

### Any problem with a spatial (or sequential) structure

---
## ConvNets for image classification

CNN = Convolutional Neural Networks = ConvNet

.center[
 <img src="images/lenet.png" style="width: 760px;" />
]

.footnote.small[
LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998). Gradient-based learning applied to document recognition.
]

---
# Outline

### Convolutions

### CNN Architectures

### Transfer learning and data augmentation

---

class: middle, center

# Convolutions

---
## Motivations

Standard Dense Layer for an image input:

```python
x = Input((640, 480, 3), dtype='float32')
# shape of x is: (None, 640, 480, 3)
x = Flatten()(x)
# shape of x is: (None, 640 x 480 x 3)
z = Dense(1000)(x)
```

How many parameters in the Dense layer?

$640 \times 480 \times 3 \times 1000 + 1000 = 922M !$

Spatial organization of the input is destroyed by `Flatten`

We never use Dense layers directly on large images. Most standard solution is **convolution** layers

---
### Fully Connected Network: MLP

```python
input_image = Input(shape=(28, 28, 1))
x = Flatten()(input_image)
x = Dense(256, activation='relu')(x)
x = Dense(10, activation='softmax')(x)
mlp = Model(inputs=input_image, outputs=x)
```

### Convolutional Network

```python
input_image = Input(shape=(28, 28, 1))
*x = Conv2D(32, 5, activation='relu')(input_image)
*x = MaxPool2D(2, strides=2)(x)
*x = Conv2D(64, 3, activation='relu')(x)
*x = MaxPool2D(2, strides=2)(x)
x = Flatten()(x)
x = Dense(256, activation='relu')(x)
x = Dense(10, activation='softmax')(x)
convnet = Model(inputs=input_image, outputs=x)
```

2D spatial organization of features preserved untill `Flatten`.

---
## Convolution in a neural network

.center[
 <img src="images/numerical_no_padding_no_strides.gif" style="width: 360px;" />
.small[
Visualisation by V. Dumoulin available at https://github.com/vdumoulin/conv_arithmetic
]
]

Output obtained by sliding the $3 \times 3$ window and computing
$$
z(x) = relu(\mathbf{w}^T x + b)
$$
 - $x$ is **sliding** a $3 \times 3$ chunk of the image (dark area)  
 - $\mathbf{w}$ is **common** a $3 \times 3$ weight matrix  (small numbers) 
 - $b$ a **common** bias (equal to $0$ here)

---
## Why Convolution ?

### Local connectivity
- A neuron depends only on a few local input neurons
- Translation invariance

### Comparison to Fully connected
- Parameter sharing: reduce overfitting
- Make use of spatial structure: **strong prior** for vision!

### Animal Vision Analogy
- Hubel & Wiesel, Receptive Fields Of Single Neurons In The Cat's Striate Cortex (1959)

---
## Mathematical convolution

Convolution is a mathematical operator between two functions

### Discrete 1d convolution
$$
(f \star g) (x) = \sum\_{a+b=x} f(a) . g(b) = \sum\_{a} f(a) . g(x - a)
$$

### Discrete 2d-convolution
$$
(f \star g) (x, y) = \sum_n \sum_m f(n, m) . g(x - n, y - m)
$$
- $g$ is a 2d map representing the image 
- $f$ is a convolution **kernel** or **filter** applied to g

---
## Image convolution

.center[
 <img src="images/numerical_no_padding_no_strides.gif" style="width: 360px;" />
]

In practice, convolution takes the form
$$ (k \star im) (x, y) = \sum\limits\_{n=0}^2 \sum\limits\_{m=0}^2 k(n, m) . im(x + n, y + m ) $$
where $im$ is a $5 \times 5$ image and $k$ a $3 \times 3$ kernel.

.small[Note the use of  $(x+n,y+n)$ instead of $(x-n,y-n)$]

---
## Dimensions

.center[
 <img src="images/numerical_no_padding_no_strides.gif" style="width: 360px;" />
]

- Input dimension: `height x width`
- Kernel size: `K x K` (usually K= 3, 5, 7, 11)
- Output dimension: 
	`(height - K + 1) x (width - K + 1)`
- Number of parameters: `K x K +1`

---
## Colored images and channels

Colors in (R, G, B) encoded by three numbers 
Colored image = tensor of shape `(height, width, channels)` 
Convolutions are usually computed for each channel and summed:

.center[
 <img src="images/convmap1_dims.svg" style="width: 300px;" />
]

$$
(k \star im^{color}) = \sum\limits\_{c=0}^2 k^c \star im^c
$$

---
## Multiple convolutions

.center[
 <img src="images/convmap1.svg" style="width: 300" />
]

---
## Multiple convolutions

.center[
 <img src="images/convmap2.svg" style="width: 300" />
]

---
## Multiple convolutions

.center[
 <img src="images/convmap3.svg" style="width: 300" />
]

---
## Multiple convolutions

.center[
 <img src="images/convmap4.svg" style="width: 300" />
]

---
## Multiple convolutions

.center[
 <img src="images/convmap_dims.svg" style="width: 300" />
]

Output dimension: 
 .small[`(height - K + 1) x (width - K + 1) x nb_conv`]

Number of parameters: 
 .small[`(K x K x nb_input_channels + 1) x nb_conv`]

---
## Strides

- Strides: increment step size for the convolution operator
- Reduces the size of the output map

.center[
 <img src="images/no_padding_strides.gif" style="width: 250px;" />
]

.center.small[
Example with kernel size $3 \times 3$ and a stride of $2$ 
(image in blue, output in green)
]
---
## Padding

- Padding: artificially fill borders of image
- Usually: fill with zero values (0-padding)
- Useful to keep spatial dimension constant across filters

.center[
 <img src="images/same_padding_no_strides.gif" style="width: 250px;" />
]
.center.small[
Example with padding to keep constant image dimension 
(image in blue, output in green)
]

---
## Dealing with shapes

**Kernel** or **Filter** shape $(K, K, C^i, C^o)$

.left-column[
- $K \times K$ kernel size,
- $C^i$ input channels
- $C^o$ output channels
]

.right-column[
.center[
 <img src="images/kernel.svg" style="width: 100px;" />
]
]

.reset-column[
]

**Number of parameters**: $(K \times K \times C^i + 1) \times C^o$

**Input and Output dimensions**:
- Input $(W^i, H^i, C^i)$
- Output $(W^o, H^o, C^o)$ 
 .small[
$W^o = (W^i - K + 2P) / S + 1 $ $ H^o = (H^i - K + 2P) / S +1 $
]

---
## Pooling

- Spatial dimension reduction
- Local invariance
- No parameters: max or average of 2x2 units

.center[
 <img src="images/pooling.png" style="width: 500px;" />
]

.footnote.small[
Schematic from Stanford http://cs231n.github.io/convolutional-networks
]
---
## Pooling

- Spatial dimension reduction
- Local invariance
- No parameters: max or average of 2x2 units

.center[
 <img src="images/maxpool.svg" style="height: 300px;" />
]

---

class:middle, center

# Architectures

---
## Classic ConvNet Architecture

### Input

### Conv blocks

- Convolution + activation (relu)
- Convolution + activation (relu)
- ...
- Maxpooling 2x2

### Output

- Fully connected layers
- Softmax

---
## AlexNet

.center[
 <img src="images/alexnet.png" style="height: 200px;" />
]

.small[
Simplified version of Krizhevsky, Alex, Sutskever, and Hinton. "Imagenet classification with deep convolutional neural networks." NIPS 2012
]

Not a sequential model in Keras (two parallel lines)

---
## AlexNet

.center[
 <img src="images/alexnet.png" style="width: 600px;" />
]

First convolutional layer
- Input shape: `227x227x3`
- Kernel shape: `(11,11,3,96)` stride 4
- Output shape: `(55,55,96)`
- Number of parameters: `34,944`
- Equivalent MLP parameters: `44 x 1e9`

---
## Hierarchical representation

.center[
 <img src="images/lecunconv.png" style="width: 760px;" />
]

---
## VGG-16

.center[
 <img src="images/vgg.png" style="width: 600px;" />
]

.small[
 Simonyan, Karen, and Zisserman. "Very deep convolutional networks for large-scale image recognition." (2014)
]

Sequential model in Keras (one line)
---
## VGG in Keras

```python
    model.add(Convolution2D(64, 3, 3, activation='relu',input_shape=(3,224,224)))
    model.add(Convolution2D(64, 3, 3, activation='relu'))
    model.add(MaxPooling2D((2,2), strides=(2,2)))

model.add(Convolution2D(128, 3, 3, activation='relu'))
    model.add(Convolution2D(128, 3, 3, activation='relu'))
    model.add(MaxPooling2D((2,2), strides=(2,2)))

model.add(Convolution2D(256, 3, 3, activation='relu'))
    model.add(Convolution2D(256, 3, 3, activation='relu'))
    model.add(Convolution2D(256, 3, 3, activation='relu'))
    model.add(MaxPooling2D((2,2), strides=(2,2)))

model.add(Convolution2D(512, 3, 3, activation='relu'))
    model.add(Convolution2D(512, 3, 3, activation='relu'))
    model.add(Convolution2D(512, 3, 3, activation='relu'))
    model.add(MaxPooling2D((2,2), strides=(2,2)))

model.add(Flatten())
    model.add(Dense(4096, activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(4096, activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(1000, activation='softmax'))
```

---
## Memory and Parameters

```md
           Activation maps          Parameters
INPUT:     [224x224x3]   = 150K     0
CONV3-64:  [224x224x64]  = 3.2M     (3x3x3)x64    =       1,728
CONV3-64:  [224x224x64]  = 3.2M     (3x3x64)x64   =      36,864
POOL2:     [112x112x64]  = 800K     0
CONV3-128: [112x112x128] = 1.6M     (3x3x64)x128  =      73,728
CONV3-128: [112x112x128] = 1.6M     (3x3x128)x128 =     147,456
POOL2:     [56x56x128]   = 400K     0
CONV3-256: [56x56x256]   = 800K     (3x3x128)x256 =     294,912
CONV3-256: [56x56x256]   = 800K     (3x3x256)x256 =     589,824
CONV3-256: [56x56x256]   = 800K     (3x3x256)x256 =     589,824
POOL2:     [28x28x256]   = 200K     0
CONV3-512: [28x28x512]   = 400K     (3x3x256)x512 =   1,179,648
CONV3-512: [28x28x512]   = 400K     (3x3x512)x512 =   2,359,296
CONV3-512: [28x28x512]   = 400K     (3x3x512)x512 =   2,359,296
POOL2:     [14x14x512]   = 100K     0
CONV3-512: [14x14x512]   = 100K     (3x3x512)x512 =   2,359,296
CONV3-512: [14x14x512]   = 100K     (3x3x512)x512 =   2,359,296
CONV3-512: [14x14x512]   = 100K     (3x3x512)x512 =   2,359,296
POOL2:     [7x7x512]     =  25K     0
FC:        [1x1x4096]    = 4096     7x7x512x4096  = 102,760,448
FC:        [1x1x4096]    = 4096     4096x4096     =  16,777,216
FC:        [1x1x1000]    = 1000     4096x1000     =   4,096,000

TOTAL activations: 24M x 4 bytes ~=  93MB / image (x2 for backward)
TOTAL parameters: 138M x 4 bytes ~= 552MB (x2 for plain SGD, x4 for Adam)
```
---
## Memory and Parameters

```md
           Activation maps          Parameters
INPUT:     [224x224x3]   = 150K     0
*CONV3-64:  [224x224x64]  = 3.2M     (3x3x3)x64    =       1,728
*CONV3-64:  [224x224x64]  = 3.2M     (3x3x64)x64   =      36,864
POOL2:     [112x112x64]  = 800K     0
CONV3-128: [112x112x128] = 1.6M     (3x3x64)x128  =      73,728
CONV3-128: [112x112x128] = 1.6M     (3x3x128)x128 =     147,456
POOL2:     [56x56x128]   = 400K     0
CONV3-256: [56x56x256]   = 800K     (3x3x128)x256 =     294,912
CONV3-256: [56x56x256]   = 800K     (3x3x256)x256 =     589,824
CONV3-256: [56x56x256]   = 800K     (3x3x256)x256 =     589,824
POOL2:     [28x28x256]   = 200K     0
CONV3-512: [28x28x512]   = 400K     (3x3x256)x512 =   1,179,648
CONV3-512: [28x28x512]   = 400K     (3x3x512)x512 =   2,359,296
CONV3-512: [28x28x512]   = 400K     (3x3x512)x512 =   2,359,296
POOL2:     [14x14x512]   = 100K     0
CONV3-512: [14x14x512]   = 100K     (3x3x512)x512 =   2,359,296
CONV3-512: [14x14x512]   = 100K     (3x3x512)x512 =   2,359,296
CONV3-512: [14x14x512]   = 100K     (3x3x512)x512 =   2,359,296
POOL2:     [7x7x512]     =  25K     0
*FC:        [1x1x4096]    = 4096     7x7x512x4096  = 102,760,448
FC:        [1x1x4096]    = 4096     4096x4096     =  16,777,216
FC:        [1x1x1000]    = 1000     4096x1000     =   4,096,000

TOTAL activations: 24M x 4 bytes ~=  93MB / image (x2 for backward)
TOTAL parameters: 138M x 4 bytes ~= 552MB (x2 for plain SGD, x4 for Adam)
```
---
.left-column[
## ResNet
]

.footnote.small[
.left-column[
He, Kaiming, et al. "Deep residual learning for image recognition." CVPR. 2016.
]
]

.right-column[
.center[
 <img src="images/resnet.png" style="width: 290px;" />
]
]

Even deeper models:

34, 50, 101, 152 layers

---
.left-column[
## ResNet
]

.footnote.small[
.left-column[
He, Kaiming, et al. "Deep residual learning for image recognition." CVPR. 2016.
]
]

.right-column[
.center[
 <img src="images/resnet.png" style="width: 290px;" />
]
]

#### Residual block learns residual w.r.t. identity

.center[
 <img src="images/residualblock.png" style="width: 290px;" />
]

#### Good optimization properties

---
.left-column[
## ResNet
]

.footnote.small[
.left-column[
He, Kaiming, et al. "Deep residual learning for image recognition." CVPR. 2016.
]
]

.right-column[
.center[
 <img src="images/resnet.png" style="width: 290px;" />
]
]

ResNet50 Compared to VGG:

#### Superior accuracy in all vision tasks **5.25%** top-5 error vs 7.1%

#### Less parameters **25M** vs 138M

#### Fully Convolutional until the last layer

---
## Deeper is better

.center[
 <img src="images/deeper.png" style="width: 660px;" />
]

.footnote.small[
from Kaiming He slides "Deep residual learning for image recognition." ICML. 2016.
]

---
## State of the art

- Finding right architectures: Active area or research
.center[
 <img src="images/inception2.png" style="width: 500px;" />
.small[
He, Deep residual learning for image recognition, ICML, 2016.
]
]
- Modular building blocks engineering
- See also: DenseNets, Wide ResNets, Fractal ResNets, ResNeXts,
 Pyramidal ResNets ...

---
## State of the art

#### Top 1-accuracy, performance and size on ImageNet

.center[
<img src="images/architectures.png" style="width: 760px;" />
]

See also: https://paperswithcode.com/sota/image-classification-on-imagenet

.footnote.small[
Canziani, Paszke, and Culurciello. "An Analysis of Deep Neural Network Models for Practical Applications." (May 2016).
]

---
## State of the art

.center[
 <img src="images/sota2.png" style="width: 620px;" />
]

.footnote.small[
Meta Pseudo Labels, Hieu Pham et al. (Jan 2021)
]

---
class: middle, center

# Transfer learning
# and
# Data augmentation

---
## Pre-trained models

Training a model on ImageNet from scratch takes **days or weeks**.

Many models trained on ImageNet and their weights are publicly
available!

Mor generally, Keras provides many pre-trained models 
[documentation](https://keras.io/api/applications/)

---
## Transfer learning

- Use pre-trained weights, remove last layers to compute representations of images
- Train a classification model from these features on a new classification task
- The network is used as a generic feature extractor
- Better than handcrafted feature extraction on natural images

---
## Fine-tuning

Retraining some parameters of the network (requires enough data)

- Truncate the last layer of the pre-trained network
- Freeze the layer weights
- Add a (linear) classifier on top and train it for a few epochs
- Unfreeze all the layer weights (or only the few deepest ones)
- Fine-tune the whole network
- Use a smaller learning rate when fine tuning

Uses the **Trainable** attribute of layers in Keras 
[documentation](https://keras.io/api/layers/base_layer).

---
## Data Augmentation

.center[
<img src="images/not-augmented-cat.png" style="width: 180px" />
]

.center[
<img src="images/augmented-cat.png" style="width: 550px" />
]

See also: [RandAugment](https://arxiv.org/abs/1909.13719) and
[Unsupervised Data Augmentation](
  https://arxiv.org/abs/1904.12848).

???
- Use prior knowledge on label-invariant transformation
- A rotated picture of a cat is still a picture of a cat.
- Effective way to reduce overfitting on small labeled sets.
- Can be very efficient in a Semi-Supervised setting when combined
  with a consistency loss.

---
## Data Augmentation

With Keras:

```python
from keras.preprocessing.image import ImageDataGenerator

image_gen = ImageDataGenerator(
    rescale=1. / 255,
    rotation_range=40,
    width_shift_range=0.2,
    height_shift_range=0.2,
    shear_range=0.2,
    zoom_range=0.2,
    horizontal_flip=True,
    channel_shift_range=9,
    fill_mode='nearest'
)

train_flow = image_gen.flow_from_directory(train_folder)
model.fit_generator(train_flow, train_flow.n)
```