class: center, middle # Ch4: Convolutional Neural Networks and image classification ### Clément Dombry .affiliations[ ![UBFC](images/logo-UBFC.jpg) ![M2S](images/logo-m2s.png) ![LmB](images/logo-lmb.jpg) ] .credits[ Based on the lecture notes and slides by Charles Ollion et Olivier Grisel [available on Github](https://github.com/m2dsupsdlclass/lectures-labs)
THANKS TO THEM ! ] --- ## Used everywhere for Vision .center[
] --- ## Many other applications
### Speech recognition & speech synthesis ### Natural Language Processing ### Protein/DNA binding prediction ### Any problem with a spatial (or sequential) structure --- ## ConvNets for image classification CNN = Convolutional Neural Networks = ConvNet
.center[
] .footnote.small[ LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998). Gradient-based learning applied to document recognition. ] --- # Outline
### Convolutions ### CNN Architectures ### Transfer learning and data augmentation --- class: middle, center # Convolutions --- ## Motivations Standard Dense Layer for an image input: ```python x = Input((640, 480, 3), dtype='float32') # shape of x is: (None, 640, 480, 3) x = Flatten()(x) # shape of x is: (None, 640 x 480 x 3) z = Dense(1000)(x) ``` How many parameters in the Dense layer?
$640 \times 480 \times 3 \times 1000 + 1000 = 922M !$ Spatial organization of the input is destroyed by `Flatten` We never use Dense layers directly on large images. Most standard solution is **convolution** layers --- ### Fully Connected Network: MLP ```python input_image = Input(shape=(28, 28, 1)) x = Flatten()(input_image) x = Dense(256, activation='relu')(x) x = Dense(10, activation='softmax')(x) mlp = Model(inputs=input_image, outputs=x) ``` ### Convolutional Network ```python input_image = Input(shape=(28, 28, 1)) *x = Conv2D(32, 5, activation='relu')(input_image) *x = MaxPool2D(2, strides=2)(x) *x = Conv2D(64, 3, activation='relu')(x) *x = MaxPool2D(2, strides=2)(x) x = Flatten()(x) x = Dense(256, activation='relu')(x) x = Dense(10, activation='softmax')(x) convnet = Model(inputs=input_image, outputs=x) ``` 2D spatial organization of features preserved untill `Flatten`. --- ## Convolution in a neural network .center[
.small[ Visualisation by V. Dumoulin available at https://github.com/vdumoulin/conv_arithmetic ] ] Output obtained by sliding the $3 \times 3$ window and computing $$ z(x) = relu(\mathbf{w}^T x + b) $$ - $x$ is **sliding** a $3 \times 3$ chunk of the image (dark area) - $\mathbf{w}$ is **common** a $3 \times 3$ weight matrix (small numbers) - $b$ a **common** bias (equal to $0$ here) --- ## Why Convolution ? ### Local connectivity - A neuron depends only on a few local input neurons - Translation invariance ### Comparison to Fully connected - Parameter sharing: reduce overfitting - Make use of spatial structure: **strong prior** for vision! ### Animal Vision Analogy - Hubel & Wiesel, Receptive Fields Of Single Neurons In The Cat's Striate Cortex (1959) --- ## Mathematical convolution Convolution is a mathematical operator between two functions ### Discrete 1d convolution $$ (f \star g) (x) = \sum\_{a+b=x} f(a) . g(b) = \sum\_{a} f(a) . g(x - a) $$ ### Discrete 2d-convolution $$ (f \star g) (x, y) = \sum_n \sum_m f(n, m) . g(x - n, y - m) $$ - $g$ is a 2d map representing the image - $f$ is a convolution **kernel** or **filter** applied to g --- ## Image convolution .center[
] In practice, convolution takes the form $$ (k \star im) (x, y) = \sum\limits\_{n=0}^2 \sum\limits\_{m=0}^2 k(n, m) . im(x + n, y + m ) $$ where $im$ is a $5 \times 5$ image and $k$ a $3 \times 3$ kernel.
.small[Note the use of $(x+n,y+n)$ instead of $(x-n,y-n)$] --- ## Dimensions .center[
] - Input dimension: `height x width` - Kernel size: `K x K`
(usually K= 3, 5, 7, 11) - Output dimension:
`(height - K + 1) x (width - K + 1)` - Number of parameters: `K x K +1` --- ## Colored images and channels Colors in (R, G, B) encoded by three numbers
Colored image = tensor of shape `(height, width, channels)`
Convolutions are usually computed for each channel and summed: .center[
] $$ (k \star im^{color}) = \sum\limits\_{c=0}^2 k^c \star im^c $$ --- ## Multiple convolutions .center[
] --- ## Multiple convolutions .center[
] --- ## Multiple convolutions .center[
] --- ## Multiple convolutions .center[
] --- ## Multiple convolutions .center[
] -- Output dimension:
.small[`(height - K + 1) x (width - K + 1) x nb_conv`] Number of parameters:
.small[`(K x K x nb_input_channels + 1) x nb_conv`] --- ## Strides - Strides: increment step size for the convolution operator - Reduces the size of the output map .center[
] .center.small[ Example with kernel size $3 \times 3$ and a stride of $2$
(image in blue, output in green) ] --- ## Padding - Padding: artificially fill borders of image - Usually: fill with zero values (0-padding) - Useful to keep spatial dimension constant across filters .center[
] .center.small[ Example with padding to keep constant image dimension
(image in blue, output in green) ] --- ## Dealing with shapes **Kernel** or **Filter** shape $(K, K, C^i, C^o)$ .left-column[ - $K \times K$ kernel size, - $C^i$ input channels - $C^o$ output channels ] .right-column[ .center[
] ] -- .reset-column[ ] **Number of parameters**: $(K \times K \times C^i + 1) \times C^o$ -- **Input and Output dimensions**: - Input $(W^i, H^i, C^i)$ - Output $(W^o, H^o, C^o)$
.small[ $W^o = (W^i - K + 2P) / S + 1 $
$ H^o = (H^i - K + 2P) / S +1 $ ] --- ## Pooling - Spatial dimension reduction - Local invariance - No parameters: max or average of 2x2 units
.center[
] .footnote.small[ Schematic from Stanford http://cs231n.github.io/convolutional-networks ] --- ## Pooling - Spatial dimension reduction - Local invariance - No parameters: max or average of 2x2 units
.center[
] --- class:middle, center # Architectures --- ## Classic ConvNet Architecture ### Input ### Conv blocks - Convolution + activation (relu) - Convolution + activation (relu) - ... - Maxpooling 2x2 ### Output - Fully connected layers - Softmax --- ## AlexNet .center[
] .small[ Simplified version of Krizhevsky, Alex, Sutskever, and Hinton. "Imagenet classification with deep convolutional neural networks." NIPS 2012 ]
Not a sequential model in Keras (two parallel lines) --- ## AlexNet .center[
] First convolutional layer - Input shape: `227x227x3` - Kernel shape: `(11,11,3,96)` stride 4 - Output shape: `(55,55,96)` - Number of parameters: `34,944` - Equivalent MLP parameters: `44 x 1e9` --- ## Hierarchical representation .center[
] --- ## VGG-16 .center[
] .small[
Simonyan, Karen, and Zisserman. "Very deep convolutional networks for large-scale image recognition." (2014) ] Sequential model in Keras (one line) --- ## VGG in Keras ```python model.add(Convolution2D(64, 3, 3, activation='relu',input_shape=(3,224,224))) model.add(Convolution2D(64, 3, 3, activation='relu')) model.add(MaxPooling2D((2,2), strides=(2,2))) model.add(Convolution2D(128, 3, 3, activation='relu')) model.add(Convolution2D(128, 3, 3, activation='relu')) model.add(MaxPooling2D((2,2), strides=(2,2))) model.add(Convolution2D(256, 3, 3, activation='relu')) model.add(Convolution2D(256, 3, 3, activation='relu')) model.add(Convolution2D(256, 3, 3, activation='relu')) model.add(MaxPooling2D((2,2), strides=(2,2))) model.add(Convolution2D(512, 3, 3, activation='relu')) model.add(Convolution2D(512, 3, 3, activation='relu')) model.add(Convolution2D(512, 3, 3, activation='relu')) model.add(MaxPooling2D((2,2), strides=(2,2))) model.add(Convolution2D(512, 3, 3, activation='relu')) model.add(Convolution2D(512, 3, 3, activation='relu')) model.add(Convolution2D(512, 3, 3, activation='relu')) model.add(MaxPooling2D((2,2), strides=(2,2))) model.add(Flatten()) model.add(Dense(4096, activation='relu')) model.add(Dropout(0.5)) model.add(Dense(4096, activation='relu')) model.add(Dropout(0.5)) model.add(Dense(1000, activation='softmax')) ``` --- ## Memory and Parameters ```md Activation maps Parameters INPUT: [224x224x3] = 150K 0 CONV3-64: [224x224x64] = 3.2M (3x3x3)x64 = 1,728 CONV3-64: [224x224x64] = 3.2M (3x3x64)x64 = 36,864 POOL2: [112x112x64] = 800K 0 CONV3-128: [112x112x128] = 1.6M (3x3x64)x128 = 73,728 CONV3-128: [112x112x128] = 1.6M (3x3x128)x128 = 147,456 POOL2: [56x56x128] = 400K 0 CONV3-256: [56x56x256] = 800K (3x3x128)x256 = 294,912 CONV3-256: [56x56x256] = 800K (3x3x256)x256 = 589,824 CONV3-256: [56x56x256] = 800K (3x3x256)x256 = 589,824 POOL2: [28x28x256] = 200K 0 CONV3-512: [28x28x512] = 400K (3x3x256)x512 = 1,179,648 CONV3-512: [28x28x512] = 400K (3x3x512)x512 = 2,359,296 CONV3-512: [28x28x512] = 400K (3x3x512)x512 = 2,359,296 POOL2: [14x14x512] = 100K 0 CONV3-512: [14x14x512] = 100K (3x3x512)x512 = 2,359,296 CONV3-512: [14x14x512] = 100K (3x3x512)x512 = 2,359,296 CONV3-512: [14x14x512] = 100K (3x3x512)x512 = 2,359,296 POOL2: [7x7x512] = 25K 0 FC: [1x1x4096] = 4096 7x7x512x4096 = 102,760,448 FC: [1x1x4096] = 4096 4096x4096 = 16,777,216 FC: [1x1x1000] = 1000 4096x1000 = 4,096,000 TOTAL activations: 24M x 4 bytes ~= 93MB / image (x2 for backward) TOTAL parameters: 138M x 4 bytes ~= 552MB (x2 for plain SGD, x4 for Adam) ``` --- ## Memory and Parameters ```md Activation maps Parameters INPUT: [224x224x3] = 150K 0 *CONV3-64: [224x224x64] = 3.2M (3x3x3)x64 = 1,728 *CONV3-64: [224x224x64] = 3.2M (3x3x64)x64 = 36,864 POOL2: [112x112x64] = 800K 0 CONV3-128: [112x112x128] = 1.6M (3x3x64)x128 = 73,728 CONV3-128: [112x112x128] = 1.6M (3x3x128)x128 = 147,456 POOL2: [56x56x128] = 400K 0 CONV3-256: [56x56x256] = 800K (3x3x128)x256 = 294,912 CONV3-256: [56x56x256] = 800K (3x3x256)x256 = 589,824 CONV3-256: [56x56x256] = 800K (3x3x256)x256 = 589,824 POOL2: [28x28x256] = 200K 0 CONV3-512: [28x28x512] = 400K (3x3x256)x512 = 1,179,648 CONV3-512: [28x28x512] = 400K (3x3x512)x512 = 2,359,296 CONV3-512: [28x28x512] = 400K (3x3x512)x512 = 2,359,296 POOL2: [14x14x512] = 100K 0 CONV3-512: [14x14x512] = 100K (3x3x512)x512 = 2,359,296 CONV3-512: [14x14x512] = 100K (3x3x512)x512 = 2,359,296 CONV3-512: [14x14x512] = 100K (3x3x512)x512 = 2,359,296 POOL2: [7x7x512] = 25K 0 *FC: [1x1x4096] = 4096 7x7x512x4096 = 102,760,448 FC: [1x1x4096] = 4096 4096x4096 = 16,777,216 FC: [1x1x1000] = 1000 4096x1000 = 4,096,000 TOTAL activations: 24M x 4 bytes ~= 93MB / image (x2 for backward) TOTAL parameters: 138M x 4 bytes ~= 552MB (x2 for plain SGD, x4 for Adam) ``` --- .left-column[ ## ResNet ] .footnote.small[ .left-column[ He, Kaiming, et al. "Deep residual learning for image recognition." CVPR. 2016. ] ] .right-column[ .center[
] ] Even deeper models: 34, 50, 101, 152 layers --- .left-column[ ## ResNet ] .footnote.small[ .left-column[ He, Kaiming, et al. "Deep residual learning for image recognition." CVPR. 2016. ] ] .right-column[ .center[
] ] #### Residual block learns residual w.r.t. identity .center[
]
#### Good optimization properties --- .left-column[ ## ResNet ] .footnote.small[ .left-column[ He, Kaiming, et al. "Deep residual learning for image recognition." CVPR. 2016. ] ] .right-column[ .center[
] ] ResNet50 Compared to VGG: #### Superior accuracy in all vision tasks
**5.25%** top-5 error vs 7.1% #### Less parameters
**25M** vs 138M #### Fully Convolutional until the last layer --- ## Deeper is better .center[
] .footnote.small[ from Kaiming He slides "Deep residual learning for image recognition." ICML. 2016. ] --- ## State of the art - Finding right architectures: Active area or research .center[
.small[ He, Deep residual learning for image recognition, ICML, 2016. ] ] - Modular building blocks engineering - See also: DenseNets, Wide ResNets, Fractal ResNets, ResNeXts, Pyramidal ResNets ... --- ## State of the art #### Top 1-accuracy, performance and size on ImageNet .center[
] See also: https://paperswithcode.com/sota/image-classification-on-imagenet .footnote.small[ Canziani, Paszke, and Culurciello. "An Analysis of Deep Neural Network Models for Practical Applications." (May 2016). ] --- ## State of the art .center[
] .footnote.small[ Meta Pseudo Labels, Hieu Pham et al. (Jan 2021) ] --- class: middle, center # Transfer learning # and # Data augmentation --- ## Pre-trained models Training a model on ImageNet from scratch takes **days or weeks**. Many models trained on ImageNet and their weights are publicly available! Mor generally, Keras provides many pre-trained models
[documentation](https://keras.io/api/applications/) --- ## Transfer learning - Use pre-trained weights, remove last layers to compute representations of images - Train a classification model from these features on a new classification task - The network is used as a generic feature extractor - Better than handcrafted feature extraction on natural images --- ## Fine-tuning Retraining some parameters of the network (requires enough data) - Truncate the last layer of the pre-trained network - Freeze the layer weights - Add a (linear) classifier on top and train it for a few epochs - Unfreeze all the layer weights (or only the few deepest ones) - Fine-tune the whole network - Use a smaller learning rate when fine tuning Uses the **Trainable** attribute of layers in Keras
[documentation](https://keras.io/api/layers/base_layer). --- ## Data Augmentation .center[
] .center[
] See also: [RandAugment](https://arxiv.org/abs/1909.13719) and [Unsupervised Data Augmentation]( https://arxiv.org/abs/1904.12848). ??? - Use prior knowledge on label-invariant transformation - A rotated picture of a cat is still a picture of a cat. - Effective way to reduce overfitting on small labeled sets. - Can be very efficient in a Semi-Supervised setting when combined with a consistency loss. --- ## Data Augmentation With Keras: ```python from keras.preprocessing.image import ImageDataGenerator image_gen = ImageDataGenerator( rescale=1. / 255, rotation_range=40, width_shift_range=0.2, height_shift_range=0.2, shear_range=0.2, zoom_range=0.2, horizontal_flip=True, channel_shift_range=9, fill_mode='nearest' ) train_flow = image_gen.flow_from_directory(train_folder) model.fit_generator(train_flow, train_flow.n) ```