Deep Learning Lectures

class: center, middle

# Ch3: Optimization, Initialization, Regularization

### Clément Dombry

.affiliations[
  ![UBFC](images/logo-UBFC.jpg)
  ![M2S](images/logo-m2s.png)
  ![LmB](images/logo-lmb.jpg)
]

.credits[
 Based on the lecture notes and slides by Charles Ollion et Olivier Grisel
 [available on Github](https://github.com/m2dsupsdlclass/lectures-labs) 
 THANKS TO THEM !
]

---
## Goal of the chapter

- Learn about advanced optimization algorithms 
 momentum, Nesterov acceleration, RMSProp, Adadelta, Adam...

- Discuss parameter initialization 
 random initialization, Glorot and Benji heuristic

- Learn about regularization techniques
 overfitting diagnostic, early stopping, penalization, dropout.

---
## Stochastic gradient descent

.center[
 <img src="images/gradient_descent_1.png" style="width: 700px;" />
]
.credits[Goodfellow, Bengio, Courville (2016). Deep Learning. ]

- crucial choice of learning rate 
low $\rightsquigarrow$ slow convergence, high $\rightsquigarrow$ early plateau or even divergence 
- requires learning rate scheduling $\epsilon \rightsquigarrow (\epsilon_k)$

---
## Learning rate scheduling

- Common strategies
 
 - try a large value first: $\epsilon = 0.1$ or even $\epsilon = 1$
 - divide by 10 and retry in case of divergence

- Large constant LR prevents final convergence
 
 - multiply $\epsilon$ by $\beta < 1$ after each update
 - monitor loss and divide $\epsilon$ by 2 or 10 when no progress
 - see [ReduceLROnPlateau](https://keras.io/callbacks/#reducelronplateau) in Keras
 
 
???
Overview of recent research and empirical tricks:

https://towardsdatascience.com/understanding-learning-rates-and-how-it-improves-performance-in-deep-learning-d0d4059c1c10

Increase LR linearly when increasing batch-size.

---
## Momentum

.center[
 <img src="images/gradient_descent_2.png" style="width: 700px;" />
]
.credits[Goodfellow, Bengio, Courville (2016). Deep Learning. ]

- Polyak, 1964
- more stable algorithm
- the velocity $v$ is an autoregressive sequence 
 $\rightsquigarrow$ average of the past gradient steps
- typical choice include $\alpha=0.5, 0.9, 0.99$.

---

.center[
<img src="images/momentum-00.png" style="width: 800px;" />

[Why Momentum Really Works](https://distill.pub/2017/momentum/)
]

---

.center[
<img src="images/momentum-06.png" style="width: 800px;" />

[Why Momentum Really Works](https://distill.pub/2017/momentum/)
]

---

.center[
<img src="images/momentum-08.png" style="width: 800px;" />

[Why Momentum Really Works](https://distill.pub/2017/momentum/)
]

---

.center[
<img src="images/momentum-09.png" style="width: 800px;" />

[Why Momentum Really Works](https://distill.pub/2017/momentum/)
]

---
## Nesterov acceleration

.center[
 <img src="images/gradient_descent_3.png" style="width: 700px;" />
]
.credits[Goodfellow, Bengio, Courville (2016). Deep Learning. ]

- Nesterov, 1983; Sutskever *et al.*, 2013.
- only difference is that the gradient is computed at a different interim point $\tilde\theta$
- theoretically faster for batch convex optimization 
convergence in $O(1/k^2)$ instead of $O(1/k)$, with $k$ = nb of iterations.

---
## AdaGrad

.center[
 <img src="images/gradient_descent_4.png" style="width: 700px;" />
]
.credits[Goodfellow, Bengio, Courville (2016). Deep Learning. ]

- Duchi *et al.*, 2011
- adaptive learning rate inversely proportional to the square root of the sum of the squared gradient values

---
## RMSProp

.center[
 <img src="images/gradient_descent_5.png" style="width: 700px;" />
]
.credits[Goodfellow, Bengio, Courville (2016). Deep Learning. ]

- Hinton, 2012
- adaptive learning rate with sum of squared gradient replaced by an autoregressive sequence (exponentially weighted moving average)
- possible variant is RMSProp with momentum

---
## ADAM = ADAptive Moments
.center[
 <img src="images/gradient_descent_7.png" style="height: 425px;" />
]
.credits[Goodfellow, Bengio, Courville (2016). Deep Learning. ]

- Kingma and Ba, 2014. 
- often considered as state of the art optimizer (default)

---
## Optimizers around a saddle point

.center[
<img src="images/contours_evaluation_optimizers.gif" style="width: 500px;" />

Credits: Alec Radford
]

---
## Optimizers in Keras

- In Keras, the optimizer is declared during the compilation step

 - model definition (architecture, layer, activation functions)
 - model compilation (loss, metrics, optimizer, validation set)
 - model training
 - model evaluation and prediction (test set, production)

Ex: SGD optimizer in multiclass classification
```py
*opt = keras.optimizers.SGD(learning_rate=0.01, momentum=0.0, nesterov=False)
model.compile(loss='categorical_crossentropy', optimizer=opt)
# equivalently, pass optimizer by name (default parameters will be used)
model.compile(loss='categorical_crossentropy', optimizer='SGD')
```

Ex: SGD optimizer for regression
```py
*opt = keras.optimizers.Adam(learning_rate=0.01)
model.compile(loss='mean_squared_error', optimizer=opt)
# equivalently, pass optimizer by name (default parameters will be used)
model.compile(loss='mean_squared_error', optimizer='adam')
```

.center[[optimizers in Keras](https://keras.io/api/optimizers)]

---
class: center,middle
# Initialization

---
## Initialization

- Normalisation of input is important so that all variables have approx. the same range 

 - standardization (transformed to zero mean unit variance)
 - min-max normalization (transformed to values on [0,1])

- Optimization for convex problems: 
 
 -theoretical guarantees in the convex case only
 - stochastic gradient descent converges for all initialization

- For non-convex problems: 
 
 - no guarantees and initialization is crucial in practice

---
## Initialization strategies

- Bias vectors $b^h$, $b^o$ can  (should) be initialized to $0$.

- Weight vectors should'nt be initialized to $0$.

 - zero is a saddle point with $0$ gradient 
 $\rightsquigarrow$ no gradient, no learning.
 - non zero constant initialization suffers from symmetry issues.
 - e.g. random initialization with small weights, $ W\_{i,j}\sim \mathcal{N}(0,0.01)$

---
## Initialization strategies

State of the art: initialization depends on layer dimension 
$m$ = input dim, $n$= output dim, weight $W\in\mathbb{R}^{n\times m}$ .

- Glorot and Benji 2010: Uniform distribution 
$$W\_{ij}\sim U(-\sqrt{\frac{6}{m+n}},\sqrt{\frac{6}{m+n}})$$

- He et al. 2015  : normal  distribution 
$$ W\_{ij}\sim \mathcal{N}(0,sd=\frac{1}{\sqrt{2m}})$$

- Saxe et al. (2014): random orthogonal matrices for $W$

More on  initialization strategies in Goodfellow, section 8.4.

---
## Initialization in Keras

In Keras, parameter initialization is declared when creating the layer.

```py
layer = layers.Dense(
    units=64,
*   kernel_initializer='random_normal',
*   bias_initializer='zeros'
)
```

```py
layer = layers.Dense(
    units=64,
*   kernel_initializer=initializers.RandomNormal(stddev=0.01),
*   bias_initializer=initializers.Zeros()
)
```

.center[[initializers in Keras](https://keras.io/api/layers/initializers/)]
 
Question: find in Keras documentation what is the default initialization for a dense layer.

---
class: center,middle
# Regularization

---
## Architecture

- Overfitting is more likely to occur with highly complex networks. 
$\rightsquigarrow$ large width and depth imply high dimensional parameters. 
$\rightsquigarrow$ very flexible model prone to overfitting.

- Very deep neural networks need huge database to get trained.
$\rightsquigarrow$ data augmentation techniques consist in artificially augmenting the training set.

- Good practice: try to dimension your network with respect to the amount of training data at disposal.

- A smaller network with similar performances should be prefered.

- Compare the loss/metrics on training and validation sets to diagnose overfitting.

---
## Learning curves

.center[
 <img src="images/learning_curves.webp" style="width: 750px;" />
]
.credits[Shorten & Khoshgoftaar (2016). A survey on Image Data Augmentation for Deep Learning. Journal of Big Data. ]

---
## Early stopping

- During training, it often happens:
 - during the first epochs, loss decreases on both training and validation sets ;
 - at some point, loss still decreases on training set but reaches a plateau, or even increases on validation set. 
 $\rightsquigarrow$ this indicates overfitting
 
- We are not interested in models that truly minimize training loss, but in models with good generalization capacity.

- Early stopping consists in stopping training when loss reaches a plateau. (similar as in gradient boosting)

---
## Learning curves

.center[
 <img src="images/loss_curves.png" style="width: 600px;" />
]

.center[$\rightsquigarrow$ stop after 50 epochs]

---
## Penalisation

- Similarly as in linear regression, penalisation of coefficients may be used for regularization 
 l2-penalty (ridge), l1-penalty (lasso), l1+l2-penalty (elastic-net)

- In deep learning, one can similarly penalize large values in the weights $W$, but bias $b$ is usually not penalized.

- In Keras, this is simply declared when defining the layer

```py
from tensorflow.keras import layers
from tensorflow.keras import regularizers

layer = layers.Dense(
    units=64,
*   kernel_regularizer=regularizers.l2(1e-4),
)
```

.center[
[layer regularizers in Keras](https://keras.io/api/layers/regularizers/)
]

---
## Beware of interpretation !

Sometimes validation loss is smaller than training loss.

.center[
	<img src="images/regularization_1.png" style="width: 350px;" />
	<img src="images/regularization_2.png" style="width: 350px;" />

]

The most common reason is regularization, since it applies during training, but not during validation.

---
## Dropout

- Randomly set activations to $0$ with probability $p$

- Bernoulli mask sampled for a forward pass / backward pass pair

- Typically only enabled at training time

---
## Dropout

.center[
 <img src="images/dropout.png" style="width: 680px;" />
]

.credits[Dropout: A Simple Way to Prevent Neural Networks from Overfitting,
Srivastava et al., _Journal of Machine Learning Research_ 2014]

---
## Dropout

.center[
 <img src="images/dropout_traintest.png" style="width: 600px;" />

]
.credits[Dropout: A Simple Way to Prevent Neural Networks from Overfitting,
Srivastava et al., _Journal of Machine Learning Research_ 2014]

- At test time, multiply weights by $p$ to keep same level of activation
- Equivalently, Keras rather use a facter $1/p$ during training.

---
## Dropout

### Interpretation

- Reduces the network dependency to individual neurons
- More redundant representation of data

### Ensemble interpretation

- Equivalent to training a large ensemble of shared-parameters, binary-masked models
- Each model is only trained on a single data point

---
## Numerical experiment

.center[
 <img src="images/dropout_curves_1.svg" style="width: 500px;" /> 
]

This dataset has few samples and ~10% noisy labels. 
The model is seriously overparametrized (3 wide hidden layers). 
The training loss goes to zero while the validation stops decreasing
after a few epochs and starts increasing again: this is a serious case
of overfitting.

---
## A bit of dropout

.center[
 <img src="images/dropout_curves_2.svg" style="width: 500px;" />
]

With dropout the training speed is much slower and the training loss
has many random bumps caused by the additional variance in the updates
of SGD with dropout. 
The validation loss on the other hand stays closer to the training loss
and can reach a slightly lower level than the model without dropout:
overfitting is reduced but not completely solved.

---
## Too much dropout: underfitting

.center[
 <img src="images/dropout_curves_3.svg" style="width: 600px;" />
]

---
## Implementation in Keras

```py
model = Sequential()
model.add(Dense(hidden_size, input_shape, activation='relu'))
*model.add(Dropout(p=0.5))
model.add(Dense(hidden_size, activation='relu'))
*model.add(Dropout(p=0.5))
model.add(Dense(output_size, activation='softmax'))
```