class: center, middle # Ch3: Optimization, Initialization, Regularization ### Clément Dombry .affiliations[ ![UBFC](images/logo-UBFC.jpg) ![M2S](images/logo-m2s.png) ![LmB](images/logo-lmb.jpg) ] .credits[ Based on the lecture notes and slides by Charles Ollion et Olivier Grisel [available on Github](https://github.com/m2dsupsdlclass/lectures-labs)
THANKS TO THEM ! ] --- ## Goal of the chapter - Learn about advanced optimization algorithms
momentum, Nesterov acceleration, RMSProp, Adadelta, Adam...
- Discuss parameter initialization
random initialization, Glorot and Benji heuristic
- Learn about regularization techniques
overfitting diagnostic, early stopping, penalization, dropout.
--- ## Stochastic gradient descent .center[
] .credits[Goodfellow, Bengio, Courville (2016). Deep Learning. ]
- crucial choice of learning rate
low $\rightsquigarrow$ slow convergence, high $\rightsquigarrow$ early plateau or even divergence - requires learning rate scheduling $\epsilon \rightsquigarrow (\epsilon_k)$
--- ## Learning rate scheduling - Common strategies
- try a large value first: $\epsilon = 0.1$ or even $\epsilon = 1$ - divide by 10 and retry in case of divergence
- Large constant LR prevents final convergence
- multiply $\epsilon$ by $\beta < 1$ after each update - monitor loss and divide $\epsilon$ by 2 or 10 when no progress - see [ReduceLROnPlateau](https://keras.io/callbacks/#reducelronplateau) in Keras
??? Overview of recent research and empirical tricks: https://towardsdatascience.com/understanding-learning-rates-and-how-it-improves-performance-in-deep-learning-d0d4059c1c10 Increase LR linearly when increasing batch-size. --- ## Momentum .center[
] .credits[Goodfellow, Bengio, Courville (2016). Deep Learning. ]
- Polyak, 1964. - more stable - typical choice include $\alpha=0.5, 0.9, 0.99$.
--- .center[
[Why Momentum Really Works](https://distill.pub/2017/momentum/) ] --- .center[
[Why Momentum Really Works](https://distill.pub/2017/momentum/) ] --- .center[
[Why Momentum Really Works](https://distill.pub/2017/momentum/) ] --- .center[
[Why Momentum Really Works](https://distill.pub/2017/momentum/) ] --- ## Nesterov acceleration .center[
] .credits[Goodfellow, Bengio, Courville (2016). Deep Learning. ]
- Nesterov, 1983; Sutskever *et al.*, 2013. - theoretically faster for batch convex optimization (m=n)
convergence in $O(1/k^2)$ instead of $O(1/k)$, $k$ = nb of iterations.
--- ## AdaGrad .center[
] .credits[Goodfellow, Bengio, Courville (2016). Deep Learning. ]
- Duchi *et al.*, 2011 - adaptive learning rate - inversely proportional to the squared root of the sum of the suared values
--- ## RMSProp .center[
] .credits[Goodfellow, Bengio, Courville (2016). Deep Learning. ]
- Hinton, 2012 - adaptive learning rate - exponentially weighted moving average of the squared gradient - variant RMSProp with momentum
--- ## ADAM = ADAptive Moments .center[
] .credits[Goodfellow, Bengio, Courville (2016). Deep Learning. ]
- Kingma and Ba, 2014. - often considered as state of the art optimizer (default)
--- ## Optimizers around a saddle point .center[
Credits: Alec Radford ] --- ## Optimizers in Keras
- In Keras, the optimizer is declared during the compilation step
- model definition (architecture, layer, activation functions) - model compilation (loss, metrics, optimizer, validation set) - model training - model evaluation and prediction (test set, production)
Ex: SGD optimizer in multiclass classification ```py *opt = keras.optimizers.SGD(learning_rate=0.01, momentum=0.0, nesterov=False) model.compile(loss='categorical_crossentropy', optimizer=opt) # equivalently, pass optimizer by name (default parameters will be used) model.compile(loss='categorical_crossentropy', optimizer='SGD') ``` Ex: SGD optimizer for regression ```py *opt = keras.optimizers.Adam(learning_rate=0.01) model.compile(loss='mean_squared_error', optimizer=opt) # equivalently, pass optimizer by name (default parameters will be used) model.compile(loss='mean_squared_error', optimizer='adam') ``` .center[[optimizers in Keras](https://keras.io/api/optimizers)]
--- class: center,middle # Initialization --- ## Initialization - Normalisation of input is important so that all variables have approx. the same range
- standardization (transformed input with zero mean and unit variance) - quantile normalization (transformed input uniform on [0,1])
- Optimization for convex problems: theoretical guarantees
stochastic gradient descent converges for all initialization
- For non-convex problems: no guarantees and initialization is crucial in practice --- ## Initialization strategies - Bias vectors $b^h$, $b^o$ can (should) be initialized to $0$. - Weight vectors should'nt be initialized to $0$.
- zero is a saddle point with $0$ gradient
$\rightsquigarrow$ no gradient, no learning. - non zero constant initialization suffers from symmetry issues. - e.g. random initialization with small weights, $ W\_{i,j}\sim \mathcal{N}(0,0.01)$
--- ## Initialization strategies State of the art: initialization depends on layer dimension
$m$ = input dim, $n$= output dim, weight $W\in\mathbb{R}^{n\times m}$ .
- Glorot and Benji 2010: Uniform distribution $$W\_{ij}\sim U(-\sqrt{\frac{6}{m+n}},\sqrt{\frac{6}{m+n}})$$ - He et al. 2015 : normal distribution $$ W\_{ij}\sim \mathcal{N}(0,sd=\frac{1}{\sqrt{2m}})$$ - Saxe et al. (2014): random orthogonal matrices for $W$
More on initialization strategies in Goodfellow, section 8.4. --- ## Initialization in Keras In Keras, parameter initialization is declared when creating the layer. ```py layer = layers.Dense( units=64, * kernel_initializer='random_normal', * bias_initializer='zeros' ) ``` ```py layer = layers.Dense( units=64, * kernel_initializer=initializers.RandomNormal(stddev=0.01), * bias_initializer=initializers.Zeros() ) ```
.center[[initializers in Keras](https://keras.io/api/layers/initializers/)]
Question: find in Keras documentation what is the default initialization for a dense layer. --- class: center,middle # Regularization --- ## Architecture
- Overfitting is more likely to occur with highly complex networks.
$\rightsquigarrow$ large width and depth imply high dimensional parameters.
$\rightsquigarrow$ very flexible model prone to overfitting. - Very deep neural networks need huge database to get trained. $\rightsquigarrow$ data augmentation techniques consist in artificially augmenting the training set. - Good practice: try to dimension your network with respect to the amount of training data at disposal. - A smaller network with similar performances should be prefered. - Compare the loss/metrics on training and validation sets to diagnose overfitting.
--- ## Learning curves
.center[
] .credits[Shorten & Khoshgoftaar (2016). A survey on Image Data Augmentation for Deep Learning. Journal of Big Data. ] --- ## Early stopping
- During training, it often happens: - during the first epochs, loss decreases on both training and validation sets ; - at some point, loss still decreases on training set but reaches a plateau, or even increases on validation set.
$\rightsquigarrow$ this indicates overfitting - We are not interested in models that truly minimize training loss, but in models with good generalization capacity. - Early stopping consists in stopping training when loss reaches a plateau. (similar as in gradient boosting)
--- ## Learning curves .center[
] .center[$\rightsquigarrow$ stop after 50 epochs] --- ## Penalisation
- Similarly as in linear regression, penalisation of coefficients may be used for regularization
l2-penalty (ridge), l1-penalty (lasso), l1+l2-penalty (elastic-net)
- In deep learning, one can similarly penalize large values in the weights $W$, but bias $b$ is usually not penalized. - In Keras, this is simply declared when defining the layer ```py from tensorflow.keras import layers from tensorflow.keras import regularizers layer = layers.Dense( units=64, * kernel_regularizer=regularizers.l2(1e-4), ) ``` .center[ [layer regularizers in Keras](https://keras.io/api/layers/regularizers/) ] --- ## Beware of interpretation ! Sometimes validation loss is smaller than training loss. .center[
] The most common reason is regularization, since it applies during training, but not during validation. --- ## Dropout - Randomly set activations to $0$ with probability $p$ - Bernoulli mask sampled for a forward pass / backward pass pair - Typically only enabled at training time --- ## Dropout .center[
] .credits[Dropout: A Simple Way to Prevent Neural Networks from Overfitting, Srivastava et al., _Journal of Machine Learning Research_ 2014] --- ## Dropout .center[
] .credits[Dropout: A Simple Way to Prevent Neural Networks from Overfitting, Srivastava et al., _Journal of Machine Learning Research_ 2014]
- At test time, multiply weights by $p$ to keep same level of activation - Equivalently, Keras rather use a facter $1/p$ during training. --- ## Dropout ### Interpretation - Reduces the network dependency to individual neurons - More redundant representation of data ### Ensemble interpretation - Equivalent to training a large ensemble of shared-parameters, binary-masked models - Each model is only trained on a single data point --- ## Numerical experiment .center[
]
This dataset has few samples and ~10% noisy labels.
The model is seriously overparametrized (3 wide hidden layers).
The training loss goes to zero while the validation stops decreasing after a few epochs and starts increasing again: this is a serious case of overfitting.
--- ## A bit of dropout .center[
]
With dropout the training speed is much slower and the training loss has many random bumps caused by the additional variance in the updates of SGD with dropout.
The validation loss on the other hand stays closer to the training loss and can reach a slightly lower level than the model without dropout: overfitting is reduced but not completely solved.
--- ## Too much dropout: underfitting .center[
] --- ## Implementation in Keras
```py model = Sequential() model.add(Dense(hidden_size, input_shape, activation='relu')) *model.add(Dropout(p=0.5)) model.add(Dense(hidden_size, activation='relu')) *model.add(Dropout(p=0.5)) model.add(Dense(output_size, activation='softmax')) ```