Supervised learning: predict the label y using the covariates x
e.g. regression, classification (binary or multiclass)
A neural network defines a parametric function x↦fθ(x).
Link between output fθ(x) and label y depends on the task and is introduced via the loss function.
z(x)=wTx+b
f(x)=g(wTx+b)
z(x) = w^T x + b
f(x) = w^T x + b
z(x) = w^T x + b \in\mathbb{R}
f(x) = \sigma(w^T x + b)\in (0,1)
z(x) = W x + b \in\mathbb{R}^K
f(x) = g(W x + b)\in\mathbb{R}^K
output y\in\{1,\ldots,K\} with K classes.
re-encoding with class indicator y\in \{0,1\}^K also called one-hot-encoding.
example: with K=3
Goal: predict the class probabilities given X=x: P(Y=k\mid X=x),\quad k=1,\ldots K
\mathrm{softmax}(z) = \frac{1}{\sum_{k=1}^{K}{e^{z_k}}} \cdot \begin{bmatrix} e^{z_1}\\ \vdots\\ e^{z_K} \end{bmatrix}
\mathrm{softmax}(z) = \frac{1}{\sum_{k=1}^{K}{e^{z_k}}} \cdot \begin{bmatrix} e^{z_1}\\ \vdots\\ e^{z_K} \end{bmatrix}
z(x) = W x + b \in\mathbb{R}^K
f(x) = \mathrm{softmax}(W x + b)\in (0,1)^K
Given a sample S (train or test set) with instances (x_i,y_i)\in S, the empirical risk on S is L_S(\theta) =\frac{1}{|S|} \sum_{i\in S} L( f(x_i;\theta),y_i)
Given a sample S (train or test set) with instances (x_i,y_i)\in S, the empirical risk on S is L_S(\theta) =\frac{1}{|S|} \sum_{i\in S} L( f(x_i;\theta),y_i)
Goal: find parameter \theta = ( W,b) minimizing the empirical risk.
Strategy: use gradient descent to minimize L_S(\theta) on training set S.
Remark: possible regularisation introduced during training with L_S(\theta) + \lambda | W |_2^2 (l2-penalty similar to ridge regression, l1-penalty similar to lasso also possible)
Each gradient computation uses the full training sample possibly undoable for large dataset
\rightsquigarrow Stochastic Gradient Descent (SGD)
Choice of learning rate is crucial
\rightsquigarrow low LR= slow algorithm, large LR = unstable algorithm
\rightsquigarrow LR scheduling, adaptive LR
Stopping criterion can be a fixed number of iterations or based on performance (i.e. relative loss decrease smaller than \epsilon)
Second order method (like Newton-Raphson) usually not adapted to the complexity of deep learning problems (non-convexity, extremely high dimensionality).
Often typical batch size is b=8,16,32,64 and N=n/b so that one epoch goes through the whole sample.
Same comments on the learnig rate and stopping criterion as before.
During the training process, the loss is monitored on a validation set.
Overfitting happens when loss continues to decrease on training set but not on the test set.
In Keras, the evolution of different metrics can be monitored on a validation set.
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |