class: center, middle # Ch1: Linear regression and classification revisited ### Clément Dombry .affiliations[ ![UBFC](images/logo-UBFC.jpg) ![M2S](images/logo-m2s.png) ![LmB](images/logo-lmb.jpg) ] .credits[ Based on the lecture notes and slides by Charles Ollion et Olivier Grisel [available on Github](https://github.com/m2dsupsdlclass/lectures-labs)
THANKS TO THEM ! ] --- # Goal of the chapter - Consider the simplest models from statistic from a neural network perspective.
linear regression, logistic regression (binary or multiclass)
- Introduce some important concepts.
neural unit, neural layer, output function, loss function
- Introduce the training method.
minimisation with gradient descent and its mini-batch version
- Exercise:
compute all the formulas in the algorithm
- Lab:
provide a numpy implementation (no "black box")
--- # Supervised learning Supervised learning: predict the label $y$ using the covariates $x$
e.g. regression, classification (binary or multiclass)
A neural network defines a parametric function $x\mapsto f\_\theta(x)$.
- input $x\in\mathbb{R}^p$ corresponds to explanatory variables - output $f\_\theta(x)\in\mathbb{R}^K$ is used to predict $y$
choice of output dimension $K$ depends on the task - parameter $\theta\in\Theta$ is often extremely high dimensional
depends on the network architecture, up to billions
Link between output $f\_\theta (x)$ and label $y$ depends on the task and is introduced via the loss function. --- class: middle, center # Neurons and neural layers --- # Artificial Neuron .center[
] .center[ $z(x) = w^T x + b$ $f(x) = g(w^T x + b)$ ] - $x, f(x) \,\,$ input and output - $z(x)\,\,$ pre-activation - $w, b\,\,$ weights and bias $\rightsquigarrow \theta=(w,b)\in\mathbb{R}^{p}\times\mathbb{R}$ - $g$ activation function --- # Linear regression .center[
] .center[ $z(x) = w^T x + b$ $f(x) = w^T x + b$ ] - label $y\in\mathbb{R}$ - output $f(x)$ interpreted as $E[Y\mid X=x]$ - activation: $g(z)=z$ --- # Logistic regression .center[
] .center[ $z(x) = w^T x + b \in\mathbb{R}$ $f(x) = \sigma(w^T x + b)\in (0,1)$ ] - label $y\in\\{0,1\\}$ - output $f(x)$ interpreted as $P(Y=1|X=x)$ - activation: $\textit{logistic}$ function
.center[$\sigma(z)=\frac{e^z}{1+e^z} \in(0,1)\quad,\quad \sigma'(z)=\sigma(z)(1-\sigma(z))$]
- pre-activation $z(x)$ is the log-odds --- # Layer of Neurons .center[
] .center[ $z(x) = W x + b \in\mathbb{R}^K$ $f(x) = g(W x + b)\in\mathbb{R}^K$ ] - dense layer of neurons with $K$ units. - $W$, $b$ now matrix and vector $\rightsquigarrow \theta=(W,b)\in\mathbb{R}^{K\times p}\times\mathbb{R}^K$. - activation may apply componentwise (plot) or globally. --- # Multiclass classification
- output $y\in\\{1,\ldots,K\\}$ with $K$ classes. - re-encoding with class indicator $y\in \\{0,1\\}^K$ also called *one-hot-encoding*. - example: with $K=3$ .center[$y=1\rightsquigarrow y=\begin{bmatrix}1\\\\ 0\\\\ 0 \end{bmatrix},\\ y=2\rightsquigarrow y=\begin{bmatrix}0\\\\ 1\\\\ 0 \end{bmatrix},\\ y=3\rightsquigarrow y=\begin{bmatrix}0\\\\ 0\\\\ 1 \end{bmatrix}$] - **Goal:** predict the class probabilities given $X=x$: .center[$P(Y=k\mid X=x),\quad k=1,\ldots K$]
--- # Softmax function
$$ \mathrm{softmax}(z) = \frac{1}{\sum_{k=1}^{K}{e^{z_k}}} \cdot \begin{bmatrix} e^{z_1}\\\\ \vdots\\\\ e^{z_K} \end{bmatrix} $$
- gives a probability vector in $(0, 1)^K$ with sum $1$. - output $f(x)$ interpreted as $(P(Y = k|X = x))_{1\leq k\leq K}$. - pre-activation vector $z(x)$ called *logits*. --
- gradient computation (useful later on)
$$ \frac{\partial \mathrm{softmax}(z)_k}{\partial z_l} = \begin{cases} \mathrm{softmax}(z)_k \cdot (1 - \mathrm{softmax}(z)_k) & k = l\\\\ -\mathrm{softmax}(z)_k \cdot \mathrm{softmax}(z)_l & k \neq l \end{cases} $$
--- # Multiclass logistic regression .center[
] .center[ $z(x) = W x + b \in\mathbb{R}^K$ $f(x) = \mathrm{softmax}(W x + b)\in (0,1)^K$ ] - label $y\in \\{0,1\\}^K$ (one-hot encoding) - activation : $\textit{softmax}$ function - output $f(x)\in (0,1)^K$ (probabilities of classes) --- class: middle, center # Loss functions --- # Loss function - $L(f(x),y)\geq 0$ compares output (prediction) and label (observation). - Low values corresponding to good predictions. - Choice of loss function depends on network and task. --
- Regression: $\textit{squared error}$ .center[$L(f(x),y)=(y-f(x))^2$] - Binary classification: $\textit{binary cross entropy}$ .center[$L(f(x),y)=-y \ln f(x)-(1-y)\ln (1-f(x))$] - Multiclass classification : $\textit{categorical cross-entropy}$ .center[$L(f(x),y)=-\sum_{k=1}^K y_k \ln f(x)_k$]
--- # Loss function - Output $f(x)$ often seen as the parameter of the conditional distribution of $Y$ given $X=x$ within a parametric family. - Loss function $L(f(x),y)$ then corresponds to NLL (negative log likelihood). --
- regression: $Y|X=x\sim \mathcal{N}(f(x),\sigma^2)$
$\rightsquigarrow f(x)=m$ is the mean - binary classification: $Y|X=x\sim \mathcal{B}(f(x))$
$\rightsquigarrow f(x)=p$ is the success probability - multiclass: $Y\mid X=x\sim \mathcal{M}(f(x))$ (multinomial distribution)
$\rightsquigarrow f(x)=p$ is the probability vector
--- # Exercises
Logit parametrization: in a classification problem, assume the network output $f(x)$ represents the log-odds. What choice for the loss function do you recommend ?
Same question in multiclass case.
Heteroscedastic regression: design a simple network for heteroscedastic regression where $Y$ given $X=x$ has a normal distribution .center[$Y\mid X=x \sim \mathcal{N}(m(x),\sigma^2(x))$.]
Poisson regression (counting data): design a simple network for Poisson regression where $Y$ given $X=x$ has a Poisson distribution .center[$Y\mid X=x \sim \mathcal{P}(\lambda(x))$.]
--- class: middle, center # Training the network --- # Empirical Risk Minimisation Given a sample $S$ (train or test set) with instances $(x\_i,y\_i)\in S$, the *empirical risk* on $S$ is $$ L\_S(\theta) =\frac{1}{|S|} \sum_{i\in S} L( f(x\_i;\theta),y\_i) $$ -- **Goal**: find parameter $\theta = ( W,b)$ minimizing the empirical risk. **Strategy**: use gradient descent to minimize $L\_S(\theta)$ on training set $S$. **Remark**: possible regularisation introduced during training with $$ L_S(\theta) + \lambda \| W \|_2^2 $$
(l2-penalty similar to ridge regression, l1-penalty similar to lasso also possible)
--- # Gradient descent ### Initialize $\mathbf{\theta}$ randomly ### For epochs $E=1,2,\ldots$ perform:
Compute gradients: $\Delta= \nabla_\theta L_S(\theta)$
Update parameters: $\theta \leftarrow \theta - \eta \Delta$
$\eta > 0$ is called the learning rate
### Stop when reaching criterion --- # Comments
- Each gradient computation uses the full training sample possibly undoable for large dataset
$\rightsquigarrow$ Stochastic Gradient Descent (SGD) - Choice of learning rate is crucial
$\rightsquigarrow$ low LR= slow algorithm, large LR = unstable algorithm
$\rightsquigarrow$ LR scheduling, adaptive LR - Stopping criterion can be a fixed number of iterations or based on performance (i.e. relative loss decrease smaller than $\epsilon$) - Second order method (like Newton-Raphson) usually not adapted to the complexity of deep learning problems (non-convexity, extremely high dimensionality).
--- # Stochastic Gradient Descent ### Initialize $\mathbf{\theta}$ randomly ### For epochs $E=1,2,\ldots$ perform: - Randomly select small batches $B\_1,\ldots,B\_N$ in $S$ - For $B\in\\{B\_1,\ldots, B\_N\\}$
Compute gradient: $\Delta = \nabla_\theta L\_B(\theta)$
Update parameters: $\theta \leftarrow \theta - \eta \Delta$
### Stop when reaching criterion --- # Comments
- Often typical batch size is $b=8,16,32,64$ and $N=n/b$ so that one epoch goes through the whole sample. - Same comments on the learnig rate and stopping criterion as before. - During the training process, the loss is monitored on a validation set.
Overfitting happens when loss continues to decrease on training set but not on the test set.
In Keras, the evolution of different metrics can be monitored on a validation set.
--- # Impact of the Learning Rate .center[
[Why Momentum Really Works](https://distill.pub/2017/momentum/) ] --- # Impact of the Learning Rate .center[
[Why Momentum Really Works](https://distill.pub/2017/momentum/) ] --- # Impact of the Learning Rate .center[
[Why Momentum Really Works](https://distill.pub/2017/momentum/) ] --- # Impact of the Learning Rate .center[
[Why Momentum Really Works](https://distill.pub/2017/momentum/) ] ### $\rightsquigarrow$ More on optimizers in chapter 3 ! --- # Exercises
For linear regression, recall the exact formula for computing the minimizer $\theta$ and then write the gradient descent algorithm with $E=100$ epochs.
For logistic regression, write the stochastic gradient algorithm with batch size $B=8$ and with a regularisation term $0.1\|w\|_2^2$. Stop when the relative loss decrease is smaller than $0.5$%
For multiclass logistic regression, write the sochastic gradient algorithm with batch size $B=16$ and $E=50$ epochs. Reduce the learning rate by a factor $1/2$ every $10$ epochs.