Deep Learning Lectures

# Ch1: Linear regression and classification revisited
    ### Clément Dombry
    .affiliations[
      ![UBFC](images/logo-UBFC.jpg)
      ![M2S](images/logo-m2s.png)
      ![LmB](images/logo-lmb.jpg)
    ]

.credits[
 Based on the lecture notes and slides by Charles Ollion et Olivier Grisel
 [available on Github](https://github.com/m2dsupsdlclass/lectures-labs) 
 THANKS TO THEM !
 ]

---
    # Goal of the chapter

- Consider the simplest models from statistic from a neural network perspective. 
 linear regression, logistic regression (binary or multiclass)

- Introduce some important concepts. 
 neural unit, neural layer, output function, loss function
 
 - Introduce the training method. 
 minimisation with gradient descent and its mini-batch version
 
 - Exercise: compute all the formulas in the algorithm 
 
 - Lab: provide a numpy implementation (no "black box") 
 
 
 ---
 # Supervised learning
 
 - Supervised learning: predict the label $y$ using the covariates $x$ 
 e.g. regression, classification (binary or multiclass)

- A neural network defines a parametric function $x\mapsto f\_\theta(x)$.
 - input $x\in\mathbb{R}^p$ corresponds to explanatory variables
 - output $f\_\theta(x)\in\mathbb{R}^K$ is used to predict $y$ 
 choice of output dimension $K$ depends on the task 
 - parameter $\theta\in\Theta$ is often extremely high dimensional 
 depends on the network architecture, up to billions

- Link between output $f\_\theta (x)$ and label $y$ depends on the task and is introduced via the loss function.

---

# Neurons and neural layers

---
 # Artificial Neuron
 .center[
 <img src="images/artificial_neuron.png" style="width: 400px;" />
 ]

$f(x) = g(w^T x + b)$
    ]

- $x, f(x) \,\,$    input and output
    - $z(x)\,\,$    pre-activation
    - $w, b\,\,$    weights and bias  $\rightsquigarrow \theta=(w,b)\in\mathbb{R}^{p}\times\mathbb{R}$
    - $g$ activation function

---
 # Linear regression 
 .center[
 <img src="images/artificial_neuron.png" style="width: 400px;" />
 ]

$f(x) = w^T x + b$
    ]

- label $y\in\mathbb{R}$ 
    - output $f(x)$ interpreted as $E[Y\mid X=x]$
    - activation: $g(z)=z$

---
 # Logistic regression 
 .center[
 <img src="images/artificial_neuron.png" style="width: 400px;" />
 ]

$f(x) = \sigma(w^T x + b)\in (0,1)$
    ]

- label $y\in\\{0,1\\}$
 - output $f(x)$ interpreted as $P(Y=1|X=x)$
 - activation: $\textit{logistic}$ function 
 .center[$\sigma(z)=\frac{e^z}{1+e^z} \in(0,1)\quad,\quad \sigma'(z)=\sigma(z)(1-\sigma(z))$]
 - pre-activation $z(x)$ is the log-odds

---
 # Layer of Neurons
 .center[
 <img src="images/neural_network.png" style="width: 400px;" />
 ]

$f(x) = g(W x + b)\in\mathbb{R}^K$
    ]

- dense layer of neurons with $K$ units.
    - $W$, $b$ now matrix and vector  $\rightsquigarrow \theta=(W,b)\in\mathbb{R}^{K\times p}\times\mathbb{R}^K$.
    - activation may apply componentwise (plot) or globally.

---
 # Multiclass classification
 
 - output $y\in\\{1,\ldots,K\\}$ with $K$ classes.

- re-encoding with class indicator $y\in \\{0,1\\}^K$ also called *one-hot-encoding*.

- example: with $K=3$
    .center[$y=1\rightsquigarrow y=\begin{bmatrix}1\\\\  0\\\\   0 \end{bmatrix},\\  
    y=2\rightsquigarrow y=\begin{bmatrix}0\\\\  1\\\\   0 \end{bmatrix},\\
    y=3\rightsquigarrow y=\begin{bmatrix}0\\\\  0\\\\   1 \end{bmatrix}$]

- **Goal:** predict the class probabilities given $X=x$: 
 .center[$P(Y=k\mid X=x),\quad k=1,\ldots K$]

---
 # Softmax function
 
 $$
 \mathrm{softmax}(z) = \frac{1}{\sum_{k=1}^{K}{e^{z_k}}}
 \cdot
 \begin{bmatrix}
 e^{z_1}\\\\
 \vdots\\\\
 e^{z_K}
 \end{bmatrix}
 $$
 - gives a probability vector in $(0, 1)^K$ with sum $1$.
 - output $f(x)$ interpreted as $(P(Y = k|X = x))_{1\leq k\leq K}$.
 - pre-activation vector $z(x)$ called *logits*.
 - gradient computation (useful later on)
 $$
 \frac{\partial \mathrm{softmax}(z)_k}{\partial z_l} =
 \begin{cases}
 \mathrm{softmax}(z)_k \cdot (1 - \mathrm{softmax}(z)_k) & k = l\\\\
 -\mathrm{softmax}(z)_k \cdot \mathrm{softmax}(z)_l & k \neq l
 \end{cases}
 $$

---
    # Multiclass logistic regression

$f(x) = \mathrm{softmax}(W x + b)\in (0,1)^K$
    ]

- label $y\in \\{0,1\\}^K$ (one-hot encoding)
    - activation : $\textit{softmax}$ function 
    - output $f(x)\in (0,1)^K$  (probabilities of  classes)

---

# Loss functions

---
    # Loss function

- $L(f(x),y)\geq 0$ compares  output (prediction $f(x)$) and label (observation $y$).
    - Low values corresponding to good predictions. 
    - Choice of loss function depends on network and task.

- Regression: $\textit{squared error}$
 .center[$L(f(x),y)=(y-f(x))^2$]
 - Binary classification: $\textit{binary cross entropy}$
 .center[$L(f(x),y)=-y \ln f(x)-(1-y)\ln (1-f(x))$]
 - Multiclass classification : $\textit{categorical cross-entropy}$ 
 .center[$L(f(x),y)=-\sum_{k=1}^K y_k \ln f(x)_k$]

---
    # Loss function
    - Output $f(x)$ often seen as the parameter of the conditional distribution of $Y$ given $X=x$  within a parametric family.
    - Loss function $L(f(x),y)$ then corresponds to NLL (negative log likelihood).

- regression: $Y|X=x\sim \mathcal{N}(f(x),\sigma^2)$ 
 $\rightsquigarrow f(x)=m$ is the mean
 - binary classification: $Y|X=x\sim \mathcal{B}(f(x))$ 
 $\rightsquigarrow f(x)=p$ is the success probability
 - multiclass: $Y\mid X=x\sim \mathcal{M}(f(x))$ (multinomial distribution) 
 $\rightsquigarrow f(x)=p$ is the probability vector

---
 # Exercises 
 
 <ol>
 <li> Logit parametrization: in a classification problem, assume the network output $f(x)$ represents the log-odds. 
 What choice for the loss function do you recommend ? </li>
 <li> Same question in multiclass case.</li>
 <li> Heteroscedastic regression: design a simple network for heteroscedastic regression where $Y$ given $X=x$ has a normal distribution
 .center[$Y\mid X=x \sim \mathcal{N}(m(x),\sigma^2(x))$.]</li>
 <li> Poisson regression (count data): design a simple network for Poisson regression where $Y$ given $X=x$ has a Poisson distribution
 .center[$Y\mid X=x \sim \mathcal{P}(\lambda(x))$.]</li>
 <ol/>

---

# Training the network

---
 # Empirical Risk Minimisation
 
 - Given a sample $S$ (train or test set) with instances $(x\_i,y\_i)\in S$, 
 the *empirical risk* on $S$ is 
 $$
 L\_S(\theta) =\frac{1}{|S|} \sum_{i\in S} L(f(x\_i;\theta),y\_i)
 $$
 
 
 
 - **Goal**: find parameter $\theta = ( W,b)$ minimizing the empirical risk.
 - **Strategy**: use gradient descent to minimize $L\_S(\theta)$ on training set $S$.
 - **Remark**: possible regularisation introduced during training with
 $$ L_S(\theta) + \lambda \| W \|_2^2 $$
 (l2-penalty similar to ridge regression, l1-penalty similar to lasso also possible)

---
 # Gradient descent
 The following algorithm is called *batch gradient descent with learning rate $\eta>0$*.
 
 - Initialize $\mathbf{\theta}$ randomly
 - For epochs $E=1,2,\ldots,$ do:
 - Compute gradients: $\Delta\_E= \nabla_\theta L_S(\theta)$
 - Update parameters: $\theta \leftarrow \theta - \eta \Delta_E$
 - Stop when some stopping criterion is reached

The gradient computation uses the full training sample

---
 # Comments
 
 - Each gradient computation uses the full training sample 
 possibly undoable for large dataset 
 $\rightsquigarrow$ Stochastic Gradient Descent (SGD)

- Choice of learning rate is crucial 
 $\rightsquigarrow$ low LR= slow algorithm, large LR = unstable algorithm 
 $\rightsquigarrow$ LR scheduling, adaptive LR

- Stopping criterion can be a fixed number of iterations or based on performance (i.e. relative loss decrease  smaller than $\epsilon$)

- Second order method (like Newton-Raphson) usually not adapted to the complexity of deep learning problems (non-convexity, extremely high dimensionality).

---
    # Impact of the Learning Rate

[Why Momentum Really Works](https://distill.pub/2017/momentum/)
    ]

---
    # Impact of the Learning Rate

[Why Momentum Really Works](https://distill.pub/2017/momentum/)
    ]

---
    # Impact of the Learning Rate

[Why Momentum Really Works](https://distill.pub/2017/momentum/)
    ]

---
    # Impact of the Learning Rate

[Why Momentum Really Works](https://distill.pub/2017/momentum/)
    ]
    ### $\rightsquigarrow$ More on optimizers in chapter 3 !

---
    # Stochastic Gradient Descent

The following algorithm is called *stochastic gradient descent with learning rate $\eta>0$*.
 
 - Initialize $\mathbf{\theta}$ randomly
 - For epochs $E=1,2,\ldots$, do:
 - For $i\in S$ (in random order) do: 
 - Compute gradient: $\Delta\_i = \frac{1}{|S|}\nabla\_\theta L(f_\theta(x\_i),y\_i)$ 
 - Update parameters: $\theta \leftarrow \theta - \eta \Delta\_i$
 - Stop when some stopping criterion is reached

The gradient computation uses only one observation

---
    # Mini-Batch Gradient Descent

The following algorithm is called *mini-batch gradient descent with learning rate $\eta>0$* and batch size $b\geq 1$.
 
 - Initialize $\mathbf{\theta}$ randomly
 - For epochs $E=1,2,\ldots$, do:
 - Randomly partition $S$ into mini batches $B\_1,\ldots,B\_k$ of size $b$ 
 - For mini-batch $B=B\_1,\ldots,B\_k$ do: 
 - Compute gradient: $\Delta\_B = \frac{|B|}{|S|}\nabla\_\theta L_B(\theta)$ 
 - Update parameters: $\theta \leftarrow \theta - \eta \Delta\_B$
 - Stop when some stopping criterion is reached

The gradient computation uses only one mini-batch. 
 One epoch goes through the whole sample ($k=|S|/b$). 
 We retrieve batch algorithm when $b=|S|$ and stochastic algorithm when $b=1$.

---
    # Comments

- Often typical batch size is $b=16,32,64, 128$.

- Same comments on the learnig rate and stopping criterion as before.

- During the training process, the loss is monitored on a validation set. 
 Overfitting happens when loss continues to decrease on training set but not on the test set. 
 In Keras, the evolution of different metrics can be monitored on a validation set.

---
 # Exercises
 
 <ol>
 <li> For linear regression, recall the exact formula for computing the minimizer $\theta$ and then write the gradient descent algorithm with $E=100$ epochs.</li>
 
 <li> For logistic regression, write the stochastic gradient algorithm with batch size $B=8$ and with a regularisation term $0.1\|w\|_2^2$. 
 Stop when the relative loss decrease is smaller than $0.5$%</li>
 
 <li> For multiclass logistic regression, write the sochastic gradient algorithm with batch size $B=16$ and $E=50$ epochs. 
 Reduce the learning rate by a factor $1/2$ every $10$ epochs.
 </ol>