+ - 0:00:00
Notes for current slide
Notes for next slide

Ch1: Linear regression and classification revisited

Clément Dombry

UBFC M2S LmB
Based on the lecture notes and slides by Charles Ollion et Olivier Grisel available on Github
THANKS TO THEM !
1 / 30

Goal of the chapter

  • Consider the simplest models from statistic from a neural network perspective.
    linear regression, logistic regression (binary or multiclass)
  • Introduce some important concepts.
    neural unit, neural layer, output function, loss function
  • Introduce the training method.
    minimisation with gradient descent and its mini-batch version
  • Exercise: compute all the formulas in the algorithm
  • Lab: provide a numpy implementation (no "black box")
2 / 30

Supervised learning

Supervised learning: predict the label y using the covariates x
e.g. regression, classification (binary or multiclass)

A neural network defines a parametric function xfθ(x).

  • input xRp corresponds to explanatory variables
  • output fθ(x)RK is used to predict y
    choice of output dimension K depends on the task
  • parameter θΘ is often extremely high dimensional
    depends on the network architecture, up to billions

Link between output fθ(x) and label y depends on the task and is introduced via the loss function.

3 / 30

Neurons and neural layers

4 / 30

Artificial Neuron

z(x)=wTx+b

f(x)=g(wTx+b)

  • x,f(x) input and output
  • z(x) pre-activation
  • w,b weights and bias [Math Processing Error]
  • g activation function
5 / 30

Linear regression

z(x) = w^T x + b

f(x) = w^T x + b

  • label y\in\mathbb{R}
  • output f(x) interpreted as E[Y\mid X=x]
  • activation: g(z)=z
6 / 30

Logistic regression

z(x) = w^T x + b \in\mathbb{R}

f(x) = \sigma(w^T x + b)\in (0,1)

  • label y\in\{0,1\}
  • output f(x) interpreted as P(Y=1|X=x)
  • activation: \textit{logistic} function \sigma(z)=\frac{e^z}{1+e^z} \in(0,1)\quad,\quad \sigma'(z)=\sigma(z)(1-\sigma(z))
  • pre-activation z(x) is the log-odds
7 / 30

Layer of Neurons

z(x) = W x + b \in\mathbb{R}^K

f(x) = g(W x + b)\in\mathbb{R}^K

  • dense layer of neurons with K units.
  • W, b now matrix and vector \rightsquigarrow \theta=(W,b)\in\mathbb{R}^{K\times p}\times\mathbb{R}^K.
  • activation may apply componentwise (plot) or globally.
8 / 30

Multiclass classification

  • output y\in\{1,\ldots,K\} with K classes.

  • re-encoding with class indicator y\in \{0,1\}^K also called one-hot-encoding.

  • example: with K=3

    y=1\rightsquigarrow y=\begin{bmatrix}1\\ 0\\ 0 \end{bmatrix},\ y=2\rightsquigarrow y=\begin{bmatrix}0\\ 1\\ 0 \end{bmatrix},\ y=3\rightsquigarrow y=\begin{bmatrix}0\\ 0\\ 1 \end{bmatrix}
  • Goal: predict the class probabilities given X=x: P(Y=k\mid X=x),\quad k=1,\ldots K

9 / 30

Softmax function

\mathrm{softmax}(z) = \frac{1}{\sum_{k=1}^{K}{e^{z_k}}} \cdot \begin{bmatrix} e^{z_1}\\ \vdots\\ e^{z_K} \end{bmatrix}

  • gives a probability vector in (0, 1)^K with sum 1.
  • output f(x) interpreted as (P(Y = k|X = x))_{1\leq k\leq K}.
  • pre-activation vector z(x) called logits.
10 / 30

Softmax function

\mathrm{softmax}(z) = \frac{1}{\sum_{k=1}^{K}{e^{z_k}}} \cdot \begin{bmatrix} e^{z_1}\\ \vdots\\ e^{z_K} \end{bmatrix}

  • gives a probability vector in (0, 1)^K with sum 1.
  • output f(x) interpreted as (P(Y = k|X = x))_{1\leq k\leq K}.
  • pre-activation vector z(x) called logits.


  • gradient computation (useful later on) \frac{\partial \mathrm{softmax}(z)_k}{\partial z_l} = \begin{cases} \mathrm{softmax}(z)_k \cdot (1 - \mathrm{softmax}(z)_k) & k = l\\ -\mathrm{softmax}(z)_k \cdot \mathrm{softmax}(z)_l & k \neq l \end{cases}
11 / 30

Multiclass logistic regression

z(x) = W x + b \in\mathbb{R}^K

f(x) = \mathrm{softmax}(W x + b)\in (0,1)^K

  • label y\in \{0,1\}^K (one-hot encoding)
  • activation : \textit{softmax} function
  • output f(x)\in (0,1)^K (probabilities of classes)
12 / 30

Loss functions

13 / 30

Loss function

  • L(f(x),y)\geq 0 compares output (prediction) and label (observation).
  • Low values corresponding to good predictions.
  • Choice of loss function depends on network and task.
14 / 30

Loss function

  • L(f(x),y)\geq 0 compares output (prediction) and label (observation).
  • Low values corresponding to good predictions.
  • Choice of loss function depends on network and task.


  • Regression: \textit{squared error} L(f(x),y)=(y-f(x))^2
  • Binary classification: \textit{binary cross entropy} L(f(x),y)=-y \ln f(x)-(1-y)\ln (1-f(x))
  • Multiclass classification : \textit{categorical cross-entropy} L(f(x),y)=-\sum_{k=1}^K y_k \ln f(x)_k
15 / 30

Loss function

  • Output f(x) often seen as the parameter of the conditional distribution of Y given X=x within a parametric family.
  • Loss function L(f(x),y) then corresponds to NLL (negative log likelihood).
16 / 30

Loss function

  • Output f(x) often seen as the parameter of the conditional distribution of Y given X=x within a parametric family.
  • Loss function L(f(x),y) then corresponds to NLL (negative log likelihood).


  • regression: Y|X=x\sim \mathcal{N}(f(x),\sigma^2)
    \rightsquigarrow f(x)=m is the mean
  • binary classification: Y|X=x\sim \mathcal{B}(f(x))
    \rightsquigarrow f(x)=p is the success probability
  • multiclass: Y\mid X=x\sim \mathcal{M}(f(x)) (multinomial distribution)
    \rightsquigarrow f(x)=p is the probability vector
17 / 30

Exercises

  1. Logit parametrization: in a classification problem, assume the network output f(x) represents the log-odds. What choice for the loss function do you recommend ?
  2. Same question in multiclass case.
  3. Heteroscedastic regression: design a simple network for heteroscedastic regression where Y given X=x has a normal distribution Y\mid X=x \sim \mathcal{N}(m(x),\sigma^2(x)).
  4. Poisson regression (counting data): design a simple network for Poisson regression where Y given X=x has a Poisson distribution Y\mid X=x \sim \mathcal{P}(\lambda(x)).
18 / 30

Training the network

19 / 30

Empirical Risk Minimisation

Given a sample S (train or test set) with instances (x_i,y_i)\in S, the empirical risk on S is L_S(\theta) =\frac{1}{|S|} \sum_{i\in S} L( f(x_i;\theta),y_i)

20 / 30

Empirical Risk Minimisation

Given a sample S (train or test set) with instances (x_i,y_i)\in S, the empirical risk on S is L_S(\theta) =\frac{1}{|S|} \sum_{i\in S} L( f(x_i;\theta),y_i)

Goal: find parameter \theta = ( W,b) minimizing the empirical risk.

Strategy: use gradient descent to minimize L_S(\theta) on training set S.

Remark: possible regularisation introduced during training with L_S(\theta) + \lambda | W |_2^2 (l2-penalty similar to ridge regression, l1-penalty similar to lasso also possible)

21 / 30

Gradient descent

Initialize \mathbf{\theta} randomly

For epochs E=1,2,\ldots perform:

  1. Compute gradients: \Delta= \nabla_\theta L_S(\theta)
  2. Update parameters: \theta \leftarrow \theta - \eta \Delta
    \eta > 0 is called the learning rate

Stop when reaching criterion

22 / 30

Comments

  • Each gradient computation uses the full training sample possibly undoable for large dataset
    \rightsquigarrow Stochastic Gradient Descent (SGD)

  • Choice of learning rate is crucial
    \rightsquigarrow low LR= slow algorithm, large LR = unstable algorithm
    \rightsquigarrow LR scheduling, adaptive LR

  • Stopping criterion can be a fixed number of iterations or based on performance (i.e. relative loss decrease smaller than \epsilon)

  • Second order method (like Newton-Raphson) usually not adapted to the complexity of deep learning problems (non-convexity, extremely high dimensionality).

23 / 30

Stochastic Gradient Descent

Initialize \mathbf{\theta} randomly

For epochs E=1,2,\ldots perform:

  • Randomly select small batches B_1,\ldots,B_N in S
  • For B\in\{B_1,\ldots, B_N\}
    1. Compute gradient: \Delta = \nabla_\theta L_B(\theta)
    2. Update parameters: \theta \leftarrow \theta - \eta \Delta

Stop when reaching criterion

24 / 30

Comments

  • Often typical batch size is b=8,16,32,64 and N=n/b so that one epoch goes through the whole sample.

  • Same comments on the learnig rate and stopping criterion as before.

  • During the training process, the loss is monitored on a validation set.
    Overfitting happens when loss continues to decrease on training set but not on the test set.
    In Keras, the evolution of different metrics can be monitored on a validation set.

25 / 30

Impact of the Learning Rate

26 / 30

Impact of the Learning Rate

27 / 30

Impact of the Learning Rate

28 / 30

Impact of the Learning Rate

\rightsquigarrow More on optimizers in chapter 3 !

29 / 30

Exercises

  1. For linear regression, recall the exact formula for computing the minimizer \theta and then write the gradient descent algorithm with E=100 epochs.

  2. For logistic regression, write the stochastic gradient algorithm with batch size B=8 and with a regularisation term 0.1\|w\|_2^2. Stop when the relative loss decrease is smaller than 0.5%

  3. For multiclass logistic regression, write the sochastic gradient algorithm with batch size B=16 and E=50 epochs. Reduce the learning rate by a factor 1/2 every 10 epochs.
30 / 30

Goal of the chapter

  • Consider the simplest models from statistic from a neural network perspective.
    linear regression, logistic regression (binary or multiclass)
  • Introduce some important concepts.
    neural unit, neural layer, output function, loss function
  • Introduce the training method.
    minimisation with gradient descent and its mini-batch version
  • Exercise: compute all the formulas in the algorithm
  • Lab: provide a numpy implementation (no "black box")
2 / 30
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow