Deep Learning Lectures

class: center, middle

# Ch2: Multilayer perceptron and backpropagation

### Clément Dombry

.affiliations[
  ![UBFC](images/logo-UBFC.jpg)
  ![M2S](images/logo-m2s.png)
  ![LmB](images/logo-lmb.jpg)
]

.credits[
 Based on the lecture notes and slides by Charles Ollion et Olivier Grisel
 [available on Github](https://github.com/m2dsupsdlclass/lectures-labs) 
 THANKS TO THEM !
]

---
## Goal of the chapter

- Define multilayer perceptron 
 dense hidden layers, activation function

- Introduce backpropagation for gradient computation 
 detailled equations on a small network

- Discuss computation graphs and automatic differentiation

---
class: center,middle
## MLP: MultiLayer Perceptron

---
## High level representation

.center[
 <img src="images/image_dl2.png" style="width: 700px;" />
]

- Hidden layers for feature extraction. 
$\rightsquigarrow$ non linear features learned during training
- Output layer performs the task (regression, classification). 
$\rightsquigarrow$ linear regression / logistic regression on extracted features
- Multilayer perceptron: dense hidden layers with nonlinear activation function.

---
## Layer of Neurons

.center[
<img src="images/neural_network.png" style="width: 400px;" />
]

.center[
$ z(x) = Wx + b $

$f(x) = g(Wx + b)$
]
 
- in hidden layer, activation function usually acts component-wise, i.e. $g(z)\_k=g(z\_k)$

---
## Elementwise activation functions

.center[
<img src="images/activation_functions.svg" style="width: 780px;" />
]

 - most common: *sigmoid*, *hyperbolic tangent*, **rectified linear unit**.
 - blue: activation function; green: derivative
 - *relu* widely used: numerical simplicity yields faster algorithms
 - [activation functions in Keras](https://keras.io/api/layers/activations/)

---
## One Hidden Layer Network

.center[
<img src="images/neural_network_hidden.png" style="width: 700px;" />
]

Hidden layer
- $z^h(\mathbf{x}) = W^h x + b^h$
- $h(x) = g(z^h(x)) = g(W^h x + b^h)$

Output layer
- $z^o(x) = W^o h(x) + b^o$
- $f(x) = \mathrm{softmax}(z^o) = \mathrm{softmax}(W^o h(x) + b^o)$

---
## One Hidden Layer Network

.center[
<img src="images/neural_network_hidden.png" style="width: 700px;" />
]

**Architecture**
- input dimension $p$
- **depth** 2 (total number of layers), **width** $H$ and $K$ (number of neurons).
- hidden layer with $H$ units and *tanh* activation; $(W^h,b^h)\in \mathbb{R}^{H\times p+H}$
- output layer with $K$ units and *softmax* activation; $(W^o,b^o)\in \mathbb{R}^{K\times H+K}$
- parameter $\theta=(W^h,b^h,W^o,b^o)$ in dimension $(p+1)H+(H+1)K$

---
## One Hidden Layer Network

### Compact representation
.center[
<img src="images/flow_graph.png" style="width: 500px;" />
]

### Keras implementation
```py
model = Sequential()
model.add(Dense(H, input_dim=p, activation="tanh")   # (p+1)H params
model.add(Dense(K, activation="softmax"))			# (H+1)K params
```

---
## Two Hidden Layer Network 

- depth 3, width H1, H2, K=1 
- hidden layers with H1 and H2 units and relu activation
- output layer with 1 unit and sigmoid activation

.center[
<img src="images/flow_graph_2.png" style="width: 700px;" />
]

### Keras implementation
```py
model = Sequential()
model.add(Dense(H1, input_dim=p, activation="relu")   # (p+1)H1 params
model.add(Dense(H2, activation="relu")				# (H1+1)H2 params
model.add(Dense(1, activation="sigmoid"))			 # (H2+1) params
```

---
## Why MLP ?

**Universal Approximation Theorem** (Cybenko 1989) 

Let $f:\mathbb{R}^p\to\mathbb{R}^K$ continuous and $\sigma:\mathbb{R}\to\mathbb{R}$ continuous non-polynomial.
Then, for all compact $A\subset \mathbb{R}^p$ and $\epsilon>0$, there is $H\geq 1$ such that
 $$ \sup\_{x\in A}|f(x)-f\_H(x)|\le \epsilon $$
 with $f\_H:\mathbb{R}^p\to\mathbb{R}^K$ is a MLP with depth 2, hidden layer with H units and activation $\sigma$, output layer with $K$ units and identity activation.

--



- compute the numbers of parameters in $f_H$. 
- what happens for polynomial activation $\sigma$ ?
- larger $H$ needed for complex fonctions or good approximation.
- partially explains the performance of MLP 
- role of depth ?

---
## Why MLP ?

**Universal Approximation Theorem 2** (Kidger and Lyons 2020) 

Let $f:\mathbb{R}^p\to\mathbb{R}^K$ continuous and $\sigma:\mathbb{R}\to\mathbb{R}$ piecewise continuously diffentiable and non affine.
Then, for all compact $A\subset \mathbb{R}^p$ and $\epsilon>0$, there is $d\geq 1$ such that
 $$ \sup\_{x\in A}|f(x)-f\_d(x)|\le \epsilon $$
 with $f\_d:\mathbb{R}^p\to\mathbb{R}^K$ a MLP with depth d, hidden layer with $(p+K+2)$ units and activation $\sigma$, output layer with $K$ units and linear activation.

--



- compute the number of parameters in $f_d$.
- what happens for affine activation $\sigma$ ?
- larger $d$ needed for complex fonctions or good approximation.
- approximation theory do not explain why MLPs are quite robust to overfitting.

---
class: center,middle
## Gradient computation and backpropagation

---
## Computing Gradients

### Exercise : toy example
This exercise focus on the scalar case and avoids vector calculus !

- Consider a MLP for regression with one hidden layer :

	- $p=H=K=1$
	- activation ReLU and Identity on hidden and output layers

- Compute the gradient (w.r.t. parameters) of the squared loss 
$$L(f(x),y)=(y-f(x))^2$$ 
- Represent the computations on a graph
- *Indication*: recall the chain rule $(u\circ v)'(x)= u'(v(x))\times v'(x)$
---

## Computing Gradients
### MLP with one hidden layer for classification

.center[
<img src="images/flow_graph.png" style="width: 500px;" />
]

- label $y\in \\{1,\ldots,K\\}$ or $y\in\\{0,1\\}^K$ (one-hot-encoding)
 - hidden layer $z(x)=\sigma(W^h x+b^h) \in\mathbb{R}^H$
 - output $f(x)=\mathrm{softmax}(W^o z(x)+ b^o))\in (0,1)^K$ 
 - loss $L(f(x),y)=-\log (f(x))\_y = -\sum\_{k=1}^K y\_k \log (f(x))\_k$

---

## Computing Gradients
### MLP with one hidden layer for classification

.center[
<img src="images/flow_graph.png" style="width: 500px;" />
]

- Derivatives of $L(f(x),y)$ with respect to $W^o$ ,$b^o$, $W^h$, $b^h$ ?

- The network is a composition of differentiable modules and we can apply the chain rule

---
# Chain rule

For functions of multiple variables:

.center[
<img src="images/chain_rule.svg" style="width: 600px;" />
]

---
# Backpropagation
.center[
<img src="images/flow_graph_b.svg" style="width: 650px;" />
 
]

Compute partial derivatives of the loss

- $\frac{\partial l}{\partial \mathbf{f(x)}\_i}=\frac{\partial l(\mathbf{f(x)}, y)}{\partial \mathbf{f(x)}\_i} = \frac{\partial -\log \mathbf{f(x)}\_y}{\partial \mathbf{f(x)}\_i} = \frac{-1\_{y=i}}{\mathbf{f(x)}\_y} $
- $\frac{\partial l}{\partial \mathbf{z}^o(\mathbf{x})_i} = \ ?? $  $\quad \rightsquigarrow $ chain rule

---
- We compute
.center[
<img src="images/equations.png" style="width: 750px;" />
]
- Denoting by $\mathbf{e}(y)$ the one hot encoding of $y$, we have
$$\nabla\_{\mathbf{z}^o(\mathbf{x})} l = \mathbf{f(x)} - \mathbf{e}(y)$$
---
- The gradient with respect to activation is 
$$\nabla\_{\mathbf{z}^o(\mathbf{x})} l = \mathbf{f(x)} - \mathbf{e}(y)$$
and we recall
$$\mathbf{z}^o(\mathbf{x}) = \mathbf{W}^o \mathbf{h}(\mathbf{x}) + \mathbf{b}^o$$
- We deduce the gradient with respect to $\mathbf{b}^o$ 
$$\nabla\_{\mathbf{b}^o} l = \mathbf{f(x)} - \mathbf{e}(y)$$
and the gradient with respect to $\mathbf{W}^o$
$$\frac{\partial l}{\partial W^o\_{i,j}} = \sum\_{k} \frac{\partial \boldsymbol{l}}{\partial \mathbf{z}^o(\mathbf{x})\_k} \frac{\partial \mathbf{z}^o(\mathbf{x})\_k}{\partial W^o\_{i,j}}$$
$$\nabla\_{\mathbf{W}^o} l = (\mathbf{f(x)} - \mathbf{e}(y)) \cdot \mathbf{h(x)}^\top$$

---
# Backprop gradients

#### Compute activation gradients
- $\nabla\_{\mathbf{z}^o(\mathbf{x})} l = \mathbf{f(x)} - \mathbf{e}(y)$

#### Compute layer params gradients
- $\nabla\_{\mathbf{b}^o} l = \nabla\_{\mathbf{z}^o(\mathbf{x})} l$
- $\nabla\_{\mathbf{W}^o} l = \nabla\_{\mathbf{z}^o(\mathbf{x})} l \cdot \mathbf{h(x)}^\top$

#### Compute prev layer gradients
- $\nabla\_{\mathbf{h(x)}} l = \mathbf{W}^{o\top} \nabla\_{\mathbf{z}^o(\mathbf{x})} l$
- $\nabla\_{\mathbf{z}^h(\mathbf{x})} l = \nabla\_{\mathbf{h(x)}} l \odot \mathbf{\sigma^\prime(z^h(x))}\quad $ (pointwise multiplication)
- $\nabla\_{\mathbf{b}^h} l = \nabla\_{\mathbf{z}^h(\mathbf{x})} l$
- $\nabla\_{\mathbf{W}^h} l = \nabla\_{\mathbf{z}^h(\mathbf{x})} l \cdot x^\top$

---
# Backpropagation
.center[
<img src="images/flow_graph_b.svg" style="width: 650px;" />
 
]
### In short:
- forward computation for output and loss 
- backward computation for gradients

### Fortunately, softwares propose automatic differentiation !!! 
---
class: center,middle
## Computation Graphs

---
## Libraries & Frameworks

.center[
<img src="images/frameworks.png" style="width: 400px;" /> 
]

- **Automatic differentiation**: TensorFlow, Pytorch, MXnet...
- Computations on **CPU** and also on accelerators **GPU**, **TPU**
- **Tensorflow 1** needs declarations for a static computation graph
- **TensorFlow 2**, **PyTorch** construct the graph dynamically
- **Keras**: high level frontend with TensorFlow2 on background

---
## Computation Graph

.center[
<img src="images/computation_graph_simple_expl.png" style="width: 500px;" />
]

- Computation graph: directed graph of functions, depending on parameters 
- Combination of linear and non-linear functions.
- All functions must be differentiable.

---
## Computation Graph

Functions are computed in the forward pass.
.center[
<img src="images/computation_graph_simple.png" style="width: 500px;" /> 
]

Gradients are computed in a *backward pass* (backpropagation)
.center[
<img src="images/computation_graph_backprop.png" style="width: 500px;" /> 
]

---
## Computation Graph

Computation graphs can also handle complex structures (not only sequential)

.center[
<img src="images/computation_graph_complicated.png" style="width: 500px;" /> 
]

---
## The gradient tape in TensorFlow

Neural network with one layer
```py
# variable initialisation 
x = tf.random.uniform((10, 3))
W = tf.Variable(tf.random.uniform((3, 3)))
b = tf.Variable(tf.zeros((3,)))

# gradient computation with gradient tape
with tf.GradientTape() as tape:
    z = tf.matmul(x, W) + b
    f = tf.nn.softmax(z)
grad_wrt_params = tape.gradient(f, [W, b])
```

---
## The gradient tape in TensorFlow
Neural network with two layers
```py
# variable initialisation 
x = tf.random.uniform((10, 3))
W1 = tf.Variable(tf.random.uniform((3, 3)))
b1 = tf.Variable(tf.zeros((3,)))
W2 = tf.Variable(tf.random.uniform((3, 3)))
b2 = tf.Variable(tf.zeros((3,)))

# gradient computation with gradient tape
with tf.GradientTape() as tape:
    z1 = tf.matmul(x, W1) + b1
    h = tf.nn.relu(z1)
    z2 = tf.matmul(h, W2) + b2 
    f = tf.nn.softmax(z2)   
grad_wrt_params = tape.gradient(f, [W1, b1, W2, b2])
```