class: center, middle # Ch2: Multilayer perceptron and backpropagation ### Clément Dombry .affiliations[ ![UBFC](images/logo-UBFC.jpg) ![M2S](images/logo-m2s.png) ![LmB](images/logo-lmb.jpg) ] .credits[ Based on the lecture notes and slides by Charles Ollion et Olivier Grisel [available on Github](https://github.com/m2dsupsdlclass/lectures-labs)
THANKS TO THEM ! ] --- ## Goal of the chapter - Define multilayer perceptron
dense hidden layers, activation function
- Introduce backpropagation for gradient computation
detailled equations on a small network
- Discuss computation graphs and automatic differentiation --- class: center,middle ## MLP: MultiLayer Perceptron --- ## High level representation .center[
]
- Hidden layers for feature extraction.
$\rightsquigarrow$ non linear features learned during training - Output layer performs the task (regression, classification).
$\rightsquigarrow$ linear regression / logistic regression on extracted features - Multilayer perceptron: dense hidden layers with nonlinear activation function.
--- ## Layer of Neurons .center[
] .center[ $ z(x) = Wx + b $ $f(x) = g(Wx + b)$ ]
- in hidden layer, activation function usually acts component-wise, i.e. $g(z)\_k=g(z\_k)$ --- ## Elementwise activation functions .center[
] - most common: *sigmoid*, *hyperbolic tangent*, **rectified linear unit**. - blue: activation function; green: derivative - *relu* widely used: numerical simplicity yields faster algorithms - [activation functions in Keras](https://keras.io/api/layers/activations/) --- ## One Hidden Layer Network .center[
]
- $z^h(\mathbf{x}) = W^h x + b^h$ -
$h(x) = g(z^h(x)) = g(W^h x + b^h)$
-
$z^o(x) = W^o h(x) + b^o$
-
$f(x) = \mathrm{softmax}(z^o) = \mathrm{softmax}(W^o h(x) + b^o)$
--- ## One Hidden Layer Network .center[
]
-
$z^h(\mathbf{x}) = W^h x + b^h$
- $h(x) = g(z^h(x)) = g(W^h x + b^h)$ -
$z^o(x) = W^o h(x) + b^o$
-
$f(x) = \mathrm{softmax}(z^o) = \mathrm{softmax}(W^o h(x) + b^o)$
--- ## One Hidden Layer Network .center[
]
-
$z^h(\mathbf{x}) = W^h x + b^h$
-
$h(x) = g(z^h(x)) = g(W^h x + b^h)$
- $z^o(x) = W^o h(x) + b^o$ -
$f(x) = \mathrm{softmax}(z^o) = \mathrm{softmax}(W^o h(x) + b^o)$
--- ## One Hidden Layer Network .center[
]
-
$z^h(\mathbf{x}) = W^h x + b^h$
-
$h(x) = g(z^h(x)) = g(W^h x + b^h)$
-
$z^o(x) = W^o h(x) + b^o$
- $\mathbf{f}(\mathbf{x}) = softmax(\mathbf{z}^o) = softmax(\mathbf{W}^o \mathbf{h}(\mathbf{x}) + \mathbf{b}^o)$ --- ## One Hidden Layer Network .center[
]
### Architecture - input dimension $p$ - depth 2 (total number of layers), width $H$ and $K$ (number of neurons). - hidden layer with $H$ units and *tanh* activation; $(W^h,b^h)\in \mathbb{R}^{H\times p+H}$ - output layer with $K$ units and *softmax* activation; $(W^o,b^o)\in \mathbb{R}^{K\times H+K}$ - parameter: $\theta=(W^h,b^h,W^o,b^o)$ in dimension $(p+1)H+(H+1)K$
--- ## One Hidden Layer Network ### Compact representation .center[
]
### Keras implementation ```py model = Sequential() model.add(Dense(H, input_dim=p, activation="tanh") # (p+1)H params model.add(Dense(K, activation="softmax")) # (H+1)K params ``` --- ## Two Hidden Layer Network
- depth 3, width H1, H2, 1 - hidden layers with H1 and H2 units and relu activation - output layer with 1 unit and sigmoid activation
.center[
]
### Keras implementation ```py model = Sequential() model.add(Dense(H1, input_dim=p, activation="relu") # (p+1)H1 params model.add(Dense(H2, activation="relu") # (H1+1)H2 params model.add(Dense(1, activation="sigmoid")) # (H2+1) params ``` --- ## Why MLP ? **Universal Approximation Theorem** (Cybenko 1989)
Let $f:\mathbb{R}^p\to\mathbb{R}^K$ continuous and $\sigma:\mathbb{R}\to\mathbb{R}$ continuous non-polynomial. Then, for all compact $A\subset \mathbb{R}^p$ and $\epsilon>0$, there is $H\geq 1$ such that $$ \sup\_{x\in A}|f(x)-f\_H(x)|\le \epsilon $$ with $f\_H:\mathbb{R}^p\to\mathbb{R}^K$ is a MLP with depth 2, hidden layer with H units and activation $\sigma$, output layer with $K$ units and identity activation.
--
- compute the numbers of parameters in $f_H$. - what happens for polynomial activation $\sigma$ ? - larger $H$ needed for complex fonctions or good approximation. - partially explains the performance of MLP - role of depth ?
--- ## Why MLP ? **Universal Approximation Theorem 2** (Kidger and Lyons 2020)
Let $f:\mathbb{R}^p\to\mathbb{R}^K$ continuous and $\sigma:\mathbb{R}\to\mathbb{R}$ piecewise continuously diffentiable and non affine. Then, for all compact $A\subset \mathbb{R}^p$ and $\epsilon>0$, there is $d\geq 1$ such that $$ \sup\_{x\in A}|f(x)-f\_d(x)|\le \epsilon $$ with $f\_d:\mathbb{R}^p\to\mathbb{R}^K$ a MLP with depth d, hidden layer with $(p+K+2)$ units and activation $\sigma$, output layer with $K$ units and linear activation.
--
- compute the number of parameters in $f_d$. - what happens for affine activation $\sigma$ ? - larger $d$ needed for complex fonctions or good approximation. - approximation theory do not explain why MLPs are quite robust to overfitting.
--- class: center,middle ## Gradient computation and backpropagation --- ## Computing Gradients ### Exercise : toy example This exercise focus on the scalar case and avoids vector calculus ! - Consider a MLP with depth 2 and :
- $p=H=K=1$ - activation $\sigma$ and $g$ on hidden and output layers.
- Compute the derivatives of $L(f(x),y)$ w.r.t. parameters. - Represent the computations on a graph and use the chain rule $$(u\circ v)'(x)= u'(v(x))\times v'(x)$$ --- ## Computing Gradients .center[
]
- label $y\in \\{1,\ldots,K\\}$ or $y\in\\{0,1\\}^K$ (one-hot-encoding) - hidden layer $z(x)=g(W^h x+b^h) \in\mathbb{R}^H$ - output $f(x)=\mathrm{softmax}(W^o z(x)+ b^o))\in (0,1)^K$ - loss $L(f(x),y)=\ln (f(x))\_y$ = \sum\_{k=1}^K y\_k \ln (f(x))\_k$
--- ## Computing Gradients .center[
]
- derivatives of $L(f(x),y)$ with respect to $W^o$ ,$b^o$, $W^h$, $b^h$ ? $$\frac{\partial L(f(x), y)}{\partial W^o\_{i,j}}, \frac{\partial L(f(x), y)}{\partial b^o\_{i}}, \frac{\partial L(f(x), y)}{\partial W^h\_{i,j}}, \frac{\partial L(f(x), y)}{\partial b^h\_{i}} $$ - The network is a composition of differentiable modules - We can apply the "chain rule"
--- # Chain rule
.center[
] --- # Chain rule
.center[
] --- # Chain rule
.center[
] --- # Chain rule
.center[
] --- name: bprop # Backpropagation .center[
] --- template: bprop Compute partial derivatives of the loss - $\frac{\partial l(\mathbf{f(x)}, y)}{\partial \mathbf{f(x)}\_i} = \frac{\partial -\log \mathbf{f(x)}\_y}{\partial \mathbf{f(x)}\_i} = \frac{-1\_{y=i}}{\mathbf{f(x)}\_y} = \frac{\partial \boldsymbol{l}}{\partial \mathbf{f(x)}\_i}$ -- - $\frac{\partial \boldsymbol{l}}{\partial \mathbf{z}^o(\mathbf{x})_i} = ?$ --- .center[
] --- .center[
] --- .center[
] --- .center[
] --- .center[
] --- template: bprop Gradients - $\nabla\_{\mathbf{z}^o(\mathbf{x})} \boldsymbol{l} = \mathbf{f(x)} - \mathbf{e}(y)$ - $\nabla\_{\mathbf{b}^o} \boldsymbol{l} = \mathbf{f(x)} - \mathbf{e}(y)$ because $\mathbf{z}^o(\mathbf{x}) = \mathbf{W}^o \mathbf{h}(\mathbf{x}) + \mathbf{b}^o$ and then $\frac{\partial \mathbf{z}^o(\mathbf{x})\_i}{\partial \mathbf{b}^o\_j} = 1\_{i=j}$ --- template: bprop Partial derivatives related to $\mathbf{W}^o$ - $\frac{\partial \boldsymbol{l}}{\partial W^o\_{i,j}} = \sum\_{k} \frac{\partial \boldsymbol{l}}{\partial \mathbf{z}^o(\mathbf{x})\_k} \frac{\partial \mathbf{z}^o(\mathbf{x})\_k}{\partial W^o\_{i,j}}$ - $\nabla\_{\mathbf{W}^o} \boldsymbol{l} = (\mathbf{f(x)} - \mathbf{e}(y)) . \mathbf{h(x)}^\top$ --- # Backprop gradients ### Compute activation gradients - $\nabla\_{\mathbf{z}^o(\mathbf{x})} \boldsymbol{l} = \mathbf{f(x)} - \mathbf{e}(y)$ -- ### Compute layer params gradients - $\nabla\_{\mathbf{W}^o} \boldsymbol{l} = \nabla\_{\mathbf{z}^o(\mathbf{x})} \boldsymbol{l} \cdot \mathbf{h(x)}^\top$ - $\nabla\_{\mathbf{b}^o} \boldsymbol{l} = \nabla\_{\mathbf{z}^o(\mathbf{x})} \boldsymbol{l}$ -- ### Compute prev layer activation gradients - $\nabla\_{\mathbf{h(x)}} \boldsymbol{l} = \mathbf{W}^{o\top} \nabla\_{\mathbf{z}^o(\mathbf{x})} \boldsymbol{l}$ - $\nabla\_{\mathbf{z}^h(\mathbf{x})} \boldsymbol{l} = \nabla\_{\mathbf{h(x)}} \boldsymbol{l} \odot \mathbf{\sigma^\prime(z^h(x))}$ --- class: center,middle ## Computation Graphs --- ## Libraries & Frameworks .center[
] - **Automatic differentiation**: Theano, TensorFlow, MXnet, CNTK - Computations on **CPU** and also on accelerators **GPU**, **TPU** - **Tensorflow 1** needs declarations for a static computation graph - **TensorFlow 2**, **PyTorch** construct the graph dynamically - **Keras**: high level frontend with TensorFlow2 on background --- ## Computation Graph .center[
]
- Computation graph: directed graph of functions, depending on parameters - Combination of linear and non-linear functions. - All functions must be differentiable.
--- ## Computation Graph Functions are computed in the forward pass. .center[
] -- Gradients are computed in a *backward pass* (backpropagation) .center[
] --- ## Computation Graph .center[
] Computation graphs also handle complex structures (not only sequential)