class: center, middle # Ch2: Multilayer perceptron and backpropagation ### Clément Dombry .affiliations[ ![UBFC](images/logo-UBFC.jpg) ![M2S](images/logo-m2s.png) ![LmB](images/logo-lmb.jpg) ] .credits[ Based on the lecture notes and slides by Charles Ollion et Olivier Grisel [available on Github](https://github.com/m2dsupsdlclass/lectures-labs)
THANKS TO THEM ! ] --- ## Goal of the chapter - Define multilayer perceptron
dense hidden layers, activation function
- Introduce backpropagation for gradient computation
detailled equations on a small network
- Discuss computation graphs and automatic differentiation --- class: center,middle ## MLP: MultiLayer Perceptron --- ## High level representation .center[
]
- Hidden layers for feature extraction.
$\rightsquigarrow$ non linear features learned during training - Output layer performs the task (regression, classification).
$\rightsquigarrow$ linear regression / logistic regression on extracted features - Multilayer perceptron: dense hidden layers with nonlinear activation function.
--- ## Layer of Neurons .center[
] .center[ $ z(x) = Wx + b $ $f(x) = g(Wx + b)$ ]
- in hidden layer, activation function usually acts component-wise, i.e. $g(z)\_k=g(z\_k)$ --- ## Elementwise activation functions .center[
] - most common: *sigmoid*, *hyperbolic tangent*, **rectified linear unit**. - blue: activation function; green: derivative - *relu* widely used: numerical simplicity yields faster algorithms - [activation functions in Keras](https://keras.io/api/layers/activations/) --- ## One Hidden Layer Network .center[
]
Hidden layer - $z^h(\mathbf{x}) = W^h x + b^h$ - $h(x) = g(z^h(x)) = g(W^h x + b^h)$ Output layer - $z^o(x) = W^o h(x) + b^o$ - $f(x) = \mathrm{softmax}(z^o) = \mathrm{softmax}(W^o h(x) + b^o)$
--- ## One Hidden Layer Network .center[
]
**Architecture** - input dimension $p$ - **depth** 2 (total number of layers), **width** $H$ and $K$ (number of neurons). - hidden layer with $H$ units and *tanh* activation; $(W^h,b^h)\in \mathbb{R}^{H\times p+H}$ - output layer with $K$ units and *softmax* activation; $(W^o,b^o)\in \mathbb{R}^{K\times H+K}$ - parameter $\theta=(W^h,b^h,W^o,b^o)$ in dimension $(p+1)H+(H+1)K$
--- ## One Hidden Layer Network ### Compact representation .center[
]
### Keras implementation ```py model = Sequential() model.add(Dense(H, input_dim=p, activation="tanh") # (p+1)H params model.add(Dense(K, activation="softmax")) # (H+1)K params ``` --- ## Two Hidden Layer Network
- depth 3, width H1, H2, K=1 - hidden layers with H1 and H2 units and relu activation - output layer with 1 unit and sigmoid activation
.center[
]
### Keras implementation ```py model = Sequential() model.add(Dense(H1, input_dim=p, activation="relu") # (p+1)H1 params model.add(Dense(H2, activation="relu") # (H1+1)H2 params model.add(Dense(1, activation="sigmoid")) # (H2+1) params ``` --- ## Why MLP ? **Universal Approximation Theorem** (Cybenko 1989)
Let $f:\mathbb{R}^p\to\mathbb{R}^K$ continuous and $\sigma:\mathbb{R}\to\mathbb{R}$ continuous non-polynomial. Then, for all compact $A\subset \mathbb{R}^p$ and $\epsilon>0$, there is $H\geq 1$ such that $$ \sup\_{x\in A}|f(x)-f\_H(x)|\le \epsilon $$ with $f\_H:\mathbb{R}^p\to\mathbb{R}^K$ is a MLP with depth 2, hidden layer with H units and activation $\sigma$, output layer with $K$ units and identity activation.
--
- compute the numbers of parameters in $f_H$. - what happens for polynomial activation $\sigma$ ? - larger $H$ needed for complex fonctions or good approximation. - partially explains the performance of MLP - role of depth ?
--- ## Why MLP ? **Universal Approximation Theorem 2** (Kidger and Lyons 2020)
Let $f:\mathbb{R}^p\to\mathbb{R}^K$ continuous and $\sigma:\mathbb{R}\to\mathbb{R}$ piecewise continuously diffentiable and non affine. Then, for all compact $A\subset \mathbb{R}^p$ and $\epsilon>0$, there is $d\geq 1$ such that $$ \sup\_{x\in A}|f(x)-f\_d(x)|\le \epsilon $$ with $f\_d:\mathbb{R}^p\to\mathbb{R}^K$ a MLP with depth d, hidden layer with $(p+K+2)$ units and activation $\sigma$, output layer with $K$ units and linear activation.
--
- compute the number of parameters in $f_d$. - what happens for affine activation $\sigma$ ? - larger $d$ needed for complex fonctions or good approximation. - approximation theory do not explain why MLPs are quite robust to overfitting.
--- class: center,middle ## Gradient computation and backpropagation --- ## Computing Gradients ### Exercise : toy example This exercise focus on the scalar case and avoids vector calculus ! - Consider a MLP for regression with one hidden layer :
- $p=H=K=1$ - activation ReLU and Identity on hidden and output layers
- Compute the gradient (w.r.t. parameters) of the squared loss $$L(f(x),y)=(y-f(x))^2$$ - Represent the computations on a graph - *Indication*: recall the chain rule $(u\circ v)'(x)= u'(v(x))\times v'(x)$ --- ## Computing Gradients ### MLP with one hidden layer for classification .center[
]
- label $y\in \\{1,\ldots,K\\}$ or $y\in\\{0,1\\}^K$ (one-hot-encoding) - hidden layer $z(x)=\sigma(W^h x+b^h) \in\mathbb{R}^H$ - output $f(x)=\mathrm{softmax}(W^o z(x)+ b^o))\in (0,1)^K$ - loss $L(f(x),y)=-\log (f(x))\_y = -\sum\_{k=1}^K y\_k \log (f(x))\_k$
--- ## Computing Gradients ### MLP with one hidden layer for classification .center[
]
- Derivatives of $L(f(x),y)$ with respect to $W^o$ ,$b^o$, $W^h$, $b^h$ ? - The network is a composition of differentiable modules and we can apply the chain rule
--- # Chain rule For functions of multiple variables: .center[
] --- # Backpropagation .center[
] Compute partial derivatives of the loss - $\frac{\partial l}{\partial \mathbf{f(x)}\_i}=\frac{\partial l(\mathbf{f(x)}, y)}{\partial \mathbf{f(x)}\_i} = \frac{\partial -\log \mathbf{f(x)}\_y}{\partial \mathbf{f(x)}\_i} = \frac{-1\_{y=i}}{\mathbf{f(x)}\_y} $ - $\frac{\partial l}{\partial \mathbf{z}^o(\mathbf{x})_i} = \ ?? $ $\quad \rightsquigarrow $ chain rule --- - We compute .center[
] - Denoting by $\mathbf{e}(y)$ the one hot encoding of $y$, we have $$\nabla\_{\mathbf{z}^o(\mathbf{x})} l = \mathbf{f(x)} - \mathbf{e}(y)$$ --- - The gradient with respect to activation is $$\nabla\_{\mathbf{z}^o(\mathbf{x})} l = \mathbf{f(x)} - \mathbf{e}(y)$$ and we recall $$\mathbf{z}^o(\mathbf{x}) = \mathbf{W}^o \mathbf{h}(\mathbf{x}) + \mathbf{b}^o$$ - We deduce the gradient with respect to $\mathbf{b}^o$ $$\nabla\_{\mathbf{b}^o} l = \mathbf{f(x)} - \mathbf{e}(y)$$ and the gradient with respect to $\mathbf{W}^o$ $$\frac{\partial l}{\partial W^o\_{i,j}} = \sum\_{k} \frac{\partial \boldsymbol{l}}{\partial \mathbf{z}^o(\mathbf{x})\_k} \frac{\partial \mathbf{z}^o(\mathbf{x})\_k}{\partial W^o\_{i,j}}$$ $$\nabla\_{\mathbf{W}^o} l = (\mathbf{f(x)} - \mathbf{e}(y)) \cdot \mathbf{h(x)}^\top$$ --- # Backprop gradients #### Compute activation gradients - $\nabla\_{\mathbf{z}^o(\mathbf{x})} l = \mathbf{f(x)} - \mathbf{e}(y)$ #### Compute layer params gradients - $\nabla\_{\mathbf{b}^o} l = \nabla\_{\mathbf{z}^o(\mathbf{x})} l$ - $\nabla\_{\mathbf{W}^o} l = \nabla\_{\mathbf{z}^o(\mathbf{x})} l \cdot \mathbf{h(x)}^\top$ #### Compute prev layer gradients - $\nabla\_{\mathbf{h(x)}} l = \mathbf{W}^{o\top} \nabla\_{\mathbf{z}^o(\mathbf{x})} l$ - $\nabla\_{\mathbf{z}^h(\mathbf{x})} l = \nabla\_{\mathbf{h(x)}} l \odot \mathbf{\sigma^\prime(z^h(x))}\quad $
(pointwise multiplication)
- $\nabla\_{\mathbf{b}^h} l = \nabla\_{\mathbf{z}^h(\mathbf{x})} l$ - $\nabla\_{\mathbf{W}^h} l = \nabla\_{\mathbf{z}^h(\mathbf{x})} l \cdot x^\top$ --- # Backpropagation .center[
] ### In short: -
forward computation for output and loss
-
backward computation for gradients
### Fortunately, softwares propose automatic differentiation !!! --- class: center,middle ## Computation Graphs --- ## Libraries & Frameworks .center[
]
- **Automatic differentiation**: TensorFlow, Pytorch, MXnet... - Computations on **CPU** and also on accelerators **GPU**, **TPU** - **Tensorflow 1** needs declarations for a static computation graph - **TensorFlow 2**, **PyTorch** construct the graph dynamically - **Keras**: high level frontend with TensorFlow2 on background
--- ## Computation Graph .center[
]
- Computation graph: directed graph of functions, depending on parameters - Combination of linear and non-linear functions. - All functions must be differentiable.
--- ## Computation Graph Functions are computed in the forward pass. .center[
] Gradients are computed in a *backward pass* (backpropagation) .center[
] --- ## Computation Graph Computation graphs can also handle complex structures (not only sequential) .center[
] --- ## The gradient tape in TensorFlow Neural network with one layer ```py # variable initialisation x = tf.random.uniform((10, 3)) W = tf.Variable(tf.random.uniform((3, 3))) b = tf.Variable(tf.zeros((3,))) # gradient computation with gradient tape with tf.GradientTape() as tape: z = tf.matmul(x, W) + b f = tf.nn.softmax(z) grad_wrt_params = tape.gradient(f, [W, b]) ``` --- ## The gradient tape in TensorFlow Neural network with two layers ```py # variable initialisation x = tf.random.uniform((10, 3)) W1 = tf.Variable(tf.random.uniform((3, 3))) b1 = tf.Variable(tf.zeros((3,))) W2 = tf.Variable(tf.random.uniform((3, 3))) b2 = tf.Variable(tf.zeros((3,))) # gradient computation with gradient tape with tf.GradientTape() as tape: z1 = tf.matmul(x, W1) + b1 h = tf.nn.relu(z1) z2 = tf.matmul(h, W2) + b2 f = tf.nn.softmax(z2) grad_wrt_params = tape.gradient(f, [W1, b1, W2, b2]) ```