Skip to content
Snippets Groups Projects
Commit bce20273 authored by Nando Farchmin's avatar Nando Farchmin
Browse files

Update NN basics

parent c7b77827
Branches
No related tags found
1 merge request!1Update math to conform with gitlab markdown
...@@ -87,11 +87,11 @@ Typically, these networks use different types of activation functions, such as: ...@@ -87,11 +87,11 @@ Typically, these networks use different types of activation functions, such as:
| --- | --- | --- | | --- | --- | --- |
| <img src="argmax.png" title="argmax" alt="argmax" height=200 /> | <img src="softmax.png" title="softmax" alt="softmax" height=200 /> | <img src="maxpool.png" title="maxpool" alt="maxpool" height=200 /> | | <img src="argmax.png" title="argmax" alt="argmax" height=200 /> | <img src="softmax.png" title="softmax" alt="softmax" height=200 /> | <img src="maxpool.png" title="maxpool" alt="maxpool" height=200 /> |
Training Minimization Problem
-------- --------------------
In this section we focus on training a fully-connected network for a regression task. In this section we focus on training a fully-connected network for a regression task.
The principles stay the same of any other objective, such as classification, but may be more complicated at different points. The principles stay the same of any other objective, such as classification, but may be more complicated in some aspects.
Let $`M = \sum_{\ell=1,\dots,L} d_\ell(d_{\ell-1}+1)`$ denote the number of degrees of freedom encorporated in $`\vartheta`$. Let $`M = \sum_{\ell=1,\dots,L} d_\ell(d_{\ell-1}+1)`$ denote the number of degrees of freedom encorporated in $`\vartheta`$.
For $`\varphi = (\varphi_1, \dots, \varphi_L)`$ we define the model class of a certain (fully connected) network topology by For $`\varphi = (\varphi_1, \dots, \varphi_L)`$ we define the model class of a certain (fully connected) network topology by
...@@ -100,31 +100,114 @@ For $`\varphi = (\varphi_1, \dots, \varphi_L)`$ we define the model class of a c ...@@ -100,31 +100,114 @@ For $`\varphi = (\varphi_1, \dots, \varphi_L)`$ we define the model class of a c
\mathcal{M}_{d, \varphi} = \{ \Psi_\vartheta \,\vert\, \vartheta \in \mathbb{R}^M \text{ and activation functions } \varphi\}. \mathcal{M}_{d, \varphi} = \{ \Psi_\vartheta \,\vert\, \vartheta \in \mathbb{R}^M \text{ and activation functions } \varphi\}.
``` ```
If we want to use the neural network to approximate a function $f$ in some appropriate norm, we can use Least-Squares. If we want to use the neural network to approximate a function $`f`$ the easiest approach would be to conduct a Least-Squares regression in an appropriate norm.
The problem then reads: To make things even easier for the explaination, we assume $`f\colon \mathbb{R}^K \to \mathbb{R}`$, i.e., $`\operatorname{dim}(x^{(0)})=K`$ and $`\operatorname{dim}(x^{(L)})=1`$.
Assuming the function $`f`$ has a second moment, we can use a standard $`L^2`$-norm for our Least-Square problem:
```math ```math
\text{Find}\qquad \Psi_\vartheta \text{Find}\qquad \Psi_\vartheta
= \operatorname*{arg\, min}_{\Psi_\theta\in\mathcal{M}_{d,\varphi}} \Vert f - \Psi_\theta \Vert^2 = \operatorname*{arg\, min}_{\Psi_\theta\in\mathcal{M}_{d,\varphi}} \Vert f - \Psi_\theta \Vert_{L^2(\pi)}^2
= \operatorname*{arg\, min}_{\theta\in\mathbb{R}^{M}} \Vert f - \Psi_\theta \Vert^2. = \operatorname*{arg\, min}_{\theta\in\mathbb{R}^{M}} \int_{\mathbb{R}^K} \bigl(f(x) - \Psi_\theta(x)\bigr)^2 \ \mathrm{d}\pi(x),
```
where we assume $`x\sim\pi`$ for some appropriate probability distribution $`\pi`$ (e.g. uniform or normal).
As computing the integrals above is not feasible for $`K\gg1`$, we consider an empirical version.
Let $`x^{(1)},\dots,x^{(N)}\sim\pi`$ be independent (random) samples and assume we have access to $`f^{(i)}:=f(x^{(i)})`$, $`i=1,\dots,N`$.
> **Definition** (training data):
> Tuples of the form $`(x^{(i)}, f^{(i)})_{i=1}^N`$ are called _labeled training data_.
The empirical regression problem then reads
```math
\text{Find}\qquad \Psi_\vartheta
= \operatorname*{arg\, min}_{\Psi_\theta\in\mathcal{M}_{d,\varphi}} \frac{1}{N} \sum_{i=1}^N \bigl(f^{(i)} - \Psi_\theta(x^{(i)})\bigr)^2
=: \operatorname*{arg\, min}_{\Psi_\theta\in\mathcal{M}_{d,\varphi}} \mathcal{L}_N(\Psi_\theta)
```
> **Definition** (loss function):
> A _loss functions_ is any function, which measures how good a neural network approximates the target values.
Typical loss functions for regression and classification tasks are
- mean-square error (MSE, standard $`L^2`$-error)
- weighted $`L^p`$- or $`H^k`$-norms (solutions of PDEs)
- cross-entropy (difference between distributions)
- Kullback-Leibler divergence, Hellinger distance, Wasserstein metrics
- Hinge loss (SVM)
To find a minimizer of our loss function $`\mathcal{L}_N`$, we want to use the first-order optimality criterion
```math
0
= \operatorname{\nabla}_\vartheta \mathcal{L}_N(\Psi_\vartheta)
= -\frac{2}{N} \sum_{i=1}^N \bigl(f^{(i)} - \Psi_\vartheta(x^{(i)}\bigr) \operatorname{\nabla}_\vartheta \Psi_\vartheta.
``` ```
The first order optimality criterion then gives us the linear system
$$ Solving this equation requires the evaluation of the Jacobian (gradient) of the neural network $`\Psi_\vartheta`$ with respect to the network parameters $`\vartheta`$.
\langle f-\Psi_\vartheta,\, \operatorname{\nabla_\vartheta} \Psi_\vartheta \rangle = 0 As $`\vartheta\in\mathbb{R}^M`$ with $`M\gg1`$ (millions of degrees of freedom), computation of the gradient w.r.t. all parameters for each training data point is infeasible.
$$
Optimization (Training)
-----------------------
Instead of solving the minimization problem explicitly, we can use iterative schemes to approximate the solution.
The easiest and most well known approach is gradient descent (Euler's method), i.e.
```math
\vartheta^{(j+1)} = \vartheta^{(j)} - \eta \operatorname{\nabla}_{\vartheta}\mathcal{L}_N(\Psi_{\vartheta^{(j)}}),
\qquad j=0, 1, 2, \dots
```
where the step size $`\eta>0`$ is typically called the _learning rate_ and $`\vartheta^{(0)}`$ is a random initialization of the weights and biases.
The key why gradient descent is more promising then first-order optimality criterion is the iterative character.
In particular, we can use the law of large numbers and restrict the number of summands in $\mathcal{L}_N$ to a random subset of fixed size in each iteration step, which is called _stochastic gradient descent_ (SGD).
Convergence of SGD can be shown by convex minimization and stochastic approximation theory and only requires that the learning rate $`\eta`$ with an appropriate rate.
**(see ?? for mor information)**
Here, however, I want to focus more on the difference between "normal" GD and SGD (in an intuitive level).
In principle, SGD trades gradient computations of a large number of term against the convergence rate of the algorithm.
The best metaphor to remember the difference (I know of) is the following:
> **Metaphor (SGD):**
> Assume you and a friend of yours have had a party on the top of a mountain.
> As the party has come to an end, you both want to get back home somewhere in the valley.
> You, scientist that you are, plan the most direct way down the mountain, following the steepest descent, planning each step carefully as the terrain is very rough.
> Your friend, however, drank a little to much and is not capable of planning anymore.
> So they stagger down the mountain in a more or less random direction.
> Each step they take is with little thought, but it takes them a long time overall to get back home (or at least close to it).
> <img src="sgd.png" title="sgd" alt="sgd" height=400 />
What remains is the computation of $`\operatorname{\nabla}_\vartheta\Psi_{\vartheta^{(i)}}`$ for $`i`\in\Gamma_j\subset\{1,\dots,N\}$ in each step.
Lucky for us, we know that $`\Psi_\vartheta`$ is a simple concatenation of activation functions $`\varphi_\ell`$ and affine maps $`A_\ell(x^{(\ell-1)}) = W_\ell x^{(\ell-1)} + b_\ell`$ with derivative
```
\partial_{W^{(m)}_{\alpha,\beta}} A^{(\ell)} =
\begin{cases}
W^{(\ell)}_{\alpha,\beta} & \text{if }m=\ell,\\
0 & \text{if }m\neq\ell,
\end{cases}
\qquad\text{and}\qquad
\partial_{b^{(m)}_{\alpha}} A^{(\ell)} =
\begin{cases}
b^{(\ell)}_{\alpha} & \text{if }m=\ell,\\
0 & \text{if }m\neq\ell.
\end{cases}
```
- LS regression (and differentiation -> derivative w.r.t. $\vartheta$) The gradient $`\operatorname{\nabla}_\vartheta\Psi_{\vartheta^{(i)}}`$ can then be computed using the chain rule due to the compositional structure of the neural network.
- loss function Computing the gradient through the chain rule is still very inefficient and most probably infeasible if done in a naive fashion.
- back-prob (computing gradient w.r.t. $\vartheta$ by chain-rule) The so called _Backpropagation_ is esentially a way to compute the partial derivatives layer-wise storting only the necessary information to prevent repetitive computations, rendering the computation manaeable.
- examples of loss functions
Types of Neural Networks Types of Neural Networks
------------------------ ------------------------
| Fully Connected Neural Network | Convolutional Neural Network | | Name | Graph |
| --- | --- | | --- | --- |
| <img src="nn_fc.png" title="nn_fc" alt="nn_fc" height=250 /> | <img src="nn_conv.png" title="nn_conv" alt="nn_conv" height=250/> | | Fully Connected Neural Network | <img src="nn_fc.png" title="nn_fc" alt="nn_fc" height=250 /> |
| Convolutional Neural Network | <img src="nn_conv.png" title="nn_conv" alt="nn_conv" height=250/> |
| U-Net | <img src="u_net.png" title="u_net" alt="u_net" height=250/> |
| Residual Neural Network | <img src="res_net.png" title="res_net" alt="res_net" height=250/> |
| Invertible Neural Network | <img src="inn.png" title="inn" alt="inn" height=250/> |
Further Reading Further Reading
--------------- ---------------
......
doc/inn.png

8.83 KiB

doc/res_net.png

21 KiB

doc/sgd.png

69.9 KiB

doc/u_net.png

101 KiB

0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment