diff --git a/doc/basics.md b/doc/basics.md index 0b9e2f66db5797343bb4b62979266917fa9cf91b..80b57d30db5a7ca7d665afb145235deaf50d3424 100644 --- a/doc/basics.md +++ b/doc/basics.md @@ -87,11 +87,11 @@ Typically, these networks use different types of activation functions, such as: | --- | --- | --- | | <img src="argmax.png" title="argmax" alt="argmax" height=200 /> | <img src="softmax.png" title="softmax" alt="softmax" height=200 /> | <img src="maxpool.png" title="maxpool" alt="maxpool" height=200 /> | -Training --------- +Minimization Problem +-------------------- In this section we focus on training a fully-connected network for a regression task. -The principles stay the same of any other objective, such as classification, but may be more complicated at different points. +The principles stay the same of any other objective, such as classification, but may be more complicated in some aspects. Let $`M = \sum_{\ell=1,\dots,L} d_\ell(d_{\ell-1}+1)`$ denote the number of degrees of freedom encorporated in $`\vartheta`$. For $`\varphi = (\varphi_1, \dots, \varphi_L)`$ we define the model class of a certain (fully connected) network topology by @@ -100,31 +100,114 @@ For $`\varphi = (\varphi_1, \dots, \varphi_L)`$ we define the model class of a c \mathcal{M}_{d, \varphi} = \{ \Psi_\vartheta \,\vert\, \vartheta \in \mathbb{R}^M \text{ and activation functions } \varphi\}. ``` -If we want to use the neural network to approximate a function $f$ in some appropriate norm, we can use Least-Squares. -The problem then reads: +If we want to use the neural network to approximate a function $`f`$ the easiest approach would be to conduct a Least-Squares regression in an appropriate norm. +To make things even easier for the explaination, we assume $`f\colon \mathbb{R}^K \to \mathbb{R}`$, i.e., $`\operatorname{dim}(x^{(0)})=K`$ and $`\operatorname{dim}(x^{(L)})=1`$. +Assuming the function $`f`$ has a second moment, we can use a standard $`L^2`$-norm for our Least-Square problem: ```math \text{Find}\qquad \Psi_\vartheta -= \operatorname*{arg\, min}_{\Psi_\theta\in\mathcal{M}_{d,\varphi}} \Vert f - \Psi_\theta \Vert^2 -= \operatorname*{arg\, min}_{\theta\in\mathbb{R}^{M}} \Vert f - \Psi_\theta \Vert^2. += \operatorname*{arg\, min}_{\Psi_\theta\in\mathcal{M}_{d,\varphi}} \Vert f - \Psi_\theta \Vert_{L^2(\pi)}^2 += \operatorname*{arg\, min}_{\theta\in\mathbb{R}^{M}} \int_{\mathbb{R}^K} \bigl(f(x) - \Psi_\theta(x)\bigr)^2 \ \mathrm{d}\pi(x), +``` + +where we assume $`x\sim\pi`$ for some appropriate probability distribution $`\pi`$ (e.g. uniform or normal). +As computing the integrals above is not feasible for $`K\gg1`$, we consider an empirical version. +Let $`x^{(1)},\dots,x^{(N)}\sim\pi`$ be independent (random) samples and assume we have access to $`f^{(i)}:=f(x^{(i)})`$, $`i=1,\dots,N`$. + +> **Definition** (training data): +> Tuples of the form $`(x^{(i)}, f^{(i)})_{i=1}^N`$ are called _labeled training data_. + +The empirical regression problem then reads + +```math +\text{Find}\qquad \Psi_\vartheta += \operatorname*{arg\, min}_{\Psi_\theta\in\mathcal{M}_{d,\varphi}} \frac{1}{N} \sum_{i=1}^N \bigl(f^{(i)} - \Psi_\theta(x^{(i)})\bigr)^2 +=: \operatorname*{arg\, min}_{\Psi_\theta\in\mathcal{M}_{d,\varphi}} \mathcal{L}_N(\Psi_\theta) +``` + +> **Definition** (loss function): +> A _loss functions_ is any function, which measures how good a neural network approximates the target values. + +Typical loss functions for regression and classification tasks are +- mean-square error (MSE, standard $`L^2`$-error) +- weighted $`L^p`$- or $`H^k`$-norms (solutions of PDEs) +- cross-entropy (difference between distributions) +- Kullback-Leibler divergence, Hellinger distance, Wasserstein metrics +- Hinge loss (SVM) + +To find a minimizer of our loss function $`\mathcal{L}_N`$, we want to use the first-order optimality criterion + +```math +0 += \operatorname{\nabla}_\vartheta \mathcal{L}_N(\Psi_\vartheta) += -\frac{2}{N} \sum_{i=1}^N \bigl(f^{(i)} - \Psi_\vartheta(x^{(i)}\bigr) \operatorname{\nabla}_\vartheta \Psi_\vartheta. ``` -The first order optimality criterion then gives us the linear system -$$ -\langle f-\Psi_\vartheta,\, \operatorname{\nabla_\vartheta} \Psi_\vartheta \rangle = 0 -$$ +Solving this equation requires the evaluation of the Jacobian (gradient) of the neural network $`\Psi_\vartheta`$ with respect to the network parameters $`\vartheta`$. +As $`\vartheta\in\mathbb{R}^M`$ with $`M\gg1`$ (millions of degrees of freedom), computation of the gradient w.r.t. all parameters for each training data point is infeasible. + +Optimization (Training) +----------------------- + +Instead of solving the minimization problem explicitly, we can use iterative schemes to approximate the solution. +The easiest and most well known approach is gradient descent (Euler's method), i.e. + +```math +\vartheta^{(j+1)} = \vartheta^{(j)} - \eta \operatorname{\nabla}_{\vartheta}\mathcal{L}_N(\Psi_{\vartheta^{(j)}}), +\qquad j=0, 1, 2, \dots +``` + +where the step size $`\eta>0`$ is typically called the _learning rate_ and $`\vartheta^{(0)}`$ is a random initialization of the weights and biases. + +The key why gradient descent is more promising then first-order optimality criterion is the iterative character. +In particular, we can use the law of large numbers and restrict the number of summands in $\mathcal{L}_N$ to a random subset of fixed size in each iteration step, which is called _stochastic gradient descent_ (SGD). +Convergence of SGD can be shown by convex minimization and stochastic approximation theory and only requires that the learning rate $`\eta`$ with an appropriate rate. +**(see ?? for mor information)** + +Here, however, I want to focus more on the difference between "normal" GD and SGD (in an intuitive level). +In principle, SGD trades gradient computations of a large number of term against the convergence rate of the algorithm. +The best metaphor to remember the difference (I know of) is the following: + +> **Metaphor (SGD):** +> Assume you and a friend of yours have had a party on the top of a mountain. +> As the party has come to an end, you both want to get back home somewhere in the valley. +> You, scientist that you are, plan the most direct way down the mountain, following the steepest descent, planning each step carefully as the terrain is very rough. +> Your friend, however, drank a little to much and is not capable of planning anymore. +> So they stagger down the mountain in a more or less random direction. +> Each step they take is with little thought, but it takes them a long time overall to get back home (or at least close to it). +> <img src="sgd.png" title="sgd" alt="sgd" height=400 /> + +What remains is the computation of $`\operatorname{\nabla}_\vartheta\Psi_{\vartheta^{(i)}}`$ for $`i`\in\Gamma_j\subset\{1,\dots,N\}$ in each step. +Lucky for us, we know that $`\Psi_\vartheta`$ is a simple concatenation of activation functions $`\varphi_\ell`$ and affine maps $`A_\ell(x^{(\ell-1)}) = W_\ell x^{(\ell-1)} + b_\ell`$ with derivative + +``` +\partial_{W^{(m)}_{\alpha,\beta}} A^{(\ell)} = +\begin{cases} +W^{(\ell)}_{\alpha,\beta} & \text{if }m=\ell,\\ +0 & \text{if }m\neq\ell, +\end{cases} +\qquad\text{and}\qquad +\partial_{b^{(m)}_{\alpha}} A^{(\ell)} = +\begin{cases} +b^{(\ell)}_{\alpha} & \text{if }m=\ell,\\ +0 & \text{if }m\neq\ell. +\end{cases} +``` -- LS regression (and differentiation -> derivative w.r.t. $\vartheta$) -- loss function -- back-prob (computing gradient w.r.t. $\vartheta$ by chain-rule) -- examples of loss functions +The gradient $`\operatorname{\nabla}_\vartheta\Psi_{\vartheta^{(i)}}`$ can then be computed using the chain rule due to the compositional structure of the neural network. +Computing the gradient through the chain rule is still very inefficient and most probably infeasible if done in a naive fashion. +The so called _Backpropagation_ is esentially a way to compute the partial derivatives layer-wise storting only the necessary information to prevent repetitive computations, rendering the computation manaeable. Types of Neural Networks ------------------------ -| Fully Connected Neural Network | Convolutional Neural Network | +| Name | Graph | | --- | --- | -| <img src="nn_fc.png" title="nn_fc" alt="nn_fc" height=250 /> | <img src="nn_conv.png" title="nn_conv" alt="nn_conv" height=250/> | +| Fully Connected Neural Network | <img src="nn_fc.png" title="nn_fc" alt="nn_fc" height=250 /> | +| Convolutional Neural Network | <img src="nn_conv.png" title="nn_conv" alt="nn_conv" height=250/> | +| U-Net | <img src="u_net.png" title="u_net" alt="u_net" height=250/> | +| Residual Neural Network | <img src="res_net.png" title="res_net" alt="res_net" height=250/> | +| Invertible Neural Network | <img src="inn.png" title="inn" alt="inn" height=250/> | Further Reading --------------- diff --git a/doc/inn.png b/doc/inn.png new file mode 100644 index 0000000000000000000000000000000000000000..d5b32fc57ac11d68c10f6bde9e2f5b601c9d3561 Binary files /dev/null and b/doc/inn.png differ diff --git a/doc/res_net.png b/doc/res_net.png new file mode 100644 index 0000000000000000000000000000000000000000..467340a6c89006386ea8ffb04496317600ab63d3 Binary files /dev/null and b/doc/res_net.png differ diff --git a/doc/sgd.png b/doc/sgd.png new file mode 100644 index 0000000000000000000000000000000000000000..2138a2cc7c9f0f27dfd4571d33538226493e6692 Binary files /dev/null and b/doc/sgd.png differ diff --git a/doc/u_net.png b/doc/u_net.png new file mode 100644 index 0000000000000000000000000000000000000000..312c59f077a6810df4f08ef1c61e24317aac4ca1 Binary files /dev/null and b/doc/u_net.png differ