diff --git a/doc/tmp.md b/doc/tmp.md new file mode 100644 index 0000000000000000000000000000000000000000..3152672ef97aea2a2461139ba8a15572c1009888 --- /dev/null +++ b/doc/tmp.md @@ -0,0 +1,137 @@ +Neural Networks 101 +------------------- + +<div style="text-align: center;"> + <img src="machine_learning.png" title="ml" alt="ml" height=500 /> +</div> + +Table of Contents +----------------- +[[_TOC_]] + +Nomenclature and Definitions +---------------------------- + +First, we need to clarify a few terms: **artificial intelligence**, **machine learning** and **neural network**. +Everybody categorizes them differently, but we look at it this way: + +<br/> +<div style="text-align: center;"> + <img src="venn.png" title="venn" alt="venn" height=300 /> +</div> +<br/> + +Here we focus on neural networks as a special model class used for function approximation in regression or classification tasks. +To be more precise, we will rely on the following definition. + +> **Definition** (Neural Network): +> For any $`L\in\mathbb{N}`$ and $`d=(d_0,\dots,d_L)\in\mathbb{N}^{L+1}`$ a non-linear map $`\Psi\colon\mathbb{R}^{d_0}\to\mathbb{R}^{d_L}`$ of the form +> ```math +> \Psi(x) = \bigl[\varphi_L\circ (W_L\bullet + b_L)\circ\varphi_{L-1}\circ\dots\circ(W_2\bullet + b_2)\circ\varphi_1\circ (W_1\bullet + b_1)\bigr](x) +> ``` +> is called a _fully connected feed-forward neural network_. + +Typically, we use the following nomenclature: +- $`L`$ is called the _depth_ of the network with layers $`\ell=0,\dots,L`$. +- $`d`$ is called the _width_ of the network, where $`d_\ell`$ is the widths of the layers $`\ell`$. +- $`W_\ell\in\mathbb{R}^{d_{\ell-1}\times d_\ell}`$ are the _weights_ of layer $`\ell`$. +- $`b_\ell\in\mathbb{R}^{d_\ell}`$ is the _biases_ of layer $`\ell`$. +- $`\vartheta=(W_1,b_1,\dots,W_L,b_L)`$ are the _free parameters_ of the neural network. + Sometimes we write $`\Psi_\vartheta`$ or $`\Psi(x; \vartheta)`$ to indicate the dependence of $`\Psi`$ on the parameters $`\vartheta`$. +- $`\varphi_\ell`$ is the _activation function_ of layer $`\ell`$. + Note that $`\varphi_\ell`$ has to be non-linear and monotone increasing. + +Additionally, there exist the following conventions: +- $`x^{(0)}:=x`$ is called the _input (layer)_ of the neural network $`\Psi`$. +- $`x^{(L)}:=\Psi(x)`$ is called the _output (layer)_ of the neural network $`\Psi`$. +- Intermediate results $`x^{(\ell)} = \varphi_\ell(W_\ell\, x^{(\ell-1)} + b_\ell)`$ are called _hidden layers_. +- (debatable) A neural network is called _shallow_ if it has only one hidden layer ($`L=2`$) and deep otherwise. + +**Example:** +Let $`L=3`$, $`d=(6, 10, 10, 3)`$ and $`\varphi_1=\varphi_2=\varphi_3=\tanh`$. +Then the neural network is given by the concatenation +```math +\Psi\colon \mathbb{R}^6\to\mathbb{R}^3, +\qquad +\Psi(x) = \varphi_3\Bigl(W_3 \Bigl(\underbrace{\varphi_2\bigl(W_2 \bigl(\underbrace{\varphi_1(W_1 x + b_1)}_{x^{(1)}}\bigr) + b_2\bigr)}_{x^{(2)}}\Bigr) + b_3\Bigr). +``` +A typical graphical representation of the neural network looks like this: + +<br/> +<div style="text-align: center;"> + <img src="nn_fc_example.png" title="ml" alt="ml" width=400 /> +</div> +<br/> + +The entries of $`W_\ell`$, $`\ell=1,2,3`$, are depicted as lines connecting nodes in one layer to the subsequent one. +The color indicates the sign of the entries (blue = "+", magenta = "-") and the opacity represents the absolute value (magnitude) of the values. +Note that neither the employed actication functions $`\varphi_\ell`$ nor the biases $`b_\ell`$ are represented in this graph. + +Activation Functions +-------------------- + +Activation functions can, in principle, be arbitrary non-linear maps. +The important part is the non-linearity, as otherwise the neural network would be simply be an affine function. + +Typical examples of continuous activation functions applied in the context of function approximation or regression are: + +| ReLU | Leaky ReLU | Sigmoid | +| --- | --- | --- | +| <img src="relu.png" title="ReLU" alt="ReLU" height=200 /> | <img src="leaky_relu.png" title="leaky ReLU" alt="leaky ReLU" height=200 /> | <img src="tanh.png" title="tanh" alt="tanh" height=200 /> | + +For classification tasks, such as image recognition, so called convolutional neural networks (CNNs) are employed. +Typically, these networks use different types of activation functions, such as: + +**Examples for discrete activation functions:** +| Argmax | Softmax | Max-Pooling | +| --- | --- | --- | +| <img src="argmax.png" title="argmax" alt="argmax" height=200 /> | <img src="softmax.png" title="softmax" alt="softmax" height=200 /> | <img src="maxpool.png" title="maxpool" alt="maxpool" height=200 /> | + +Minimization Problem +-------------------- + +In this section we focus on training a fully-connected network for a regression task. +The principles stay the same of any other objective, such as classification, but may be more complicated in some aspects. + +Let $`M = \sum_{\ell=1,\dots,L} d_\ell(d_{\ell-1}+1)`$ denote the number of degrees of freedom encorporated in $`\vartheta`$. +For $`\varphi = (\varphi_1, \dots, \varphi_L)`$ we define the model class of a certain (fully connected) network topology by + +```math +\mathcal{M}_{d, \varphi} = \{ \Psi_\vartheta \,\vert\, \vartheta \in \mathbb{R}^M \text{ and activation functions } \varphi\}. +``` + +If we want to use the neural network to approximate a function $`f`$ the easiest approach would be to conduct a Least-Squares regression in an appropriate norm. +To make things even easier for the explaination, we assume $`f\colon \mathbb{R}^K \to \mathbb{R}`$, i.e., $`\operatorname{dim}(x^{(0)})=K`$ and $`\operatorname{dim}(x^{(L)})=1`$. +Assuming the function $`f`$ has a second moment, we can use a standard $`L^2`$-norm for our Least-Square problem: + +```math +\text{Find}\qquad \Psi_\vartheta += \operatorname*{arg\, min}_{\Psi_\theta\in\mathcal{M}_{d,\varphi}} \Vert f - \Psi_\theta \Vert_{L^2(\pi)}^2 += \operatorname*{arg\, min}_{\theta\in\mathbb{R}^{M}} \int_{\mathbb{R}^K} \bigl(f(x) - \Psi_\theta(x)\bigr)^2 \ \mathrm{d}\pi(x), +``` + +where we assume $`x\sim\pi`$ for some appropriate probability distribution $`\pi`$ (e.g. uniform or normal). +As computing the integrals above is not feasible for $`K\gg1`$, we consider an empirical version. +Let $`x^{(1)},\dots,x^{(N)}\sim\pi`$ be independent (random) samples and assume we have access to $`f^{(i)}:=f(x^{(i)})`$, $`i=1,\dots,N`$. + +> **Definition** (training data): +> Tuples of the form $`(x^{(i)}, f^{(i)})_{i=1}^N`$ are called _labeled training data_. + +The empirical regression problem then reads + +```math +\text{Find}\qquad \Psi_\vartheta += \operatorname*{arg\, min}_{\Psi_\theta\in\mathcal{M}_{d,\varphi}} \frac{1}{N} \sum_{i=1}^N \bigl(f^{(i)} - \Psi_\theta(x^{(i)})\bigr)^2 +=: \operatorname*{arg\, min}_{\Psi_\theta\in\mathcal{M}_{d,\varphi}} \mathcal{L}_N(\Psi_\theta) +``` + +> **Definition** (loss function): +> A _loss functions_ is any function, which measures how good a neural network approximates the target values. + +Typical loss functions for regression and classification tasks are + - mean-square error (MSE, standard $`L^2`$-error) + - weighted $`L^p`$- or $`H^k`$-norms (solutions of PDEs) + - cross-entropy (difference between distributions) + - Kullback-Leibler divergence, Hellinger distance, Wasserstein metrics + - Hinge loss (SVM) +