Test markdown math display

394d1bee · Nando Farchmin · a26a8752 · 394d1bee
Commit 394d1bee authored 2 years ago by Nando Farchmin
--- a/doc/tmp.md
+++ b/doc/tmp.md
+Neural Networks 101
+-------------------
+<div style="text-align: center;">
+    <img src="machine_learning.png" title="ml" alt="ml" height=500 />
+</div>
+Table of Contents
+-----------------
+[[_TOC_]]
+Nomenclature and Definitions
+----------------------------
+First, we need to clarify a few terms: **artificial intelligence**, **machine learning** and **neural network**.
+Everybody categorizes them differently, but we look at it this way:
+<br/>
+<div style="text-align: center;">
+    <img src="venn.png" title="venn" alt="venn" height=300 />
+</div>
+<br/>
+Here we focus on neural networks as a special model class used for function approximation in regression or classification tasks.
+To be more precise, we will rely on the following definition.
+> **Definition** (Neural Network):
+> For any $`L\in\mathbb{N}`$ and $`d=(d_0,\dots,d_L)\in\mathbb{N}^{L+1}`$ a non-linear map $`\Psi\colon\mathbb{R}^{d_0}\to\mathbb{R}^{d_L}`$ of the form
+> ```math
+> \Psi(x) = \bigl[\varphi_L\circ (W_L\bullet  + b_L)\circ\varphi_{L-1}\circ\dots\circ(W_2\bullet  + b_2)\circ\varphi_1\circ (W_1\bullet  + b_1)\bigr](x)
+> ```
+> is called a _fully connected feed-forward neural network_.
+Typically, we use the following nomenclature:
+- $`L`$ is called the _depth_ of the network with layers $`\ell=0,\dots,L`$.
+- $`d`$ is called the _width_ of the network, where $`d_\ell`$ is the widths of the layers $`\ell`$.
+- $`W_\ell\in\mathbb{R}^{d_{\ell-1}\times d_\ell}`$ are the _weights_ of layer $`\ell`$.
+- $`b_\ell\in\mathbb{R}^{d_\ell}`$ is the _biases_ of layer $`\ell`$.
+- $`\vartheta=(W_1,b_1,\dots,W_L,b_L)`$ are the _free parameters_ of the neural network.
+  Sometimes we write $`\Psi_\vartheta`$ or $`\Psi(x; \vartheta)`$ to indicate the dependence of $`\Psi`$ on the parameters $`\vartheta`$.
+- $`\varphi_\ell`$ is the _activation function_ of layer $`\ell`$.
+  Note that $`\varphi_\ell`$ has to be non-linear and monotone increasing.
+Additionally, there exist the following conventions:
+- $`x^{(0)}:=x`$ is called the _input (layer)_ of the neural network $`\Psi`$.
+- $`x^{(L)}:=\Psi(x)`$ is called the _output (layer)_ of the neural network $`\Psi`$.
+- Intermediate results $`x^{(\ell)} = \varphi_\ell(W_\ell\, x^{(\ell-1)} + b_\ell)`$ are called _hidden layers_.
+- (debatable) A neural network is called _shallow_ if it has only one hidden layer ($`L=2`$) and deep otherwise.
+**Example:**
+Let $`L=3`$, $`d=(6, 10, 10, 3)`$ and $`\varphi_1=\varphi_2=\varphi_3=\tanh`$.
+Then the neural network is given by the concatenation
+```math
+\Psi\colon \mathbb{R}^6\to\mathbb{R}^3,
+\qquad
+\Psi(x) = \varphi_3\Bigl(W_3 \Bigl(\underbrace{\varphi_2\bigl(W_2 \bigl(\underbrace{\varphi_1(W_1 x + b_1)}_{x^{(1)}}\bigr) + b_2\bigr)}_{x^{(2)}}\Bigr) + b_3\Bigr).
+```
+A typical graphical representation of the neural network looks like this:
+<br/>
+<div style="text-align: center;">
+  <img src="nn_fc_example.png" title="ml" alt="ml" width=400 />
+</div>
+<br/>
+The entries of $`W_\ell`$, $`\ell=1,2,3`$, are depicted as lines connecting nodes in one layer to the subsequent one.
+The color indicates the sign of the entries (blue = "+", magenta = "-") and the opacity represents the absolute value (magnitude) of the values.
+Note that neither the employed actication functions $`\varphi_\ell`$ nor the biases $`b_\ell`$ are represented in this graph.
+Activation Functions
+--------------------
+Activation functions can, in principle, be arbitrary non-linear maps.
+The important part is the non-linearity, as otherwise the neural network would be simply be an affine function.
+Typical examples of continuous activation functions applied in the context of function approximation or regression are:
+| ReLU | Leaky ReLU | Sigmoid |
+| --- | --- | --- |
+| <img src="relu.png" title="ReLU" alt="ReLU" height=200 /> | <img src="leaky_relu.png" title="leaky ReLU" alt="leaky ReLU" height=200 /> | <img src="tanh.png" title="tanh" alt="tanh" height=200 /> |
+For classification tasks, such as image recognition, so called convolutional neural networks (CNNs) are employed.
+Typically, these networks use different types of activation functions, such as:
+**Examples for discrete activation functions:**
+| Argmax | Softmax | Max-Pooling |
+| --- | --- | --- |
+| <img src="argmax.png" title="argmax" alt="argmax" height=200 /> | <img src="softmax.png" title="softmax" alt="softmax" height=200 /> | <img src="maxpool.png" title="maxpool" alt="maxpool" height=200 /> |
+Minimization Problem
+--------------------
+In this section we focus on training a fully-connected network for a regression task.
+The principles stay the same of any other objective, such as classification, but may be more complicated in some aspects.
+Let $`M = \sum_{\ell=1,\dots,L} d_\ell(d_{\ell-1}+1)`$ denote the number of degrees of freedom encorporated in $`\vartheta`$.
+For $`\varphi = (\varphi_1, \dots, \varphi_L)`$ we define the model class of a certain (fully connected) network topology by
+```math
+\mathcal{M}_{d, \varphi} = \{ \Psi_\vartheta \,\vert\, \vartheta \in \mathbb{R}^M \text{ and activation functions } \varphi\}.
+```
+If we want to use the neural network to approximate a function $`f`$ the easiest approach would be to conduct a Least-Squares regression in an appropriate norm.
+To make things even easier for the explaination, we assume $`f\colon \mathbb{R}^K \to \mathbb{R}`$, i.e., $`\operatorname{dim}(x^{(0)})=K`$ and $`\operatorname{dim}(x^{(L)})=1`$.
+Assuming the function $`f`$ has a second moment, we can use a standard $`L^2`$-norm for our Least-Square problem:
+```math
+\text{Find}\qquad \Psi_\vartheta
+= \operatorname*{arg\, min}_{\Psi_\theta\in\mathcal{M}_{d,\varphi}} \Vert f - \Psi_\theta \Vert_{L^2(\pi)}^2
+= \operatorname*{arg\, min}_{\theta\in\mathbb{R}^{M}} \int_{\mathbb{R}^K} \bigl(f(x) - \Psi_\theta(x)\bigr)^2 \ \mathrm{d}\pi(x),
+```
+where we assume $`x\sim\pi`$ for some appropriate probability distribution $`\pi`$ (e.g. uniform or normal).
+As computing the integrals above is not feasible for $`K\gg1`$, we consider an empirical version.
+Let $`x^{(1)},\dots,x^{(N)}\sim\pi`$ be independent (random) samples and assume we have access to $`f^{(i)}:=f(x^{(i)})`$, $`i=1,\dots,N`$.
+> **Definition** (training data):
+> Tuples of the form $`(x^{(i)}, f^{(i)})_{i=1}^N`$ are called _labeled training data_.
+The empirical regression problem then reads
+```math
+\text{Find}\qquad \Psi_\vartheta
+= \operatorname*{arg\, min}_{\Psi_\theta\in\mathcal{M}_{d,\varphi}} \frac{1}{N} \sum_{i=1}^N \bigl(f^{(i)} - \Psi_\theta(x^{(i)})\bigr)^2
+=: \operatorname*{arg\, min}_{\Psi_\theta\in\mathcal{M}_{d,\varphi}} \mathcal{L}_N(\Psi_\theta)
+```
+> **Definition** (loss function):
+> A _loss functions_ is any function, which measures how good a neural network approximates the target values.
+Typical loss functions for regression and classification tasks are
+  - mean-square error (MSE, standard $`L^2`$-error)
+  - weighted $`L^p`$- or $`H^k`$-norms (solutions of PDEs)
+  - cross-entropy (difference between distributions)
+  - Kullback-Leibler divergence, Hellinger distance, Wasserstein metrics
+  - Hinge loss (SVM)