Skip to content
Snippets Groups Projects
Commit 394d1bee authored by Nando Farchmin's avatar Nando Farchmin
Browse files

Test markdown math display

parent a26a8752
Branches
No related tags found
1 merge request!1Update math to conform with gitlab markdown
Neural Networks 101
-------------------
<div style="text-align: center;">
<img src="machine_learning.png" title="ml" alt="ml" height=500 />
</div>
Table of Contents
-----------------
[[_TOC_]]
Nomenclature and Definitions
----------------------------
First, we need to clarify a few terms: **artificial intelligence**, **machine learning** and **neural network**.
Everybody categorizes them differently, but we look at it this way:
<br/>
<div style="text-align: center;">
<img src="venn.png" title="venn" alt="venn" height=300 />
</div>
<br/>
Here we focus on neural networks as a special model class used for function approximation in regression or classification tasks.
To be more precise, we will rely on the following definition.
> **Definition** (Neural Network):
> For any $`L\in\mathbb{N}`$ and $`d=(d_0,\dots,d_L)\in\mathbb{N}^{L+1}`$ a non-linear map $`\Psi\colon\mathbb{R}^{d_0}\to\mathbb{R}^{d_L}`$ of the form
> ```math
> \Psi(x) = \bigl[\varphi_L\circ (W_L\bullet + b_L)\circ\varphi_{L-1}\circ\dots\circ(W_2\bullet + b_2)\circ\varphi_1\circ (W_1\bullet + b_1)\bigr](x)
> ```
> is called a _fully connected feed-forward neural network_.
Typically, we use the following nomenclature:
- $`L`$ is called the _depth_ of the network with layers $`\ell=0,\dots,L`$.
- $`d`$ is called the _width_ of the network, where $`d_\ell`$ is the widths of the layers $`\ell`$.
- $`W_\ell\in\mathbb{R}^{d_{\ell-1}\times d_\ell}`$ are the _weights_ of layer $`\ell`$.
- $`b_\ell\in\mathbb{R}^{d_\ell}`$ is the _biases_ of layer $`\ell`$.
- $`\vartheta=(W_1,b_1,\dots,W_L,b_L)`$ are the _free parameters_ of the neural network.
Sometimes we write $`\Psi_\vartheta`$ or $`\Psi(x; \vartheta)`$ to indicate the dependence of $`\Psi`$ on the parameters $`\vartheta`$.
- $`\varphi_\ell`$ is the _activation function_ of layer $`\ell`$.
Note that $`\varphi_\ell`$ has to be non-linear and monotone increasing.
Additionally, there exist the following conventions:
- $`x^{(0)}:=x`$ is called the _input (layer)_ of the neural network $`\Psi`$.
- $`x^{(L)}:=\Psi(x)`$ is called the _output (layer)_ of the neural network $`\Psi`$.
- Intermediate results $`x^{(\ell)} = \varphi_\ell(W_\ell\, x^{(\ell-1)} + b_\ell)`$ are called _hidden layers_.
- (debatable) A neural network is called _shallow_ if it has only one hidden layer ($`L=2`$) and deep otherwise.
**Example:**
Let $`L=3`$, $`d=(6, 10, 10, 3)`$ and $`\varphi_1=\varphi_2=\varphi_3=\tanh`$.
Then the neural network is given by the concatenation
```math
\Psi\colon \mathbb{R}^6\to\mathbb{R}^3,
\qquad
\Psi(x) = \varphi_3\Bigl(W_3 \Bigl(\underbrace{\varphi_2\bigl(W_2 \bigl(\underbrace{\varphi_1(W_1 x + b_1)}_{x^{(1)}}\bigr) + b_2\bigr)}_{x^{(2)}}\Bigr) + b_3\Bigr).
```
A typical graphical representation of the neural network looks like this:
<br/>
<div style="text-align: center;">
<img src="nn_fc_example.png" title="ml" alt="ml" width=400 />
</div>
<br/>
The entries of $`W_\ell`$, $`\ell=1,2,3`$, are depicted as lines connecting nodes in one layer to the subsequent one.
The color indicates the sign of the entries (blue = "+", magenta = "-") and the opacity represents the absolute value (magnitude) of the values.
Note that neither the employed actication functions $`\varphi_\ell`$ nor the biases $`b_\ell`$ are represented in this graph.
Activation Functions
--------------------
Activation functions can, in principle, be arbitrary non-linear maps.
The important part is the non-linearity, as otherwise the neural network would be simply be an affine function.
Typical examples of continuous activation functions applied in the context of function approximation or regression are:
| ReLU | Leaky ReLU | Sigmoid |
| --- | --- | --- |
| <img src="relu.png" title="ReLU" alt="ReLU" height=200 /> | <img src="leaky_relu.png" title="leaky ReLU" alt="leaky ReLU" height=200 /> | <img src="tanh.png" title="tanh" alt="tanh" height=200 /> |
For classification tasks, such as image recognition, so called convolutional neural networks (CNNs) are employed.
Typically, these networks use different types of activation functions, such as:
**Examples for discrete activation functions:**
| Argmax | Softmax | Max-Pooling |
| --- | --- | --- |
| <img src="argmax.png" title="argmax" alt="argmax" height=200 /> | <img src="softmax.png" title="softmax" alt="softmax" height=200 /> | <img src="maxpool.png" title="maxpool" alt="maxpool" height=200 /> |
Minimization Problem
--------------------
In this section we focus on training a fully-connected network for a regression task.
The principles stay the same of any other objective, such as classification, but may be more complicated in some aspects.
Let $`M = \sum_{\ell=1,\dots,L} d_\ell(d_{\ell-1}+1)`$ denote the number of degrees of freedom encorporated in $`\vartheta`$.
For $`\varphi = (\varphi_1, \dots, \varphi_L)`$ we define the model class of a certain (fully connected) network topology by
```math
\mathcal{M}_{d, \varphi} = \{ \Psi_\vartheta \,\vert\, \vartheta \in \mathbb{R}^M \text{ and activation functions } \varphi\}.
```
If we want to use the neural network to approximate a function $`f`$ the easiest approach would be to conduct a Least-Squares regression in an appropriate norm.
To make things even easier for the explaination, we assume $`f\colon \mathbb{R}^K \to \mathbb{R}`$, i.e., $`\operatorname{dim}(x^{(0)})=K`$ and $`\operatorname{dim}(x^{(L)})=1`$.
Assuming the function $`f`$ has a second moment, we can use a standard $`L^2`$-norm for our Least-Square problem:
```math
\text{Find}\qquad \Psi_\vartheta
= \operatorname*{arg\, min}_{\Psi_\theta\in\mathcal{M}_{d,\varphi}} \Vert f - \Psi_\theta \Vert_{L^2(\pi)}^2
= \operatorname*{arg\, min}_{\theta\in\mathbb{R}^{M}} \int_{\mathbb{R}^K} \bigl(f(x) - \Psi_\theta(x)\bigr)^2 \ \mathrm{d}\pi(x),
```
where we assume $`x\sim\pi`$ for some appropriate probability distribution $`\pi`$ (e.g. uniform or normal).
As computing the integrals above is not feasible for $`K\gg1`$, we consider an empirical version.
Let $`x^{(1)},\dots,x^{(N)}\sim\pi`$ be independent (random) samples and assume we have access to $`f^{(i)}:=f(x^{(i)})`$, $`i=1,\dots,N`$.
> **Definition** (training data):
> Tuples of the form $`(x^{(i)}, f^{(i)})_{i=1}^N`$ are called _labeled training data_.
The empirical regression problem then reads
```math
\text{Find}\qquad \Psi_\vartheta
= \operatorname*{arg\, min}_{\Psi_\theta\in\mathcal{M}_{d,\varphi}} \frac{1}{N} \sum_{i=1}^N \bigl(f^{(i)} - \Psi_\theta(x^{(i)})\bigr)^2
=: \operatorname*{arg\, min}_{\Psi_\theta\in\mathcal{M}_{d,\varphi}} \mathcal{L}_N(\Psi_\theta)
```
> **Definition** (loss function):
> A _loss functions_ is any function, which measures how good a neural network approximates the target values.
Typical loss functions for regression and classification tasks are
- mean-square error (MSE, standard $`L^2`$-error)
- weighted $`L^p`$- or $`H^k`$-norms (solutions of PDEs)
- cross-entropy (difference between distributions)
- Kullback-Leibler divergence, Hellinger distance, Wasserstein metrics
- Hinge loss (SVM)
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment