Skip to content
Snippets Groups Projects
Commit dacaf710 authored by Nando Farchmin's avatar Nando Farchmin
Browse files

Test markdown math display

parent 394d1bee
No related branches found
No related tags found
1 merge request!1Update math to conform with gitlab markdown
Neural Networks 101 The empirical regression problem then reads
-------------------
<div style="text-align: center;">
<img src="machine_learning.png" title="ml" alt="ml" height=500 />
</div>
Table of Contents
-----------------
[[_TOC_]]
Nomenclature and Definitions
----------------------------
First, we need to clarify a few terms: **artificial intelligence**, **machine learning** and **neural network**.
Everybody categorizes them differently, but we look at it this way:
<br/>
<div style="text-align: center;">
<img src="venn.png" title="venn" alt="venn" height=300 />
</div>
<br/>
Here we focus on neural networks as a special model class used for function approximation in regression or classification tasks.
To be more precise, we will rely on the following definition.
> **Definition** (Neural Network):
> For any $`L\in\mathbb{N}`$ and $`d=(d_0,\dots,d_L)\in\mathbb{N}^{L+1}`$ a non-linear map $`\Psi\colon\mathbb{R}^{d_0}\to\mathbb{R}^{d_L}`$ of the form
> ```math
> \Psi(x) = \bigl[\varphi_L\circ (W_L\bullet + b_L)\circ\varphi_{L-1}\circ\dots\circ(W_2\bullet + b_2)\circ\varphi_1\circ (W_1\bullet + b_1)\bigr](x)
> ```
> is called a _fully connected feed-forward neural network_.
Typically, we use the following nomenclature:
- $`L`$ is called the _depth_ of the network with layers $`\ell=0,\dots,L`$.
- $`d`$ is called the _width_ of the network, where $`d_\ell`$ is the widths of the layers $`\ell`$.
- $`W_\ell\in\mathbb{R}^{d_{\ell-1}\times d_\ell}`$ are the _weights_ of layer $`\ell`$.
- $`b_\ell\in\mathbb{R}^{d_\ell}`$ is the _biases_ of layer $`\ell`$.
- $`\vartheta=(W_1,b_1,\dots,W_L,b_L)`$ are the _free parameters_ of the neural network.
Sometimes we write $`\Psi_\vartheta`$ or $`\Psi(x; \vartheta)`$ to indicate the dependence of $`\Psi`$ on the parameters $`\vartheta`$.
- $`\varphi_\ell`$ is the _activation function_ of layer $`\ell`$.
Note that $`\varphi_\ell`$ has to be non-linear and monotone increasing.
Additionally, there exist the following conventions:
- $`x^{(0)}:=x`$ is called the _input (layer)_ of the neural network $`\Psi`$.
- $`x^{(L)}:=\Psi(x)`$ is called the _output (layer)_ of the neural network $`\Psi`$.
- Intermediate results $`x^{(\ell)} = \varphi_\ell(W_\ell\, x^{(\ell-1)} + b_\ell)`$ are called _hidden layers_.
- (debatable) A neural network is called _shallow_ if it has only one hidden layer ($`L=2`$) and deep otherwise.
**Example:**
Let $`L=3`$, $`d=(6, 10, 10, 3)`$ and $`\varphi_1=\varphi_2=\varphi_3=\tanh`$.
Then the neural network is given by the concatenation
```math ```math
\Psi\colon \mathbb{R}^6\to\mathbb{R}^3, \text{Find}\qquad \Psi_\vartheta
\qquad = \operatorname*{arg\, min}_{\Psi_\theta\in\mathcal{M}_{d,\varphi}} \frac{1}{N} \sum_{i=1}^N \bigl(f^{(i)} - \Psi_\theta(x^{(i)})\bigr)^2
\Psi(x) = \varphi_3\Bigl(W_3 \Bigl(\underbrace{\varphi_2\bigl(W_2 \bigl(\underbrace{\varphi_1(W_1 x + b_1)}_{x^{(1)}}\bigr) + b_2\bigr)}_{x^{(2)}}\Bigr) + b_3\Bigr). =: \operatorname*{arg\, min}_{\Psi_\theta\in\mathcal{M}_{d,\varphi}} \mathcal{L}_N(\Psi_\theta)
``` ```
A typical graphical representation of the neural network looks like this:
<br/> > **Definition** (loss function):
<div style="text-align: center;"> > A _loss functions_ is any function, which measures how good a neural network approximates the target values.
<img src="nn_fc_example.png" title="ml" alt="ml" width=400 />
</div>
<br/>
The entries of $`W_\ell`$, $`\ell=1,2,3`$, are depicted as lines connecting nodes in one layer to the subsequent one.
The color indicates the sign of the entries (blue = "+", magenta = "-") and the opacity represents the absolute value (magnitude) of the values.
Note that neither the employed actication functions $`\varphi_\ell`$ nor the biases $`b_\ell`$ are represented in this graph.
Activation Functions
--------------------
Activation functions can, in principle, be arbitrary non-linear maps.
The important part is the non-linearity, as otherwise the neural network would be simply be an affine function.
Typical examples of continuous activation functions applied in the context of function approximation or regression are:
| ReLU | Leaky ReLU | Sigmoid | **TODO: Is there a maximum number of inline math?**
| --- | --- | --- |
| <img src="relu.png" title="ReLU" alt="ReLU" height=200 /> | <img src="leaky_relu.png" title="leaky ReLU" alt="leaky ReLU" height=200 /> | <img src="tanh.png" title="tanh" alt="tanh" height=200 /> |
For classification tasks, such as image recognition, so called convolutional neural networks (CNNs) are employed. Typical loss functions for regression and classification tasks are
Typically, these networks use different types of activation functions, such as: - mean-square error (MSE, standard $`L^2`$-error)
- weighted $`L^p`$- or $`H^k`$-norms (solutions of PDEs)
- cross-entropy (difference between distributions)
- Kullback-Leibler divergence, Hellinger distance, Wasserstein metrics
- Hinge loss (SVM)
**Examples for discrete activation functions:** To find a minimizer of our loss function $`\mathcal{L}_N`$, we want to use the first-order optimality criterion
| Argmax | Softmax | Max-Pooling |
| --- | --- | --- |
| <img src="argmax.png" title="argmax" alt="argmax" height=200 /> | <img src="softmax.png" title="softmax" alt="softmax" height=200 /> | <img src="maxpool.png" title="maxpool" alt="maxpool" height=200 /> |
Minimization Problem ```math
-------------------- 0
= \operatorname{\nabla}_\vartheta \mathcal{L}_N(\Psi_\vartheta)
= -\frac{2}{N} \sum_{i=1}^N \bigl(f^{(i)} - \Psi_\vartheta(x^{(i)}\bigr) \operatorname{\nabla}_\vartheta \Psi_\vartheta.
```
In this section we focus on training a fully-connected network for a regression task. Solving this equation requires the evaluation of the Jacobian (gradient) of the neural network $`\Psi_\vartheta`$ with respect to the network parameters $`\vartheta`$.
The principles stay the same of any other objective, such as classification, but may be more complicated in some aspects. As $`\vartheta\in\mathbb{R}^M`$ with $`M\gg1`$ (millions of degrees of freedom), computation of the gradient w.r.t. all parameters for each training data point is infeasible.
Optimization (Training)
-----------------------
Let $`M = \sum_{\ell=1,\dots,L} d_\ell(d_{\ell-1}+1)`$ denote the number of degrees of freedom encorporated in $`\vartheta`$. Instead of solving the minimization problem explicitly, we can use iterative schemes to approximate the solution.
For $`\varphi = (\varphi_1, \dots, \varphi_L)`$ we define the model class of a certain (fully connected) network topology by The easiest and most well known approach is gradient descent (Euler's method), i.e.
```math ```math
\mathcal{M}_{d, \varphi} = \{ \Psi_\vartheta \,\vert\, \vartheta \in \mathbb{R}^M \text{ and activation functions } \varphi\}. \vartheta^{(j+1)} = \vartheta^{(j)} - \eta \operatorname{\nabla}_{\vartheta}\mathcal{L}_N(\Psi_{\vartheta^{(j)}}),
\qquad j=0, 1, 2, \dots
``` ```
If we want to use the neural network to approximate a function $`f`$ the easiest approach would be to conduct a Least-Squares regression in an appropriate norm. where the step size $`\eta>0`$ is typically called the _learning rate_ and $`\vartheta^{(0)}`$ is a random initialization of the weights and biases.
To make things even easier for the explaination, we assume $`f\colon \mathbb{R}^K \to \mathbb{R}`$, i.e., $`\operatorname{dim}(x^{(0)})=K`$ and $`\operatorname{dim}(x^{(L)})=1`$.
Assuming the function $`f`$ has a second moment, we can use a standard $`L^2`$-norm for our Least-Square problem:
```math The key why gradient descent is more promising then first-order optimality criterion is the iterative character.
\text{Find}\qquad \Psi_\vartheta In particular, we can use the law of large numbers and restrict the number of summands in $\mathcal{L}_N$ to a random subset of fixed size in each iteration step, which is called _stochastic gradient descent_ (SGD).
= \operatorname*{arg\, min}_{\Psi_\theta\in\mathcal{M}_{d,\varphi}} \Vert f - \Psi_\theta \Vert_{L^2(\pi)}^2 Convergence of SGD can be shown by convex minimization and stochastic approximation theory and only requires that the learning rate $`\eta`$ with an appropriate rate.
= \operatorname*{arg\, min}_{\theta\in\mathbb{R}^{M}} \int_{\mathbb{R}^K} \bigl(f(x) - \Psi_\theta(x)\bigr)^2 \ \mathrm{d}\pi(x), **(see ?? for mor information)**
```
where we assume $`x\sim\pi`$ for some appropriate probability distribution $`\pi`$ (e.g. uniform or normal). Here, however, I want to focus more on the difference between "normal" GD and SGD (in an intuitive level).
As computing the integrals above is not feasible for $`K\gg1`$, we consider an empirical version. In principle, SGD trades gradient computations of a large number of term against the convergence rate of the algorithm.
Let $`x^{(1)},\dots,x^{(N)}\sim\pi`$ be independent (random) samples and assume we have access to $`f^{(i)}:=f(x^{(i)})`$, $`i=1,\dots,N`$. The best metaphor to remember the difference (I know of) is the following:
> **Definition** (training data): > **Metaphor (SGD):**
> Tuples of the form $`(x^{(i)}, f^{(i)})_{i=1}^N`$ are called _labeled training data_. > Assume you and a friend of yours have had a party on the top of a mountain.
> As the party has come to an end, you both want to get back home somewhere in the valley.
> You, scientist that you are, plan the most direct way down the mountain, following the steepest descent, planning each step carefully as the terrain is very rough.
> Your friend, however, drank a little to much and is not capable of planning anymore.
> So they stagger down the mountain in a more or less random direction.
> Each step they take is with little thought, but it takes them a long time overall to get back home (or at least close to it).
>
> <img src="sgd.png" title="sgd" alt="sgd" height=400 />
The empirical regression problem then reads What remains is the computation of $`\operatorname{\nabla}_\vartheta\Psi_{\vartheta^{(i)}}`$ for $`i`\in\Gamma_j\subset\{1,\dots,N\}$ in each step.
Lucky for us, we know that $`\Psi_\vartheta`$ is a simple concatenation of activation functions $`\varphi_\ell`$ and affine maps $`A_\ell(x^{(\ell-1)}) = W_\ell x^{(\ell-1)} + b_\ell`$ with derivative
```math ```
\text{Find}\qquad \Psi_\vartheta \partial_{W^{(m)}_{\alpha,\beta}} A^{(\ell)} =
= \operatorname*{arg\, min}_{\Psi_\theta\in\mathcal{M}_{d,\varphi}} \frac{1}{N} \sum_{i=1}^N \bigl(f^{(i)} - \Psi_\theta(x^{(i)})\bigr)^2 \begin{cases}
=: \operatorname*{arg\, min}_{\Psi_\theta\in\mathcal{M}_{d,\varphi}} \mathcal{L}_N(\Psi_\theta) W^{(\ell)}_{\alpha,\beta} & \text{if }m=\ell,\\
0 & \text{if }m\neq\ell,
\end{cases}
\qquad\text{and}\qquad
\partial_{b^{(m)}_{\alpha}} A^{(\ell)} =
\begin{cases}
b^{(\ell)}_{\alpha} & \text{if }m=\ell,\\
0 & \text{if }m\neq\ell.
\end{cases}
``` ```
> **Definition** (loss function): The gradient $`\operatorname{\nabla}_\vartheta\Psi_{\vartheta^{(i)}}`$ can then be computed using the chain rule due to the compositional structure of the neural network.
> A _loss functions_ is any function, which measures how good a neural network approximates the target values. Computing the gradient through the chain rule is still very inefficient and most probably infeasible if done in a naive fashion.
The so called _Backpropagation_ is esentially a way to compute the partial derivatives layer-wise storting only the necessary information to prevent repetitive computations, rendering the computation manaeable.
Typical loss functions for regression and classification tasks are Types of Neural Networks
- mean-square error (MSE, standard $`L^2`$-error) ------------------------
- weighted $`L^p`$- or $`H^k`$-norms (solutions of PDEs)
- cross-entropy (difference between distributions) | Name | Graph |
- Kullback-Leibler divergence, Hellinger distance, Wasserstein metrics | --- | --- |
- Hinge loss (SVM) | Fully Connected Neural Network | <img src="nn_fc.png" title="nn_fc" alt="nn_fc" height=250 /> |
| Convolutional Neural Network | <img src="nn_conv.png" title="nn_conv" alt="nn_conv" height=250/> |
| U-Net | <img src="u_net.png" title="u_net" alt="u_net" height=250/> |
| Residual Neural Network | <img src="res_net.png" title="res_net" alt="res_net" height=250/> |
| Invertible Neural Network | <img src="inn.png" title="inn" alt="inn" height=250/> |
Further Reading
---------------
- Python: PyTorch, TensorFlow, Scikit learn
- Matlab: Deeplearning Toolbox
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment