Skip to content
Snippets Groups Projects
Commit 4c21e66f authored by Nando Farchmin's avatar Nando Farchmin
Browse files

Test markdown math display

parent 9b2fa410
No related branches found
No related tags found
1 merge request!1Update math to conform with gitlab markdown
...@@ -25,30 +25,30 @@ Here we focus on neural networks as a special model class used for function appr ...@@ -25,30 +25,30 @@ Here we focus on neural networks as a special model class used for function appr
To be more precise, we will rely on the following definition. To be more precise, we will rely on the following definition.
> **Definition** (Neural Network): > **Definition** (Neural Network):
> For any $`L\in\mathbb{N}`$ and $`d=(d_0,\dots,d_L)\in\mathbb{N}^{L+1}`$ a non-linear map $`\Psi\colon\mathbb{R}^{d_0}\to\mathbb{R}^{d_L}`$ of the form > For any $`L\in\mathbb{N} \text{ and } d=(d_0,\dots,d_L)\in\mathbb{N}^{L+1}`$ a non-linear map $`\Psi\colon\mathbb{R}^{d_0}\to\mathbb{R}^{d_L}`$ of the form
> ```math > ```math
> \Psi(x) = \bigl[\varphi_L\circ (W_L\bullet + b_L)\circ\varphi_{L-1}\circ\dots\circ(W_2\bullet + b_2)\circ\varphi_1\circ (W_1\bullet + b_1)\bigr](x) > \Psi(x) = \bigl[\varphi_L\circ (W_L\bullet + b_L)\circ\varphi_{L-1}\circ\dots\circ(W_2\bullet + b_2)\circ\varphi_1\circ (W_1\bullet + b_1)\bigr](x)
> ``` > ```
> is called a _fully connected feed-forward neural network_. > is called a _fully connected feed-forward neural network_.
Typically, we use the following nomenclature: Typically, we use the following nomenclature:
- $`L`$ is called the _depth_ of the network with layers $`\ell=0,\dots,L`$. - $`L`$ is called the _depth_ of the network.
- $`d`$ is called the _width_ of the network, where $`d_\ell`$ is the widths of the layers $`\ell`$. - $`d`$ is called the _width(s)_ of the network.
- $`W_\ell\in\mathbb{R}^{d_{\ell-1}\times d_\ell}`$ are the _weights_ of layer $`\ell`$. - $`W_\ell\in\mathbb{R}^{d_{\ell-1}\times d_\ell}`$ are the _weights_ of each layer.
- $`b_\ell\in\mathbb{R}^{d_\ell}`$ is the _biases_ of layer $`\ell`$. - $`b_\ell\in\mathbb{R}^{d_\ell}`$ are the _biases_ of each layer.
- $`\vartheta=(W_1,b_1,\dots,W_L,b_L)`$ are the _free parameters_ of the neural network. - $`\vartheta=(W_1,b_1,\dots,W_L,b_L)`$ are the _free parameters_ of the neural network.
Sometimes we write $`\Psi_\vartheta`$ or $`\Psi(x; \vartheta)`$ to indicate the dependence of $`\Psi`$ on the parameters $`\vartheta`$. Sometimes we write $`\Psi_\vartheta \text{ or } \Psi(x; \vartheta)`$ to indicate the dependence of the neural network on the parameters.
- $`\varphi_\ell`$ is the _activation function_ of layer $`\ell`$. - $`\varphi_\ell`$ are the _activation functions_ of each layer.
Note that $`\varphi_\ell`$ has to be non-linear and monotone increasing. Note that the activation functions have to be non-linear and monotone increasing.
Additionally, there exist the following conventions: Additionally, there exist the following conventions:
- $`x^{(0)}:=x`$ is called the _input (layer)_ of the neural network $`\Psi`$. - $`x^{(0)}:=x`$ is called the _input (layer)_ of the neural network.
- $`x^{(L)}:=\Psi(x)`$ is called the _output (layer)_ of the neural network $`\Psi`$. - $`x^{(L)}:=\Psi_\vartheta(x)`$ is called the _output (layer)_ of the neural network.
- Intermediate results $`x^{(\ell)} = \varphi_\ell(W_\ell\, x^{(\ell-1)} + b_\ell)`$ are called _hidden layers_. - Intermediate results $`x^{(\ell)} = \varphi_\ell(W_\ell\, x^{(\ell-1)} + b_\ell)`$ are called _hidden layers_.
- (debatable) A neural network is called _shallow_ if it has only one hidden layer ($`L=2`$) and deep otherwise. - (debatable) A neural network is called _shallow_ if it has only one hidden layer and deep otherwise.
**Example:** **Example:**
Let $`L=3`$, $`d=(6, 10, 10, 3)`$ and $`\varphi_1=\varphi_2=\varphi_3=\tanh`$. Let $`L=3,\ d=(6, 10, 10, 3) \text{ and } \varphi_1=\varphi_2=\varphi_3=\tanh`$.
Then the neural network is given by the concatenation Then the neural network is given by the concatenation
```math ```math
\Psi\colon \mathbb{R}^6\to\mathbb{R}^3, \Psi\colon \mathbb{R}^6\to\mathbb{R}^3,
...@@ -63,7 +63,7 @@ A typical graphical representation of the neural network looks like this: ...@@ -63,7 +63,7 @@ A typical graphical representation of the neural network looks like this:
</div> </div>
<br/> <br/>
The entries of $`W_\ell`$, $`\ell=1,2,3`$, are depicted as lines connecting nodes in one layer to the subsequent one. The entries of $`W_\ell,\ \ell=1,2,3`$, are depicted as lines connecting nodes in one layer to the subsequent one.
The color indicates the sign of the entries (blue = "+", magenta = "-") and the opacity represents the absolute value (magnitude) of the values. The color indicates the sign of the entries (blue = "+", magenta = "-") and the opacity represents the absolute value (magnitude) of the values.
Note that neither the employed actication functions $`\varphi_\ell`$ nor the biases $`b_\ell`$ are represented in this graph. Note that neither the employed actication functions $`\varphi_\ell`$ nor the biases $`b_\ell`$ are represented in this graph.
...@@ -101,8 +101,8 @@ For $`\varphi = (\varphi_1, \dots, \varphi_L)`$ we define the model class of a c ...@@ -101,8 +101,8 @@ For $`\varphi = (\varphi_1, \dots, \varphi_L)`$ we define the model class of a c
``` ```
If we want to use the neural network to approximate a function $`f`$ the easiest approach would be to conduct a Least-Squares regression in an appropriate norm. If we want to use the neural network to approximate a function $`f`$ the easiest approach would be to conduct a Least-Squares regression in an appropriate norm.
To make things even easier for the explaination, we assume $`f\colon \mathbb{R}^K \to \mathbb{R}`$, i.e., $`\operatorname{dim}(x^{(0)})=K`$ and $`\operatorname{dim}(x^{(L)})=1`$. To make things even easier for the explaination, we assume $`f\colon \mathbb{R}^K \to \mathbb{R}, \text{ i.e., }\operatorname{dim}(x^{(0)})=K \text{ and } \operatorname{dim}(x^{(L)})=1`$.
Assuming the function $`f`$ has a second moment, we can use a standard $`L^2`$-norm for our Least-Square problem: Assuming the function has a second moment, we can use a standard $`L^2`$-norm for our Least-Square problem:
```math ```math
\text{Find}\qquad \Psi_\vartheta \text{Find}\qquad \Psi_\vartheta
...@@ -112,7 +112,7 @@ Assuming the function $`f`$ has a second moment, we can use a standard $`L^2`$-n ...@@ -112,7 +112,7 @@ Assuming the function $`f`$ has a second moment, we can use a standard $`L^2`$-n
where we assume $`x\sim\pi`$ for some appropriate probability distribution $`\pi`$ (e.g. uniform or normal). where we assume $`x\sim\pi`$ for some appropriate probability distribution $`\pi`$ (e.g. uniform or normal).
As computing the integrals above is not feasible for $`K\gg1`$, we consider an empirical version. As computing the integrals above is not feasible for $`K\gg1`$, we consider an empirical version.
Let $`x^{(1)},\dots,x^{(N)}\sim\pi`$ be independent (random) samples and assume we have access to $`f^{(i)}:=f(x^{(i)})`$, $`i=1,\dots,N`$. Let $`x^{(1)},\dots,x^{(N)}\sim\pi`$ be independent (random) samples and assume we have access to $`f^{(i)}:=f(x^{(i)}),\ i=1,\dots,N`$.
> **Definition** (training data): > **Definition** (training data):
> Tuples of the form $`(x^{(i)}, f^{(i)})_{i=1}^N`$ are called _labeled training data_. > Tuples of the form $`(x^{(i)}, f^{(i)})_{i=1}^N`$ are called _labeled training data_.
...@@ -128,11 +128,9 @@ The empirical regression problem then reads ...@@ -128,11 +128,9 @@ The empirical regression problem then reads
> **Definition** (loss function): > **Definition** (loss function):
> A _loss functions_ is any function, which measures how good a neural network approximates the target values. > A _loss functions_ is any function, which measures how good a neural network approximates the target values.
**TODO: Is there a maximum number of inline math?**
Typical loss functions for regression and classification tasks are Typical loss functions for regression and classification tasks are
- mean-square error (MSE, standard $`L^2`$-error) - mean-square error (MSE, standard $`L^2`$-error)
- weighted $`L^p`$- or $`H^k`$-norms (solutions of PDEs) - weighted $`L^p \text{- or } H^k\text{-}`$norms (solutions of PDEs)
- cross-entropy (difference between distributions) - cross-entropy (difference between distributions)
- Kullback-Leibler divergence, Hellinger distance, Wasserstein metrics - Kullback-Leibler divergence, Hellinger distance, Wasserstein metrics
- Hinge loss (SVM) - Hinge loss (SVM)
...@@ -145,8 +143,8 @@ To find a minimizer of our loss function $`\mathcal{L}_N`$, we want to use the f ...@@ -145,8 +143,8 @@ To find a minimizer of our loss function $`\mathcal{L}_N`$, we want to use the f
= -\frac{2}{N} \sum_{i=1}^N \bigl(f^{(i)} - \Psi_\vartheta(x^{(i)}\bigr) \operatorname{\nabla}_\vartheta \Psi_\vartheta. = -\frac{2}{N} \sum_{i=1}^N \bigl(f^{(i)} - \Psi_\vartheta(x^{(i)}\bigr) \operatorname{\nabla}_\vartheta \Psi_\vartheta.
``` ```
Solving this equation requires the evaluation of the Jacobian (gradient) of the neural network $`\Psi_\vartheta`$ with respect to the network parameters $`\vartheta`$. Solving this equation requires the evaluation of the Jacobian (gradient) of the neural network with respect to the network parameters $`\vartheta`$.
As $`\vartheta\in\mathbb{R}^M`$ with $`M\gg1`$ (millions of degrees of freedom), computation of the gradient w.r.t. all parameters for each training data point is infeasible. As $`\vartheta\in\mathbb{R}^M \text{ with } M\gg1`$ (millions of degrees of freedom), computation of the gradient w.r.t. all parameters for each training data point is infeasible.
Optimization (Training) Optimization (Training)
----------------------- -----------------------
...@@ -162,9 +160,8 @@ The easiest and most well known approach is gradient descent (Euler's method), i ...@@ -162,9 +160,8 @@ The easiest and most well known approach is gradient descent (Euler's method), i
where the step size $`\eta>0`$ is typically called the _learning rate_ and $`\vartheta^{(0)}`$ is a random initialization of the weights and biases. where the step size $`\eta>0`$ is typically called the _learning rate_ and $`\vartheta^{(0)}`$ is a random initialization of the weights and biases.
The key why gradient descent is more promising then first-order optimality criterion is the iterative character. The key why gradient descent is more promising then first-order optimality criterion is the iterative character.
In particular, we can use the law of large numbers and restrict the number of summands in $`\mathcal{L}_N`$ to a random subset of fixed size in each iteration step, which is called _stochastic gradient descent_ (SGD). In particular, we can use the law of large numbers and restrict the number of summands in our loss to a random subset of fixed size in each iteration step, which is called _stochastic gradient descent_ (SGD).
Convergence of SGD can be shown by convex minimization and stochastic approximation theory and only requires that the learning rate $`\eta`$ with an appropriate rate. Convergence of SGD can be shown by convex minimization and stochastic approximation theory and only requires that the learning rate decays with an appropriate rate.
**(see ?? for mor information)**
Here, however, I want to focus more on the difference between "normal" GD and SGD (in an intuitive level). Here, however, I want to focus more on the difference between "normal" GD and SGD (in an intuitive level).
In principle, SGD trades gradient computations of a large number of term against the convergence rate of the algorithm. In principle, SGD trades gradient computations of a large number of term against the convergence rate of the algorithm.
...@@ -180,8 +177,8 @@ The best metaphor to remember the difference (I know of) is the following: ...@@ -180,8 +177,8 @@ The best metaphor to remember the difference (I know of) is the following:
> >
> <img src="sgd.png" title="sgd" alt="sgd" height=400 /> > <img src="sgd.png" title="sgd" alt="sgd" height=400 />
What remains is the computation of $`\operatorname{\nabla}_\vartheta\Psi_{\vartheta^{(i)}}`$ for $`i\in\Gamma_j\subset\{1,\dots,N\}`$ in each step. What remains is the computation of $`\operatorname{\nabla}_\vartheta\Psi_{\vartheta^{(i)}} \text{ for } i\in\Gamma_j\subset\{1,\dots,N\}`$ in each step.
Lucky for us, we know that $`\Psi_\vartheta`$ is a simple concatenation of activation functions $`\varphi_\ell`$ and affine maps $`A_\ell(x^{(\ell-1)}) = W_\ell x^{(\ell-1)} + b_\ell`$ with derivative Lucky for us, we know that our neural network is a simple concatenation of activation functions and affine maps $`A_\ell(x^{(\ell-1)}) = W_\ell x^{(\ell-1)} + b_\ell`$ with derivative
```math ```math
\partial_{W^{(m)}_{\alpha,\beta}} A^{(\ell)} = \partial_{W^{(m)}_{\alpha,\beta}} A^{(\ell)} =
...@@ -212,8 +209,14 @@ Types of Neural Networks ...@@ -212,8 +209,14 @@ Types of Neural Networks
| Residual Neural Network | <img src="res_net.png" title="res_net" alt="res_net" height=250/> | | Residual Neural Network | <img src="res_net.png" title="res_net" alt="res_net" height=250/> |
| Invertible Neural Network | <img src="inn.png" title="inn" alt="inn" height=250/> | | Invertible Neural Network | <img src="inn.png" title="inn" alt="inn" height=250/> |
Further Reading Deep Learning Libraries
--------------- -----------------------
- Python: PyTorch, TensorFlow, Scikit learn | Library | Language Support | Remark |
- Matlab: Deeplearning Toolbox | --- | --- | --- |
| [PyTorch](https://pytorch.org/) | `Python`, `C++`, `Java` | developped by Facebook |
| [TensorFlow](https://www.tensorflow.org/) | `Python`, `JavaScript`, `Java`, `C`, `Go` | developped by Google |
| [Keras](https://keras.io/) | `Python` | Runs on top of [TensorFlow](https://www.tensorflow.org/) |
| [scikit-learn]() | `Python` | open source, build on `numpy`, `scipy` and `matplotlib`|
| [Deeplearning Toolbox](https://de.mathworks.com/products/deep-learning.html) | `Matlab` | no free to use |
| [deeplearning4j](https://deeplearning4j.konduit.ai/)| `Java` | Java hook into Python |
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment