From 4c21e66fcfd0948144de11c94d72cccba2ac9f59 Mon Sep 17 00:00:00 2001
From: Nando Farchmin <nando.farchmin@gmail.com>
Date: Mon, 4 Jul 2022 12:27:51 +0200
Subject: [PATCH] Test markdown math display

---
 doc/basics.md | 63 +++++++++++++++++++++++++++------------------------
 1 file changed, 33 insertions(+), 30 deletions(-)

diff --git a/doc/basics.md b/doc/basics.md
index 4c9c977..fa90819 100644
--- a/doc/basics.md
+++ b/doc/basics.md
@@ -25,30 +25,30 @@ Here we focus on neural networks as a special model class used for function appr
 To be more precise, we will rely on the following definition.
 
 > **Definition** (Neural Network):
-> For any $`L\in\mathbb{N}`$ and $`d=(d_0,\dots,d_L)\in\mathbb{N}^{L+1}`$ a non-linear map $`\Psi\colon\mathbb{R}^{d_0}\to\mathbb{R}^{d_L}`$ of the form
+> For any $`L\in\mathbb{N} \text{ and } d=(d_0,\dots,d_L)\in\mathbb{N}^{L+1}`$ a non-linear map $`\Psi\colon\mathbb{R}^{d_0}\to\mathbb{R}^{d_L}`$ of the form
 > ```math
 > \Psi(x) = \bigl[\varphi_L\circ (W_L\bullet  + b_L)\circ\varphi_{L-1}\circ\dots\circ(W_2\bullet  + b_2)\circ\varphi_1\circ (W_1\bullet  + b_1)\bigr](x)
 > ```
 > is called a _fully connected feed-forward neural network_.
 
 Typically, we use the following nomenclature:
-- $`L`$ is called the _depth_ of the network with layers $`\ell=0,\dots,L`$.
-- $`d`$ is called the _width_ of the network, where $`d_\ell`$ is the widths of the layers $`\ell`$.
-- $`W_\ell\in\mathbb{R}^{d_{\ell-1}\times d_\ell}`$ are the _weights_ of layer $`\ell`$.
-- $`b_\ell\in\mathbb{R}^{d_\ell}`$ is the _biases_ of layer $`\ell`$.
+- $`L`$ is called the _depth_ of the network.
+- $`d`$ is called the _width(s)_ of the network.
+- $`W_\ell\in\mathbb{R}^{d_{\ell-1}\times d_\ell}`$ are the _weights_ of each layer.
+- $`b_\ell\in\mathbb{R}^{d_\ell}`$ are the _biases_ of each layer.
 - $`\vartheta=(W_1,b_1,\dots,W_L,b_L)`$ are the _free parameters_ of the neural network.
-  Sometimes we write $`\Psi_\vartheta`$ or $`\Psi(x; \vartheta)`$ to indicate the dependence of $`\Psi`$ on the parameters $`\vartheta`$.
-- $`\varphi_\ell`$ is the _activation function_ of layer $`\ell`$.
-  Note that $`\varphi_\ell`$ has to be non-linear and monotone increasing.
+  Sometimes we write $`\Psi_\vartheta \text{ or } \Psi(x; \vartheta)`$ to indicate the dependence of the neural network on the parameters.
+- $`\varphi_\ell`$ are the _activation functions_ of each layer.
+  Note that the activation functions have to be non-linear and monotone increasing.
 
 Additionally, there exist the following conventions:
-- $`x^{(0)}:=x`$ is called the _input (layer)_ of the neural network $`\Psi`$.
-- $`x^{(L)}:=\Psi(x)`$ is called the _output (layer)_ of the neural network $`\Psi`$.
+- $`x^{(0)}:=x`$ is called the _input (layer)_ of the neural network.
+- $`x^{(L)}:=\Psi_\vartheta(x)`$ is called the _output (layer)_ of the neural network.
 - Intermediate results $`x^{(\ell)} = \varphi_\ell(W_\ell\, x^{(\ell-1)} + b_\ell)`$ are called _hidden layers_.
-- (debatable) A neural network is called _shallow_ if it has only one hidden layer ($`L=2`$) and deep otherwise.
+- (debatable) A neural network is called _shallow_ if it has only one hidden layer and deep otherwise.
 
 **Example:**
-Let $`L=3`$, $`d=(6, 10, 10, 3)`$ and $`\varphi_1=\varphi_2=\varphi_3=\tanh`$.
+Let $`L=3,\ d=(6, 10, 10, 3) \text{ and } \varphi_1=\varphi_2=\varphi_3=\tanh`$.
 Then the neural network is given by the concatenation
 ```math
 \Psi\colon \mathbb{R}^6\to\mathbb{R}^3,
@@ -63,7 +63,7 @@ A typical graphical representation of the neural network looks like this:
 </div>
 <br/>
 
-The entries of $`W_\ell`$, $`\ell=1,2,3`$, are depicted as lines connecting nodes in one layer to the subsequent one.
+The entries of $`W_\ell,\ \ell=1,2,3`$, are depicted as lines connecting nodes in one layer to the subsequent one.
 The color indicates the sign of the entries (blue = "+", magenta = "-") and the opacity represents the absolute value (magnitude) of the values.
 Note that neither the employed actication functions $`\varphi_\ell`$ nor the biases $`b_\ell`$ are represented in this graph.
 
@@ -101,8 +101,8 @@ For $`\varphi = (\varphi_1, \dots, \varphi_L)`$ we define the model class of a c
 ```
 
 If we want to use the neural network to approximate a function $`f`$ the easiest approach would be to conduct a Least-Squares regression in an appropriate norm.
-To make things even easier for the explaination, we assume $`f\colon \mathbb{R}^K \to \mathbb{R}`$, i.e., $`\operatorname{dim}(x^{(0)})=K`$ and $`\operatorname{dim}(x^{(L)})=1`$.
-Assuming the function $`f`$ has a second moment, we can use a standard $`L^2`$-norm for our Least-Square problem:
+To make things even easier for the explaination, we assume $`f\colon \mathbb{R}^K \to \mathbb{R}, \text{ i.e., }\operatorname{dim}(x^{(0)})=K \text{ and } \operatorname{dim}(x^{(L)})=1`$.
+Assuming the function has a second moment, we can use a standard $`L^2`$-norm for our Least-Square problem:
 
 ```math
 \text{Find}\qquad \Psi_\vartheta
@@ -112,7 +112,7 @@ Assuming the function $`f`$ has a second moment, we can use a standard $`L^2`$-n
 
 where we assume $`x\sim\pi`$ for some appropriate probability distribution $`\pi`$ (e.g. uniform or normal).
 As computing the integrals above is not feasible for $`K\gg1`$, we consider an empirical version.
-Let $`x^{(1)},\dots,x^{(N)}\sim\pi`$ be independent (random) samples and assume we have access to $`f^{(i)}:=f(x^{(i)})`$, $`i=1,\dots,N`$.
+Let $`x^{(1)},\dots,x^{(N)}\sim\pi`$ be independent (random) samples and assume we have access to $`f^{(i)}:=f(x^{(i)}),\ i=1,\dots,N`$.
 
 > **Definition** (training data):
 > Tuples of the form $`(x^{(i)}, f^{(i)})_{i=1}^N`$ are called _labeled training data_.
@@ -128,11 +128,9 @@ The empirical regression problem then reads
 > **Definition** (loss function):
 > A _loss functions_ is any function, which measures how good a neural network approximates the target values.
 
-**TODO: Is there a maximum number of inline math?**
-
 Typical loss functions for regression and classification tasks are
   - mean-square error (MSE, standard $`L^2`$-error)
-  - weighted $`L^p`$- or $`H^k`$-norms (solutions of PDEs)
+  - weighted $`L^p \text{- or } H^k\text{-}`$norms (solutions of PDEs)
   - cross-entropy (difference between distributions)
   - Kullback-Leibler divergence, Hellinger distance, Wasserstein metrics
   - Hinge loss (SVM)
@@ -145,8 +143,8 @@ To find a minimizer of our loss function $`\mathcal{L}_N`$, we want to use the f
 = -\frac{2}{N} \sum_{i=1}^N \bigl(f^{(i)} - \Psi_\vartheta(x^{(i)}\bigr) \operatorname{\nabla}_\vartheta \Psi_\vartheta.
 ```
 
-Solving this equation requires the evaluation of the Jacobian (gradient) of the neural network $`\Psi_\vartheta`$ with respect to the network parameters $`\vartheta`$.
-As $`\vartheta\in\mathbb{R}^M`$ with $`M\gg1`$ (millions of degrees of freedom), computation of the gradient w.r.t. all parameters for each training data point is infeasible.
+Solving this equation requires the evaluation of the Jacobian (gradient) of the neural network with respect to the network parameters $`\vartheta`$.
+As $`\vartheta\in\mathbb{R}^M \text{ with } M\gg1`$ (millions of degrees of freedom), computation of the gradient w.r.t. all parameters for each training data point is infeasible.
  
 Optimization (Training)
 -----------------------
@@ -162,9 +160,8 @@ The easiest and most well known approach is gradient descent (Euler's method), i
 where the step size $`\eta>0`$ is typically called the _learning rate_ and $`\vartheta^{(0)}`$ is a random initialization of the weights and biases.
 
 The key why gradient descent is more promising then first-order optimality criterion is the iterative character.
-In particular, we can use the law of large numbers and restrict the number of summands in $`\mathcal{L}_N`$ to a random subset of fixed size in each iteration step, which is called _stochastic gradient descent_ (SGD).
-Convergence of SGD can be shown by convex minimization and stochastic approximation theory and only requires that the learning rate $`\eta`$ with an appropriate rate.
-**(see ?? for mor information)**
+In particular, we can use the law of large numbers and restrict the number of summands in our loss to a random subset of fixed size in each iteration step, which is called _stochastic gradient descent_ (SGD).
+Convergence of SGD can be shown by convex minimization and stochastic approximation theory and only requires that the learning rate decays with an appropriate rate.
 
 Here, however, I want to focus more on the difference between "normal" GD and SGD (in an intuitive level).
 In principle, SGD trades gradient computations of a large number of term against the convergence rate of the algorithm.
@@ -180,8 +177,8 @@ The best metaphor to remember the difference (I know of) is the following:
 >
 > <img src="sgd.png" title="sgd" alt="sgd" height=400 />
 
-What remains is the computation of $`\operatorname{\nabla}_\vartheta\Psi_{\vartheta^{(i)}}`$ for $`i\in\Gamma_j\subset\{1,\dots,N\}`$ in each step.
-Lucky for us, we know that $`\Psi_\vartheta`$ is a simple concatenation of activation functions $`\varphi_\ell`$ and affine maps $`A_\ell(x^{(\ell-1)}) = W_\ell x^{(\ell-1)} + b_\ell`$ with derivative
+What remains is the computation of $`\operatorname{\nabla}_\vartheta\Psi_{\vartheta^{(i)}} \text{ for } i\in\Gamma_j\subset\{1,\dots,N\}`$ in each step.
+Lucky for us, we know that our neural network is a simple concatenation of activation functions and affine maps $`A_\ell(x^{(\ell-1)}) = W_\ell x^{(\ell-1)} + b_\ell`$ with derivative
 
 ```math
 \partial_{W^{(m)}_{\alpha,\beta}} A^{(\ell)} = 
@@ -212,8 +209,14 @@ Types of Neural Networks
 | Residual Neural Network | <img src="res_net.png" title="res_net" alt="res_net" height=250/> |
 | Invertible Neural Network | <img src="inn.png" title="inn" alt="inn" height=250/> |
 
-Further Reading
----------------
+Deep Learning Libraries
+-----------------------
 
-- Python: PyTorch, TensorFlow, Scikit learn
-- Matlab: Deeplearning Toolbox
+| Library | Language Support | Remark |
+| --- | --- | --- |
+| [PyTorch](https://pytorch.org/) | `Python`, `C++`, `Java` | developped by Facebook |
+| [TensorFlow](https://www.tensorflow.org/) | `Python`, `JavaScript`, `Java`, `C`, `Go` | developped by Google |
+| [Keras](https://keras.io/) | `Python` | Runs on top of [TensorFlow](https://www.tensorflow.org/) |
+| [scikit-learn]() | `Python` | open source, build on `numpy`, `scipy` and `matplotlib`|
+| [Deeplearning Toolbox](https://de.mathworks.com/products/deep-learning.html) | `Matlab` | no free to use |
+| [deeplearning4j](https://deeplearning4j.konduit.ai/)| `Java` | Java hook into Python |
-- 
GitLab