diff --git a/README.md b/README.md
index 923b874817734a8159fb49fa0f55b758323d789d..0bdc797b26437d44f19f1999ace750c6408238d5 100644
--- a/README.md
+++ b/README.md
@@ -26,5 +26,16 @@ where `/home/.../` has to be changed to the path this repository is located in.
 
 ## Usage
 
+The basic theory behind neural networks is explained in [docs/basisc.md](doc/basics.md).
+Here you can find some additional references and links to get you going on your in-depth neural network adventure as well.
+
+Besides the theory, I added some scripts to give you a basic coding structure and show you how to employ neural networks with [PyTorch](https://pytorch.org/).
+The following table gives you an overview of the scripts and what they do.
+
+| File | Description |
+| --- | --- |
+| [app/function_approximation.py](app/function_approximation.py) | A simple benchmark on how to approximate a 2D sine function with a neural network, broken down to a few simple steps. The code used for the steps can be found in [src/approximation.py](src/approximation.py). |
+| [app/mnist_image_classification.py](app/mnist_image_classification.py) | A simple benchmark for image classification using the MNIST data set. The code used for the steps can be found in [src/mnist.py](src/mnist.py). |
+
 ## License
 This software runs under the GNU General Public License 3.0.
diff --git a/doc/basics.md b/doc/basics.md
index e6688f48a747c03cc79065db34dd7c6268b1826a..fa9081986259d8acebd2be5de4c3e3dc2e16562b 100644
--- a/doc/basics.md
+++ b/doc/basics.md
@@ -1,10 +1,14 @@
 Neural Networks 101
-===================
+-------------------
 
 <div style="text-align: center;">
-    <img src="machine_learning.png" title="ml" alt="ml" height=400 />
+    <img src="machine_learning.png" title="ml" alt="ml" height=500 />
 </div>
 
+Table of Contents
+-----------------
+[[_TOC_]]
+
 Nomenclature and Definitions
 ----------------------------
 
@@ -21,36 +25,36 @@ Here we focus on neural networks as a special model class used for function appr
 To be more precise, we will rely on the following definition.
 
 > **Definition** (Neural Network):
-> For any $L\in\mathbb{N}$ and $d=(d_0,\dots,d_L)\in\mathbb{N}^{L+1}$ a non-linear map $\Psi\colon\mathbb{R}^{d_0}\to\mathbb{R}^{d_L}$ of the form
-> $$
+> For any $`L\in\mathbb{N} \text{ and } d=(d_0,\dots,d_L)\in\mathbb{N}^{L+1}`$ a non-linear map $`\Psi\colon\mathbb{R}^{d_0}\to\mathbb{R}^{d_L}`$ of the form
+> ```math
 > \Psi(x) = \bigl[\varphi_L\circ (W_L\bullet  + b_L)\circ\varphi_{L-1}\circ\dots\circ(W_2\bullet  + b_2)\circ\varphi_1\circ (W_1\bullet  + b_1)\bigr](x)
-> $$
+> ```
 > is called a _fully connected feed-forward neural network_.
 
 Typically, we use the following nomenclature:
-- $L$ is called the _depth_ of the network with layers $\ell=0,\dots,L$.
-- $d$ is called the _width_ of the network, where $d_\ell$ is the widths of the layers $\ell$.
-- $W_\ell\in\mathbb{R}^{d_{\ell-1}\times d_\ell}$ are the _weights_ of layer $\ell$.
-- $b_\ell\in\mathbb{R}^{d_\ell}$ is the _biases_ of layer $\ell$.
-- $\vartheta=(W_1,b_1,\dots,W_L,b_L)$ are the _free parameters_ of the neural network.
-  Sometimes we write $\Psi_\vartheta$ or $\Psi(x; \vartheta)$ to indicate the dependence of $\Psi$ on the parameters $\vartheta$.
-- $\varphi_\ell$ is the _activation function_ of layer $\ell$.
-  Note that $\varphi_\ell$ has to be non-linear and monotone increasing.
+- $`L`$ is called the _depth_ of the network.
+- $`d`$ is called the _width(s)_ of the network.
+- $`W_\ell\in\mathbb{R}^{d_{\ell-1}\times d_\ell}`$ are the _weights_ of each layer.
+- $`b_\ell\in\mathbb{R}^{d_\ell}`$ are the _biases_ of each layer.
+- $`\vartheta=(W_1,b_1,\dots,W_L,b_L)`$ are the _free parameters_ of the neural network.
+  Sometimes we write $`\Psi_\vartheta \text{ or } \Psi(x; \vartheta)`$ to indicate the dependence of the neural network on the parameters.
+- $`\varphi_\ell`$ are the _activation functions_ of each layer.
+  Note that the activation functions have to be non-linear and monotone increasing.
 
 Additionally, there exist the following conventions:
-- $x^{(0)}:=x$ is called the _input (layer)_ of the neural network $\Psi$.
-- $x^{(L)}:=\Psi(x)$ is called the _output (layer)_ of the neural network $\Psi$.
-- Intermediate results $x^{(\ell)} = \varphi_\ell(W_\ell\, x^{(\ell-1)} + b_\ell)$ are called _hidden layers_.
-- (debatable) A neural network is called _shallow_ if it has only one hidden layer ($L=2$) and deep otherwise.
+- $`x^{(0)}:=x`$ is called the _input (layer)_ of the neural network.
+- $`x^{(L)}:=\Psi_\vartheta(x)`$ is called the _output (layer)_ of the neural network.
+- Intermediate results $`x^{(\ell)} = \varphi_\ell(W_\ell\, x^{(\ell-1)} + b_\ell)`$ are called _hidden layers_.
+- (debatable) A neural network is called _shallow_ if it has only one hidden layer and deep otherwise.
 
 **Example:**
-Let $L=3$, $d=(6, 10, 10, 3)$ and $\varphi_1=\varphi_2=\varphi_3=\mathrm{ReLU}$.
+Let $`L=3,\ d=(6, 10, 10, 3) \text{ and } \varphi_1=\varphi_2=\varphi_3=\tanh`$.
 Then the neural network is given by the concatenation
-$$
+```math
 \Psi\colon \mathbb{R}^6\to\mathbb{R}^3,
 \qquad
 \Psi(x) = \varphi_3\Bigl(W_3 \Bigl(\underbrace{\varphi_2\bigl(W_2 \bigl(\underbrace{\varphi_1(W_1 x + b_1)}_{x^{(1)}}\bigr) + b_2\bigr)}_{x^{(2)}}\Bigr) + b_3\Bigr).
-$$
+```
 A typical graphical representation of the neural network looks like this:
 
 <br/>
@@ -59,9 +63,9 @@ A typical graphical representation of the neural network looks like this:
 </div>
 <br/>
 
-The entries of $W_\ell$, $\ell=1,2,3$, are depicted as lines connecting nodes in one layer to the subsequent one.
-the color indicates the sign of the entries (blue = "+", magenta = "-") and the opacity represents the absolute value (magnitude) of the values.
-Note that neither the employed actication functions $\varphi_\ell$ nor the biases $b_\ell$ are represented in this graph.
+The entries of $`W_\ell,\ \ell=1,2,3`$, are depicted as lines connecting nodes in one layer to the subsequent one.
+The color indicates the sign of the entries (blue = "+", magenta = "-") and the opacity represents the absolute value (magnitude) of the values.
+Note that neither the employed actication functions $`\varphi_\ell`$ nor the biases $`b_\ell`$ are represented in this graph.
 
 Activation Functions
 --------------------
@@ -71,32 +75,148 @@ The important part is the non-linearity, as otherwise the neural network would b
 
 Typical examples of continuous activation functions applied in the context of function approximation or regression are:
 
-ReLU | Leaky ReLU | Sigmoid
-- | - | -
-<img src="relu.png" title="ReLU" alt="ReLU" width=300 /> | <img src="leaky_relu.png" title="leaky ReLU" alt="leaky ReLU" width=300 /> | <img src="tanh.png" title="tanh" alt="tanh" width=300 />
+| ReLU | Leaky ReLU | Sigmoid |
+| --- | --- | --- |
+| <img src="relu.png" title="ReLU" alt="ReLU" height=200 /> | <img src="leaky_relu.png" title="leaky ReLU" alt="leaky ReLU" height=200 /> | <img src="tanh.png" title="tanh" alt="tanh" height=200 /> |
 
 For classification tasks, such as image recognition, so called convolutional neural networks (CNNs) are employed.
 Typically, these networks use different types of activation functions, such as:
 
 **Examples for discrete activation functions:**
-Argmax | Softmax | Max-Pooling
-- | - | -
-<img src="argmax.png" title="argmax" alt="argmax" width=300 /> | <img src="softmax.png" title="softmax" alt="softmax" width=300 /> | <img src="maxpool.png" title="maxpool" alt="maxpool" width=300 />
+| Argmax | Softmax | Max-Pooling |
+| --- | --- | --- |
+| <img src="argmax.png" title="argmax" alt="argmax" height=200 /> | <img src="softmax.png" title="softmax" alt="softmax" height=200 /> | <img src="maxpool.png" title="maxpool" alt="maxpool" height=200 /> |
 
-More infos on CNNs follow below.
+Minimization Problem
+--------------------
 
-Training
---------
+In this section we focus on training a fully-connected network for a regression task.
+The principles stay the same of any other objective, such as classification, but may be more complicated in some aspects.
+
+Let $`M = \sum_{\ell=1,\dots,L} d_\ell(d_{\ell-1}+1)`$ denote the number of degrees of freedom encorporated in $`\vartheta`$.
+For $`\varphi = (\varphi_1, \dots, \varphi_L)`$ we define the model class of a certain (fully connected) network topology by
+
+```math
+\mathcal{M}_{d, \varphi} = \{ \Psi_\vartheta \,\vert\, \vartheta \in \mathbb{R}^M \text{ and activation functions } \varphi\}.
+```
+
+If we want to use the neural network to approximate a function $`f`$ the easiest approach would be to conduct a Least-Squares regression in an appropriate norm.
+To make things even easier for the explaination, we assume $`f\colon \mathbb{R}^K \to \mathbb{R}, \text{ i.e., }\operatorname{dim}(x^{(0)})=K \text{ and } \operatorname{dim}(x^{(L)})=1`$.
+Assuming the function has a second moment, we can use a standard $`L^2`$-norm for our Least-Square problem:
+
+```math
+\text{Find}\qquad \Psi_\vartheta
+= \operatorname*{arg\, min}_{\Psi_\theta\in\mathcal{M}_{d,\varphi}} \Vert f - \Psi_\theta \Vert_{L^2(\pi)}^2
+= \operatorname*{arg\, min}_{\theta\in\mathbb{R}^{M}} \int_{\mathbb{R}^K} \bigl(f(x) - \Psi_\theta(x)\bigr)^2 \ \mathrm{d}\pi(x),
+```
+
+where we assume $`x\sim\pi`$ for some appropriate probability distribution $`\pi`$ (e.g. uniform or normal).
+As computing the integrals above is not feasible for $`K\gg1`$, we consider an empirical version.
+Let $`x^{(1)},\dots,x^{(N)}\sim\pi`$ be independent (random) samples and assume we have access to $`f^{(i)}:=f(x^{(i)}),\ i=1,\dots,N`$.
+
+> **Definition** (training data):
+> Tuples of the form $`(x^{(i)}, f^{(i)})_{i=1}^N`$ are called _labeled training data_.
+
+The empirical regression problem then reads
+
+```math
+\text{Find}\qquad \Psi_\vartheta
+= \operatorname*{arg\, min}_{\Psi_\theta\in\mathcal{M}_{d,\varphi}} \frac{1}{N} \sum_{i=1}^N \bigl(f^{(i)} - \Psi_\theta(x^{(i)})\bigr)^2
+=: \operatorname*{arg\, min}_{\Psi_\theta\in\mathcal{M}_{d,\varphi}} \mathcal{L}_N(\Psi_\theta)
+```
+
+> **Definition** (loss function):
+> A _loss functions_ is any function, which measures how good a neural network approximates the target values.
+
+Typical loss functions for regression and classification tasks are
+  - mean-square error (MSE, standard $`L^2`$-error)
+  - weighted $`L^p \text{- or } H^k\text{-}`$norms (solutions of PDEs)
+  - cross-entropy (difference between distributions)
+  - Kullback-Leibler divergence, Hellinger distance, Wasserstein metrics
+  - Hinge loss (SVM)
+
+To find a minimizer of our loss function $`\mathcal{L}_N`$, we want to use the first-order optimality criterion
+
+```math
+0
+= \operatorname{\nabla}_\vartheta \mathcal{L}_N(\Psi_\vartheta)
+= -\frac{2}{N} \sum_{i=1}^N \bigl(f^{(i)} - \Psi_\vartheta(x^{(i)}\bigr) \operatorname{\nabla}_\vartheta \Psi_\vartheta.
+```
+
+Solving this equation requires the evaluation of the Jacobian (gradient) of the neural network with respect to the network parameters $`\vartheta`$.
+As $`\vartheta\in\mathbb{R}^M \text{ with } M\gg1`$ (millions of degrees of freedom), computation of the gradient w.r.t. all parameters for each training data point is infeasible.
+ 
+Optimization (Training)
+-----------------------
+
+Instead of solving the minimization problem explicitly, we can use iterative schemes to approximate the solution.
+The easiest and most well known approach is gradient descent (Euler's method), i.e.
+
+```math
+\vartheta^{(j+1)} = \vartheta^{(j)} - \eta \operatorname{\nabla}_{\vartheta}\mathcal{L}_N(\Psi_{\vartheta^{(j)}}),
+\qquad j=0, 1, 2, \dots
+```
+
+where the step size $`\eta>0`$ is typically called the _learning rate_ and $`\vartheta^{(0)}`$ is a random initialization of the weights and biases.
+
+The key why gradient descent is more promising then first-order optimality criterion is the iterative character.
+In particular, we can use the law of large numbers and restrict the number of summands in our loss to a random subset of fixed size in each iteration step, which is called _stochastic gradient descent_ (SGD).
+Convergence of SGD can be shown by convex minimization and stochastic approximation theory and only requires that the learning rate decays with an appropriate rate.
+
+Here, however, I want to focus more on the difference between "normal" GD and SGD (in an intuitive level).
+In principle, SGD trades gradient computations of a large number of term against the convergence rate of the algorithm.
+The best metaphor to remember the difference (I know of) is the following:
+
+> **Metaphor (SGD):**
+> Assume you and a friend of yours have had a party on the top of a mountain.
+> As the party has come to an end, you both want to get back home somewhere in the valley.
+> You, scientist that you are, plan the most direct way down the mountain, following the steepest descent, planning each step carefully as the terrain is very rough.
+> Your friend, however, drank a little to much and is not capable of planning anymore.
+> So they stagger down the mountain in a more or less random direction.
+> Each step they take is with little thought, but it takes them a long time overall to get back home (or at least close to it).
+>
+> <img src="sgd.png" title="sgd" alt="sgd" height=400 />
+
+What remains is the computation of $`\operatorname{\nabla}_\vartheta\Psi_{\vartheta^{(i)}} \text{ for } i\in\Gamma_j\subset\{1,\dots,N\}`$ in each step.
+Lucky for us, we know that our neural network is a simple concatenation of activation functions and affine maps $`A_\ell(x^{(\ell-1)}) = W_\ell x^{(\ell-1)} + b_\ell`$ with derivative
+
+```math
+\partial_{W^{(m)}_{\alpha,\beta}} A^{(\ell)} = 
+\begin{cases}
+W^{(\ell)}_{\alpha,\beta} & \text{if }m=\ell,\\
+0 & \text{if }m\neq\ell,
+\end{cases}
+\qquad\text{and}\qquad
+\partial_{b^{(m)}_{\alpha}} A^{(\ell)} = 
+\begin{cases}
+b^{(\ell)}_{\alpha} & \text{if }m=\ell,\\
+0 & \text{if }m\neq\ell.
+\end{cases}
+```
+
+The gradient $`\operatorname{\nabla}_\vartheta\Psi_{\vartheta^{(i)}}`$ can then be computed using the chain rule due to the compositional structure of the neural network.
+Computing the gradient through the chain rule is still very inefficient and most probably infeasible if done in a naive fashion.
+The so called _Backpropagation_ is esentially a way to compute the partial derivatives layer-wise storting only the necessary information to prevent repetitive computations, rendering the computation manaeable. 
 
 Types of Neural Networks
 ------------------------
 
-Fully Connected Neural Network| Convolutional Neural Network
-- | -
-![bla](nn_fc.png) | ![conv](nn_conv.png)
-
-Further Reading
----------------
-
-- Python: PyTorch, TensorFlow, Scikit learn
-- Matlab: Deeplearning Toolbox
+| Name | Graph |
+| --- | --- |
+| Fully Connected Neural Network | <img src="nn_fc.png" title="nn_fc" alt="nn_fc" height=250 /> |
+| Convolutional Neural Network | <img src="nn_conv.png" title="nn_conv" alt="nn_conv" height=250/> |
+| U-Net | <img src="u_net.png" title="u_net" alt="u_net" height=250/> |
+| Residual Neural Network | <img src="res_net.png" title="res_net" alt="res_net" height=250/> |
+| Invertible Neural Network | <img src="inn.png" title="inn" alt="inn" height=250/> |
+
+Deep Learning Libraries
+-----------------------
+
+| Library | Language Support | Remark |
+| --- | --- | --- |
+| [PyTorch](https://pytorch.org/) | `Python`, `C++`, `Java` | developped by Facebook |
+| [TensorFlow](https://www.tensorflow.org/) | `Python`, `JavaScript`, `Java`, `C`, `Go` | developped by Google |
+| [Keras](https://keras.io/) | `Python` | Runs on top of [TensorFlow](https://www.tensorflow.org/) |
+| [scikit-learn]() | `Python` | open source, build on `numpy`, `scipy` and `matplotlib`|
+| [Deeplearning Toolbox](https://de.mathworks.com/products/deep-learning.html) | `Matlab` | no free to use |
+| [deeplearning4j](https://deeplearning4j.konduit.ai/)| `Java` | Java hook into Python |
diff --git a/doc/inn.png b/doc/inn.png
new file mode 100644
index 0000000000000000000000000000000000000000..d5b32fc57ac11d68c10f6bde9e2f5b601c9d3561
Binary files /dev/null and b/doc/inn.png differ
diff --git a/doc/res_net.png b/doc/res_net.png
new file mode 100644
index 0000000000000000000000000000000000000000..467340a6c89006386ea8ffb04496317600ab63d3
Binary files /dev/null and b/doc/res_net.png differ
diff --git a/doc/sgd.png b/doc/sgd.png
new file mode 100644
index 0000000000000000000000000000000000000000..2138a2cc7c9f0f27dfd4571d33538226493e6692
Binary files /dev/null and b/doc/sgd.png differ
diff --git a/doc/u_net.png b/doc/u_net.png
new file mode 100644
index 0000000000000000000000000000000000000000..312c59f077a6810df4f08ef1c61e24317aac4ca1
Binary files /dev/null and b/doc/u_net.png differ