Skip to content
Snippets Groups Projects

Compare revisions

Changes are shown as if the source revision was being merged into the target revision. Learn more about comparing revisions.

Source

Select target project
No results found

Target

Select target project
  • ptb-843/neural_networks_101
1 result
Show changes
Commits on Source (2)
...@@ -26,5 +26,16 @@ where `/home/.../` has to be changed to the path this repository is located in. ...@@ -26,5 +26,16 @@ where `/home/.../` has to be changed to the path this repository is located in.
## Usage ## Usage
The basic theory behind neural networks is explained in [docs/basisc.md](doc/basics.md).
Here you can find some additional references and links to get you going on your in-depth neural network adventure as well.
Besides the theory, I added some scripts to give you a basic coding structure and show you how to employ neural networks with [PyTorch](https://pytorch.org/).
The following table gives you an overview of the scripts and what they do.
| File | Description |
| --- | --- |
| [app/function_approximation.py](app/function_approximation.py) | A simple benchmark on how to approximate a 2D sine function with a neural network, broken down to a few simple steps. The code used for the steps can be found in [src/approximation.py](src/approximation.py). |
| [app/mnist_image_classification.py](app/mnist_image_classification.py) | A simple benchmark for image classification using the MNIST data set. The code used for the steps can be found in [src/mnist.py](src/mnist.py). |
## License ## License
This software runs under the GNU General Public License 3.0. This software runs under the GNU General Public License 3.0.
Neural Networks 101 Neural Networks 101
=================== -------------------
<div style="text-align: center;"> <div style="text-align: center;">
<img src="machine_learning.png" title="ml" alt="ml" height=400 /> <img src="machine_learning.png" title="ml" alt="ml" height=500 />
</div> </div>
Table of Contents
-----------------
[[_TOC_]]
Nomenclature and Definitions Nomenclature and Definitions
---------------------------- ----------------------------
...@@ -21,36 +25,36 @@ Here we focus on neural networks as a special model class used for function appr ...@@ -21,36 +25,36 @@ Here we focus on neural networks as a special model class used for function appr
To be more precise, we will rely on the following definition. To be more precise, we will rely on the following definition.
> **Definition** (Neural Network): > **Definition** (Neural Network):
> For any $L\in\mathbb{N}$ and $d=(d_0,\dots,d_L)\in\mathbb{N}^{L+1}$ a non-linear map $\Psi\colon\mathbb{R}^{d_0}\to\mathbb{R}^{d_L}$ of the form > For any $`L\in\mathbb{N} \text{ and } d=(d_0,\dots,d_L)\in\mathbb{N}^{L+1}`$ a non-linear map $`\Psi\colon\mathbb{R}^{d_0}\to\mathbb{R}^{d_L}`$ of the form
> $$ > ```math
> \Psi(x) = \bigl[\varphi_L\circ (W_L\bullet + b_L)\circ\varphi_{L-1}\circ\dots\circ(W_2\bullet + b_2)\circ\varphi_1\circ (W_1\bullet + b_1)\bigr](x) > \Psi(x) = \bigl[\varphi_L\circ (W_L\bullet + b_L)\circ\varphi_{L-1}\circ\dots\circ(W_2\bullet + b_2)\circ\varphi_1\circ (W_1\bullet + b_1)\bigr](x)
> $$ > ```
> is called a _fully connected feed-forward neural network_. > is called a _fully connected feed-forward neural network_.
Typically, we use the following nomenclature: Typically, we use the following nomenclature:
- $L$ is called the _depth_ of the network with layers $\ell=0,\dots,L$. - $`L`$ is called the _depth_ of the network.
- $d$ is called the _width_ of the network, where $d_\ell$ is the widths of the layers $\ell$. - $`d`$ is called the _width(s)_ of the network.
- $W_\ell\in\mathbb{R}^{d_{\ell-1}\times d_\ell}$ are the _weights_ of layer $\ell$. - $`W_\ell\in\mathbb{R}^{d_{\ell-1}\times d_\ell}`$ are the _weights_ of each layer.
- $b_\ell\in\mathbb{R}^{d_\ell}$ is the _biases_ of layer $\ell$. - $`b_\ell\in\mathbb{R}^{d_\ell}`$ are the _biases_ of each layer.
- $\vartheta=(W_1,b_1,\dots,W_L,b_L)$ are the _free parameters_ of the neural network. - $`\vartheta=(W_1,b_1,\dots,W_L,b_L)`$ are the _free parameters_ of the neural network.
Sometimes we write $\Psi_\vartheta$ or $\Psi(x; \vartheta)$ to indicate the dependence of $\Psi$ on the parameters $\vartheta$. Sometimes we write $`\Psi_\vartheta \text{ or } \Psi(x; \vartheta)`$ to indicate the dependence of the neural network on the parameters.
- $\varphi_\ell$ is the _activation function_ of layer $\ell$. - $`\varphi_\ell`$ are the _activation functions_ of each layer.
Note that $\varphi_\ell$ has to be non-linear and monotone increasing. Note that the activation functions have to be non-linear and monotone increasing.
Additionally, there exist the following conventions: Additionally, there exist the following conventions:
- $x^{(0)}:=x$ is called the _input (layer)_ of the neural network $\Psi$. - $`x^{(0)}:=x`$ is called the _input (layer)_ of the neural network.
- $x^{(L)}:=\Psi(x)$ is called the _output (layer)_ of the neural network $\Psi$. - $`x^{(L)}:=\Psi_\vartheta(x)`$ is called the _output (layer)_ of the neural network.
- Intermediate results $x^{(\ell)} = \varphi_\ell(W_\ell\, x^{(\ell-1)} + b_\ell)$ are called _hidden layers_. - Intermediate results $`x^{(\ell)} = \varphi_\ell(W_\ell\, x^{(\ell-1)} + b_\ell)`$ are called _hidden layers_.
- (debatable) A neural network is called _shallow_ if it has only one hidden layer ($L=2$) and deep otherwise. - (debatable) A neural network is called _shallow_ if it has only one hidden layer and deep otherwise.
**Example:** **Example:**
Let $L=3$, $d=(6, 10, 10, 3)$ and $\varphi_1=\varphi_2=\varphi_3=\mathrm{ReLU}$. Let $`L=3,\ d=(6, 10, 10, 3) \text{ and } \varphi_1=\varphi_2=\varphi_3=\tanh`$.
Then the neural network is given by the concatenation Then the neural network is given by the concatenation
$$ ```math
\Psi\colon \mathbb{R}^6\to\mathbb{R}^3, \Psi\colon \mathbb{R}^6\to\mathbb{R}^3,
\qquad \qquad
\Psi(x) = \varphi_3\Bigl(W_3 \Bigl(\underbrace{\varphi_2\bigl(W_2 \bigl(\underbrace{\varphi_1(W_1 x + b_1)}_{x^{(1)}}\bigr) + b_2\bigr)}_{x^{(2)}}\Bigr) + b_3\Bigr). \Psi(x) = \varphi_3\Bigl(W_3 \Bigl(\underbrace{\varphi_2\bigl(W_2 \bigl(\underbrace{\varphi_1(W_1 x + b_1)}_{x^{(1)}}\bigr) + b_2\bigr)}_{x^{(2)}}\Bigr) + b_3\Bigr).
$$ ```
A typical graphical representation of the neural network looks like this: A typical graphical representation of the neural network looks like this:
<br/> <br/>
...@@ -59,9 +63,9 @@ A typical graphical representation of the neural network looks like this: ...@@ -59,9 +63,9 @@ A typical graphical representation of the neural network looks like this:
</div> </div>
<br/> <br/>
The entries of $W_\ell$, $\ell=1,2,3$, are depicted as lines connecting nodes in one layer to the subsequent one. The entries of $`W_\ell,\ \ell=1,2,3`$, are depicted as lines connecting nodes in one layer to the subsequent one.
the color indicates the sign of the entries (blue = "+", magenta = "-") and the opacity represents the absolute value (magnitude) of the values. The color indicates the sign of the entries (blue = "+", magenta = "-") and the opacity represents the absolute value (magnitude) of the values.
Note that neither the employed actication functions $\varphi_\ell$ nor the biases $b_\ell$ are represented in this graph. Note that neither the employed actication functions $`\varphi_\ell`$ nor the biases $`b_\ell`$ are represented in this graph.
Activation Functions Activation Functions
-------------------- --------------------
...@@ -71,32 +75,148 @@ The important part is the non-linearity, as otherwise the neural network would b ...@@ -71,32 +75,148 @@ The important part is the non-linearity, as otherwise the neural network would b
Typical examples of continuous activation functions applied in the context of function approximation or regression are: Typical examples of continuous activation functions applied in the context of function approximation or regression are:
ReLU | Leaky ReLU | Sigmoid | ReLU | Leaky ReLU | Sigmoid |
- | - | - | --- | --- | --- |
<img src="relu.png" title="ReLU" alt="ReLU" width=300 /> | <img src="leaky_relu.png" title="leaky ReLU" alt="leaky ReLU" width=300 /> | <img src="tanh.png" title="tanh" alt="tanh" width=300 /> | <img src="relu.png" title="ReLU" alt="ReLU" height=200 /> | <img src="leaky_relu.png" title="leaky ReLU" alt="leaky ReLU" height=200 /> | <img src="tanh.png" title="tanh" alt="tanh" height=200 /> |
For classification tasks, such as image recognition, so called convolutional neural networks (CNNs) are employed. For classification tasks, such as image recognition, so called convolutional neural networks (CNNs) are employed.
Typically, these networks use different types of activation functions, such as: Typically, these networks use different types of activation functions, such as:
**Examples for discrete activation functions:** **Examples for discrete activation functions:**
Argmax | Softmax | Max-Pooling | Argmax | Softmax | Max-Pooling |
- | - | - | --- | --- | --- |
<img src="argmax.png" title="argmax" alt="argmax" width=300 /> | <img src="softmax.png" title="softmax" alt="softmax" width=300 /> | <img src="maxpool.png" title="maxpool" alt="maxpool" width=300 /> | <img src="argmax.png" title="argmax" alt="argmax" height=200 /> | <img src="softmax.png" title="softmax" alt="softmax" height=200 /> | <img src="maxpool.png" title="maxpool" alt="maxpool" height=200 /> |
More infos on CNNs follow below. Minimization Problem
--------------------
Training In this section we focus on training a fully-connected network for a regression task.
-------- The principles stay the same of any other objective, such as classification, but may be more complicated in some aspects.
Let $`M = \sum_{\ell=1,\dots,L} d_\ell(d_{\ell-1}+1)`$ denote the number of degrees of freedom encorporated in $`\vartheta`$.
For $`\varphi = (\varphi_1, \dots, \varphi_L)`$ we define the model class of a certain (fully connected) network topology by
```math
\mathcal{M}_{d, \varphi} = \{ \Psi_\vartheta \,\vert\, \vartheta \in \mathbb{R}^M \text{ and activation functions } \varphi\}.
```
If we want to use the neural network to approximate a function $`f`$ the easiest approach would be to conduct a Least-Squares regression in an appropriate norm.
To make things even easier for the explaination, we assume $`f\colon \mathbb{R}^K \to \mathbb{R}, \text{ i.e., }\operatorname{dim}(x^{(0)})=K \text{ and } \operatorname{dim}(x^{(L)})=1`$.
Assuming the function has a second moment, we can use a standard $`L^2`$-norm for our Least-Square problem:
```math
\text{Find}\qquad \Psi_\vartheta
= \operatorname*{arg\, min}_{\Psi_\theta\in\mathcal{M}_{d,\varphi}} \Vert f - \Psi_\theta \Vert_{L^2(\pi)}^2
= \operatorname*{arg\, min}_{\theta\in\mathbb{R}^{M}} \int_{\mathbb{R}^K} \bigl(f(x) - \Psi_\theta(x)\bigr)^2 \ \mathrm{d}\pi(x),
```
where we assume $`x\sim\pi`$ for some appropriate probability distribution $`\pi`$ (e.g. uniform or normal).
As computing the integrals above is not feasible for $`K\gg1`$, we consider an empirical version.
Let $`x^{(1)},\dots,x^{(N)}\sim\pi`$ be independent (random) samples and assume we have access to $`f^{(i)}:=f(x^{(i)}),\ i=1,\dots,N`$.
> **Definition** (training data):
> Tuples of the form $`(x^{(i)}, f^{(i)})_{i=1}^N`$ are called _labeled training data_.
The empirical regression problem then reads
```math
\text{Find}\qquad \Psi_\vartheta
= \operatorname*{arg\, min}_{\Psi_\theta\in\mathcal{M}_{d,\varphi}} \frac{1}{N} \sum_{i=1}^N \bigl(f^{(i)} - \Psi_\theta(x^{(i)})\bigr)^2
=: \operatorname*{arg\, min}_{\Psi_\theta\in\mathcal{M}_{d,\varphi}} \mathcal{L}_N(\Psi_\theta)
```
> **Definition** (loss function):
> A _loss functions_ is any function, which measures how good a neural network approximates the target values.
Typical loss functions for regression and classification tasks are
- mean-square error (MSE, standard $`L^2`$-error)
- weighted $`L^p \text{- or } H^k\text{-}`$norms (solutions of PDEs)
- cross-entropy (difference between distributions)
- Kullback-Leibler divergence, Hellinger distance, Wasserstein metrics
- Hinge loss (SVM)
To find a minimizer of our loss function $`\mathcal{L}_N`$, we want to use the first-order optimality criterion
```math
0
= \operatorname{\nabla}_\vartheta \mathcal{L}_N(\Psi_\vartheta)
= -\frac{2}{N} \sum_{i=1}^N \bigl(f^{(i)} - \Psi_\vartheta(x^{(i)}\bigr) \operatorname{\nabla}_\vartheta \Psi_\vartheta.
```
Solving this equation requires the evaluation of the Jacobian (gradient) of the neural network with respect to the network parameters $`\vartheta`$.
As $`\vartheta\in\mathbb{R}^M \text{ with } M\gg1`$ (millions of degrees of freedom), computation of the gradient w.r.t. all parameters for each training data point is infeasible.
Optimization (Training)
-----------------------
Instead of solving the minimization problem explicitly, we can use iterative schemes to approximate the solution.
The easiest and most well known approach is gradient descent (Euler's method), i.e.
```math
\vartheta^{(j+1)} = \vartheta^{(j)} - \eta \operatorname{\nabla}_{\vartheta}\mathcal{L}_N(\Psi_{\vartheta^{(j)}}),
\qquad j=0, 1, 2, \dots
```
where the step size $`\eta>0`$ is typically called the _learning rate_ and $`\vartheta^{(0)}`$ is a random initialization of the weights and biases.
The key why gradient descent is more promising then first-order optimality criterion is the iterative character.
In particular, we can use the law of large numbers and restrict the number of summands in our loss to a random subset of fixed size in each iteration step, which is called _stochastic gradient descent_ (SGD).
Convergence of SGD can be shown by convex minimization and stochastic approximation theory and only requires that the learning rate decays with an appropriate rate.
Here, however, I want to focus more on the difference between "normal" GD and SGD (in an intuitive level).
In principle, SGD trades gradient computations of a large number of term against the convergence rate of the algorithm.
The best metaphor to remember the difference (I know of) is the following:
> **Metaphor (SGD):**
> Assume you and a friend of yours have had a party on the top of a mountain.
> As the party has come to an end, you both want to get back home somewhere in the valley.
> You, scientist that you are, plan the most direct way down the mountain, following the steepest descent, planning each step carefully as the terrain is very rough.
> Your friend, however, drank a little to much and is not capable of planning anymore.
> So they stagger down the mountain in a more or less random direction.
> Each step they take is with little thought, but it takes them a long time overall to get back home (or at least close to it).
>
> <img src="sgd.png" title="sgd" alt="sgd" height=400 />
What remains is the computation of $`\operatorname{\nabla}_\vartheta\Psi_{\vartheta^{(i)}} \text{ for } i\in\Gamma_j\subset\{1,\dots,N\}`$ in each step.
Lucky for us, we know that our neural network is a simple concatenation of activation functions and affine maps $`A_\ell(x^{(\ell-1)}) = W_\ell x^{(\ell-1)} + b_\ell`$ with derivative
```math
\partial_{W^{(m)}_{\alpha,\beta}} A^{(\ell)} =
\begin{cases}
W^{(\ell)}_{\alpha,\beta} & \text{if }m=\ell,\\
0 & \text{if }m\neq\ell,
\end{cases}
\qquad\text{and}\qquad
\partial_{b^{(m)}_{\alpha}} A^{(\ell)} =
\begin{cases}
b^{(\ell)}_{\alpha} & \text{if }m=\ell,\\
0 & \text{if }m\neq\ell.
\end{cases}
```
The gradient $`\operatorname{\nabla}_\vartheta\Psi_{\vartheta^{(i)}}`$ can then be computed using the chain rule due to the compositional structure of the neural network.
Computing the gradient through the chain rule is still very inefficient and most probably infeasible if done in a naive fashion.
The so called _Backpropagation_ is esentially a way to compute the partial derivatives layer-wise storting only the necessary information to prevent repetitive computations, rendering the computation manaeable.
Types of Neural Networks Types of Neural Networks
------------------------ ------------------------
Fully Connected Neural Network| Convolutional Neural Network | Name | Graph |
- | - | --- | --- |
![bla](nn_fc.png) | ![conv](nn_conv.png) | Fully Connected Neural Network | <img src="nn_fc.png" title="nn_fc" alt="nn_fc" height=250 /> |
| Convolutional Neural Network | <img src="nn_conv.png" title="nn_conv" alt="nn_conv" height=250/> |
Further Reading | U-Net | <img src="u_net.png" title="u_net" alt="u_net" height=250/> |
--------------- | Residual Neural Network | <img src="res_net.png" title="res_net" alt="res_net" height=250/> |
| Invertible Neural Network | <img src="inn.png" title="inn" alt="inn" height=250/> |
- Python: PyTorch, TensorFlow, Scikit learn
- Matlab: Deeplearning Toolbox Deep Learning Libraries
-----------------------
| Library | Language Support | Remark |
| --- | --- | --- |
| [PyTorch](https://pytorch.org/) | `Python`, `C++`, `Java` | developped by Facebook |
| [TensorFlow](https://www.tensorflow.org/) | `Python`, `JavaScript`, `Java`, `C`, `Go` | developped by Google |
| [Keras](https://keras.io/) | `Python` | Runs on top of [TensorFlow](https://www.tensorflow.org/) |
| [scikit-learn]() | `Python` | open source, build on `numpy`, `scipy` and `matplotlib`|
| [Deeplearning Toolbox](https://de.mathworks.com/products/deep-learning.html) | `Matlab` | no free to use |
| [deeplearning4j](https://deeplearning4j.konduit.ai/)| `Java` | Java hook into Python |
doc/inn.png

8.83 KiB

doc/res_net.png

21 KiB

doc/sgd.png

69.9 KiB

doc/u_net.png

101 KiB