Skip to content
Snippets Groups Projects
Commit 596b3c75 authored by Nando Farchmin's avatar Nando Farchmin
Browse files

Add section on training of NN

parent 0139f7a1
No related branches found
No related tags found
1 merge request!1Update math to conform with gitlab markdown
...@@ -48,7 +48,7 @@ Additionally, there exist the following conventions: ...@@ -48,7 +48,7 @@ Additionally, there exist the following conventions:
- (debatable) A neural network is called _shallow_ if it has only one hidden layer ($`L=2`$) and deep otherwise. - (debatable) A neural network is called _shallow_ if it has only one hidden layer ($`L=2`$) and deep otherwise.
**Example:** **Example:**
Let $`L=3`$, $`d=(6, 10, 10, 3)`$ and $`\varphi_1=\varphi_2=\varphi_3=\mathrm{ReLU}`$. Let $`L=3`$, $`d=(6, 10, 10, 3)`$ and $`\varphi_1=\varphi_2=\varphi_3=\tanh`$.
Then the neural network is given by the concatenation Then the neural network is given by the concatenation
```math ```math
\Psi\colon \mathbb{R}^6\to\mathbb{R}^3, \Psi\colon \mathbb{R}^6\to\mathbb{R}^3,
...@@ -93,7 +93,27 @@ Training ...@@ -93,7 +93,27 @@ Training
In this section we focus on training a fully-connected network for a regression task. In this section we focus on training a fully-connected network for a regression task.
The principles stay the same of any other objective, such as classification, but may be more complicated at different points. The principles stay the same of any other objective, such as classification, but may be more complicated at different points.
- approximation space and DoFs ($\vartheta$) Let $`M = \sum_{\ell=1,\dots,L} d_\ell(d_{\ell-1}+1)`$ denote the number of degrees of freedom encorporated in $`\vartheta`$.
For $`\varphi = (\varphi_1, \dots, \varphi_L)`$ we define the model class of a certain (fully connected) network topology by
```math
\mathcal{M}_{d, \varphi} = \{ \Psi_\vartheta \,\vert\, \vartheta \in \mathbb{R}^M \text{ and activation functions } \varphi\}.
```
If we want to use the neural network to approximate a function $f$ in some appropriate norm, we can use Least-Squares.
The problem then reads:
```math
\text{Find}\qquad \Psi_\vartheta
= \operatorname*{arg\, min}_{\Psi_\theta\in\mathcal{M}_{d,\varphi}} \Vert f - \Psi_\theta \Vert^2
= \operatorname*{arg\, min}_{\theta\in\mathbb{R}^{M}} \Vert f - \Psi_\theta \Vert^2.
```
The first order optimality criterion then gives us the linear system
$$
\langle f-\Psi_\vartheta,\, \operatorname{\nabla_\vartheta} \Psi_\vartheta \rangle = 0
$$
- LS regression (and differentiation -> derivative w.r.t. $\vartheta$) - LS regression (and differentiation -> derivative w.r.t. $\vartheta$)
- loss function - loss function
- back-prob (computing gradient w.r.t. $\vartheta$ by chain-rule) - back-prob (computing gradient w.r.t. $\vartheta$ by chain-rule)
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment