Add section on training of NN

596b3c75 · Nando Farchmin · 0139f7a1 · 596b3c75
Commit 596b3c75 authored 2 years ago by Nando Farchmin
--- a/doc/basics.md
+++ b/doc/basics.md
@@ -48,7 +48,7 @@ Additionally, there exist the following conventions:
 - (debatable) A neural network is called _shallow_ if it has only one hidden layer ($`L=2`$) and deep otherwise.
 **Example:**
-Let $`L=3`$, $`d=(6, 10, 10, 3)`$ and $`\varphi_1=\varphi_2=\varphi_3=\mathrm{ReLU}`$.
+Let $`L=3`$, $`d=(6, 10, 10, 3)`$ and $`\varphi_1=\varphi_2=\varphi_3=\tanh`$.
 Then the neural network is given by the concatenation
 ```math
 \Psi\colon \mathbb{R}^6\to\mathbb{R}^3,
@@ -93,7 +93,27 @@ Training
 In this section we focus on training a fully-connected network for a regression task.
 The principles stay the same of any other objective, such as classification, but may be more complicated at different points.
- approximation space and DoFs ($\vartheta$)
+Let $`M = \sum_{\ell=1,\dots,L} d_\ell(d_{\ell-1}+1)`$ denote the number of degrees of freedom encorporated in $`\vartheta`$.
+For $`\varphi = (\varphi_1, \dots, \varphi_L)`$ we define the model class of a certain (fully connected) network topology by
+```math
+\mathcal{M}_{d, \varphi} = \{ \Psi_\vartheta \,\vert\, \vartheta \in \mathbb{R}^M \text{ and activation functions } \varphi\}.
+```
+If we want to use the neural network to approximate a function $f$ in some appropriate norm, we can use Least-Squares.
+The problem then reads:
+```math
+\text{Find}\qquad \Psi_\vartheta
+= \operatorname*{arg\, min}_{\Psi_\theta\in\mathcal{M}_{d,\varphi}} \Vert f - \Psi_\theta \Vert^2
+= \operatorname*{arg\, min}_{\theta\in\mathbb{R}^{M}} \Vert f - \Psi_\theta \Vert^2.
+```
+The first order optimality criterion then gives us the linear system
+$$
+\langle f-\Psi_\vartheta,\, \operatorname{\nabla_\vartheta} \Psi_\vartheta \rangle = 0
+$$
 - LS regression (and differentiation -> derivative w.r.t. $\vartheta$)
 - loss function
 - back-prob (computing gradient w.r.t. $\vartheta$ by chain-rule)