From 596b3c75a44e412eccccf15797f45ffc5090779d Mon Sep 17 00:00:00 2001
From: Nando Farchmin <nando.farchmin@gmail.com>
Date: Wed, 29 Jun 2022 18:09:30 +0200
Subject: [PATCH] Add section on training of NN

---
 doc/basics.md | 24 ++++++++++++++++++++++--
 1 file changed, 22 insertions(+), 2 deletions(-)

diff --git a/doc/basics.md b/doc/basics.md
index 3aa740d..0b9e2f6 100644
--- a/doc/basics.md
+++ b/doc/basics.md
@@ -48,7 +48,7 @@ Additionally, there exist the following conventions:
 - (debatable) A neural network is called _shallow_ if it has only one hidden layer ($`L=2`$) and deep otherwise.
 
 **Example:**
-Let $`L=3`$, $`d=(6, 10, 10, 3)`$ and $`\varphi_1=\varphi_2=\varphi_3=\mathrm{ReLU}`$.
+Let $`L=3`$, $`d=(6, 10, 10, 3)`$ and $`\varphi_1=\varphi_2=\varphi_3=\tanh`$.
 Then the neural network is given by the concatenation
 ```math
 \Psi\colon \mathbb{R}^6\to\mathbb{R}^3,
@@ -93,7 +93,27 @@ Training
 In this section we focus on training a fully-connected network for a regression task.
 The principles stay the same of any other objective, such as classification, but may be more complicated at different points.
 
-- approximation space and DoFs ($\vartheta$)
+Let $`M = \sum_{\ell=1,\dots,L} d_\ell(d_{\ell-1}+1)`$ denote the number of degrees of freedom encorporated in $`\vartheta`$.
+For $`\varphi = (\varphi_1, \dots, \varphi_L)`$ we define the model class of a certain (fully connected) network topology by
+
+```math
+\mathcal{M}_{d, \varphi} = \{ \Psi_\vartheta \,\vert\, \vartheta \in \mathbb{R}^M \text{ and activation functions } \varphi\}.
+```
+
+If we want to use the neural network to approximate a function $f$ in some appropriate norm, we can use Least-Squares.
+The problem then reads:
+
+```math
+\text{Find}\qquad \Psi_\vartheta
+= \operatorname*{arg\, min}_{\Psi_\theta\in\mathcal{M}_{d,\varphi}} \Vert f - \Psi_\theta \Vert^2
+= \operatorname*{arg\, min}_{\theta\in\mathbb{R}^{M}} \Vert f - \Psi_\theta \Vert^2.
+```
+The first order optimality criterion then gives us the linear system
+
+$$
+\langle f-\Psi_\vartheta,\, \operatorname{\nabla_\vartheta} \Psi_\vartheta \rangle = 0
+$$
+
 - LS regression (and differentiation -> derivative w.r.t. $\vartheta$)
 - loss function
 - back-prob (computing gradient w.r.t. $\vartheta$ by chain-rule)
-- 
GitLab