From 596b3c75a44e412eccccf15797f45ffc5090779d Mon Sep 17 00:00:00 2001 From: Nando Farchmin <nando.farchmin@gmail.com> Date: Wed, 29 Jun 2022 18:09:30 +0200 Subject: [PATCH] Add section on training of NN --- doc/basics.md | 24 ++++++++++++++++++++++-- 1 file changed, 22 insertions(+), 2 deletions(-) diff --git a/doc/basics.md b/doc/basics.md index 3aa740d..0b9e2f6 100644 --- a/doc/basics.md +++ b/doc/basics.md @@ -48,7 +48,7 @@ Additionally, there exist the following conventions: - (debatable) A neural network is called _shallow_ if it has only one hidden layer ($`L=2`$) and deep otherwise. **Example:** -Let $`L=3`$, $`d=(6, 10, 10, 3)`$ and $`\varphi_1=\varphi_2=\varphi_3=\mathrm{ReLU}`$. +Let $`L=3`$, $`d=(6, 10, 10, 3)`$ and $`\varphi_1=\varphi_2=\varphi_3=\tanh`$. Then the neural network is given by the concatenation ```math \Psi\colon \mathbb{R}^6\to\mathbb{R}^3, @@ -93,7 +93,27 @@ Training In this section we focus on training a fully-connected network for a regression task. The principles stay the same of any other objective, such as classification, but may be more complicated at different points. -- approximation space and DoFs ($\vartheta$) +Let $`M = \sum_{\ell=1,\dots,L} d_\ell(d_{\ell-1}+1)`$ denote the number of degrees of freedom encorporated in $`\vartheta`$. +For $`\varphi = (\varphi_1, \dots, \varphi_L)`$ we define the model class of a certain (fully connected) network topology by + +```math +\mathcal{M}_{d, \varphi} = \{ \Psi_\vartheta \,\vert\, \vartheta \in \mathbb{R}^M \text{ and activation functions } \varphi\}. +``` + +If we want to use the neural network to approximate a function $f$ in some appropriate norm, we can use Least-Squares. +The problem then reads: + +```math +\text{Find}\qquad \Psi_\vartheta += \operatorname*{arg\, min}_{\Psi_\theta\in\mathcal{M}_{d,\varphi}} \Vert f - \Psi_\theta \Vert^2 += \operatorname*{arg\, min}_{\theta\in\mathbb{R}^{M}} \Vert f - \Psi_\theta \Vert^2. +``` +The first order optimality criterion then gives us the linear system + +$$ +\langle f-\Psi_\vartheta,\, \operatorname{\nabla_\vartheta} \Psi_\vartheta \rangle = 0 +$$ + - LS regression (and differentiation -> derivative w.r.t. $\vartheta$) - loss function - back-prob (computing gradient w.r.t. $\vartheta$ by chain-rule) -- GitLab