Here we focus on neural networks as a special model class used for function approximation in regression or classification tasks.
To be more precise, we will rely on the following definition.
> **Definition** (Neural Network):
> For any $`L\in\mathbb{N}`$ and $`d=(d_0,\dots,d_L)\in\mathbb{N}^{L+1}`$ a non-linear map $`\Psi\colon\mathbb{R}^{d_0}\to\mathbb{R}^{d_L}`$ of the form
In this section we focus on training a fully-connected network for a regression task.
Solving this equation requires the evaluation of the Jacobian (gradient) of the neural network $`\Psi_\vartheta`$ with respect to the network parameters $`\vartheta`$.
The principles stay the same of any other objective, such as classification, but may be more complicated in some aspects.
As $`\vartheta\in\mathbb{R}^M`$ with $`M\gg1`$ (millions of degrees of freedom), computation of the gradient w.r.t. all parameters for each training data point is infeasible.
Optimization (Training)
-----------------------
Let $`M = \sum_{\ell=1,\dots,L} d_\ell(d_{\ell-1}+1)`$ denote the number of degrees of freedom encorporated in $`\vartheta`$.
Instead of solving the minimization problem explicitly, we can use iterative schemes to approximate the solution.
For $`\varphi = (\varphi_1, \dots, \varphi_L)`$ we define the model class of a certain (fully connected) network topology by
The easiest and most well known approach is gradient descent (Euler's method), i.e.
If we want to use the neural network to approximate a function $`f`$ the easiest approach would be to conduct a Least-Squares regression in an appropriate norm.
where the step size $`\eta>0`$ is typically called the _learning rate_ and $`\vartheta^{(0)}`$ is a random initialization of the weights and biases.
To make things even easier for the explaination, we assume $`f\colon \mathbb{R}^K \to \mathbb{R}`$, i.e., $`\operatorname{dim}(x^{(0)})=K`$ and $`\operatorname{dim}(x^{(L)})=1`$.
Assuming the function $`f`$ has a second moment, we can use a standard $`L^2`$-norm for our Least-Square problem:
```math
The key why gradient descent is more promising then first-order optimality criterion is the iterative character.
\text{Find}\qquad \Psi_\vartheta
In particular, we can use the law of large numbers and restrict the number of summands in $\mathcal{L}_N$ to a random subset of fixed size in each iteration step, which is called _stochastic gradient descent_ (SGD).
= \operatorname*{arg\, min}_{\Psi_\theta\in\mathcal{M}_{d,\varphi}} \Vert f - \Psi_\theta \Vert_{L^2(\pi)}^2
Convergence of SGD can be shown by convex minimization and stochastic approximation theory and only requires that the learning rate $`\eta`$ with an appropriate rate.
where we assume $`x\sim\pi`$ for some appropriate probability distribution $`\pi`$ (e.g. uniform or normal).
Here, however, I want to focus more on the difference between "normal" GD and SGD (in an intuitive level).
As computing the integrals above is not feasible for $`K\gg1`$, we consider an empirical version.
In principle, SGD trades gradient computations of a large number of term against the convergence rate of the algorithm.
Let $`x^{(1)},\dots,x^{(N)}\sim\pi`$ be independent (random) samples and assume we have access to $`f^{(i)}:=f(x^{(i)})`$, $`i=1,\dots,N`$.
The best metaphor to remember the difference (I know of) is the following:
> **Definition** (training data):
> **Metaphor (SGD):**
> Tuples of the form $`(x^{(i)}, f^{(i)})_{i=1}^N`$ are called _labeled training data_.
> Assume you and a friend of yours have had a party on the top of a mountain.
> As the party has come to an end, you both want to get back home somewhere in the valley.
> You, scientist that you are, plan the most direct way down the mountain, following the steepest descent, planning each step carefully as the terrain is very rough.
> Your friend, however, drank a little to much and is not capable of planning anymore.
> So they stagger down the mountain in a more or less random direction.
> Each step they take is with little thought, but it takes them a long time overall to get back home (or at least close to it).
What remains is the computation of $`\operatorname{\nabla}_\vartheta\Psi_{\vartheta^{(i)}}`$ for $`i`\in\Gamma_j\subset\{1,\dots,N\}$ in each step.
Lucky for us, we know that $`\Psi_\vartheta`$ is a simple concatenation of activation functions $`\varphi_\ell`$ and affine maps $`A_\ell(x^{(\ell-1)}) = W_\ell x^{(\ell-1)} + b_\ell`$ with derivative
The gradient $`\operatorname{\nabla}_\vartheta\Psi_{\vartheta^{(i)}}`$ can then be computed using the chain rule due to the compositional structure of the neural network.
> A _loss functions_ is any function, which measures how good a neural network approximates the target values.
Computing the gradient through the chain rule is still very inefficient and most probably infeasible if done in a naive fashion.
The so called _Backpropagation_ is esentially a way to compute the partial derivatives layer-wise storting only the necessary information to prevent repetitive computations, rendering the computation manaeable.
Typical loss functions for regression and classification tasks are
Types of Neural Networks
- mean-square error (MSE, standard $`L^2`$-error)
------------------------
- weighted $`L^p`$- or $`H^k`$-norms (solutions of PDEs)
- cross-entropy (difference between distributions)