Solving this equation requires the evaluation of the Jacobian (gradient) of the neural network $`\Psi_\vartheta`$ with respect to the network parameters $`\vartheta`$.
1. - weighted $`H^k`$-norms (solutions of PDEs)
As $`\vartheta\in\mathbb{R}^M`$ with $`M\gg1`$ (millions of degrees of freedom), computation of the gradient w.r.t. all parameters for each training data point is infeasible.
1. - weighted $`H^k`$-norms (solutions of PDEs)
1. - weighted $`H^k`$-norms (solutions of PDEs)
Optimization (Training)
1. - weighted $`H^k`$-norms (solutions of PDEs)
-----------------------
1. - weighted $`H^k`$-norms (solutions of PDEs)
1. - weighted $`H^k`$-norms (solutions of PDEs)
Instead of solving the minimization problem explicitly, we can use iterative schemes to approximate the solution.
1. - weighted $`H^k`$-norms (solutions of PDEs)
The easiest and most well known approach is gradient descent (Euler's method), i.e.
where the step size $`\eta>0`$ is typically called the _learning rate_ and $`\vartheta^{(0)}`$ is a random initialization of the weights and biases.
1. - weighted $`H^k`$-norms (solutions of PDEs)
1. - weighted $`H^k`$-norms (solutions of PDEs)
The key why gradient descent is more promising then first-order optimality criterion is the iterative character.
1. - weighted $`H^k`$-norms (solutions of PDEs)
In particular, we can use the law of large numbers and restrict the number of summands in $`\mathcal{L}_N`$ to a random subset of fixed size in each iteration step, which is called _stochastic gradient descent_ (SGD).
1. - weighted $`H^k`$-norms (solutions of PDEs)
Convergence of SGD can be shown by convex minimization and stochastic approximation theory and only requires that the learning rate $`\eta`$ with an appropriate rate.
1. - weighted $`H^k`$-norms (solutions of PDEs)
**(see ?? for mor information)**
1. - weighted $`H^k`$-norms (solutions of PDEs)
1. - weighted $`H^k`$-norms (solutions of PDEs)
Here, however, I want to focus more on the difference between "normal" GD and SGD (in an intuitive level).
1. - weighted $`H^k`$-norms (solutions of PDEs)
In principle, SGD trades gradient computations of a large number of term against the convergence rate of the algorithm.
1. - weighted $`H^k`$-norms (solutions of PDEs)
The best metaphor to remember the difference (I know of) is the following:
1. - weighted $`H^k`$-norms (solutions of PDEs)
1. - weighted $`H^k`$-norms (solutions of PDEs)
> **Metaphor (SGD):**
1. - weighted $`H^k`$-norms (solutions of PDEs)
> Assume you and a friend of yours have had a party on the top of a mountain.
1. - weighted $`H^k`$-norms (solutions of PDEs)
> As the party has come to an end, you both want to get back home somewhere in the valley.
1. - weighted $`H^k`$-norms (solutions of PDEs)
> You, scientist that you are, plan the most direct way down the mountain, following the steepest descent, planning each step carefully as the terrain is very rough.
1. - weighted $`H^k`$-norms (solutions of PDEs)
> Your friend, however, drank a little to much and is not capable of planning anymore.
1. - weighted $`H^k`$-norms (solutions of PDEs)
> So they stagger down the mountain in a more or less random direction.
1. - weighted $`H^k`$-norms (solutions of PDEs)
> Each step they take is with little thought, but it takes them a long time overall to get back home (or at least close to it).
What remains is the computation of $`\operatorname{\nabla}_\vartheta\Psi_{\vartheta^{(i)}}`$ for $`i\in\Gamma_j\subset\{1,\dots,N\}`$ in each step.
1. - weighted $`H^k`$-norms (solutions of PDEs)
Lucky for us, we know that $`\Psi_\vartheta`$ is a simple concatenation of activation functions $`\varphi_\ell`$ and affine maps $`A_\ell(x^{(\ell-1)}) = W_\ell x^{(\ell-1)} + b_\ell`$ with derivative
1. - weighted $`H^k`$-norms (solutions of PDEs)
1. - weighted $`H^k`$-norms (solutions of PDEs)
```math
1. - weighted $`H^k`$-norms (solutions of PDEs)
\partial_{W^{(m)}_{\alpha,\beta}} A^{(\ell)} =
1. - weighted $`H^k`$-norms (solutions of PDEs)
\begin{cases}
1. - weighted $`H^k`$-norms (solutions of PDEs)
W^{(\ell)}_{\alpha,\beta} & \text{if }m=\ell,\\
1. - weighted $`H^k`$-norms (solutions of PDEs)
0 & \text{if }m\neq\ell,
1. - weighted $`H^k`$-norms (solutions of PDEs)
\end{cases}
1. - weighted $`H^k`$-norms (solutions of PDEs)
\qquad\text{and}\qquad
1. - weighted $`H^k`$-norms (solutions of PDEs)
\partial_{b^{(m)}_{\alpha}} A^{(\ell)} =
1. - weighted $`H^k`$-norms (solutions of PDEs)
\begin{cases}
1. - weighted $`H^k`$-norms (solutions of PDEs)
b^{(\ell)}_{\alpha} & \text{if }m=\ell,\\
1. - weighted $`H^k`$-norms (solutions of PDEs)
0 & \text{if }m\neq\ell.
1. - weighted $`H^k`$-norms (solutions of PDEs)
\end{cases}
1. - weighted $`H^k`$-norms (solutions of PDEs)
```
1. - weighted $`H^k`$-norms (solutions of PDEs)
1. - weighted $`H^k`$-norms (solutions of PDEs)
The gradient $`\operatorname{\nabla}_\vartheta\Psi_{\vartheta^{(i)}}`$ can then be computed using the chain rule due to the compositional structure of the neural network.
1. - weighted $`H^k`$-norms (solutions of PDEs)
Computing the gradient through the chain rule is still very inefficient and most probably infeasible if done in a naive fashion.
1. - weighted $`H^k`$-norms (solutions of PDEs)
The so called _Backpropagation_ is esentially a way to compute the partial derivatives layer-wise storting only the necessary information to prevent repetitive computations, rendering the computation manaeable.