Skip to content
Snippets Groups Projects
Commit b6db7c05 authored by Nando Farchmin's avatar Nando Farchmin
Browse files

Test markdown math display

parent dacaf710
No related branches found
No related tags found
1 merge request!1Update math to conform with gitlab markdown
......@@ -162,7 +162,7 @@ The easiest and most well known approach is gradient descent (Euler's method), i
where the step size $`\eta>0`$ is typically called the _learning rate_ and $`\vartheta^{(0)}`$ is a random initialization of the weights and biases.
The key why gradient descent is more promising then first-order optimality criterion is the iterative character.
In particular, we can use the law of large numbers and restrict the number of summands in $\mathcal{L}_N$ to a random subset of fixed size in each iteration step, which is called _stochastic gradient descent_ (SGD).
In particular, we can use the law of large numbers and restrict the number of summands in $`\mathcal{L}_N`$ to a random subset of fixed size in each iteration step, which is called _stochastic gradient descent_ (SGD).
Convergence of SGD can be shown by convex minimization and stochastic approximation theory and only requires that the learning rate $`\eta`$ with an appropriate rate.
**(see ?? for mor information)**
......@@ -180,10 +180,10 @@ The best metaphor to remember the difference (I know of) is the following:
>
> <img src="sgd.png" title="sgd" alt="sgd" height=400 />
What remains is the computation of $`\operatorname{\nabla}_\vartheta\Psi_{\vartheta^{(i)}}`$ for $`i`\in\Gamma_j\subset\{1,\dots,N\}$ in each step.
What remains is the computation of $`\operatorname{\nabla}_\vartheta\Psi_{\vartheta^{(i)}}`$ for $`i\in\Gamma_j\subset\{1,\dots,N\}`$ in each step.
Lucky for us, we know that $`\Psi_\vartheta`$ is a simple concatenation of activation functions $`\varphi_\ell`$ and affine maps $`A_\ell(x^{(\ell-1)}) = W_\ell x^{(\ell-1)} + b_\ell`$ with derivative
```
```math
\partial_{W^{(m)}_{\alpha,\beta}} A^{(\ell)} =
\begin{cases}
W^{(\ell)}_{\alpha,\beta} & \text{if }m=\ell,\\
......
......@@ -43,7 +43,7 @@ The easiest and most well known approach is gradient descent (Euler's method), i
where the step size $`\eta>0`$ is typically called the _learning rate_ and $`\vartheta^{(0)}`$ is a random initialization of the weights and biases.
The key why gradient descent is more promising then first-order optimality criterion is the iterative character.
In particular, we can use the law of large numbers and restrict the number of summands in $\mathcal{L}_N$ to a random subset of fixed size in each iteration step, which is called _stochastic gradient descent_ (SGD).
In particular, we can use the law of large numbers and restrict the number of summands in $`\mathcal{L}_N`$ to a random subset of fixed size in each iteration step, which is called _stochastic gradient descent_ (SGD).
Convergence of SGD can be shown by convex minimization and stochastic approximation theory and only requires that the learning rate $`\eta`$ with an appropriate rate.
**(see ?? for mor information)**
......@@ -61,10 +61,10 @@ The best metaphor to remember the difference (I know of) is the following:
>
> <img src="sgd.png" title="sgd" alt="sgd" height=400 />
What remains is the computation of $`\operatorname{\nabla}_\vartheta\Psi_{\vartheta^{(i)}}`$ for $`i`\in\Gamma_j\subset\{1,\dots,N\}$ in each step.
What remains is the computation of $`\operatorname{\nabla}_\vartheta\Psi_{\vartheta^{(i)}}`$ for $`i\in\Gamma_j\subset\{1,\dots,N\}`$ in each step.
Lucky for us, we know that $`\Psi_\vartheta`$ is a simple concatenation of activation functions $`\varphi_\ell`$ and affine maps $`A_\ell(x^{(\ell-1)}) = W_\ell x^{(\ell-1)} + b_\ell`$ with derivative
```
```math
\partial_{W^{(m)}_{\alpha,\beta}} A^{(\ell)} =
\begin{cases}
W^{(\ell)}_{\alpha,\beta} & \text{if }m=\ell,\\
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment