From dacaf7100b422c8d7f8effd75704d649bec19cb6 Mon Sep 17 00:00:00 2001 From: Nando Farchmin <nando.farchmin@gmail.com> Date: Mon, 4 Jul 2022 11:33:14 +0200 Subject: [PATCH] Test markdown math display --- doc/tmp.md | 193 ++++++++++++++++++++++------------------------------- 1 file changed, 78 insertions(+), 115 deletions(-) diff --git a/doc/tmp.md b/doc/tmp.md index 3152672..74aac62 100644 --- a/doc/tmp.md +++ b/doc/tmp.md @@ -1,137 +1,100 @@ -Neural Networks 101 -------------------- - -<div style="text-align: center;"> - <img src="machine_learning.png" title="ml" alt="ml" height=500 /> -</div> - -Table of Contents ------------------ -[[_TOC_]] - -Nomenclature and Definitions ----------------------------- - -First, we need to clarify a few terms: **artificial intelligence**, **machine learning** and **neural network**. -Everybody categorizes them differently, but we look at it this way: - -<br/> -<div style="text-align: center;"> - <img src="venn.png" title="venn" alt="venn" height=300 /> -</div> -<br/> - -Here we focus on neural networks as a special model class used for function approximation in regression or classification tasks. -To be more precise, we will rely on the following definition. - -> **Definition** (Neural Network): -> For any $`L\in\mathbb{N}`$ and $`d=(d_0,\dots,d_L)\in\mathbb{N}^{L+1}`$ a non-linear map $`\Psi\colon\mathbb{R}^{d_0}\to\mathbb{R}^{d_L}`$ of the form -> ```math -> \Psi(x) = \bigl[\varphi_L\circ (W_L\bullet + b_L)\circ\varphi_{L-1}\circ\dots\circ(W_2\bullet + b_2)\circ\varphi_1\circ (W_1\bullet + b_1)\bigr](x) -> ``` -> is called a _fully connected feed-forward neural network_. - -Typically, we use the following nomenclature: -- $`L`$ is called the _depth_ of the network with layers $`\ell=0,\dots,L`$. -- $`d`$ is called the _width_ of the network, where $`d_\ell`$ is the widths of the layers $`\ell`$. -- $`W_\ell\in\mathbb{R}^{d_{\ell-1}\times d_\ell}`$ are the _weights_ of layer $`\ell`$. -- $`b_\ell\in\mathbb{R}^{d_\ell}`$ is the _biases_ of layer $`\ell`$. -- $`\vartheta=(W_1,b_1,\dots,W_L,b_L)`$ are the _free parameters_ of the neural network. - Sometimes we write $`\Psi_\vartheta`$ or $`\Psi(x; \vartheta)`$ to indicate the dependence of $`\Psi`$ on the parameters $`\vartheta`$. -- $`\varphi_\ell`$ is the _activation function_ of layer $`\ell`$. - Note that $`\varphi_\ell`$ has to be non-linear and monotone increasing. - -Additionally, there exist the following conventions: -- $`x^{(0)}:=x`$ is called the _input (layer)_ of the neural network $`\Psi`$. -- $`x^{(L)}:=\Psi(x)`$ is called the _output (layer)_ of the neural network $`\Psi`$. -- Intermediate results $`x^{(\ell)} = \varphi_\ell(W_\ell\, x^{(\ell-1)} + b_\ell)`$ are called _hidden layers_. -- (debatable) A neural network is called _shallow_ if it has only one hidden layer ($`L=2`$) and deep otherwise. - -**Example:** -Let $`L=3`$, $`d=(6, 10, 10, 3)`$ and $`\varphi_1=\varphi_2=\varphi_3=\tanh`$. -Then the neural network is given by the concatenation +The empirical regression problem then reads + ```math -\Psi\colon \mathbb{R}^6\to\mathbb{R}^3, -\qquad -\Psi(x) = \varphi_3\Bigl(W_3 \Bigl(\underbrace{\varphi_2\bigl(W_2 \bigl(\underbrace{\varphi_1(W_1 x + b_1)}_{x^{(1)}}\bigr) + b_2\bigr)}_{x^{(2)}}\Bigr) + b_3\Bigr). +\text{Find}\qquad \Psi_\vartheta += \operatorname*{arg\, min}_{\Psi_\theta\in\mathcal{M}_{d,\varphi}} \frac{1}{N} \sum_{i=1}^N \bigl(f^{(i)} - \Psi_\theta(x^{(i)})\bigr)^2 +=: \operatorname*{arg\, min}_{\Psi_\theta\in\mathcal{M}_{d,\varphi}} \mathcal{L}_N(\Psi_\theta) ``` -A typical graphical representation of the neural network looks like this: -<br/> -<div style="text-align: center;"> - <img src="nn_fc_example.png" title="ml" alt="ml" width=400 /> -</div> -<br/> - -The entries of $`W_\ell`$, $`\ell=1,2,3`$, are depicted as lines connecting nodes in one layer to the subsequent one. -The color indicates the sign of the entries (blue = "+", magenta = "-") and the opacity represents the absolute value (magnitude) of the values. -Note that neither the employed actication functions $`\varphi_\ell`$ nor the biases $`b_\ell`$ are represented in this graph. - -Activation Functions --------------------- - -Activation functions can, in principle, be arbitrary non-linear maps. -The important part is the non-linearity, as otherwise the neural network would be simply be an affine function. - -Typical examples of continuous activation functions applied in the context of function approximation or regression are: +> **Definition** (loss function): +> A _loss functions_ is any function, which measures how good a neural network approximates the target values. -| ReLU | Leaky ReLU | Sigmoid | -| --- | --- | --- | -| <img src="relu.png" title="ReLU" alt="ReLU" height=200 /> | <img src="leaky_relu.png" title="leaky ReLU" alt="leaky ReLU" height=200 /> | <img src="tanh.png" title="tanh" alt="tanh" height=200 /> | +**TODO: Is there a maximum number of inline math?** -For classification tasks, such as image recognition, so called convolutional neural networks (CNNs) are employed. -Typically, these networks use different types of activation functions, such as: +Typical loss functions for regression and classification tasks are + - mean-square error (MSE, standard $`L^2`$-error) + - weighted $`L^p`$- or $`H^k`$-norms (solutions of PDEs) + - cross-entropy (difference between distributions) + - Kullback-Leibler divergence, Hellinger distance, Wasserstein metrics + - Hinge loss (SVM) -**Examples for discrete activation functions:** -| Argmax | Softmax | Max-Pooling | -| --- | --- | --- | -| <img src="argmax.png" title="argmax" alt="argmax" height=200 /> | <img src="softmax.png" title="softmax" alt="softmax" height=200 /> | <img src="maxpool.png" title="maxpool" alt="maxpool" height=200 /> | +To find a minimizer of our loss function $`\mathcal{L}_N`$, we want to use the first-order optimality criterion -Minimization Problem --------------------- +```math +0 += \operatorname{\nabla}_\vartheta \mathcal{L}_N(\Psi_\vartheta) += -\frac{2}{N} \sum_{i=1}^N \bigl(f^{(i)} - \Psi_\vartheta(x^{(i)}\bigr) \operatorname{\nabla}_\vartheta \Psi_\vartheta. +``` -In this section we focus on training a fully-connected network for a regression task. -The principles stay the same of any other objective, such as classification, but may be more complicated in some aspects. +Solving this equation requires the evaluation of the Jacobian (gradient) of the neural network $`\Psi_\vartheta`$ with respect to the network parameters $`\vartheta`$. +As $`\vartheta\in\mathbb{R}^M`$ with $`M\gg1`$ (millions of degrees of freedom), computation of the gradient w.r.t. all parameters for each training data point is infeasible. + +Optimization (Training) +----------------------- -Let $`M = \sum_{\ell=1,\dots,L} d_\ell(d_{\ell-1}+1)`$ denote the number of degrees of freedom encorporated in $`\vartheta`$. -For $`\varphi = (\varphi_1, \dots, \varphi_L)`$ we define the model class of a certain (fully connected) network topology by +Instead of solving the minimization problem explicitly, we can use iterative schemes to approximate the solution. +The easiest and most well known approach is gradient descent (Euler's method), i.e. ```math -\mathcal{M}_{d, \varphi} = \{ \Psi_\vartheta \,\vert\, \vartheta \in \mathbb{R}^M \text{ and activation functions } \varphi\}. +\vartheta^{(j+1)} = \vartheta^{(j)} - \eta \operatorname{\nabla}_{\vartheta}\mathcal{L}_N(\Psi_{\vartheta^{(j)}}), +\qquad j=0, 1, 2, \dots ``` -If we want to use the neural network to approximate a function $`f`$ the easiest approach would be to conduct a Least-Squares regression in an appropriate norm. -To make things even easier for the explaination, we assume $`f\colon \mathbb{R}^K \to \mathbb{R}`$, i.e., $`\operatorname{dim}(x^{(0)})=K`$ and $`\operatorname{dim}(x^{(L)})=1`$. -Assuming the function $`f`$ has a second moment, we can use a standard $`L^2`$-norm for our Least-Square problem: +where the step size $`\eta>0`$ is typically called the _learning rate_ and $`\vartheta^{(0)}`$ is a random initialization of the weights and biases. -```math -\text{Find}\qquad \Psi_\vartheta -= \operatorname*{arg\, min}_{\Psi_\theta\in\mathcal{M}_{d,\varphi}} \Vert f - \Psi_\theta \Vert_{L^2(\pi)}^2 -= \operatorname*{arg\, min}_{\theta\in\mathbb{R}^{M}} \int_{\mathbb{R}^K} \bigl(f(x) - \Psi_\theta(x)\bigr)^2 \ \mathrm{d}\pi(x), -``` +The key why gradient descent is more promising then first-order optimality criterion is the iterative character. +In particular, we can use the law of large numbers and restrict the number of summands in $\mathcal{L}_N$ to a random subset of fixed size in each iteration step, which is called _stochastic gradient descent_ (SGD). +Convergence of SGD can be shown by convex minimization and stochastic approximation theory and only requires that the learning rate $`\eta`$ with an appropriate rate. +**(see ?? for mor information)** -where we assume $`x\sim\pi`$ for some appropriate probability distribution $`\pi`$ (e.g. uniform or normal). -As computing the integrals above is not feasible for $`K\gg1`$, we consider an empirical version. -Let $`x^{(1)},\dots,x^{(N)}\sim\pi`$ be independent (random) samples and assume we have access to $`f^{(i)}:=f(x^{(i)})`$, $`i=1,\dots,N`$. +Here, however, I want to focus more on the difference between "normal" GD and SGD (in an intuitive level). +In principle, SGD trades gradient computations of a large number of term against the convergence rate of the algorithm. +The best metaphor to remember the difference (I know of) is the following: -> **Definition** (training data): -> Tuples of the form $`(x^{(i)}, f^{(i)})_{i=1}^N`$ are called _labeled training data_. +> **Metaphor (SGD):** +> Assume you and a friend of yours have had a party on the top of a mountain. +> As the party has come to an end, you both want to get back home somewhere in the valley. +> You, scientist that you are, plan the most direct way down the mountain, following the steepest descent, planning each step carefully as the terrain is very rough. +> Your friend, however, drank a little to much and is not capable of planning anymore. +> So they stagger down the mountain in a more or less random direction. +> Each step they take is with little thought, but it takes them a long time overall to get back home (or at least close to it). +> +> <img src="sgd.png" title="sgd" alt="sgd" height=400 /> -The empirical regression problem then reads +What remains is the computation of $`\operatorname{\nabla}_\vartheta\Psi_{\vartheta^{(i)}}`$ for $`i`\in\Gamma_j\subset\{1,\dots,N\}$ in each step. +Lucky for us, we know that $`\Psi_\vartheta`$ is a simple concatenation of activation functions $`\varphi_\ell`$ and affine maps $`A_\ell(x^{(\ell-1)}) = W_\ell x^{(\ell-1)} + b_\ell`$ with derivative -```math -\text{Find}\qquad \Psi_\vartheta -= \operatorname*{arg\, min}_{\Psi_\theta\in\mathcal{M}_{d,\varphi}} \frac{1}{N} \sum_{i=1}^N \bigl(f^{(i)} - \Psi_\theta(x^{(i)})\bigr)^2 -=: \operatorname*{arg\, min}_{\Psi_\theta\in\mathcal{M}_{d,\varphi}} \mathcal{L}_N(\Psi_\theta) +``` +\partial_{W^{(m)}_{\alpha,\beta}} A^{(\ell)} = +\begin{cases} +W^{(\ell)}_{\alpha,\beta} & \text{if }m=\ell,\\ +0 & \text{if }m\neq\ell, +\end{cases} +\qquad\text{and}\qquad +\partial_{b^{(m)}_{\alpha}} A^{(\ell)} = +\begin{cases} +b^{(\ell)}_{\alpha} & \text{if }m=\ell,\\ +0 & \text{if }m\neq\ell. +\end{cases} ``` -> **Definition** (loss function): -> A _loss functions_ is any function, which measures how good a neural network approximates the target values. +The gradient $`\operatorname{\nabla}_\vartheta\Psi_{\vartheta^{(i)}}`$ can then be computed using the chain rule due to the compositional structure of the neural network. +Computing the gradient through the chain rule is still very inefficient and most probably infeasible if done in a naive fashion. +The so called _Backpropagation_ is esentially a way to compute the partial derivatives layer-wise storting only the necessary information to prevent repetitive computations, rendering the computation manaeable. -Typical loss functions for regression and classification tasks are - - mean-square error (MSE, standard $`L^2`$-error) - - weighted $`L^p`$- or $`H^k`$-norms (solutions of PDEs) - - cross-entropy (difference between distributions) - - Kullback-Leibler divergence, Hellinger distance, Wasserstein metrics - - Hinge loss (SVM) +Types of Neural Networks +------------------------ + +| Name | Graph | +| --- | --- | +| Fully Connected Neural Network | <img src="nn_fc.png" title="nn_fc" alt="nn_fc" height=250 /> | +| Convolutional Neural Network | <img src="nn_conv.png" title="nn_conv" alt="nn_conv" height=250/> | +| U-Net | <img src="u_net.png" title="u_net" alt="u_net" height=250/> | +| Residual Neural Network | <img src="res_net.png" title="res_net" alt="res_net" height=250/> | +| Invertible Neural Network | <img src="inn.png" title="inn" alt="inn" height=250/> | + +Further Reading +--------------- +- Python: PyTorch, TensorFlow, Scikit learn +- Matlab: Deeplearning Toolbox -- GitLab