doc/basics.md · a26a87522805fb26b2a66e54194b86fc011bb25c · ptb-843 / neural_networks_101 · GitLab

Snippets Groups Projects

3 years ago
a26a8752

Fix typo · a26a8752
Nando Farchmin authored 3 years ago

a26a8752

History

Fix typo
Nando Farchmin authored 3 years ago

basics.md 11.42 KiB

Neural Networks 101

Table of Contents

Neural Networks 101
Table of Contents
Nomenclature and Definitions
Activation Functions
Minimization Problem
Optimization (Training)
Types of Neural Networks
Further Reading

Nomenclature and Definitions

First, we need to clarify a few terms: artificial intelligence, machine learning and neural network. Everybody categorizes them differently, but we look at it this way:

Here we focus on neural networks as a special model class used for function approximation in regression or classification tasks. To be more precise, we will rely on the following definition.

Definition (Neural Network): For any L\in\mathbb{N} and d=(d_0,\dots,d_L)\in\mathbb{N}^{L+1} a non-linear map \Psi\colon\mathbb{R}^{d_0}\to\mathbb{R}^{d_L} of the form
\Psi(x) = \bigl[\varphi_L\circ (W_L\bullet  + b_L)\circ\varphi_{L-1}\circ\dots\circ(W_2\bullet  + b_2)\circ\varphi_1\circ (W_1\bullet  + b_1)\bigr](x)
is called a fully connected feed-forward neural network.

Typically, we use the following nomenclature:

L is called the depth of the network with layers \ell=0,\dots,L.
d is called the width of the network, where d_\ell is the widths of the layers \ell.
W_\ell\in\mathbb{R}^{d_{\ell-1}\times d_\ell} are the weights of layer \ell.
b_\ell\in\mathbb{R}^{d_\ell} is the biases of layer \ell.
\vartheta=(W_1,b_1,\dots,W_L,b_L) are the free parameters of the neural network. Sometimes we write \Psi_\vartheta or \Psi(x; \vartheta) to indicate the dependence of \Psi on the parameters \vartheta.
\varphi_\ell is the activation function of layer \ell. Note that \varphi_\ell has to be non-linear and monotone increasing.

Additionally, there exist the following conventions:

x^{(0)}:=x is called the input (layer) of the neural network \Psi.
x^{(L)}:=\Psi(x) is called the output (layer) of the neural network \Psi.
Intermediate results x^{(\ell)} = \varphi_\ell(W_\ell\, x^{(\ell-1)} + b_\ell) are called hidden layers.
(debatable) A neural network is called shallow if it has only one hidden layer (L=2) and deep otherwise.

Example: Let L=3, d=(6, 10, 10, 3) and \varphi_1=\varphi_2=\varphi_3=\tanh. Then the neural network is given by the concatenation

\Psi\colon \mathbb{R}^6\to\mathbb{R}^3,
\qquad
\Psi(x) = \varphi_3\Bigl(W_3 \Bigl(\underbrace{\varphi_2\bigl(W_2 \bigl(\underbrace{\varphi_1(W_1 x + b_1)}_{x^{(1)}}\bigr) + b_2\bigr)}_{x^{(2)}}\Bigr) + b_3\Bigr).

A typical graphical representation of the neural network looks like this: