-
Nando Farchmin authoredNando Farchmin authored
Neural Networks 101
Table of Contents
- Neural Networks 101
- Table of Contents
- Nomenclature and Definitions
- Activation Functions
- Minimization Problem
- Optimization (Training)
- Types of Neural Networks
- Further Reading
Nomenclature and Definitions
First, we need to clarify a few terms: artificial intelligence, machine learning and neural network. Everybody categorizes them differently, but we look at it this way:
Here we focus on neural networks as a special model class used for function approximation in regression or classification tasks. To be more precise, we will rely on the following definition.
Definition (Neural Network): For any
L\in\mathbb{N}
andd=(d_0,\dots,d_L)\in\mathbb{N}^{L+1}
a non-linear map\Psi\colon\mathbb{R}^{d_0}\to\mathbb{R}^{d_L}
of the form\Psi(x) = \bigl[\varphi_L\circ (W_L\bullet + b_L)\circ\varphi_{L-1}\circ\dots\circ(W_2\bullet + b_2)\circ\varphi_1\circ (W_1\bullet + b_1)\bigr](x)
is called a fully connected feed-forward neural network.
Typically, we use the following nomenclature:
-
L
is called the depth of the network with layers\ell=0,\dots,L
. -
d
is called the width of the network, whered_\ell
is the widths of the layers\ell
. -
W_\ell\in\mathbb{R}^{d_{\ell-1}\times d_\ell}
are the weights of layer\ell
. -
b_\ell\in\mathbb{R}^{d_\ell}
is the biases of layer\ell
. -
\vartheta=(W_1,b_1,\dots,W_L,b_L)
are the free parameters of the neural network. Sometimes we write\Psi_\vartheta
or\Psi(x; \vartheta)
to indicate the dependence of\Psi
on the parameters\vartheta
. -
\varphi_\ell
is the activation function of layer\ell
. Note that\varphi_\ell
has to be non-linear and monotone increasing.
Additionally, there exist the following conventions:
-
x^{(0)}:=x
is called the input (layer) of the neural network\Psi
. -
x^{(L)}:=\Psi(x)
is called the output (layer) of the neural network\Psi
. - Intermediate results
x^{(\ell)} = \varphi_\ell(W_\ell\, x^{(\ell-1)} + b_\ell)
are called hidden layers. - (debatable) A neural network is called shallow if it has only one hidden layer (
L=2
) and deep otherwise.
Example:
Let L=3
, d=(6, 10, 10, 3)
and \varphi_1=\varphi_2=\varphi_3=\tanh
.
Then the neural network is given by the concatenation
\Psi\colon \mathbb{R}^6\to\mathbb{R}^3,
\qquad
\Psi(x) = \varphi_3\Bigl(W_3 \Bigl(\underbrace{\varphi_2\bigl(W_2 \bigl(\underbrace{\varphi_1(W_1 x + b_1)}_{x^{(1)}}\bigr) + b_2\bigr)}_{x^{(2)}}\Bigr) + b_3\Bigr).
A typical graphical representation of the neural network looks like this: