1
jL
xed
j
X
l2L
xed
X
v2U
out
@ out
(l)
v
@ ex
(l)
u
:
Formal derivation:Apply chain rule.
@ out
v
@ ex
u
=
@ out
v
@ out
u
@ out
u
@ ex
u
=
@ out
v
@ net
v
@ net
v
@ out
u
@ out
u
@ ex
u
:
Simplication:Assume that the output function is the identity.
@ out
u
@ ex
u
= 1:
Christian Borgelt Introduction to Neural Networks 113
Sensitivity Analysis
For the second factor we get the general result:
@ net
v
@ out
u
=
@
@ out
u
X
p2pred(v)
w
vp
out
p
=
X
p2pred(v)
w
vp
@ out
p
@ out
u
:
This leads to the recursion formula
@ out
v
@ out
u
=
@ out
v
@ net
v
@ net
v
@ out
u
=
@ out
v
@ net
v
X
p2pred(v)
w
vp
@ out
p
@ out
u
:
However,for the rst hidden layer we get
@ net
v
@ out
u
= w
vu
;therefore
@ out
v
@ out
u
=
@ out
v
@ net
v
w
vu
:
This formula marks the start of the recursion.
Christian Borgelt Introduction to Neural Networks 114
Sensitivity Analysis
Consider as usual the special case with
output function is the identity,
activation function is logistic.
The recursion formula is in this case
@ out
v
@ out
u
= out
v
(1 out
v
)
X
p2pred(v)
w
vp
@ out
p
@ out
u
and the anchor of the recursion is
@ out
v
@ out
u
= out
v
(1 out
v
)w
vu
:
Christian Borgelt Introduction to Neural Networks 115
Demonstration Software:xmlp/wmlp
Demonstration of multilayer perceptron training:
Visualization of the training process
Biimplication and Exclusive Or,two continuous functions
http://www.borgelt.net/mlpd.html
Christian Borgelt Introduction to Neural Networks 116
Multilayer Perceptron Software:mlp/mlpgui
Software for training general multilayer perceptrons:
Command line version written in C,fast training
Graphical user interface in Java,easy to use
http://www.borgelt.net/mlp.html,http://www.borgelt.net/mlpgui.html
Christian Borgelt Introduction to Neural Networks 117
Radial Basis Function Networks
Christian Borgelt Introduction to Neural Networks 118
Radial Basis Function Networks
A radial basis function network (RBFN) is a neural network
with a graph G = (U;C) that satises the following conditions
(i) U
in
\U
out
=;;
(ii) C = (U
in
U
hidden
) [C
0
;C
0
(U
hidden
U
out
)
The network input function of each hidden neuron is a distance function
of the input vector and the weight vector,i.e.
8u 2 U
hidden
:f
(u)
net
( ~w
u
;
~
in
u
) = d( ~w
u
;
~
in
u
);
where d:IR
n
IR
n
!IR
+
0
is a function satisfying 8~x;~y;~z 2 IR
n
:
(i) d(~x;~y) = 0,~x = ~y;
(ii) d(~x;~y) = d(~y;~x) (symmetry);
(iii) d(~x;~z) d(~x;~y) +d(~y;~z) (triangle inequality):
Christian Borgelt Introduction to Neural Networks 119
Distance Functions
Illustration of distance functions
d
k
(~x;~y) =
0
@
n
X
i=1
(x
i
y
i
)
k
1
A
1
k
Wellknown special cases from this family are:
k = 1:Manhattan or city block distance,
k = 2:Euclidean distance,
k!1:maximum distance,i.e.d
1
(~x;~y) = max
n
i=1
jx
i
y
i
j.
k = 1
k = 2
k!1
Christian Borgelt Introduction to Neural Networks 120
Radial Basis Function Networks
The network input function of the output neurons is the weighted sum of their
inputs,i.e.
8u 2 U
out
:f
(u)
net
( ~w
u
;
~
in
u
) = ~w
u
~
in
u
=
X
v2pred(u)
w
uv
out
v
:
The activation function of each hidden neuron is a socalled radial function,i.e.
a monotonously decreasing function
f:IR
+
0
![0;1] with f(0) = 1 and lim
x!1
f(x) = 0:
The activation function of each output neuron is a linear function,namely
f
(u)
act
(net
u
;
u
) = net
u
u
:
(The linear activation function is important for the initialization.)
Christian Borgelt Introduction to Neural Networks 121
Radial Activation Functions
rectangle function:
f
act
(net;) =
0;if net > ;
1;otherwise.
net
0
1
triangle function:
f
act
(net;) =
0;if net > ;
1
net
;otherwise.
net
0
1
cosine until zero:
f
act
(net;) =
(
0;if net > 2,
cos
(
2
net
)
+1
2
;otherwise.
net
0
1
2
1
2
Gaussian function:
f
act
(net;) = e
net
2
2
2
net
0
1
2
e
1
2
e
2
Christian Borgelt Introduction to Neural Networks 122
Radial Basis Function Networks:Examples
Radial basis function networks for the conjunction x
1
^x
2
1
2
0
x
1
x
2
1
1
1
y
0 1
1
0
x
1
x
2
6
5
1
x
1
x
2
0
0
1
y
0 1
1
0
x
1
x
2
Christian Borgelt Introduction to Neural Networks 123
Radial Basis Function Networks:Examples
Radial basis function networks for the biimplication x
1
$x
2
Idea:logical decomposition
x
1
$x
2
(x
1
^x
2
) _:(x
1
_x
2
)
1
2
1
2
0
x
1
x
2
1
1
0
0
1
1
y
0 1
1
0
x
1
x
2
Christian Borgelt Introduction to Neural Networks 124
Radial Basis Function Networks:Function Approximation
x
y
x
1
x
2
x
3
x
4
x
y
x
1
x
2
x
3
x
4
y
1
y
2
y
3
y
4
y
1
y
2
y
3
y
4
0
1
y
4
0
1
y
3
0
1
y
2
0
1
y
1
Christian Borgelt Introduction to Neural Networks 125
Radial Basis Function Networks:Function Approximation
0
x
.
.
.
x
1
x
2
x
3
x
4
.
.
.
.
.
.
.
.
.
y
1
y
2
y
3
y
4
.
.
.
.
.
.
y
=
1
2
x =
1
2
(x
i+1
x
i
)
Christian Borgelt Introduction to Neural Networks 126
Radial Basis Function Networks:Function Approximation
x
y
x
1
x
2
x
3
x
4
x
y
x
1
x
2
x
3
x
4
y
1
y
2
y
3
y
4
y
1
y
2
y
3
y
4
0
1
0
1
0
1
0
1
!
!
!
!
!
!
a
a
a
a
a
a
y
4
!
!
!
!
!
!
a
a
a
a
a
a
y
3
!
!
!
!
!
!
a
a
a
a
a
a
y
2
!
!
!
!
!
!
a
a
a
a
a
a
y
1
Christian Borgelt Introduction to Neural Networks 127
Radial Basis Function Networks:Function Approximation
x
y
2
1
0
1
2 4 6 8
x
y
2
1
0
1
2 4 6 8
0
1
w
1
0
1
w
2
0
1
w
3
Christian Borgelt Introduction to Neural Networks 128
Radial Basis Function Networks:Function Approximation
Radial basis function network for a sum of three Gaussian functions
x
2
5
6
1
1
1
1
3
2
0
y
Christian Borgelt Introduction to Neural Networks 129
Training Radial Basis Function Networks
Christian Borgelt Introduction to Neural Networks 130
Radial Basis Function Networks:Initialization
Let L
xed
= fl
1
;:::;l
m
g be a xed learning task,
consisting of m training patterns l = (~{
(l)
;~o
(l)
).
Simple radial basis function network:
One hidden neuron v
k
,k = 1;:::;m,for each training pattern:
8k 2 f1;:::;mg:~w
v
k
=~{
(l
k
)
:
If the activation function is the Gaussian function,
the radii
k
are chosen heuristically
8k 2 f1;:::;mg:
k
=
d
max
p
2m
;
where
d
max
= max
l
j
;l
k
2L
xed
d
~{
(l
j
)
;~{
(l
k
)
:
Christian Borgelt Introduction to Neural Networks 131
Radial Basis Function Networks:Initialization
Initializing the connections from the hidden to the output neurons
8u:
m
X
k=1
w
uv
m
out
(l)
v
m
u
= o
(l)
u
or abbreviated A ~w
u
= ~o
u
;
where ~o
u
= (o
(l
1
)
u
;:::;o
(l
m
)
u
)
>
is the vector of desired outputs,
u
= 0,and
A=
0
B
B
B
B
B
B
@
out
(l
1
)
v
1
out
(l
1
)
v
2
:::out
(l
1
)
v
m
out
(l
2
)
v
1
out
(l
2
)
v
2
:::out
(l
2
)
v
m
.
.
.
.
.
.
.
.
.
out
(l
m
)
v
1
out
(l
m
)
v
2
:::out
(l
m
)
v
m
1
C
C
C
C
C
C
A
:
This is a linear equation system,that can be solved by inverting the matrix A:
~w
u
= A
1
~o
u
:
Christian Borgelt Introduction to Neural Networks 132
RBFN Initialization:Example
Simple radial basis function network for the biimplication x
1
$x
2
x
1
x
2
y
0
0
1
1
0
0
0
1
0
1
1
1
1
2
1
2
1
2
1
2
0
x
1
x
2
0
0
1
0
0
1
1
1
w
1
w
2
w
3
w
4
y
Christian Borgelt Introduction to Neural Networks 133
RBFN Initialization:Example
Simple radial basis function network for the biimplication x
1
$x
2
A=
0
B
B
B
B
@
1 e
2
e
2
e
4
e
2
1 e
4
e
2
e
2
e
4
1 e
2
e
4
e
2
e
2
1
1
C
C
C
C
A
A
1
=
0
B
B
B
B
B
B
@
a
D
b
D
b
D
c
D
b
D
a
D
c
D
b
D
b
D
c
D
a
D
b
D
c
D
b
D
b
D
a
D
1
C
C
C
C
C
C
A
where
D = 1 4e
4
+6e
8
4e
12
+e
16
0:9287
a = 1 2e
4
+e
8
0:9637
b = e
2
+2e
6
e
10
0:1304
c = e
4
2e
8
+e
12
0:0177
~w
u
= A
1
~o
u
=
1
D
0
B
B
B
B
@
a +c
2b
2b
a +c
1
C
C
C
C
A
0
B
B
B
B
@
1:0567
0:2809
0:2809
1:0567
1
C
C
C
C
A
Christian Borgelt Introduction to Neural Networks 134
RBFN Initialization:Example
Simple radial basis function network for the biimplication x
1
$x
2
single basis function
x
2
x
1
1
1
0
1
2
1
0
1
2
act
all basis functions
x
2
x
1
1
1
0
1
2
1
0
1
2
act
output
x
2
x
1
1
1
0
1
2
1
0
1
2
y
(1,0)
Initialization leads already to a perfect solution of the learning task.
Subsequent training is not necessary.
Christian Borgelt Introduction to Neural Networks 135
Radial Basis Function Networks:Initialization
Normal radial basis function networks:
Select subset of k training patterns as centers.
A=
0
B
B
B
B
B
B
@
1 out
(l
1
)
v
1
out
(l
1
)
v
2
:::out
(l
1
)
v
k
1 out
(l
2
)
v
1
out
(l
2
)
v
2
:::out
(l
2
)
v
k
.
.
.
.
.
.
.
.
.
.
.
.
1 out
(l
m
)
v
1
out
(l
m
)
v
2
:::out
(l
m
)
v
k
1
C
C
C
C
C
C
A
A ~w
u
= ~o
u
Compute (Moore{Penrose) pseudo inverse:
A
+
= (A
>
A)
1
A
>
:
The weights can then be computed by
~w
u
= A
+
~o
u
= (A
>
A)
1
A
>
~o
u
Christian Borgelt Introduction to Neural Networks 136
RBFN Initialization:Example
Normal radial basis function network for the biimplication x
1
$x
2
Select two training patterns:
l
1
= (~{
(l
1
)
;~o
(l
1
)
) = ((0;0);(1))
l
4
= (~{
(l
4
)
;~o
(l
4
)
) = ((1;1);(1))
1
2
1
2
x
1
x
2
1
1
0
0
w
1
w
2
y
Christian Borgelt Introduction to Neural Networks 137
RBFN Initialization:Example
Normal radial basis function network for the biimplication x
1
$x
2
A=
0
B
B
B
B
@
1 1 e
4
1 e
2
e
2
1 e
2
e
2
1 e
4
1
1
C
C
C
C
A
A
+
= (A
>
A)
1
A
>
=
0
B
@
a b b a
c d d e
e d d c
1
C
A
where
a 0:1810;b 0:6810;
c 1:1781;d 0:6688;e 0:1594:
Resulting weights:
~w
u
=
0
B
@
w
1
w
2
1
C
A = A
+
~o
u
0
B
@
0:3620
1:3375
1:3375
1
C
A:
Christian Borgelt Introduction to Neural Networks 138
RBFN Initialization:Example
Normal radial basis function network for the biimplication x
1
$x
2
basis function (0,0)
x
2
x
1
1
1
0
1
2
1
0
1
2
act
basis function (1,1)
x
2
x
1
1
1
0
1
2
1
0
1
2
act
output
y
1
0
0:36
(1,0)
Initialization leads already to a perfect solution of the learning task.
This is an accident,because the linear equation system is not overdetermined,
due to linearly dependent equations.
Christian Borgelt Introduction to Neural Networks 139
Radial Basis Function Networks:Initialization
Finding appropriate centers for the radial basis functions
One approach:kmeans clustering
Select randomly k training patterns as centers.
Assign to each center those training patterns that are closest to it.
Compute new centers as the center of gravity of the assigned training patterns
Repeat previous two steps until convergence,
i.e.,until the centers do not change anymore.
Use resulting centers for the weight vectors of the hidden neurons.
Alternative approach:learning vector quantization
Christian Borgelt Introduction to Neural Networks 140
Radial Basis Function Networks:Training
Training radial basis function networks:
Derivation of update rules is analogous to that of multilayer perceptrons.
Weights from the hidden to the output neurons.
Gradient:
~
r
~w
u
e
(l)
u
=
@e
(l)
u
@ ~w
u
= 2(o
(l)
u
out
(l)
u
)
~
in
(l)
u
;
Weight update rule:
~w
(l)
u
=
3
2
~
r
~w
u
e
(l)
u
=
3
(o
(l)
u
out
(l)
u
)
~
in
(l)
u
(Two more learning rates are needed for the center coordinates and the radii.)
Christian Borgelt Introduction to Neural Networks 141
Radial Basis Function Networks:Training
Training radial basis function networks:
Center coordinates (weights from the input to the hidden neurons).
Gradient:
~
r
~w
v
e
(l)
=
@e
(l)
@ ~w
v
= 2
X
s2succ(v)
(o
(l)
s
out
(l)
s
)w
su
@ out
(l)
v
@ net
(l)
v
@ net
(l)
v
@ ~w
v
Weight update rule:
~w
(l)
v
=
1
2
~
r
~w
v
e
(l)
=
1
X
s2succ(v)
(o
(l)
s
out
(l)
s
)w
sv
@ out
(l)
v
@ net
(l)
v
@ net
(l)
v
@ ~w
v
Christian Borgelt Introduction to Neural Networks 142
Radial Basis Function Networks:Training
Training radial basis function networks:
Center coordinates (weights from the input to the hidden neurons).
Special case:Euclidean distance
@ net
(l)
v
@ ~w
v
=
0
@
n
X
i=1
(w
vp
i
out
(l)
p
i
)
2
1
A
1
2
( ~w
v
~
in
(l)
v
):
Special case:Gaussian activation function
@ out
(l)
v
@ net
(l)
v
=
@f
act
( net
(l)
v
;
v
)
@ net
(l)
v
=
@
@ net
(l)
v
e
net
(l)
v
2
2
2
v
=
net
(l)
v
2
v
e
net
(l)
v
2
2
2
v
:
Christian Borgelt Introduction to Neural Networks 143
Radial Basis Function Networks:Training
Training radial basis function networks:
Radii of radial basis functions.
Gradient:
@e
(l)
@
v
= 2
X
s2succ(v)
(o
(l)
s
out
(l)
s
)w
su
@ out
(l)
v
@
v
:
Weight update rule:
(l)
v
=
2
2
@e
(l)
@
v
=
2
X
s2succ(v)
(o
(l)
s
out
(l)
s
)w
sv
@ out
(l)
v
@
v
:
Special case:Gaussian activation function
@ out
(l)
v
@
v
=
@
@
v
e
net
(l)
v
2
2
2
v
=
net
(l)
v
2
3
v
e
net
(l)
v
2
2
2
v
:
Christian Borgelt Introduction to Neural Networks 144
Radial Basis Function Networks:Generalization
Generalization of the distance function
Idea:Use anisotropic distance function.
Example:Mahalanobis distance
d(~x;~y) =
q
(~x ~y)
>
1
(~x ~y):
Example:biimplication
1
3
0
x
1
x
2
1
2
1
2
1
y
=
9 8
8 9
0 1
1
0
x
1
x
2
Christian Borgelt Introduction to Neural Networks 145
Learning Vector Quantization
Christian Borgelt Introduction to Neural Networks 146
Vector Quantization
Voronoi diagram of a vector quantization
Dots represent vectors that are used for quantizing the area.
Lines are the boundaries of the regions of points
that are closest to the enclosed vector.
Christian Borgelt Introduction to Neural Networks 147
Learning Vector Quantization
Finding clusters in a given set of data points
Data points are represented by empty circles ().
Cluster centers are represented by full circles ().
Christian Borgelt Introduction to Neural Networks 148
Learning Vector Quantization Networks
A learning vector quantization network (LVQ) is a neural network
with a graph G = (U;C) that satises the following conditions
(i) U
in
\U
out
=;;U
hidden
=;
(ii) C = U
in
U
out
The network input function of each output neuron is a distance function
of the input vector and the weight vector,i.e.
8u 2 U
out
:f
(u)
net
( ~w
u
;
~
in
u
) = d( ~w
u
;
~
in
u
);
where d:IR
n
IR
n
!IR
+
0
is a function satisfying 8~x;~y;~z 2 IR
n
:
(i) d(~x;~y) = 0,~x = ~y;
(ii) d(~x;~y) = d(~y;~x) (symmetry);
(iii) d(~x;~z) d(~x;~y) +d(~y;~z) (triangle inequality):
Christian Borgelt Introduction to Neural Networks 149
Distance Functions
Illustration of distance functions
d
k
(~x;~y) =
0
@
n
X
i=1
(x
i
y
i
)
k
1
A
1
k
Wellknown special cases from this family are:
k = 1:Manhattan or city block distance,
k = 2:Euclidean distance,
k!1:maximum distance,i.e.d
1
(~x;~y) = max
n
i=1
jx
i
y
i
j.
k = 1
k = 2
k!1
Christian Borgelt Introduction to Neural Networks 150
Learning Vector Quantization
The activation function of each output neuron is a socalled radial function,i.e.
a monotonously decreasing function
f:IR
+
0
![0;1] with f(0) = 1 and lim
x!1
f(x) = 0:
Sometimes the range of values is restricted to the interval [0;1].
However,due to the special output function this restriction is irrelevant.
The output function of each output neuron is not a simple function of the activation
of the neuron.Rather it takes into account the activations of all output neurons:
f
(u)
out
(act
u
) =
8
<
:
1;if act
u
= max
v2U
out
act
v
;
0;otherwise.
If more than one unit has the maximal activation,one is selected at randomto have
an output of 1,all others are set to output 0:winnertakesall principle.
Christian Borgelt Introduction to Neural Networks 151
Radial Activation Functions
rectangle function:
f
act
(net;) =
0;if net > ;
1;otherwise.
net
0
1
triangle function:
f
act
(net;) =
0;if net > ;
1
net
;otherwise.
net
0
1
cosine until zero:
f
act
(net;) =
(
0;if net > 2,
cos
(
2
net
)
+1
2
;otherwise.
net
0
1
2
1
2
Gaussian function:
f
act
(net;) = e
net
2
2
2
net
0
1
2
e
1
2
e
2
Christian Borgelt Introduction to Neural Networks 152
Learning Vector Quantization
Adaptation of reference vectors/codebook vectors
For each training pattern nd the closest reference vector.
Adapt only this reference vector (winner neuron).
For classied data the class may be taken into account:
Each reference vector is assigned to a class.
Attraction rule (data point and reference vector have same class)
~r
(new)
= ~r
(old)
+(~x ~r
(old)
);
Repulsion rule (data point and reference vector have dierent class)
~r
(new)
= ~r
(old)
(~x ~r
(old)
):
Christian Borgelt Introduction to Neural Networks 153
Learning Vector Quantization
Adaptation of reference vectors/codebook vectors
~r
1
~r
2
~r
3
~x
d
d
attraction rule
~r
1
~r
2
~r
3
~x
d
d
repulsion rule
~x:data point,~r
i
:reference vector
= 0:4 (learning rate)
Christian Borgelt Introduction to Neural Networks 154
Learning Vector Quantization:Example
Adaptation of reference vectors/codebook vectors
Left:Online training with learning rate = 0:1,
Right:Batch training with learning rate = 0:05.
Christian Borgelt Introduction to Neural Networks 155
Learning Vector Quantization:Learning Rate Decay
Problem:xed learning rate can lead to oscillations
Solution:time dependent learning rate
(t) =
0
t
;0 < < 1;or (t) =
0
t
; > 0:
Christian Borgelt Introduction to Neural Networks 156
Learning Vector Quantization:Classied Data
Improved update rule for classied data
Idea:Update not only the one reference vector that is closest to the data point
(the winner neuron),but update the two closest reference vectors.
Let ~x be the currently processed data point and c its class.
Let ~r
j
and ~r
k
be the two closest reference vectors and z
j
and z
k
their classes.
Reference vectors are updated only if z
j
6= z
k
and either c = z
j
or c = z
k
.
(Without loss of generality we assume c = z
j
.)
The update rules for the two closest reference vectors are:
~r
(new)
j
= ~r
(old)
j
+(~x ~r
(old)
j
) and
~r
(new)
k
= ~r
(old)
k
(~x ~r
(old)
k
);
while all other reference vectors remain unchanged.
Christian Borgelt Introduction to Neural Networks 157
Learning Vector Quantization:Window Rule
It was observed in practical tests that standard learning vector quantization
may drive the reference vectors further and further apart.
To counteract this undesired behavior a window rule was introduced:
update only if the data point ~x is close to the classication boundary.
\Close to the boundary"is made formally precise by requiring
min
d(~x;~r
j
)
d(~x;~r
k
)
;
d(~x;~r
k
)
d(~x;~r
j
)
!
> ;where =
1
1 +
:
is a parameter that has to be specied by a user.
Intuitively, describes the\width"of the window around the classication
boundary,in which the data point has to lie in order to lead to an update.
Using it prevents divergence,because the update ceases for a data point once
the classication boundary has been moved far enough away.
Christian Borgelt Introduction to Neural Networks 158
Soft Learning Vector Quantization
Idea:Use soft assignments instead of winnertakesall.
Assumption:Given data was sampled from a mixture of normal distributions.
Each reference vector describes one normal distribution.
Objective:Maximize the loglikelihood ratio of the data,that is,maximize
lnL
ratio
=
n
X
j=1
ln
X
~r2R(c
j
)
exp
0
@
(~x
j
~r)
>
(~x
j
~r)
2
2
1
A
n
X
j=1
ln
X
~r2Q(c
j
)
exp
0
@
(~x
j
~r)
>
(~x
j
~r)
2
2
1
A
:
Here is a parameter specifying the\size"of each normal distribution.
R(c) is the set of reference vectors assigned to class c and Q(c) its complement.
Intuitively:at each data point the probability density for its class should be as large
as possible while the density for all other classes should be as small as possible.
Christian Borgelt Introduction to Neural Networks 159
Soft Learning Vector Quantization
Update rule derived from a maximum loglikelihood approach:
~r
(new)
i
= ~r
(old)
i
+
8
>
<
>
:
u
ij
(~x
j
~r
(old)
i
);if c
j
= z
i
,
u
ij
(~x
j
~r
(old)
i
);if c
j
6= z
i
,
where z
i
is the class associated with the reference vector ~r
i
and
u
ij
=
exp(
1
2
2
(~x
j
~r
(old)
i
)
>
(~x
j
~r
(old)
i
))
X
~r2R(c
j
)
exp(
1
2
2
(~x
j
~r
(old)
)
>
(~x
j
~r
(old)
))
and
u
ij
=
exp(
1
2
2
(~x
j
~r
(old)
i
)
>
(~x
j
~r
(old)
i
))
X
~r2Q(c
j
)
exp(
1
2
2
(~x
j
~r
(old)
)
>
(~x
j
~r
(old)
))
:
R(c) is the set of reference vectors assigned to class c and Q(c) its complement.
Christian Borgelt Introduction to Neural Networks 160
Hard Learning Vector Quantization
Idea:Derive a scheme with hard assignments from the soft version.
Approach:Let the size parameter of the Gaussian function go to zero.
The resulting update rule is in this case:
~r
(new)
i
= ~r
(old)
i
+
8
>
<
>
:
u
ij
(~x
j
~r
(old)
i
);if c
j
= z
i
,
u
ij
(~x
j
~r
(old)
i
);if c
j
6= z
i
,
where
u
ij
=
8
>
<
>
:
1;if ~r
i
= argmin
~r2R(c
j
)
d(~x
j
;~r);
0;otherwise,
u
ij
=
8
>
<
>
:
1;if ~r
i
= argmin
~r2Q(c
j
)
d(~x
j
;~r);
0;otherwise.
~r
i
is closest vector of same class ~r
i
is closest vector of dierent class
This update rule is stable without a window rule restricting the update.
Christian Borgelt Introduction to Neural Networks 161
Learning Vector Quantization:Extensions
Frequency Sensitive Competitive Learning
The distance to a reference vector is modied according to
the number of data points that are assigned to this reference vector.
Fuzzy Learning Vector Quantization
Exploits the close relationship to fuzzy clustering.
Can be seen as an online version of fuzzy clustering.
Leads to faster clustering.
Size and Shape Parameters
Associate each reference vector with a cluster radius.
Update this radius depending on how close the data points are.
Associate each reference vector with a covariance matrix.
Update this matrix depending on the distribution of the data points.
Christian Borgelt Introduction to Neural Networks 162
Demonstration Software:xlvq/wlvq
Demonstration of learning vector quantization:
Visualization of the training process
Arbitrary datasets,but training only in two dimensions
http://www.borgelt.net/lvqd.html
Christian Borgelt Introduction to Neural Networks 163
SelfOrganizing Maps
Christian Borgelt Introduction to Neural Networks 164
SelfOrganizing Maps
A selforganizing map or Kohonen feature map is a neural network with
a graph G = (U;C) that satises the following conditions
(i) U
hidden
=;,U
in
\U
out
=;,
(ii) C = U
in
U
out
.
The network input function of each output neuron is a distance function of
input and weight vector.The activation function of each output neuron is a radial
function,i.e.a monotonously decreasing function
f:IR
+
0
![0;1] with f(0) = 1 and lim
x!1
f(x) = 0:
The output function of each output neuron is the identity.
The output is often discretized according to the\winner takes all"principle.
On the output neurons a neighborhood relationship is dened:
d
neurons
:U
out
U
out
!IR
+
0
:
Christian Borgelt Introduction to Neural Networks 165
SelfOrganizing Maps:Neighborhood
Neighborhood of the output neurons:neurons form a grid
quadratic grid
hexagonal grid
Thin black lines:Indicate nearest neighbors of a neuron.
Thick gray lines:Indicate regions assigned to a neuron for visualization.
Christian Borgelt Introduction to Neural Networks 166
Topology Preserving Mapping
Images of points close to each other in the original space
should be close to each other in the image space.
Example:Robinson projection of the surface of a sphere

Robinson projection is frequently used for world maps.
Christian Borgelt Introduction to Neural Networks 167
SelfOrganizing Maps:Neighborhood
Find topology preserving mapping by respecting the neighborhood
Reference vector update rule:
~r
(new)
u
= ~r
(old)
u
+(t) f
nb
(d
neurons
(u;u
);%(t)) (~x ~r
(old)
u
);
u
is the winner neuron (reference vector closest to data point).
The function f
nb
is a radial function.
Time dependent learning rate
(t) =
0
t
;0 <
< 1;or (t) =
0
t
;
> 0:
Time dependent neighborhood radius
%(t) = %
0
t
%
;0 <
%
< 1;or %(t) = %
0
t
%
;
%
> 0:
Christian Borgelt Introduction to Neural Networks 168
SelfOrganizing Maps:Examples
Example:Unfolding of a twodimensional selforganizing map.
Christian Borgelt Introduction to Neural Networks 169
SelfOrganizing Maps:Examples
Example:Unfolding of a twodimensional selforganizing map.
Christian Borgelt Introduction to Neural Networks 170
SelfOrganizing Maps:Examples
Example:Unfolding of a twodimensional selforganizing map.
Training a selforganizing map may fail if
the (initial) learning rate is chosen too small or
or the (initial) neighbor is chosen too small.
Christian Borgelt Introduction to Neural Networks 171
SelfOrganizing Maps:Examples
Example:Unfolding of a twodimensional selforganizing map.
(a) (b) (c)
Selforganizing maps that have been trained with random points from
(a) a rotation parabola,(b) a simple cubic function,(c) the surface of a sphere.
In this case original space and image space have dierent dimensionality.
Selforganizing maps can be used for dimensionality reduction.
Christian Borgelt Introduction to Neural Networks 172
Demonstration Software:xsom/wsom
Demonstration of selforganizing map training:
Visualization of the training process
Twodimensional areas and threedimensional surfaces
http://www.borgelt.net/somd.html
Christian Borgelt Introduction to Neural Networks 173
Hopeld Networks
Christian Borgelt Introduction to Neural Networks 174
Hopeld Networks
A Hopeld network is a neural network with a graph G = (U;C) that satises
the following conditions:
(i) U
hidden
=;,U
in
= U
out
= U,
(ii) C = U U f(u;u) j u 2 Ug.
In a Hopeld network all neurons are input as well as output neurons.
There are no hidden neurons.
Each neuron receives input from all other neurons.
A neuron is not connected to itself.
The connection weights are symmetric,i.e.
8u;v 2 U;u 6= v:w
uv
= w
vu
:
Christian Borgelt Introduction to Neural Networks 175
Hopeld Networks
The network input function of each neuron is the weighted sum of the outputs of
all other neurons,i.e.
8u 2 U:f
(u)
net
( ~w
u
;
~
in
u
) = ~w
u
~
in
u
=
X
v2Ufug
w
uv
out
v
:
The activation function of each neuron is a threshold function,i.e.
8u 2 U:f
(u)
act
(net
u
;
u
) =
(
1;if net
u
,
1;otherwise.
The output function of each neuron is the identity,i.e.
8u 2 U:f
(u)
out
(act
u
) = act
u
:
Christian Borgelt Introduction to Neural Networks 176
Hopeld Networks
Alternative activation function
8u 2 U:f
(u)
act
(net
u
;
u
;act
u
) =
8
>
<
>
:
1;if net
u
> ,
1;if net
u
< ,
act
u
;if net
u
= .
This activation function has advantages w.r.t.the physical interpretation
of a Hopeld network.
General weight matrix of a Hopeld network
W=
0
B
B
B
B
@
0 w
u
1
u
2
:::w
u
1
u
n
w
u
1
u
2
0:::w
u
2
u
n
.
.
.
.
.
.
.
.
.
w
u
1
u
n
w
u
1
u
n
:::0
1
C
C
C
C
A
Christian Borgelt Introduction to Neural Networks 177
Hopeld Networks:Examples
Very simple Hopeld network
0
0
x
1
x
2
u
1
u
2
1 1
y
1
y
2
W=
0 1
1 0
!
The behavior of a Hopeld network can depend on the update order.
Computations can oscillate if neurons are updated in parallel.
Computations always converge if neurons are updated sequentially.
Christian Borgelt Introduction to Neural Networks 178
Hopeld Networks:Examples
Parallel update of neuron activations
u
1
u
2
input phase
1
1
work phase
1
1
1
1
1
1
1
1
1
1
1
1
The computations oscillate,no stable state is reached.
Output depends on when the computations are terminated.
Christian Borgelt Introduction to Neural Networks 179
Hopeld Networks:Examples
Sequential update of neuron activations
u
1
u
2
input phase
1
1
work phase
1
1
1
1
1
1
1
1
u
1
u
2
input phase
1
1
work phase
1
1
1
1
1
1
1
1
Regardless of the update order a stable state is reached.
Which state is reached depends on the update order.
Christian Borgelt Introduction to Neural Networks 180
Hopeld Networks:Examples
Simplied representation of a Hopeld network
0
0
0
x
1
x
2
x
3
1 1
1 12
2
y
1
y
2
y
3
0
0
0
u
1
u
2
u
3
2
1
1
W=
0
B
@
0 1 2
1 0 1
2 1 0
1
C
A
Symmetric connections between neurons are combined.
Inputs and outputs are not explicitely represented.
Christian Borgelt Introduction to Neural Networks 181
Hopeld Networks:State Graph
Graph of activation states and transitions
+++
++ ++ ++
+ + +
u
1
u
2
u
3
u
2
u
3
u
1
u
2
u
1
u
3
u
2
u
1
u
3
u
2
u
1
u
3
u
2
u
1
u
3
u
2
u
3
u
1
u
1
u
2
u
3
Christian Borgelt Introduction to Neural Networks 182
Hopeld Networks:Convergence
Convergence Theorem:If the activations of the neurons of a Hopeld network
are updated sequentially (asynchronously),then a stable state is reached in a nite
number of steps.
If the neurons are traversed cyclically in an arbitrary,but xed order,at most n 2
n
steps (updates of individual neurons) are needed,where n is the number of neurons
of the Hopeld network.
The proof is carried out with the help of an energy function.
The energy function of a Hopeld network with n neurons u
1
;:::;u
n
is
E =
1
2
~
act
>
W
~
act +
~
T
~
act
=
1
2
X
u;v2U;u6=v
w
uv
act
u
act
v
+
X
u2U
u
act
u
:
Christian Borgelt Introduction to Neural Networks 183
Hopeld Networks:Convergence
Consider the energy change resulting from an update that changes an activation:
E = E
(new)
E
(old)
= (
X
v2Ufug
w
uv
act
(new)
u
act
v
+
u
act
(new)
u
)
(
X
v2Ufug
w
uv
act
(old)
u
act
v
+
u
act
(old)
u
)
=
act
(old)
u
act
(new)
u
(
X
v2Ufug
w
uv
act
v

{z
}
= net
u
u
):
net
u
<
u
:Second factor is less than 0.
act
(new)
u
= 1 and act
(old)
u
= 1,therefore rst factor greater than 0.
Result:E < 0.
net
u
u
:Second factor greater than or equal to 0.
act
(new)
u
= 1 and act
(old)
u
= 1,therefore rst factor less than 0.
Result:E 0.
Christian Borgelt Introduction to Neural Networks 184
Hopeld Networks:Examples
Arrange states in state graph according to their energy
+ + ++ ++
++ +
+++
4
2
0
2
E
Energy function for example Hopeld network:
E = act
u
1
act
u
2
2 act
u
1
act
u
3
act
u
2
act
u
3
:
Christian Borgelt Introduction to Neural Networks 185
Hopeld Networks:Examples
The state graph need not be symmetric
1
1
1
u
1
u
2
u
3
2
2
2
+ +
++ ++
+ +++
++
7
1
1
3
5
E
Christian Borgelt Introduction to Neural Networks 186
Hopeld Networks:Physical Interpretation
Physical interpretation:Magnetism
A Hopeld network can be seen as a (microscopic) model of magnetism
(socalled Ising model,[Ising 1925]).
physical neural
atom neuron
magnetic moment (spin) activation state
strength of outer magnetic eld threshold value
magnetic coupling of the atoms connection weights
Hamilton operator of the magnetic eld energy function
Christian Borgelt Introduction to Neural Networks 187
Hopeld Networks:Associative Memory
Idea:Use stable states to store patterns
First:Store only one pattern ~x = (act
(l)
u
1
;:::;act
(l)
u
n
)
>
2 f1;1g
n
,n 2,
i.e.,nd weights,so that pattern is a stable state.
Necessary and sucient condition:
S(W~x
~
) = ~x;
where
S:IR
n
!f1;1g
n
;
~x 7!~y
with
8i 2 f1;:::;ng:y
i
=
(
1;if x
i
0,
1;otherwise.
Christian Borgelt Introduction to Neural Networks 188
Hopeld Networks:Associative Memory
If
~
=
~
0 an appropriate matrix Wcan easily be found.It suces
W~x = c~x with c 2 IR
+
.
Algebraically:Find a matrix Wthat has a positive eigenvalue w.r.t.~x.
Choose
W= ~x~x
T
E
where ~x~x
T
is the socalled outer product.
With this matrix we have
W~x = (~x~x
T
)~x E~x
{z}
=~x
()
= ~x (~x
T
~x)

{z
}
=j~xj
2
=n
~x
= n~x ~x = (n 1)~x:
Christian Borgelt Introduction to Neural Networks 189
Hopeld Networks:Associative Memory
Hebbian learning rule [Hebb 1949]
Written in individual weights the computation of the weight matrix reads:
w
uv
=
8
>
>
>
<
>
>
>
:
0;if u = v,
1;if u 6= v,act
(p)
u
= act
(v)
u
,
1;otherwise.
Originally derived from a biological analogy.
Strengthen connection between neurons that are active at the same time.
Note that this learning rule also stores the complement of the pattern:
With W~x = (n 1)~x it is also W(~x) = (n 1)(~x):
Christian Borgelt Introduction to Neural Networks 190
Hopeld Networks:Associative Memory
Storing several patterns
Choose
W~x
j
=
m
X
i=1
W
i
~x
j
=
0
@
m
X
i=1
(~x
i
~x
T
i
)~x
j
1
A
mE~x
j

{z
}
=~x
j
=
0
@
m
X
i=1
~x
i
(~x
T
i
~x
j
)
1
A
m~x
j
If patterns are orthogonal,we have
~x
T
i
~x
j
=
(
0;if i 6= j,
n;if i = j,
and therefore
W~x
j
= (n m)~x
j
:
Christian Borgelt Introduction to Neural Networks 191
Hopeld Networks:Associative Memory
Storing several patterns
Result:As long as m< n,~x is a stable state of the Hopeld network.
Note that the complements of the patterns are also stored.
With W~x
j
= (n m)~x
j
it is also W(~x
j
) = (n m)(~x
j
):
But:Capacity is very small compared to the number of possible states (2
n
).
Nonorthogonal patterns:
W~x
j
= (n m)~x
j
+
m
X
i=1
i6=j
~x
i
(~x
T
i
~x
j
)

{z
}
\disturbance term"
:
Christian Borgelt Introduction to Neural Networks 192
Associative Memory:Example
Example:Store patterns ~x
1
= (+1;+1;1;1)
>
and ~x
2
= (1;+1;1;+1)
>
.
W= W
1
+W
2
= ~x
1
~x
T
1
+~x
2
~x
T
2
2E
where
W
1
=
0
B
B
B
B
@
0 1 1 1
1 0 1 1
1 1 0 1
1 1 1 0
1
C
C
C
C
A
;W
2
=
0
B
B
B
B
@
0 1 1 1
1 0 1 1
1 1 0 1
1 1 1 0
1
C
C
C
C
A
:
The full weight matrix is:
W=
0
B
B
B
B
@
0 0 0 2
0 0 2 0
0 2 0 0
2 0 0 0
1
C
C
C
C
A
:
Therefore it is
W~x
1
= (+2;+2;2;2)
>
and W~x
1
= (2;+2;2;+2)
>
:
Christian Borgelt Introduction to Neural Networks 193
Associative Memory:Examples
Example:Storing bit maps of numbers
Left:Bit maps stored in a Hopeld network.
Right:Reconstruction of a pattern from a random input.
Christian Borgelt Introduction to Neural Networks 194
Hopeld Networks:Associative Memory
Training a Hopeld network with the Delta rule
Necessary condition for pattern ~x being a stable state:
s(0 +w
u
1
u
2
act
(p)
u
2
+:::+w
u
1
u
n
act
(p)
u
n
u
1
) =act
(p)
u
1
;
s(w
u
2
u
1
act
(p)
u
1
+0 +:::+w
u
2
u
n
act
(p)
u
n
u
2
) =act
(p)
u
2
;
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
s(w
u
n
u
1
act
(p)
u
1
+w
u
n
u
2
act
(p)
u
2
+:::+0
u
n
) =act
(p)
u
n
:
with the standard threshold function
s(x) =
(
1;if x 0,
1;otherwise.
Christian Borgelt Introduction to Neural Networks 195
Hopeld Networks:Associative Memory
Training a Hopeld network with the Delta rule
Turn weight matrix into a weight vector:
~w = ( w
u
1
u
2
;w
u
1
u
3
;:::;w
u
1
u
n
;
w
u
2
u
3
;:::;w
u
2
u
n
;
.
.
.
.
.
.
w
u
n1
u
n
;
u
1
;
u
2
;:::;
u
n
):
Construct input vectors for a threshold logic unit
~z
2
= (act
(p)
u
1
;0;:::;0;

{z
}
n 2 zeros
act
(p)
u
3
;:::;act
(p)
u
n
;:::0;1;0;:::;0

{z
}
n 2 zeros
):
Apply Delta rule training until convergence.
Christian Borgelt Introduction to Neural Networks 196
Demonstration Software:xhfn/whfn
Demonstration of Hopeld networks as associative memory:
Visualization of the association/recognition process
Twodimensional networks of arbitrary size
http://www.borgelt.net/hfnd.html
Christian Borgelt Introduction to Neural Networks 197
Hopeld Networks:Solving Optimization Problems
Use energy minimization to solve optimization problems
General procedure:
Transform function to optimize into a function to minimize.
Transform function into the form of an energy function of a Hopeld network.
Read the weights and threshold values from the energy function.
Construct the corresponding Hopeld network.
Initialize Hopeld network randomly and update until convergence.
Read solution from the stable state reached.
Repeat several times and use best solution found.
Christian Borgelt Introduction to Neural Networks 198
Hopeld Networks:Activation Transformation
A Hopeld network may be dened either with activations 1 and 1 or with acti
vations 0 and 1.The networks can be transformed into each other.
From act
u
2 f1;1g to act
u
2 f0;1g:
w
0
uv
= 2w
uv
and
0
u
=
u
+
X
v2Ufug
w
uv
From act
u
2 f0;1g to act
u
2 f1;1g:
w
uv
=
1
2
w
0
uv
and
u
=
0
u
1
2
X
v2Ufug
w
0
uv
:
Christian Borgelt Introduction to Neural Networks 199
Hopeld Networks:Solving Optimization Problems
Combination lemma:Let two Hopeld networks on the same set U of neurons
with weights w
(i)
uv
,threshold values
(i)
u
and energy functions
E
i
=
1
2
X
u2U
X
v2Ufug
w
(i)
uv
act
u
act
v
+
X
u2U
(i)
u
act
u
;
i = 1;2,be given.Furthermore let a;b 2 IR.Then E = aE
1
+bE
2
is the energy
function of the Hopeld network on the neurons in U that has the weights w
uv
=
aw
(1)
uv
+bw
(2)
uv
and the threshold values
u
= a
(1)
u
+b
(2)
u
.
Proof:Just do the computations.
Idea:Additional conditions can be formalized separately and incorporated later.
Christian Borgelt Introduction to Neural Networks 200
Hopeld Networks:Solving Optimization Problems
Example:Traveling salesman problem
Idea:Represent tour by a matrix.
1
3 4
2
city
1 2 3 4
0
B
B
B
B
@
1 0 0 0
0 0 1 0
0 0 0 1
0 1 0 0
1
C
C
C
C
A
1:
2:
3:
4:
step
An element a
ij
of the matrix is 1 if the ith city is visited in the jth step and 0
otherwise.
Each matrix element will be represented by a neuron.
Christian Borgelt Introduction to Neural Networks 201
Hopeld Networks:Solving Optimization Problems
Minimization of the tour length
E
1
=
n
X
j
1
=1
n
X
j
2
=1
n
X
i=1
d
j
1
j
2
m
ij
1
m
(i mod n)+1;j
2
:
Double summation over steps (index i) needed:
E
1
=
X
(i
1
;j
1
)2f1;:::;ng
2
X
(i
2
;j
2
)2f1;:::;ng
2
d
j
1
j
2
(i
1
mod n)+1;i
2
m
i
1
j
1
m
i
2
j
2
;
where
ab
=
(
1;if a = b,
0;otherwise.
Symmetric version of the energy function:
E
1
=
1
2
X
(i
1
;j
1
)2f1;:::;ng
2
(i
2
;j
2
)2f1;:::;ng
2
d
j
1
j
2
(
(i
1
mod n)+1;i
2
+
i
1
;(i
2
mod n)+1
) m
i
1
j
1
m
i
2
j
2
:
Christian Borgelt Introduction to Neural Networks 202
Hopeld Networks:Solving Optimization Problems
Additional conditions that have to be satised:
Each city is visited on exactly one step of the tour:
8j 2 f1;:::;ng:
n
X
i=1
m
ij
= 1;
i.e.,each column of the matrix contains exactly one 1.
On each step of the tour exactly one city is visited:
8i 2 f1;:::;ng:
n
X
j=1
m
ij
= 1;
i.e.,each row of the matrix contains exactly one 1.
These conditions are incorporated by nding additional functions to optimize.
Christian Borgelt Introduction to Neural Networks 203
Hopeld Networks:Solving Optimization Problems
Formalization of rst condition as a minimization problem:
E
2
=
n
X
j=1
0
B
@
0
@
n
X
i=1
m
ij
1
A
2
2
n
X
i=1
m
ij
+1
1
C
A
=
n
X
j=1
0
@
0
@
n
X
i
1
=1
m
i
1
j
1
A
0
@
n
X
i
2
=1
m
i
2
j
1
A
2
n
X
i=1
m
ij
+1
1
A
=
n
X
j=1
n
X
i
1
=1
n
X
i
2
=1
m
i
1
j
m
i
2
j
2
n
X
j=1
n
X
i=1
m
ij
+n:
Double summation over cities (index i) needed:
E
2
=
X
(i
1
;j
1
)2f1;:::;ng
2
X
(i
2
;j
2
)2f1;:::;ng
2
j
1
j
2
m
i
1
j
1
m
i
2
j
2
2
X
(i;j)2f1;:::;ng
2
m
ij
:
Christian Borgelt Introduction to Neural Networks 204
Hopeld Networks:Solving Optimization Problems
Resulting energy function:
E
2
=
1
2
X
(i
1
;j
1
)2f1;:::;ng
2
(i
2
;j
2
)2f1;:::;ng
2
2
j
1
j
2
m
i
1
j
1
m
i
2
j
2
+
X
(i;j)2f1;:::;ng
2
2m
ij
Second additional condition is handled in a completely analogous way:
E
3
=
1
2
X
(i
1
;j
1
)2f1;:::;ng
2
(i
2
;j
2
)2f1;:::;ng
2
2
i
1
i
2
m
i
1
j
1
m
i
2
j
2
+
X
(i;j)2f1;:::;ng
2
2m
ij
:
Combining the energy functions:
E = aE
1
+bE
2
+cE
3
where
b
a
=
c
a
> 2 max
(j
1
;j
2
)2f1;:::;ng
2
d
j
1
j
2
:
Christian Borgelt Introduction to Neural Networks 205
Hopeld Networks:Solving Optimization Problems
From the resulting energy function we can read the weights
w
(i
1
;j
1
)(i
2
;j
2
)
= ad
j
1
j
2
(
(i
1
mod n)+1;i
2
+
i
1
;(i
2
mod n)+1
)

{z
}
from E
1
2b
j
1
j
2

{z
}
from E
2
2c
i
1
i
2

{z
}
from E
3
and the threshold values:
(i;j)
= 0a
{z}
from E
1
2b

{z
}
from E
2
2c

{z
}
from E
3
= 2(b +c):
Problem:Random initialization and update until convergence not always leads to
a matrix that represents a tour,leave alone an optimal one.
Christian Borgelt Introduction to Neural Networks 206
Recurrent Neural Networks
Christian Borgelt Introduction to Neural Networks 207
Recurrent Networks:Cooling Law
A body of temperature#
0
that is placed into an environment with temperature#
A
.
The cooling/heating of the body can be described by Newton's cooling law:
d#
dt
=
_
#= k(##
A
):
Exact analytical solution:
#(t) =#
A
+(#
0
#
A
)e
k(tt
0
)
Approximate solution with EulerCauchy polygon courses:
#
1
=#(t
1
) =#(t
0
) +
_
#(t
0
)t =#
0
k(#
0
#
A
)t:
#
2
=#(t
2
) =#(t
1
) +
_
#(t
1
)t =#
1
k(#
1
#
A
)t:
General recursive formula:
#
i
=#(t
i
) =#(t
i1
) +
_
#(t
i1
)t =#
i1
k(#
i1
#
A
)t
Christian Borgelt Introduction to Neural Networks 208
Recurrent Networks:Cooling Law
Euler{Cauchy polygon courses for dierent step widths:
t
#
#
A
#
0
0 5 10 15 20
t
#
#
A
#
0
0 5 10 15 20
t
#
#
A
#
0
0 5 10 15 20
t = 4
t = 2
t = 1
The thin curve is the exact analytical solution.
Recurrent neural network:
#(t
0
)#(t)
k#
A
t
kt
Christian Borgelt Introduction to Neural Networks 209
Recurrent Networks:Cooling Law
More formal derivation of the recursive formula:
Replace dierential quotient by forward dierence
d#(t)
dt
#(t)
t
=
#(t +t) #(t)
t
with suciently small t.Then it is
#(t +t) #(t) = #(t) k(#(t) #
A
)t;
#(t +t) #(t) = #(t) kt#(t) +k#
A
t
and therefore
#
i
#
i1
kt#
i1
+k#
A
t:
Christian Borgelt Introduction to Neural Networks 210
Recurrent Networks:Mass on a Spring
m
x
0
Governing physical laws:
Hooke's law:F = cl = cx (c is a spring dependent constant)
Newton's second law:F = ma = mx (force causes an acceleration)
Resulting dierential equation:
mx = cx or x =
c
m
x:
Christian Borgelt Introduction to Neural Networks 211
Recurrent Networks:Mass on a Spring
General analytical solution of the dierential equation:
x(t) = asin(!t) +b cos(!t)
with the parameters
!=
r
c
m
;
a = x(t
0
) sin(!t
0
) + v(t
0
) cos(!t
0
);
b = x(t
0
) cos(!t
0
) v(t
0
) sin(!t
0
):
With given initial values x(t
0
) = x
0
and v(t
0
) = 0 and
the additional assumption t
0
= 0 we get the simple expression
x(t) = x
0
cos
r
c
m
t
:
Christian Borgelt Introduction to Neural Networks 212
Recurrent Networks:Mass on a Spring
Turn dierential equation into two coupled equations:
_x = v and _v =
c
m
x:
Approximate dierential quotient by forward dierence:
x
t
=
x(t +t) x(t)
t
= v and
v
t
=
v(t +t) v(t)
t
=
c
m
x
Resulting recursive equations:
x(t
i
) = x(t
i1
) +x(t
i1
) = x(t
i1
) +t v(t
i1
) and
v(t
i
) = v(t
i1
) +v(t
i1
) = v(t
i1
)
c
m
t x(t
i1
):
Christian Borgelt Introduction to Neural Networks 213
Recurrent Networks:Mass on a Spring
0
0
x(t
0
)
v(t
0
)
x(t)
v(t)
t
c
m
t
u
2
u
1
Neuron u
1
:f
(u
1
)
net
(v;w
u
1
u
2
) = w
u
1
u
2
v =
c
m
t v and
f
(u
1
)
act
(act
u
1
;net
u
1
;
u
1
) = act
u
1
+net
u
1
u
1
;
Neuron u
2
:f
(u
2
)
net
(x;w
u
2
u
1
) = w
u
2
u
1
x = t x and
f
(u
2
)
act
(act
u
2
;net
u
2
;
u
2
) = act
u
2
+net
u
2
u
2
:
Christian Borgelt Introduction to Neural Networks 214
Recurrent Networks:Mass on a Spring
Some computation steps of the neural network:
t
v
x
0:0
0:0000
1:0000
0:1
0:5000
0:9500
0:2
0:9750
0:8525
0:3
1:4012
0:7124
0:4
1:7574
0:5366
0:5
2:0258
0:3341
0:6
2:1928
0:1148
x
t
1 2 3 4
The resulting curve is close to the analytical solution.
The approximation gets better with smaller step width.
Christian Borgelt Introduction to Neural Networks 215
Recurrent Networks:Dierential Equations
General representation of explicit nth order dierential equation:
x
(n)
= f(t;x;_x;x;:::;x
(n1)
)
Introduce n 1 intermediary quantities
y
1
= _x;y
2
= x;:::y
n1
= x
(n1)
to obtain the system
_x = y
1
;
_y
1
= y
2
;
.
.
.
_y
n2
= y
n1
;
_y
n1
= f(t;x;y
1
;y
2
;:::;y
n1
)
of n coupled rst order dierential equations.
Christian Borgelt Introduction to Neural Networks 216
Recurrent Networks:Dierential Equations
Replace dierential quotient by forward distance to obtain the recursive equations
x(t
i
) = x(t
i1
) + t y
1
(t
i1
);
y
1
(t
i
) = y
1
(t
i1
) + t y
2
(t
i1
);
.
.
.
y
n2
(t
i
) = y
n2
(t
i1
) + t y
n3
(t
i1
);
y
n1
(t
i
) = y
n1
(t
i1
) + f(t
i1
;x(t
i1
);y
1
(t
i1
);:::;y
n1
(t
i1
))
Each of these equations describes the update of one neuron.
The last neuron needs a special activation function.
Christian Borgelt Introduction to Neural Networks 217
Recurrent Networks:Dierential Equations
x
0
_x
0
x
0
.
.
.
x
(n1)
0
t
0
0
0
0
.
.
.
t
x(t)
t
t
t
t
Christian Borgelt Introduction to Neural Networks 218
Recurrent Networks:Diagonal Throw
y
x
y
0
x
0
'
v
0
cos'
v
0
sin'
Diagonal throw of a body.
Two dierential equations (one for each coordinate):
x = 0 and y = g;
where g = 9:81 ms
2
.
Initial conditions x(t
0
) = x
0
,y(t
0
) = y
0
,_x(t
0
) = v
0
cos'and _y(t
0
) = v
0
sin'.
Christian Borgelt Introduction to Neural Networks 219
Recurrent Networks:Diagonal Throw
Introduce intermediary quantities
v
x
= _x and v
y
= _y
to reach the system of dierential equations:
_x = v
x
;_v
x
= 0;
_y = v
y
;_v
y
= g;
from which we get the system of recursive update formulae
x(t
i
) = x(t
i1
) +t v
x
(t
i1
);v
x
(t
i
) = v
x
(t
i1
);
y(t
i
) = y(t
i1
) +t v
y
(t
i1
);v
y
(t
i
) = v
y
(t
i1
) t g:
Christian Borgelt Introduction to Neural Networks 220
Recurrent Networks:Diagonal Throw
Better description:Use vectors as inputs and outputs
~r = g~e
y
;
where ~e
y
= (0;1).
Initial conditions are ~r(t
0
) = ~r
0
= (x
0
;y
0
) and
_
~r(t
0
) =~v
0
= (v
0
cos';v
0
sin').
Introduce one vectorvalued intermediary quantity ~v =
_
~r to obtain
_
~r =~v;
_
~v = g~e
y
This leads to the recursive update rules
~r(t
i
) = ~r(t
i1
) +t ~v(t
i1
);
~v(t
i
) = ~v(t
i1
) t g~e
y
Christian Borgelt Introduction to Neural Networks 221
Recurrent Networks:Diagonal Throw
Advantage of vector networks becomes obvious if friction is taken into account:
~a = ~v =
_
~r
is a constant that depends on the size and the shape of the body.
This leads to the dierential equation
~r =
_
~r g~e
y
:
Introduce the intermediary quantity ~v =
_
~r to obtain
_
~r =~v;
_
~v = ~v g~e
y
;
from which we obtain the recursive update formulae
~r(t
i
) = ~r(t
i1
) +t ~v(t
i1
);
~v(t
i
) = ~v(t
i1
) t ~v(t
i1
) t g~e
y
:
Christian Borgelt Introduction to Neural Networks 222
Recurrent Networks:Diagonal Throw
Resulting recurrent neural network:
~r
0
~v
0
~
0
tg~e
y
t
~r(t)
t
x
y
1 2 3
There are no strange couplings as there would be in a nonvector network.
Note the deviation from a parabola that is due to the friction.
Christian Borgelt Introduction to Neural Networks 223
Recurrent Networks:Planet Orbit
~r = m
~r
j~r j
3
;)
_
~r =~v;
_
~v = m
~r
j~r j
3
:
Recursive update rules:
~r(t
i
) = ~r(t
i1
) +t ~v(t
i1
);
~v(t
i
) = ~v(t
i1
) t m
~r(t
i1
)
j~r(t
i1
)j
3
;
~r
0
~v
0
~
0
~
0
~x(t)
~v(t)
t mt
x
y
1 0:5
0 0:5
0:5
Christian Borgelt Introduction to Neural Networks 224
Recurrent Networks:Backpropagation through Time
Idea:Unfold the network between training patterns,
i.e.,create one neuron for each point in time.
Example:Newton's cooling law
#(t
0
)
#(t)
1kt 1kt 1kt 1kt
Unfolding into four steps.It is = k#
A
t.
Training is standard backpropagation on unfolded network.
All updates refer to the same weight.
updates are carried out after rst neuron is reached.
Christian Borgelt Introduction to Neural Networks 225