1
Chap 6. Artificial neural networks
In this chap, we consider how our brains work and how to build and train artificial
neural networks.
6.1
Introduction, or how the brain works?
‘
The computer hasn
’
t proved anything yet,
’
angry Garry Kasparov, the world
chess
champion, said after his defeat (beat by IBM Deep Blue ) in New York in May
1997. The Deep Blue was capable of
analyzing
200 million positions a second.
Chess

playing programs must be able to improve their performance with experience
or, in other words, a
machine must be capable of learning.
What is machine learning?
In general, machine learning involves adaptive mechanisms that enable
computers to learn from experience, learn by example and learn by analogy. Learning
capabilities can improve the perfor
mance of an intelligent system over time. The most
popular approaches to machine learning are artificial neural networks and genetic
algorithms.
What is a neural network?
A neural network can be defined as a model of reasoning based on the human brain.
Th
e human brain incorporates nearly 10 billion neurons and 60 trillion connections,
synapses, between them. A neuron consists of a cell body, soma, a number of fibres
called dendrites, and a single long fibre called the axon. The axon stretches out to the
de
ndrites and somas of other neurons.
Figure 6.1
is a schematic drawing of a neural
network.
Even entire collections of neurons may sometimes migrate from one place to
another. These mechanisms form the basis for learning in the brain. Our brain can be
considered as a highly complex, nonlinear and parallel information

processing system.
Both data and its processing are global rather than local.
Learning is a fundamental and essential characteristic of biological neural
2
networks. The ease and naturalness
with which they can led to attempts to emulate a
biological neural network in a computer.
How do artificial neural nets model the brain?
An artificial network consists of a number of very simple and highly
interconnected processors, also called neurons
. The neurons are connected by
weighted links passing signals from one neuron to another.
Figure 6.2
represents
connections of a typical ANN, and
Table 6.1
shows the analogy between biological
and artificial neural networks.
How does an artificial neur
al network
‘
learn
’
?
The neurons are connected by links, and each link has a numerical weight
associated with it. Weights are the basic means of long

term memory in ANNs. A
neural net
‘
learns
’
through repeated adjustments of these weights.
How does the n
eural net adjust the weights?
We train the neural network by following: we initialize the weights of the
network and update the weights from a set of training examples.
6.2
The neuron as a sample computing element
Figure 6.3
shows a typical neuron. The neu
ron computes the weighted sum of the
input signals and compares the
result
with a threshold value,
. If the net input is
less than the threshold, the neuron output is
–
1. But if the net input is greater than
3
or equal to the threshold, the neuron becomes a
ctivated and its output attains a
value +1.
The typical activation function is called sign function.
Is the sign function the only activation function used by neurons?
Four comm
on choices
–
the step, sign, linear and sigmoid functions
–
are
illustrated in
Figure 6.4
.
The step and sign activation functions, also called hard limit functions, are often
used id decision

making neurons for classification and pattern recognition.
Th
e sigmoid function transforms the input, which can have any value between
4
plus and minus infinity, into a reasonable value in the range between 0 and 1.
The linear activation function provides an output equal to the neuron weighted
input.
Can a single ne
uron learn a task?
In 1958, Frank Rosenblatt introduced a simple algorithm for training a simple
ANN: perceptron. A simple

layer two

input perceptron is shown in
Figure 6.5
.
6.3
The perceptron
The aim of the perceptron is to
classify
inputs, or in other wo
rds externally
applied stimuli x1, x2,
…
, xn, into one of two classes, say A1 and A2. The hyperplane
is defined by the linearly separable function
For the case of two inputs, x1 and x2, the decision boundary tasks the form of a
st
raight line shown in bold in
Figure 6.6(a).
With three inputs the hyperplane can still be
visualized
.
Figure 6.6(b)
shows three
dimensions for the three

input perceptron. The separating plane here is defined by the
equation
5
But
how does the perceptron learn its classification tasks?
The initial weights are randomly assigned, usually in the range [

0.5, 0.5]. The
process of weight updating is particular simple. If at iteration p, the actual output is
Y(p), then the error is given
by
e(p) = Y
d
(p)
–
Y(p) where p =1, 2, 3,
…
Iteration p here refers to the p

th training example presented to the perceptron.
If the error, e(p), is positive, we need to increase the perceptron output Y(p), but
if it is negative, we need to decrease Y(
p). Thus, the following perceptron learning
rule can be established:
where
is the learning rate, a positive constant less than unity.
The training algorithm:
Step 1: Initialization
Set initial weights w1, w2,
…
, w
n
and threshol
d
to random numbers in the
range [

0.5, 0.5].
Step 2: Activation
where n is the number of the perceptron inputs, and
step
is a step activation
function.
Step 3: Weight training
6
w
here
wi(p) is the weight correction at iteration p.
The weight correction is computed by the delta rule:
Step 4: Iteration
Increase iteration p by one, go back to step 2 and repeat the process until
convergence.
Can we tra
in a perceptron to perform basic logical operations such as AND, OR or
Exclusive_OR?
The true tables for the operations AND, OR and Exclusive

OR are shown in
Table 6.2.
The perceptron is activated by the sequence of four input patterns representing an
e
poch. The result is shown in
Table 6.3.
7
In a similar manner, the perceptron can learn the operation OR. However, a
single

layer perceptron cannot be trained to perform the operation Exclusive

OR.
Figure 6.7
represents the AND, OR, and Exclusive

OR funct
ions as two

dimensional
plots based on the values of the two inputs. The fact that a perceptron can learn only
linear separable functions is rather bad news, because there are not many such
functions.
8
Can we do better by using a sigmoidal or linear el
ement in place of the
hard limiter?
The computational limitations of a perceptron were mathematically analysed in
Minsky and Papert
’
s famous book perceptrons. He concluded that the limitations of a
single

layer perceptron would also hold true for multiple
layer neural networks. This
conclusion certainly did not encourage further research on artificial neural networks.
6.4
Multilayer neural networks
A multilayer perceptron is a feedforward neural network with one or more
hidden layers. A multilayer perceptron
with two hidden layers is shown in
Figure 6.8
.
Neurons in the hidden layer detect the features; the weights of the neurons
represent the features hidden in the input pattern.
With one hidden layer, we can represent any continuous function of the input
si
gnals, and two hidden layers even discontinuous functions can be represented.
Can a neural network include more than two hidden layers?
Experimental neural networks may have five or even six layers, including three
or four hidden layers, and utilize mill
ions of neurons, but most practical applications
use only three layers, because each additional layer increases the computational
burden exponentially.
How do multilayer neural networks learn?
The most popular method is back

propagation. Only in the mid

1980s was the
back

propagation learning algorithm rediscovered.
How can we assess the blame for an error and divide it among the
contributing weights?
In a back

propagation neural network, the learning algorithm has two phases.
First, a training input pa
ttern is presented to the network input layer. If this pattern is
9
different from the desired output, an error is calculated and then propagated
backwards through the network from the output layer to the input layer. First, it
computes the net weighted inpu
t as before:
Next, this input value is passed through the activation function:
The output is bounded between 0 and 1.
What about the learning law used in the back

propagation networks?
Figure 6.9
show
s the three

layer back

propagation neural network. The indices i,
j, and k here refer to neurons in the input, hidden and output layers, respectively.
The propagation error signals. We start at the output layer and work backward to
the hidden layer.
where y
d,k
(p) is the desired output of neuron k at iteration p.
In fact, the rule for updating weights at the output layer is similar to the
perceptron learning rule of Eq. (6.7):
10
where
w
jk
(p) is the w
eight correction.
As we cannot apply input signal x
i
, what should we use instead?
We use the output of neuron j in the hidden layer, y
j
, instead of input x
i
.
where
k
(p) is the error gradient at neuron k in the output layer at
iteration p.
What is the error gradient?
For neuron k in the output layer:
where y
k
(p) is the output of neuron k at iteration p, and X
k
(p) is the net weighted
input to neuron k.
For a sigmoid function,
k
(p) is equal to:
were
.
How can we determine the weight correction for a neuron in the hidden layer?
We can apply the same equation as for the output layer:
where
j
(p) represents the error
gradient at neuron j in the hidden layer:
where
l
is the number of neurons in the output layer;
11
and
n
is the number of neurons in the input layer.
Now we can derive the back

propagation traini
ng algorithm.
Step 1: Initialization
Set all the weights and threshold levels of the network to random numbers
uniformly distributed inside a small range:
where
F
i
is the total number of inputs of neuron
i
in the network.
Step
2: Activation
Apply the input X(p) and output Y(p).
(a)
Calculate the actual outputs of the neurons in the hidden layer:
where
n
is the number of inputs of neuron
j
in the hidden layer.
(b)
Calculate the actual outputs of the neurons in t
he output layer:
where
m
is the number of inputs of neuron
k
in the output layer.
Step 3: Weight training
(a)
Calculate the error gradient for the neurons in the output layer:
where
Calcu
late the weight corrections:
Update the weights at the output
neurons
:
12
(b)
Calculate the error gradient for the neurons in the hidden layer:
Calculate the weight corrections:
Update the weights at the output
neurons
:
Step 4: Iteration
Increase iteration p by one, go back to Step 2 and repeat the process until
the selected error criterion is satisfied.
Example shown in
Figure 6.10
illustrate
s the steps. (p. 180 ~ 182)
Why do we need to sum the squared errors?
The sum of the squared errors is a useful indicator of the network
’
s
performance. The back

propagation training algorithm attempts to minimize this
criterion.
Figure 6.11
represents a
learning curve: the sum of squared errors plotted versus
the number of epochs used in training. The Exclusive

OR problem has solved and
results are shown in
Table 6.4
.
13
Can we now draw decision boundaries constructed by the multilayer
network for opera
tion Exclusive

OR?
It may be rather difficult to draw decision boundaries constructed by neurons
with a sigmoid activation function. However, we can
represent
each neuron in the
hidden and output layers by a McCulloch and Pitts model, using a
sign
functio
n. The
network in
Figure 6.12
is also trained to perform the Exclusive

OR operation.
14
The operations of the decision boundaries constructed by neurons 3 and 4 in the
hidden layer are shown in
Figure 6.13(a)
and
(b)
, respectively. Neuron 5 in the output
l
ayer performs a linear combination of the decision boundaries formed by the two
hidden neurons, as shown in
Figure 6.13(c)
.
Is back

propagation learning a good method for machine
learning
?
Although widely used, back

propagation learning is not immune fr
om problems.
For example, the back

propagation
learning
algorithm does not seem to function in
the biological world.
Another apparent problem is that the calculations are extensive and, as a result,
training is slow. For improving the computational effici
ency, some algorithms were
proposed as below.
6.5
Accelerated learning in
multiplayer
neural networks
A multilayer network, in general, learns much faster when the sigmoidal
activation function is represented by a hyperbolic tangent.
where
a
and
b
are constants.
Suitable values for
a
and
b
are: a= 1.7161 and b = 0.667.
We also can accelerate training by including a momentum term in the delta rule:
where
is a positive number (0
1) called the momentum
constant.
Typically, the momentum constant is set to 0.95.
Figure 6.14
represents learning with
momentum for operation Exclusive

OR.
15
In the delta and generalized delta rules, we use a constant and rather small
value for the learning rate parameter,
. C
an we increase this value to
speed up training?
One of the most effective means to accelerate the convergence of
back

propagation learning is to adjust the learning rate parameter during training.
If the learning parameter,
,
is made larger to speed up t
he training process, the
resulting larger change in the weights may cause instability and, as a result, the
network may become oscillatory. To accelerate the convergence, we apply two
heuristics:
1.
If the change of the sum of squared errors has the same alge
braic sign for
several consequent epochs, then the learning rate parameter,
, should be
increased.
2.
If the algebraic sign of the change of the sum of squared errors alternates for
several consequent epochs, then the learning rate parameter,
, should be
de
creased.
Figure 6.15
represents an example of back

propagation training with adaptive
learning rate.
16
Learning rate adaptation can be used together with learning with momentum (as
shown in
Figure 6.16
).
17
Can a neural network simulate associative charact
eristics of the human
memory?
To emulate the human memory
’
s associative characteristics we need a different
type of network: a recurrent neural network.
6.6
The Hopfield network
After applying a new input, the network output is calculated and fed back to
ad
just the input. Then the output is calculated again, and the process is repeated until
the output becomes constant.
In 1960s and 1970s, none was able to predict which network would be stable,
and some researchers were pessimistic about finding a solution a
t all. The problem
was solved in 1982, when John Hopfield formulated the physical principle of storing
information in a dynamically stable network.
Figure 6.17
shows a single

layer Hopfield network consisting of n neurons. The
output of each neuron is fed
back to the inputs of all other neurons (there is not
self

feedback in the Hopfield network).
How does this function work here?
It works in a similar way to the sign function in Figure 6.4. The activation
function is the saturated linear function (as
shown in Figure 6.18).
18
For a single

layer
n

neuron network, the state can be defined by the state vector as
The synaptic weights between neurons are usually represented in matrix form as
follows:
where M is the number of states to be
memorized
by the network, Y
m
is the
n

dimensional binary vector. Is n
n identity matrix, and superscript T denotes a
matrix transposition.
Figure 6.19
shows a three

neuron network represented as a cu
be in the
three

dimensional space. In
Figure 6.19
, each state is represented by a vertex. When
a new input vector is applied, the network moves from one state

vertex to another
until it becomes stable. When a new input vector is applied, the network moves
from
one state

vertex to another until it becomes stable.
19
What determines a stable state

vertex?
The stable state

vertex is determined by the weight matrix W, the current input
vector X, and threshold matrix
. If the input vector is partially incorrec
t or
incomplete, the initial state will converge into the stable state

vertex after a few
iterations. (For example, two opposite states and converging process shown in p.
191~192;
Table 6.5
shows all possible inputs and the corresponding stable states).
The training algorithm of the Hopfield network
Step 1: Storage
20
The matrix form:
where
w
ij
=
w
ji
.
Step 2: Testing
We need to confirm that the Hopfield network is capable of
recalling all
fundamental memories. That is
If all fundamental memories are recalled perfectly we may processed to the
next step.
Step 3: Retrieval
Present an unknown n

dimensional vector, X, to the
network and retrieve a
stable state.
X
Y
m
, m=1, 2,
…
, M.
(a)
Initialize the retrieval algorithm of the Hopfield network by setting:
(b)
Update the elements of the state vector, Y(p), according to the following
rule:
21
Repeat the iteration until the state vector becomes unchanged (stability).
The Hopfield network will always converge to a stable state if the retrieval is
done asynchronously. However, this stable state does not necessary represent one of
the f
undamental memories, and if it is a fundamental memory it is not necessary the
closet one.
For example, p.195, is presented for the storage problem
.
Another problem is the storage capacity, or the largest number of fundamental
memories that can be stored a
nd retrieved correctly. Hopfield showed experimentally
that the maximum number of fundamental memories that can be stored in the
n

neuron recurrent network is limited by
M
max
= 0.15n (n is the number of neurons)
The most of the fundamental
memories are to be retrieved perfectly :
It can be shown that to retrieve all the fundamental memories perfectly, their number
must be halved:
This is a major limitation of the Hopfield network.
Wh
y can
’
t Hopfield network do this job?
The Hopfield network is a single

layer network, and thus the output pattern
appears on the same set of neurons to which the input pattern was applied. To
associate one memory with another, we need a recurrent neural n
etwork capable of
accepting an
input pattern on one set of
neurons and producing a related, but different,
output pattern on another set of neurons
(
bidirectional associative memory
).
6.7
Bidirectional associative memory
Bidirectional associative memory (BAM
), first proposed by Part Kosko, is a
heteroassociative network. The basic BAM architecture is shown in
Figure 6.20
. It
consists of two fully connected layers: an input layer and an output layer.
22
How does the BAM work?
The basic idea behind the BAM is
to store pattern pairs so that when
n

dimensional vector X from set A is presented as input, the BAM recalls
m

dimensional vector Y from set B, But when Y is presented as input, the BAM
recalls X.
To develop the BAM, we need to create a correlation matrix
for each pattern pair
we want to store. The correlation matrix product of the input vector X, and the
transpose of the output vector Y
T
. The BAM weight matrix is
where M is the number of pattern pairs to be stored in the BAM.
Lik
e a Hopfield network, the BAM usually uses McCulloch and Pitts neurons
with the sign activation function.
The
training algorithm
is presented as follows.
Step 1: Storage
The BAM is required to store M pairs of patterns. The weight matrix is
23
Example
：
Step 2: Testing
The BAM should be able to receive any vector from set A and retrieve the
associated vector from set B, and receive any vector from set B and retrieve the
associated vector from set A.
, and
Example
：
24
Step 3: Retrieval
Present an unknown vector (probe) X to the BAM and retrieve a stored
association. That is,
(a)
Initialise the BAM retrieval algorithm by setting
X(0) = X, p=0
and
calculate the BAM output at iteration p
Y(p) = sign[W
T
X(p)]
(b)
Update the input vector X(p)
X(p+1) = sign[W Y(p)]
And repeat the iteration until
equilibrium
, when input and output vectors
remain unchanged with further iterations.
The BAM is
uncondi
tionally stable
. This means that any set of associations can
be
learned
without risk of instability.
There is also a close relationship between the BAM and the Hopfield network. If
the BAM weight matrix is square and symmetrical, then W = W
T
. The BAM can
be
reduced to the autoassociative Hopfield network. Thus, Hopfield net can be
considered as a BAM special case.
25
The maximum number of associations
to be stored in the BAM should not
exceed the number of neurons in the smaller layer. Another, even more se
rious
problem, is
incorrect convergence
. The BAM may not always produce the closest
association. In fact, a stable association may be only slightly related to the initial input
vector.
Can a neural network learn without a
‘
teacher
’
?
So far we have conside
red supervisor or active learning
–
learning with an
external
‘
teacher
’
or a supervisor who presents a training set to the network.
In contrast to supervised learning. Unsupervised or self

organised learning does
not required an external teacher. Unsuperv
ised learning tends to follow the
neuro

biological
organization
of the brain.
Unsupervised learning algorithms aim to learn rapidly. In fact, self

organising
neural networks learn much faster than back

propagation networks.
6.8
Self

organising neural networ
ks
In this section, we consider Hebbian and competitive learning, which are based on
self

organising networks.
6.8.1
Hebbian learning
In 1949, neuropsychologist Donald Hebb proposed one of the key ideas in
biological learning (Hebb
’
s Law). Hebb
’
s Law states tha
t if neuron
i
is near
enough to excite neuron
j
and repeatedly participates in its activation, the synaptic
connection between these two neurons is strengthened and neuron
j
becomes more
sensitive to stimuli from neuron
i
. Hebb
’
s law:
1.
If two neurons on eit
her side of a connection are activated
synchronously
, then
the weight of that connection is
increased
.
2.
If two neurons on either side of a connection are activated
asynchronously
,
then the weight of that connection is
decreased
.
Hebb
’
s Law provides the basi
s for learning without a teacher.
Figure 6.21
shows
Hebbian learning in a neural network.
where
is the learning rate parameter.
26
When we use the step function (or other functions) with output range [0,1], Hebbian
learning implie
s that weights can only increase. In other words, Hebb
’
s Law allows
the strength of a connection to increase, but it does not provide a means to decrease
the strength. We might impose a limit on the growth of synaptic weights. It can be
done by introducing
a non

linear forgetting factor into Hebb
’
s Law:
where
is the forgetting factor.
What does a forgetting factor mean?
If the forgetting factor is 0, the neural net is capable only of strengthening its
s
ynaptic weights, and as a result, these weights grow towards infinitely. On the other
hand, if the forgetting factor is close to 1, the network remembers very little of what it
learns. Therefore, a rather small forgetting factor should be chosen, typically
between
0.01 and 0.1, to allow only a little
‘
forgetting
’
while limiting the weight growth.
Generalized activity product rule
:
where
=
/
.
If the presynaptic activity at iteration p, x
i
(p), is less than w
ij
(p)/
, then the weig
ht
w
ij
(p+1) will decrease. On the other hand, if the presynaptic activity at iteration p,
x
i
(p), is greater than w
ij
(p)/
, then the weight w
ij
(p+1) will increase. So, the
generalized Hebbian learning algorithm:
Step 1: Initilaization
Set the initial
synaptic weights and thresholds to small random values in an
interval [0,1]. And small positive values
and
.
Step 2: Activation
Compute the neuron output at iteration p
27
Step 3: Learning
Update the weights in the network:
Step 4: Iteration
Increase iteration p by one. Go back to Step 2 and continue until the
synaptic weights reach their steady

state values.
Figure 6.22
illustrates Hebbian learning (input vector (probe)
X=[1 0 0 0 1]
T
and final
output Y = [0 1 0 0 1]
T
):
28
29
Example:
6.8.2
Competitive learning
Another popular type of unsupervised learning is
competitive
learning. In
competitive learning, neurons compete among themselves to be activated. The output
neuron that wins the
‘
competition
’
is called the winner

take

all neuron. Teuvo
Kohonen introduced a special class of artificial neural networks called self

organizing
feature maps.
What is a self

organizing feature map?
Our brain is dominated by the cer
ebral cortex. The cortex is a self

organizing
computational map in the human brain.
Can we model the self

organizing map?
This principle states that the spatial location of an output neuron in the
topographic map corresponds to a particular feature of th
e input pattern. Kohonen also
proposed the feature

mapping model shown in
Figure 6.23
.
30
Original paper:
How close is
‘
close physical proximity
’
?
Generally, training in the Kohonen network begins with the winner
’
s
neighborhood of a fairly large
size. Then, as training proceeds, the neighborhood size
gradually decreases. The Kohonen network consists of a single layer of computation
neurons, but it has two different types of connections. There are forward connections
from the neurons in the input
layer to the neurons in the output layer, and also lateral
connections between neurons in the output layer, as shown in
Figure 6.24
. The neuron
31
with the largest activation level among all neurons in the output layer becomes the
winner.
Original paper:
What is the Mexican hat function?
The Mexican hat function shown in
Figure 6.25
represents the relationship
between the distance from the winner

take

all neuron and the strength of the
connections within the Kohonen layer. According to this function,
the near
32
neighborhood has a strong excitatory effect, remote neighborhood has a mild
inhibitory effect. Only the winning neuron and its neighborhood are allowed to learn.
What is the Euclidean distance?
The Eulidean distance between a pair of n

by

1 ve
ctors X and W
j
is defined by
,
where x
i
and w
ij
are the i

th elements of vector X and W
j
, respectively.
Figure 6.26
clearly demonstrates that the smaller the Euclidean distance is, the greater
will be the similarity between the vec
tors X and W
j
.
Figure 6.26 Euclidean distance
(Example see p. 208~209)
Kohonen competitive learning algorithm:
Step 1: Initialization
Set the initial synaptic weights to small random values [0,1], and a small
positive value to learn
ing rate parameter
.
Step 2: Activation and similarity matching
Input vector X, and find the winner

take

all neuron
jx
at iteration p.
X
W
j

X
–
W
j

33
where n is the number of neurons in the input layer, and m is the number of
neurons in th
e output or Kohonen layer.
Step 3: Learning
Update the
synaptic weights
where is the learning rate, and
is the neighborhood function.
The simple form of a neighborhood fu
nction is shown in
Figure 6.27
.
Step 4: Iteration
Increase iteration p by one, go to step 2 and continue until
minimum

distance Euclidean criterion is satisfied.
Figure 2.28
demonstrates the Kohonen net with 100 neurons arranged in the form
of a two

dimensional lattice with 10 rows and 10 columns. It requires to classify
two

dimensional input vectors. The network is trained with 1000 two

dimensional
input vectors generated randomly in a square region in the interval
–
1 and +1. The
black dot represents
the location of its two weights, w
1j
and w
2j
.
34
35
Figure 6.29
illustrates the inputs X
1
, X
2
, and X
3
.
6.9
Summary
(p.212 ~ 215)
Comments 0
Log in to post a comment