Chap 6. Artificial neural networks

reformcartloadAI and Robotics

Oct 15, 2013 (3 years and 9 months ago)

84 views


1

Chap 6. Artificial neural networks

In this chap, we consider how our brains work and how to build and train artificial
neural networks.


6.1


Introduction, or how the brain works?


The computer hasn

t proved anything yet,


angry Garry Kasparov, the world
chess

champion, said after his defeat (beat by IBM Deep Blue ) in New York in May
1997. The Deep Blue was capable of
analyzing

200 million positions a second.
Chess
-
playing programs must be able to improve their performance with experience
or, in other words, a

machine must be capable of learning.


What is machine learning?


In general, machine learning involves adaptive mechanisms that enable
computers to learn from experience, learn by example and learn by analogy. Learning
capabilities can improve the perfor
mance of an intelligent system over time. The most
popular approaches to machine learning are artificial neural networks and genetic
algorithms.


What is a neural network?

A neural network can be defined as a model of reasoning based on the human brain.
Th
e human brain incorporates nearly 10 billion neurons and 60 trillion connections,
synapses, between them. A neuron consists of a cell body, soma, a number of fibres
called dendrites, and a single long fibre called the axon. The axon stretches out to the
de
ndrites and somas of other neurons.
Figure 6.1

is a schematic drawing of a neural
network.




Even entire collections of neurons may sometimes migrate from one place to
another. These mechanisms form the basis for learning in the brain. Our brain can be
considered as a highly complex, nonlinear and parallel information
-
processing system.
Both data and its processing are global rather than local.


Learning is a fundamental and essential characteristic of biological neural

2

networks. The ease and naturalness

with which they can led to attempts to emulate a
biological neural network in a computer.


How do artificial neural nets model the brain?


An artificial network consists of a number of very simple and highly
interconnected processors, also called neurons
. The neurons are connected by
weighted links passing signals from one neuron to another.
Figure 6.2

represents
connections of a typical ANN, and
Table 6.1

shows the analogy between biological
and artificial neural networks.



How does an artificial neur
al network

learn

?


The neurons are connected by links, and each link has a numerical weight
associated with it. Weights are the basic means of long
-
term memory in ANNs. A
neural net

learns


through repeated adjustments of these weights.


How does the n
eural net adjust the weights?


We train the neural network by following: we initialize the weights of the
network and update the weights from a set of training examples.


6.2


The neuron as a sample computing element

Figure 6.3

shows a typical neuron. The neu
ron computes the weighted sum of the
input signals and compares the
result

with a threshold value,


. If the net input is
less than the threshold, the neuron output is

1. But if the net input is greater than

3

or equal to the threshold, the neuron becomes a
ctivated and its output attains a
value +1.


The typical activation function is called sign function.






Is the sign function the only activation function used by neurons?



Four comm
on choices


the step, sign, linear and sigmoid functions


are
illustrated in
Figure 6.4
.



The step and sign activation functions, also called hard limit functions, are often
used id decision
-
making neurons for classification and pattern recognition.

Th
e sigmoid function transforms the input, which can have any value between

4

plus and minus infinity, into a reasonable value in the range between 0 and 1.

The linear activation function provides an output equal to the neuron weighted
input.


Can a single ne
uron learn a task?


In 1958, Frank Rosenblatt introduced a simple algorithm for training a simple
ANN: perceptron. A simple
-
layer two
-
input perceptron is shown in
Figure 6.5
.


6.3


The perceptron

The aim of the perceptron is to
classify

inputs, or in other wo
rds externally
applied stimuli x1, x2,

, xn, into one of two classes, say A1 and A2. The hyperplane
is defined by the linearly separable function




For the case of two inputs, x1 and x2, the decision boundary tasks the form of a
st
raight line shown in bold in
Figure 6.6(a).

With three inputs the hyperplane can still be
visualized
.
Figure 6.6(b)

shows three
dimensions for the three
-
input perceptron. The separating plane here is defined by the
equation




5



But
how does the perceptron learn its classification tasks?


The initial weights are randomly assigned, usually in the range [
-
0.5, 0.5]. The
process of weight updating is particular simple. If at iteration p, the actual output is
Y(p), then the error is given

by



e(p) = Y
d
(p)


Y(p) where p =1, 2, 3,


Iteration p here refers to the p
-
th training example presented to the perceptron.


If the error, e(p), is positive, we need to increase the perceptron output Y(p), but
if it is negative, we need to decrease Y(
p). Thus, the following perceptron learning
rule can be established:





where


is the learning rate, a positive constant less than unity.

The training algorithm:

Step 1: Initialization

Set initial weights w1, w2,

, w
n

and threshol
d


to random numbers in the
range [
-
0.5, 0.5].

Step 2: Activation


where n is the number of the perceptron inputs, and
step

is a step activation
function.

Step 3: Weight training





6



w
here

wi(p) is the weight correction at iteration p.



The weight correction is computed by the delta rule:





Step 4: Iteration



Increase iteration p by one, go back to step 2 and repeat the process until
convergence.


Can we tra
in a perceptron to perform basic logical operations such as AND, OR or
Exclusive_OR?

The true tables for the operations AND, OR and Exclusive
-
OR are shown in
Table 6.2.



The perceptron is activated by the sequence of four input patterns representing an
e
poch. The result is shown in
Table 6.3.


7



In a similar manner, the perceptron can learn the operation OR. However, a
single
-
layer perceptron cannot be trained to perform the operation Exclusive
-
OR.
Figure 6.7

represents the AND, OR, and Exclusive
-
OR funct
ions as two
-
dimensional
plots based on the values of the two inputs. The fact that a perceptron can learn only
linear separable functions is rather bad news, because there are not many such
functions.






8

Can we do better by using a sigmoidal or linear el
ement in place of the
hard limiter?


The computational limitations of a perceptron were mathematically analysed in
Minsky and Papert

s famous book perceptrons. He concluded that the limitations of a
single
-
layer perceptron would also hold true for multiple
layer neural networks. This
conclusion certainly did not encourage further research on artificial neural networks.


6.4


Multilayer neural networks

A multilayer perceptron is a feedforward neural network with one or more
hidden layers. A multilayer perceptron
with two hidden layers is shown in
Figure 6.8
.


Neurons in the hidden layer detect the features; the weights of the neurons
represent the features hidden in the input pattern.

With one hidden layer, we can represent any continuous function of the input
si
gnals, and two hidden layers even discontinuous functions can be represented.


Can a neural network include more than two hidden layers?


Experimental neural networks may have five or even six layers, including three
or four hidden layers, and utilize mill
ions of neurons, but most practical applications
use only three layers, because each additional layer increases the computational
burden exponentially.


How do multilayer neural networks learn?


The most popular method is back
-
propagation. Only in the mid
-
1980s was the
back
-
propagation learning algorithm rediscovered.


How can we assess the blame for an error and divide it among the
contributing weights?


In a back
-
propagation neural network, the learning algorithm has two phases.

First, a training input pa
ttern is presented to the network input layer. If this pattern is

9

different from the desired output, an error is calculated and then propagated
backwards through the network from the output layer to the input layer. First, it
computes the net weighted inpu
t as before:




Next, this input value is passed through the activation function:






The output is bounded between 0 and 1.


What about the learning law used in the back
-
propagation networks?


Figure 6.9

show
s the three
-
layer back
-
propagation neural network. The indices i,
j, and k here refer to neurons in the input, hidden and output layers, respectively.



The propagation error signals. We start at the output layer and work backward to
the hidden layer.






where y
d,k
(p) is the desired output of neuron k at iteration p.


In fact, the rule for updating weights at the output layer is similar to the
perceptron learning rule of Eq. (6.7):





10



where

w
jk
(p) is the w
eight correction.


As we cannot apply input signal x
i
, what should we use instead?


We use the output of neuron j in the hidden layer, y
j
, instead of input x
i
.





where

k
(p) is the error gradient at neuron k in the output layer at

iteration p.


What is the error gradient?


For neuron k in the output layer:






where y
k
(p) is the output of neuron k at iteration p, and X
k
(p) is the net weighted
input to neuron k.


For a sigmoid function,

k
(p) is equal to:






were




.

How can we determine the weight correction for a neuron in the hidden layer?


We can apply the same equation as for the output layer:




where

j
(p) represents the error

gradient at neuron j in the hidden layer:






where
l

is the number of neurons in the output layer;


11






and
n

is the number of neurons in the input layer.


Now we can derive the back
-
propagation traini
ng algorithm.

Step 1: Initialization


Set all the weights and threshold levels of the network to random numbers
uniformly distributed inside a small range:







where
F
i

is the total number of inputs of neuron

i

in the network.

Step

2: Activation


Apply the input X(p) and output Y(p).

(a)

Calculate the actual outputs of the neurons in the hidden layer:


where
n

is the number of inputs of neuron

j
in the hidden layer.

(b)

Calculate the actual outputs of the neurons in t
he output layer:


where
m

is the number of inputs of neuron

k
in the output layer.

Step 3: Weight training

(a)

Calculate the error gradient for the neurons in the output layer:


where


Calcu
late the weight corrections:


Update the weights at the output
neurons
:



12

(b)

Calculate the error gradient for the neurons in the hidden layer:


Calculate the weight corrections:


Update the weights at the output
neurons
:


Step 4: Iteration



Increase iteration p by one, go back to Step 2 and repeat the process until
the selected error criterion is satisfied.


Example shown in
Figure 6.10

illustrate
s the steps. (p. 180 ~ 182)


Why do we need to sum the squared errors?

The sum of the squared errors is a useful indicator of the network

s
performance. The back
-
propagation training algorithm attempts to minimize this
criterion.


Figure 6.11

represents a

learning curve: the sum of squared errors plotted versus
the number of epochs used in training. The Exclusive
-
OR problem has solved and
results are shown in
Table 6.4
.


13



Can we now draw decision boundaries constructed by the multilayer
network for opera
tion Exclusive
-
OR?


It may be rather difficult to draw decision boundaries constructed by neurons
with a sigmoid activation function. However, we can
represent

each neuron in the
hidden and output layers by a McCulloch and Pitts model, using a
sign

functio
n. The
network in
Figure 6.12

is also trained to perform the Exclusive
-
OR operation.



14


The operations of the decision boundaries constructed by neurons 3 and 4 in the
hidden layer are shown in
Figure 6.13(a)

and

(b)
, respectively. Neuron 5 in the output
l
ayer performs a linear combination of the decision boundaries formed by the two
hidden neurons, as shown in
Figure 6.13(c)
.


Is back
-
propagation learning a good method for machine
learning
?


Although widely used, back
-
propagation learning is not immune fr
om problems.
For example, the back
-
propagation
learning

algorithm does not seem to function in
the biological world.


Another apparent problem is that the calculations are extensive and, as a result,
training is slow. For improving the computational effici
ency, some algorithms were
proposed as below.


6.5


Accelerated learning in
multiplayer

neural networks

A multilayer network, in general, learns much faster when the sigmoidal
activation function is represented by a hyperbolic tangent.


where

a

and
b

are constants.

Suitable values for
a

and
b

are: a= 1.7161 and b = 0.667.


We also can accelerate training by including a momentum term in the delta rule:






where


is a positive number (0






1) called the momentum

constant.
Typically, the momentum constant is set to 0.95.
Figure 6.14

represents learning with
momentum for operation Exclusive
-
OR.


15


In the delta and generalized delta rules, we use a constant and rather small
value for the learning rate parameter,

. C
an we increase this value to
speed up training?


One of the most effective means to accelerate the convergence of
back
-
propagation learning is to adjust the learning rate parameter during training.

If the learning parameter,

,
is made larger to speed up t
he training process, the
resulting larger change in the weights may cause instability and, as a result, the
network may become oscillatory. To accelerate the convergence, we apply two
heuristics:

1.

If the change of the sum of squared errors has the same alge
braic sign for
several consequent epochs, then the learning rate parameter,

, should be
increased.

2.

If the algebraic sign of the change of the sum of squared errors alternates for
several consequent epochs, then the learning rate parameter,

, should be
de
creased.

Figure 6.15

represents an example of back
-
propagation training with adaptive
learning rate.


16


Learning rate adaptation can be used together with learning with momentum (as
shown in
Figure 6.16
).




17

Can a neural network simulate associative charact
eristics of the human
memory?


To emulate the human memory

s associative characteristics we need a different
type of network: a recurrent neural network.


6.6


The Hopfield network

After applying a new input, the network output is calculated and fed back to
ad
just the input. Then the output is calculated again, and the process is repeated until
the output becomes constant.

In 1960s and 1970s, none was able to predict which network would be stable,
and some researchers were pessimistic about finding a solution a
t all. The problem
was solved in 1982, when John Hopfield formulated the physical principle of storing
information in a dynamically stable network.

Figure 6.17

shows a single
-
layer Hopfield network consisting of n neurons. The
output of each neuron is fed

back to the inputs of all other neurons (there is not
self
-
feedback in the Hopfield network).



How does this function work here?


It works in a similar way to the sign function in Figure 6.4. The activation
function is the saturated linear function (as
shown in Figure 6.18).




18


For a single
-
layer
n
-
neuron network, the state can be defined by the state vector as


The synaptic weights between neurons are usually represented in matrix form as
follows:





where M is the number of states to be
memorized

by the network, Y
m

is the
n
-
dimensional binary vector. Is n

n identity matrix, and superscript T denotes a
matrix transposition.


Figure 6.19

shows a three
-
neuron network represented as a cu
be in the
three
-
dimensional space. In
Figure 6.19
, each state is represented by a vertex. When
a new input vector is applied, the network moves from one state
-
vertex to another
until it becomes stable. When a new input vector is applied, the network moves

from
one state
-
vertex to another until it becomes stable.


19


What determines a stable state
-
vertex?


The stable state
-
vertex is determined by the weight matrix W, the current input
vector X, and threshold matrix

. If the input vector is partially incorrec
t or
incomplete, the initial state will converge into the stable state
-
vertex after a few
iterations. (For example, two opposite states and converging process shown in p.
191~192;
Table 6.5

shows all possible inputs and the corresponding stable states).




The training algorithm of the Hopfield network

Step 1: Storage


20


The matrix form:














where
w
ij

=
w
ji
.

Step 2: Testing

We need to confirm that the Hopfield network is capable of

recalling all
fundamental memories. That is










If all fundamental memories are recalled perfectly we may processed to the
next step.

Step 3: Retrieval

Present an unknown n
-
dimensional vector, X, to the
network and retrieve a
stable state.





X


Y
m
, m=1, 2,

, M.


(a)

Initialize the retrieval algorithm of the Hopfield network by setting:


(b)

Update the elements of the state vector, Y(p), according to the following
rule:



21

Repeat the iteration until the state vector becomes unchanged (stability).



The Hopfield network will always converge to a stable state if the retrieval is
done asynchronously. However, this stable state does not necessary represent one of
the f
undamental memories, and if it is a fundamental memory it is not necessary the
closet one.

For example, p.195, is presented for the storage problem
.

Another problem is the storage capacity, or the largest number of fundamental
memories that can be stored a
nd retrieved correctly. Hopfield showed experimentally
that the maximum number of fundamental memories that can be stored in the
n
-
neuron recurrent network is limited by




M
max

= 0.15n (n is the number of neurons)

The most of the fundamental
memories are to be retrieved perfectly :







It can be shown that to retrieve all the fundamental memories perfectly, their number
must be halved:







This is a major limitation of the Hopfield network.


Wh
y can

t Hopfield network do this job?


The Hopfield network is a single
-
layer network, and thus the output pattern
appears on the same set of neurons to which the input pattern was applied. To
associate one memory with another, we need a recurrent neural n
etwork capable of
accepting an
input pattern on one set of
neurons and producing a related, but different,
output pattern on another set of neurons

(
bidirectional associative memory
).


6.7


Bidirectional associative memory

Bidirectional associative memory (BAM
), first proposed by Part Kosko, is a
heteroassociative network. The basic BAM architecture is shown in
Figure 6.20
. It
consists of two fully connected layers: an input layer and an output layer.


22




How does the BAM work?


The basic idea behind the BAM is

to store pattern pairs so that when
n
-
dimensional vector X from set A is presented as input, the BAM recalls
m
-
dimensional vector Y from set B, But when Y is presented as input, the BAM
recalls X.

To develop the BAM, we need to create a correlation matrix

for each pattern pair
we want to store. The correlation matrix product of the input vector X, and the
transpose of the output vector Y
T
. The BAM weight matrix is


where M is the number of pattern pairs to be stored in the BAM.


Lik
e a Hopfield network, the BAM usually uses McCulloch and Pitts neurons
with the sign activation function.

The
training algorithm

is presented as follows.

Step 1: Storage


The BAM is required to store M pairs of patterns. The weight matrix is











23

Example





Step 2: Testing


The BAM should be able to receive any vector from set A and retrieve the
associated vector from set B, and receive any vector from set B and retrieve the
associated vector from set A.

, and






Example



24



Step 3: Retrieval


Present an unknown vector (probe) X to the BAM and retrieve a stored
association. That is,





(a)

Initialise the BAM retrieval algorithm by setting

X(0) = X, p=0



and

calculate the BAM output at iteration p



Y(p) = sign[W
T

X(p)]


(b)

Update the input vector X(p)




X(p+1) = sign[W Y(p)]



And repeat the iteration until
equilibrium
, when input and output vectors
remain unchanged with further iterations.


The BAM is
uncondi
tionally stable
. This means that any set of associations can
be
learned

without risk of instability.


There is also a close relationship between the BAM and the Hopfield network. If
the BAM weight matrix is square and symmetrical, then W = W
T
. The BAM can
be
reduced to the autoassociative Hopfield network. Thus, Hopfield net can be
considered as a BAM special case.



25


The maximum number of associations

to be stored in the BAM should not
exceed the number of neurons in the smaller layer. Another, even more se
rious
problem, is
incorrect convergence
. The BAM may not always produce the closest
association. In fact, a stable association may be only slightly related to the initial input
vector.


Can a neural network learn without a

teacher

?

So far we have conside
red supervisor or active learning


learning with an
external

teacher


or a supervisor who presents a training set to the network.


In contrast to supervised learning. Unsupervised or self
-
organised learning does
not required an external teacher. Unsuperv
ised learning tends to follow the
neuro
-
biological
organization

of the brain.


Unsupervised learning algorithms aim to learn rapidly. In fact, self
-
organising
neural networks learn much faster than back
-
propagation networks.


6.8


Self
-
organising neural networ
ks

In this section, we consider Hebbian and competitive learning, which are based on
self
-
organising networks.


6.8.1

Hebbian learning

In 1949, neuropsychologist Donald Hebb proposed one of the key ideas in
biological learning (Hebb

s Law). Hebb

s Law states tha
t if neuron
i

is near
enough to excite neuron
j
and repeatedly participates in its activation, the synaptic
connection between these two neurons is strengthened and neuron
j

becomes more
sensitive to stimuli from neuron
i
. Hebb

s law:

1.

If two neurons on eit
her side of a connection are activated
synchronously
, then
the weight of that connection is
increased
.

2.

If two neurons on either side of a connection are activated
asynchronously
,
then the weight of that connection is
decreased
.

Hebb

s Law provides the basi
s for learning without a teacher.
Figure 6.21

shows
Hebbian learning in a neural network.



where


is the learning rate parameter.


26


When we use the step function (or other functions) with output range [0,1], Hebbian
learning implie
s that weights can only increase. In other words, Hebb

s Law allows
the strength of a connection to increase, but it does not provide a means to decrease
the strength. We might impose a limit on the growth of synaptic weights. It can be
done by introducing

a non
-
linear forgetting factor into Hebb

s Law:





where


is the forgetting factor.


What does a forgetting factor mean?


If the forgetting factor is 0, the neural net is capable only of strengthening its
s
ynaptic weights, and as a result, these weights grow towards infinitely. On the other
hand, if the forgetting factor is close to 1, the network remembers very little of what it
learns. Therefore, a rather small forgetting factor should be chosen, typically

between
0.01 and 0.1, to allow only a little

forgetting


while limiting the weight growth.

Generalized activity product rule
:


where



=


/

.

If the presynaptic activity at iteration p, x
i
(p), is less than w
ij
(p)/

, then the weig
ht
w
ij
(p+1) will decrease. On the other hand, if the presynaptic activity at iteration p,
x
i
(p), is greater than w
ij
(p)/

, then the weight w
ij
(p+1) will increase. So, the
generalized Hebbian learning algorithm:

Step 1: Initilaization



Set the initial

synaptic weights and thresholds to small random values in an
interval [0,1]. And small positive values


and

.

Step 2: Activation


Compute the neuron output at iteration p


27





Step 3: Learning



Update the weights in the network:







Step 4: Iteration



Increase iteration p by one. Go back to Step 2 and continue until the
synaptic weights reach their steady
-
state values.

Figure 6.22

illustrates Hebbian learning (input vector (probe)

X=[1 0 0 0 1]
T

and final
output Y = [0 1 0 0 1]
T
):







28











29

Example:



6.8.2

Competitive learning

Another popular type of unsupervised learning is
competitive

learning. In
competitive learning, neurons compete among themselves to be activated. The output

neuron that wins the

competition


is called the winner
-
take
-
all neuron. Teuvo
Kohonen introduced a special class of artificial neural networks called self
-
organizing
feature maps.


What is a self
-
organizing feature map?


Our brain is dominated by the cer
ebral cortex. The cortex is a self
-
organizing
computational map in the human brain.


Can we model the self
-
organizing map?


This principle states that the spatial location of an output neuron in the
topographic map corresponds to a particular feature of th
e input pattern. Kohonen also
proposed the feature
-
mapping model shown in
Figure 6.23
.









30



Original paper:



How close is

close physical proximity

?


Generally, training in the Kohonen network begins with the winner

s
neighborhood of a fairly large

size. Then, as training proceeds, the neighborhood size
gradually decreases. The Kohonen network consists of a single layer of computation
neurons, but it has two different types of connections. There are forward connections
from the neurons in the input
layer to the neurons in the output layer, and also lateral
connections between neurons in the output layer, as shown in
Figure 6.24
. The neuron

31

with the largest activation level among all neurons in the output layer becomes the
winner.




Original paper:



What is the Mexican hat function?


The Mexican hat function shown in
Figure 6.25

represents the relationship
between the distance from the winner
-
take
-
all neuron and the strength of the
connections within the Kohonen layer. According to this function,
the near

32

neighborhood has a strong excitatory effect, remote neighborhood has a mild
inhibitory effect. Only the winning neuron and its neighborhood are allowed to learn.



What is the Euclidean distance?

The Eulidean distance between a pair of n
-
by
-
1 ve
ctors X and W
j

is defined by



,

where x
i

and w
ij

are the i
-
th elements of vector X and W
j
, respectively.

Figure 6.26

clearly demonstrates that the smaller the Euclidean distance is, the greater
will be the similarity between the vec
tors X and W
j
.






Figure 6.26 Euclidean distance


(Example see p. 208~209)


Kohonen competitive learning algorithm:

Step 1: Initialization



Set the initial synaptic weights to small random values [0,1], and a small
positive value to learn
ing rate parameter

.


Step 2: Activation and similarity matching

Input vector X, and find the winner
-
take
-
all neuron

jx
at iteration p.

X

W
j


||
X


W
j
||


33






where n is the number of neurons in the input layer, and m is the number of
neurons in th
e output or Kohonen layer.

Step 3: Learning



Update the
synaptic weights












where is the learning rate, and

is the neighborhood function.



The simple form of a neighborhood fu
nction is shown in
Figure 6.27
.



Step 4: Iteration



Increase iteration p by one, go to step 2 and continue until
minimum
-
distance Euclidean criterion is satisfied.

Figure 2.28

demonstrates the Kohonen net with 100 neurons arranged in the form
of a two
-
dimensional lattice with 10 rows and 10 columns. It requires to classify
two
-
dimensional input vectors. The network is trained with 1000 two
-
dimensional
input vectors generated randomly in a square region in the interval

1 and +1. The
black dot represents

the location of its two weights, w
1j

and w
2j
.






34

















35

Figure 6.29

illustrates the inputs X
1
, X
2
, and X
3
.







6.9


Summary



(p.212 ~ 215)