Neural Networks A Pattern

prudencewooshAI and Robotics

Oct 19, 2013 (3 years and 9 months ago)

122 views

Neural Netw orks A P a ttern

Recognition Perspective
Christopher M Bishop
Neural Computing Resea rch Group
Aston Universit y Birmingham UK
Jan uary
T ec hnical Rep ort NCR G
Av ailable from httpwwcrgst onc k
In tro duction
Neural net w orks ha v e b een exploited in a wide v ariet y of applications the ma jorit y of whic h are
concerned with pattern recognition in one form or another Ho w ev er it has b ecome widely ac
kno wledged that the ectiv e solution of all but the simplest of suc h problems requires a principle d
treatmen t in other w ords one based on a sound theoretical framew ork
F rom the p ersp ectiv e of pattern recognition neural net w orks can b e regarded as an extension
of the man y con v en tional tec hniques whic h ha v e b een dev elop ed o v er sev eral decades Lac k of
understanding of the basic principles of statistical pattern recognition lies at the heart of man y
of the common mistak es in the application of neural net w orks In this c hapter w e aim to sho w
that the lac k b o x stigma of neural net w orks is largely unjustid and that there is actually
considerable insigh t a v ailable in to the w a y in whic h neural net w orks op erate and ho w to use them
ectiv ely
Some of the k ey p oin ts whic h are discussed in this c hapter are as follo ws
Neural net w orks can b e view ed as a general framew ork for represen ting noninear mappings
b et w een m ultiim ensional spaces in whic h the form of the mapping is go v erned b y a n um b er
of adjustable parameters They therefore b elong to a m uc h larger class of suc h mappings
man y of whic h ha v e b een studied extensiv ely in other lds
Simple tec hniques for represen ting m ulti aria te noninear mappings in one or t w o dimen
sions p olynomial s rely on linear com binations of e d basis functions r idden
functions Suc h metho ds ha v e sev ere limitations when extended to spaces of man y dimen
sions a phenomenon kno wn as the curse of dimensionality The k ey con tribution of neural
net w orks in this resp ect is that they emplo y basis functions whic h are themselv es adapted
to the data leading to eien t tec hniques for m ultiim ensional problems
The formalism of statistical pattern recognition in tro duced brie in Section lies at
the heart of a principled treatmen t of neural net w orks Man y of these topics are treated
in standard texts on statistical pattern recognition including Duda and Hart Hand
Devijv er and Kittler and F ukunaga

T o b e published in Fiesler E and Beale R ds Handb o ok of Neural Computation ew Y ork Oxford
Univ ersit y Press Bristol IOP Publishing Ltd
Net w ork training is usually based on the minim i zation of an error function W e sho w ho w
error functions arise naturally from the principle of maxim um lik eliho o d and ho w diren t
c hoices of error function corresp ond to diren t assumptions ab out the statistical prop er
ties of the data This allo ws the appropriate error function to b e selected for a particular
application
The statistical view of neural net w orks motiv ates sp eci forms for the activ ation functions
whic h arise in net w ork mo dels In particular w e see that the logistic sigmoid often in tro duced
b y analogy with the mean ing rate of a biological neuron is precisely the function whic h
allo ws the activ ation of a unit to b e giv en a particular probabilistic in terpretation
Pro vided the error function and activ ation functions are correctly c hosen the outputs of a
trained net w ork can b e giv en precise in terpretations F or regression problems they appro xi
mate the conditional a v erages of the distribution of target data while for classiation prob
lems they appro ximate the p osterior probabilities of class mem b ership This demonstrates
wh y neural net w orks can appro ximate the optimal solution to a regression or classiation
problem
Error bac kropagation is in tro duced as a general framew ork for ev aluating deriv ativ es for
feedorw ard net w orks The k ey feature of bac kropagation is that it is computationally
v ery eien t compared with a simple direct ev aluation of deriv ativ es F or net w ork training
algorithms this eiency is crucial
The original learning algorithm for m ultia y er feedorw ard net w orks umelhart et al
w as based on gradien t descen t In fact the problem of optimizing the w eigh ts in a
net w ork corresp onds to unconstrained noninear optimization for whic h man y substan tially
more p o w erful algorithms ha v e b een dev elop ed
Net w ork complexit y go v erned for example b y the n um b er of hidden units pla ys a cen tral
role in determining the generalization p erformance of a trained net w ork This is illustrated
using a simple curv e ting example in one dimension
These and man y related issues are discussed at greater length in Bishop
Classiation and Regression
In this c hapter w e concen trate on the t w o most common kinds of pattern recognition problem
The st of these w e shall refer to as r e gr ession and is concerned with predicting the v alues of one
or more con tin uous output v ariables giv en the v alues of a n um b er of input v ariables Examples
include prediction of the temp erature of a plasma giv en v alues for the in tensit y of ligh t emitted
at v arious w a v elengths or the estimation of the fraction of oil in a m ultihase pip eline giv en
measuremen ts of the absorption of gamm a b eams along v arious crossectional paths through the
pip e If w e denote the input v ariables b y a v ector x with comp onen ts x where i d and
i
the output v ariables b y a v ector y with comp onen ts y where k c then the goal of the
k
regression problem is to d a suitable set of functions whic h map the x to the y
i k
The second kind of task w e shall consider is called classi ation and in v olv es assigning input
patterns to one of a set of discrete classes C where k c An imp ortan t example in v olv es
k
the automatic in terpretation of handritten digits e Cun et al Again w e can form ulate
a classiation problem in terms of a set of functions whic h map inputs x to outputs y where
i k
no w the outputs sp ecify whic h of the classes the input pattern b elongs to F or instance the input
ma y b e assigned to the class whose output v alue y is largest
k
In general it will not b e p ossible to determine a suitable form for the required mapping
except with the help of a data set of examples The mapping is therefore mo delled in terms of
some mathematical function whic h con tains a n um b er of adjustable parameters whose v alues are
determined with the help of the data W e can write suc h functions in the form
y y x w
k k
where w denotes the v ector of parameters w w A neural net w ork mo del can b e regarded
W
simply as a particular c hoice for the set of functions y x w In this case the parameters
k
comprising w are often called weights
The imp ortance of neural net w orks in this con text is that they or a v ery p o w erful and v ery
general framew ork for represen ting noninear mappings from sev eral input v ariables to sev eral
output v ariables The pro cess of determining the v alues for these parameters on the basis of the
data set is called le arning or tr aining and for this reason the data set of examples is generally
referred to as a tr aining set Neural net w ork mo dels as w ell as man y con v en tional approac hes
to statistical pattern recognition can b e view ed as sp eci c hoices for the functional forms used
to represen t the mapping together with particular pro cedures for optimizing the parameters
in the mapping In fact neural net w ork mo dels often con tain con v en tional approac hes uc h as
linear or logistic regression as sp ecial cases
P olynomial curv e ting
Man y of the imp ortan t issues concerning the application of neural net w orks can b e in tro duced
in the simpler con text of curv e ting using p olynomial functions Here the problem is to a
p olynomial to a set of N data p oin ts b y minimi zing an error function Consider the M thrder
p olynomial giv en b y
M
X
M j
y x w w x w x w x
M j
j
This can b e regarded as a noninear mapping whic h tak es x as input and pro duces y as output The
precise form of the function y x is determined b y the v alues of the parameters w w whic h
M
are analogous to the w eigh ts in a neural net w ork It is con v enien t to denote the set of parameters
w w b y the v ector w in whic h case the p olynomial can b e written as a functional mapping
M
in the form V alues for the co eien ts can b e found b y minim izatio n of an error function as
will b e discussed in detail in Section W e shall giv e some examples of p olynomial curv e ting
in Section
Wh y neural net w orks
P attern recognition problems as w e ha v e already indicated can b e represen ted in terms of general
parametrized noninear mappings b et w een a set of input v ariables and a set of output v ariables
A p olynomial represen ts a particular class of mapping for the case of one input and one output
Pro vided w e ha v e a suien tly large n um b er of terms in the p olynomial w e can appro ximate
a wide class of functions to arbitrary accuracy This suggests that w e could simply extend the
concept of a p olynomial to higher dimensions Th us for d input v ariables and again one output
v ariable w e could for instance consider a thirdrder p olynomial of the form
d d d d d d
X X X X X X
y w w x w x x w x x x
i i i i i i i i i i i i

i i i i i i

F or an M thrder p olynomial of this kind the n um b er of indep enden t adjustable parameters
M
w ould gro w lik e d whic h represen ts a dramatic gro wth in the n um b er of degrees of freedom in
the mo del as the dimensionalit y of the input space increases This is an example of the curse
of dimensionality ellman The presence of a large n um b er of adaptiv e parameters in a
mo del can cause ma jor problems as w e shall discuss in Section In order that the mo del mak e
go o d predictions for new inputs it is necessary that the n um b er of data p oin ts in the training set
b e m uc h greater than the n um b er of adaptiv e parameters F or medium to large applications suc h
a mo del w ould need h uge quan tities of training data in order to ensure that the parameters n
this case the co eien ts in the p olynomial w ere w ell determined
There are in fact man y diren t w a ys in whic h to represen t general noninear mappings b e
t w een m ultidim ensional spaces The imp ortance of neural net w orks and similar tec hniques lies in
the w a y in whic h they deal with the problem of scaling with dimensionalit y In order to motiv ate
neural net w ork mo dels it is con v enien t to represen t the noninear mapping function in terms
of a linear com bination of b asis functions sometimes also called idden functions or hidden units
z x so that
j
M
X
y x w z x
k k j j
j
Here the basis function z tak es the ed v alue and allo ws a constan t term in the expansion

The corresp onding w eigh t parameter w is generally called a bias Both the oneimensional
k
p olynomial and the m ultiim ensional p olynomial can b e cast in this form in whic h basis
functions are ed functions of the input v ariables
W e ha v e seen from the example of the higherrder p olynomial that to represen t general func
tions of man y input v ariables w e ha v e to consider a large n um b er of basis functions whic h in
turn implies a large n um b er of adaptiv e parameters In most practical applications there will b e
signian t correlations b et w een the input v ariables so that the ectiv e dimensionalit y of the space
o ccupied b y the data no wn as the intrinsic dimensionality is signian tly less than the n um b er
of inputs The k ey to constructing a mo del whic h can tak e adv an tage of this phenomenon is to
allo w the basis functions themselv es to b e adapted to the data as part of the training pro cess In
this case the n um b er of suc h functions only needs to gro w as the complexit y of the problem itself
gro ws and not simply as the n um b er of input v ariables gro ws The n um b er of free parameters in
suc h mo dels for a giv en n um b er of hidden functions t ypically only gro ws linearly r quadrati
M
cally with the dimensionalit y of the input space as compared with the d gro wth for a general
M thrder p olynomial
One of the simplest and most commonly encoun tered mo dels with adaptiv e basis functions is
giv en b y the t w oa y er feedorw ard net w ork sometimes called a multiayer p er c eptr on whic h can
b e expressed in the form in whic h the basis functions themselv es con tain adaptiv e parameters
and are giv en b y

d
X
z x g w x
j j i i
i
where w are bias parameters and w e ha v e in tro duced an extra nput v ariable x in
j
order to allo w the biases to b e treated on the same fo oting as the other parameters and hence
b e absorb ed in to the summatio n in The function g is called an activation function and
m ust b e a noninear function of its argumen t in order that the net w ork mo del can ha v e general
appro ximation capabilities If g w ere linear then w ould reduce to the comp osition of t w o
linear mappings whic h w ould itself b e linear The activ ation function is also c hosen to b e a
diren tiable function of its argumen t in order that the net w ork parameters can b e optimized
using gradien tased metho ds as discussed in Section Man y diren t forms of activ ation
function can b e considered Ho w ev er the most common are sigmoidal eaning hap ed and
include the logistic sigmoid

g a
exp a
whic h is plotted in Figure The motiv ation for this form of activ ation function is considered
in Section W e can com bine and to obtain a complete expression for the function
represen ted b y a t w oa y er feedorw ard net w ork in the form

M d
X X
y x w g w x
k k j j i i
j i
The form of net w ork mapping giv en b y is appropriate for regression problems but needs some
mo diation for classiation applications as will also b e discussed in Section
It should b e noted that mo dels of this kind with basis functions whic h are adapted to the
data are not unique to neural net w orks Suc h mo dels ha v e b een considered for man y y ears in
1.0
g(a)
0.5
0.0
-5.0 0.0 5.0
a
Figure Plot of the logistic sigmoid activ ation function giv en b y
outputs
y y
1 c
hidden
units
bias
z
1
z z
0 M
bias
x x x
0 1 d
inputs
Figure An example of a feedorw ard net w ork ha ving t w o la y ers of adaptiv e
w eigh ts
the statistics literature and include for example pr oje ction pursuit r e gr ession riedman and
Stuetzle Hub er whic h has a form remark ably similar to that of the feedorw ard
net w ork discussed ab o v e The pro cedures for determining the parameters in pro jection pursuit
regression are ho w ev er quite diren t from those generally used for feedorw ard net w orks
It is often useful to represen t the net w ork mapping function in terms of a net w ork diagram as
sho wn in Figure Eac h elemen t of the diagram represen ts one of the terms of the corresp onding
mathematical expression The bias parameters in the st la y er are sho wn as w eigh ts from an
extra input ha ving a ed v alue of x Similarl y the bias parameters in the second la y er are

sho wn as w eigh ts from an extra hidden unit with activ ation again ed at z

More complex forms of feedorw ard net w ork function can b e considered corresp onding to
more complex top ologies of net w ork diagram Ho w ev er the simple structure of Figure has
the prop ert y that it can appro ximate an y con tin uous mapping to arbitrary accuracy pro vided
the n um b er M of hidden units is suien tly large This prop ert y has b een discussed b y man y
authors including F unahashi Hec h tielsen Cyb enk o Hornik et al
Stinc hecom b e and White Cotter Ito Hornik and Kreino vic h
A pro of that t w oa y er net w orks ha ving sigmoidal hidden units can sim ultaneously appro ximate
b oth a function and its deriv ativ es w as giv en b y Hornik et al
The other ma jor class of net w ork mo del whic h also p ossesses univ ersal appro ximation capabil
ities is the r adial b asis function network ro omhead and Lo w e Mo o dy and Dark en
Suc h net w orks again tak e the form but the basis functions no w dep end on some measure of
distanc e b et w een the input v ector x and a protot yp e v ector A t ypical example w ould b e a
j
Gaussian basis function of the form


k x k
j
z x exp
j


j
where the parameter con trols the width of the basis function T raining of radial basis function
j
net w orks usually in v olv es a t w otage pro cedure in whic h the basis functions are st optimized
using input data alone and then the parameters w in are optimized b y error function
k j
minim izatio n Suc h pro cedures are describ ed in detail in Bishop
Statistical pattern recognition
W e turn no w to some of the formalism of statistical pattern recognition whic h w e regard as
essen tial for a clear understanding of neural net w orks F or con v enience w e in tro duce man y of the
cen tral concepts in the con text of classiation problems although m uc h the same ideas apply also
to regression The goal is to assign an input pattern x to one of c classes C where k c In
k
the case of handritten digit recognition for example w e migh t ha v e ten classes corresp onding to
the ten digits One of the p o w erful results of the theory of statistical pattern recognition
is a formalism whic h describ es the theoretically b est ac hiev able p erformance corresp onding to
the smallest probabilit y of misclassifying a new input pattern This pro vides a principled con text
within whic h w e can dev elop neural net w orks and other tec hniques for classiation
F or an y but the simplest of classiation problems it will not b e p ossible to devise a system
whic h is able to giv e p erfect classiation of all p ossible input patterns The problem arises
b ecause man y input patterns cannot b e assigned unam biguously to one particular class Instead
the most general description w e can giv e is in terms of the probabilities of b elonging to eac h of
the classes C given an input v ector x These probabilities are written as P C j x and are called
k k
the p osterior probabilities of class mem b ership since they corresp ond to the probabilities after
w e ha v e observ ed the input pattern x If w e consider a large set of patterns all from a particular
class C then w e can consider the probabilit y distribution of the corresp onding input patterns
k
whic h w e write as p x jC These are called the classonditional distributions and since the
k
v ector x is a con tin uous v ariable they corresp ond to probabilit y densit y functions rather than
probabilities The distribution of input v ectors irresp ectiv e of their class lab els is written as p x
and is called the unc onditional distribution of inputs Finally w e can consider the probabilities
of o ccurrence of the diren t classes irresp ectiv e of the input pattern whic h w e write as P C
k
These corresp ond to the relativ e frequencies of patterns within the complete data set and are
called prior probabilities since they corresp ond to the probabilities of mem b ership of eac h of the
classes b efore w e observ e a particular input v ector
These v arious probabilities can b e related using t w o standard results from probabilit y theory
The st is the pr o duct rule whic h tak es the form
P C x P C j x p x
k k
and the second is the sum rule giv en b y
X
P C x p x
k
k
F rom these rules w e obtain the follo wing relation
p x jC P C
k k
P C j x
k
p x
whic h is kno wn as Bayes the or em The denominator in is giv en b y
X
p x p x jC P C
k k
k
and pla ys the role of a normalizing factor ensuring that the p osterior probabilities in sum to
P
one P C j x As w e shall see shortly kno wledge of the p osterior probabilities allo ws us to
k
k
d the optimal solution to a classiation problem A k ey result discussed in Section is that
under suitable circumstances the outputs of a correctly trained neural net w ork can b e in terpreted
as ppro ximations to the p osterior probabilities P C j x when the v ector x is presen ted to the
k
inputs of the net w ork
As w e ha v e already noted p erfect classiation of all p ossible input v ectors will in general b e
imp ossible The b est w e can do is to minimi ze the probabilit y that an input will b e misclassi
d This is ac hiev ed b y assigning eac h new input v ector x to that class for whic h the p osterior
probabilit y P C j x is largest Th us an input v ector x is assigned to class C if
k k
P C j x P C j x for all j k
k j
W e shall see the justiation for this rule shortly Since the denominator in Ba y es theorem
is indep enden t of the class w e see that this is equiv alen t to assigning input patterns to class C
k
pro vided
p x jC P C p x jC P C for all j k
k k j j
A pattern classir pro vides a rule for assigning eac h p oin t of feature space to one of c classes
W e can therefore regard the feature space as b eing divided up in to c de cision r e gions R R
c
suc h that a p oin t falling in region R is assigned to class C Note that eac h of these regions
k k
need not b e con tiguous but ma y itself b e divided in to sev eral disjoin t regions all of whic h are
asso ciated with the same class The b oundaries b et w een these regions are kno wn as de cision
surfac es or de cision b oundaries
In order to d the optimal criterion for placemen t of decision b oundaries consider the case of
a oneimensional feature space x and t w o classes C and C W e seek a decision b oundary whic h

minim izes the probabilit y of misclassiation as illustrated in Figure A misclassiation error
will o ccur if w e assign a new pattern to class C when in fact it b elongs to class C or vice v ersa

W e can calculate the total probabilit y of an error of either kind b y writing uda and Hart
P rror P x R C P x R C

P x R jC P C P x R jC P C

Z Z
p x jC P C dx p x jC P C dx

R R

where P x R C is the join t probabilit y of x b eing assigned to class C and the true class b eing

C F rom w e see that if p x jC P C p x jC P C for a giv en x w e should c ho ose the

regions R and R suc h that x is in R since this giv es a smaller con tribution to the error W e

recognise this as the decision rule giv en b y for minim izing the probabilit y of misclassiation
The same result can b e seen graphically in Figure in whic h misclassiation errors arise from
the shaded region By c ho osing the decision b oundary to coincide with the v alue of x at whic h the
t w o distributions cross ho wn b y the arro w w e minim ize the area of the shaded region and hence
minim ize the probabilit y of misclassiation This corresp onds to classifying eac h new pattern x
using whic h is equiv alen t to assigning eac h pattern to the class ha ving the largest p osterior
probabilit y A similar justiation for this decision rule ma y b e giv en for the general case of c
classes and d imensional feature v ectors uda and Hart
It is imp ortan t to distinguish b et w een t w o separate stages in the classiation pro cess The
st is infer enc e whereb y data is used to determine v alues for the p osterior probabilities These
are then used in the second stage whic h is de cision making in whic h those probabilities are used
to mak e decisions suc h as assigning a new data p oin t to one of the p ossible classes So far w e
p(x|C )P(C )
1 1
p(x|C )P(C )
2 2
x
R R
1 2
Figure Sc hematic illustratio n of the join t probabilit y densities giv en b y
p x p x P as a function of a feature v alue x for t w o classes

and If the v ertical line is used as the decision b oundary then the classia

tion errors arise from the shaded region By placing the decision b oundary at
the p oin t where the t w o probabilit y densit y curv es cross ho wn b y the arro w
the probabilit y of misclassiati on is minimized
ha v e based classiation decisions on the goal of minim i zing the probabilit y of misclassiation
In man y applications this ma y not b e the most appropriate criterion Consider for instance the
task of classifying images used in medical screening in to t w o classes corresp onding to ormal
and umour There ma y b e m uc h more serious consequences if w e classify an image of a tumour
as normal than if w e classify a normal image as that of a tumour Suc h ects ma y easily b e
tak en in to accoun t b y the in tro duction of a loss matrix with elemen ts L sp ecifying the p enalt y
k j
asso ciated with assigning a pattern to class C when in fact it b elongs to class C The o v erall
j k
exp ected loss is minim i zed if for eac h input x the decision regions R are c hosen suc h that x R
j j
when
c c
X X
L p x jC P C L p x jC P C for all i j
k j k k k i k k
k k
whic h represen ts a generalization of the usual decision rule for minim izing the probabilit y of
misclassiation Note that if w e assign a loss of if the pattern is placed in the wrong class and
a loss of if it is placed in the correct class so that L here is the Kronec k er delta
k j k j k j
sym b ol then reduces to the decision rule for minim izing the probabilit y of misclassiation
giv en b y
Another p o w erful consequence of kno wing p osterior probabilities is that it b ecomes p ossible
to in tro duce a r eje ct criterion In general w e exp ect most of the misclassiation errors to o ccur
in those regions of x pace where the largest of the p osterior probabilities is relativ ely lo w since
there is then a strong o v erlap b et w een diren t classes In some applications it ma y b e b etter not
to mak e a classiation decision in suc h cases This leads to the follo wing pro cedure

then classify x
if max P C j x
k
then reject x
k
where is a threshold in the range The larger the v alue of the few er p oin ts will b e
classid F or the medical classiation problem for example it ma y b e b etter not to rely on an
automatic classiation system in doubtful cases but to ha v e these classid instead b y a h uman
exp ert
Y et another application for the p osterior probabilities arises when the distributions of patterns
b et w een the classes corresp onding to the prior probabilities P C are strongly misatc hed If
k
w e kno w the p osterior probabilities corresp onding to the data in the training set it is then it is
a simple matter to use Ba y es theorem to mak e the necessary corrections This is ac hiev ed
b y dividing the p osterior probabilities b y the prior probabilities corresp onding to the training set
m ultiplying them b y the new prior probabilities and then normalizing the results Changes in
the prior probabilities can therefore b e accommo dated without retraining the net w ork The prior
probabilities for the training set ma y b e estimated simply b y ev aluating the fraction of the training
set data p oin ts in eac h class Prior probabilities corresp onding to the op erating en vironmen t can
often b e obtained v ery straigh tforw ardly since only the class lab els are needed and no input data
is required As an example consider again the problem of classifying medical images in to ormal
and umour When used for screening purp oses w e w ould exp ect a v ery small prior probabilit y
of umour T o obtain a go o d v ariet y of tumour images in the training set w ould therefore require
h uge n um b ers of training examples An alternativ e is to increase artiially the prop ortion of
tumour images in the training set and then to comp ensate for the diren t priors on the test
data as describ ed ab o v e The prior probabilities for tumours in the general p opulation can b e
obtained from medical statistics without ha ving to collect the corresp onding images Correction
of the net w ork outputs is then a simple matter of m ultiplicatio n and division
The most common approac h to the use of neural net w orks for classiation in v olv es ha ving
the net w ork itself directly pro duce the classiation decision As w e ha v e seen kno wledge of the
p osterior probabilities is substan tially more p o w erful
Error F unctions
W e turn next to the problem of determining suitable v alues for the w eigh t parameters w in a
net w ork
n
T raining data is pro vided in the form of N pairs of input v ectors x and corresp onding desired
n
output v ectors t where n N lab els the patterns These desired outputs are called
n n
tar get v alues in the neural net w ork con text and the comp onen ts t of t represen t the targets
k
for the corresp onding net w ork outputs y F or asso ciativ e prediction problems of the kind w e are
k
considering the most general and complete description of the statistical prop erties of the data is
giv en in terms of the conditional densit y of the target data p t j x conditioned on the input data
A principled w a y to devise an error function is to use the concept of maximum likeliho o d F or
n n
a set of training data f x t g the lik eliho o d can b e written as
Y
n n
L p t j x
n
n n
where w e ha v e assumed that eac h data p oin t x t is dra wn indep enden tly from the same distri
bution so that the lik eliho o d for the complete data set is giv en b y the pro duct of the probabilities
for eac h data p oin t separately Instead of maximi zing the lik eliho o d it is generally more con v e
nien t to minimi ze the negativ e logarithm of the lik eliho o d These are equiv alen t pro cedures since
the negativ e logarithm is a monotonic function W e therefore minimi ze
X
n n
E ln L ln p t j x
n
where E is called an err or function W e shall further assume that the distribution of the individual
target v ariables t where k c are indep enden t so that w e can write
k
c
Y
p t j x p t j x
k
k
As w e shall see a feedorw ard neural net w ork can b e regarded as a framew ork for mo delling the
conditional probabilit y densit y p t j x Diren t c hoices of error function then arise from diren t
assumptions ab out the form of the conditional distribution p t j x It is con v enien t to discuss error
functions for regression and classiation problems separately
Error functions for regression
F or regression problems the output v ariables are con tin uous T o dee a sp eci error function w e
m ust mak e some c hoice for the mo del of the distribution of target data The simplest assumption
is to tak e this distribution to b e Gaussian More sp ecially w e assume that the target v ariable
t is giv en b y some deterministic function of x with added Gaussian noise so that
k
t h x
k k k
W e then assume that the errors ha v e a normal distribution with zero mean and standard a
k
deviation whic h do es not dep end on x or k Th us the distribution of is giv en b y
k



k
p exp
k


W e no w mo del the functions h x b y a neural net w ork with outputs y x w where w is the set
k k
of w eigh t parameters go v erning the neural net w ork mapping Using and w e see that the
probabilit y distribution of target v ariables is giv en b y


f y x w t g
k k
p t j x exp
k


where w e ha v e replaced the unkno wn function h x b y our mo del y x w T ogether with
k k
and this leads to the follo wing expression for the error function
N c
X X
N c
n n
E f y x w t g N c ln ln
k
k


n k
W e note that for the purp oses of error minim izatio n the second and third terms on the righ tand
side of are indep enden t of the w eigh ts w and so can b e omitted Similarly the o v erall factor

of in the st term can also b e omitted W e then ally obtain the familiar expression for
the sumfquares error function
N
X


n n
E k y x w t k

n
Note that mo dels of the form with ed basis functions are linear functions of the pa
rameters w and so is a quadratic function of w This means that the minim um of E can b e
found in terms of the solution of a set of linear algebraic equations F or this reason the pro cess of
determining the parameters in suc h mo dels is extremely fast F unctions whic h dep end linearly on
the adaptiv e parameters are called line ar mo dels ev en though they ma y b e noninear functions
of the input v ariables If the basis functions themselv es con tain adaptiv e parameters w e ha v e to
address the problem of minimi zing an error function whic h is generally highly noninear
The sumfquares error function w as deriv ed from the requiremen t that the net w ork output
v ector should represen t the conditional mean of the target data as a function of the input v ector
It is easily sho wn ishop that minimi zation of this error for an initely large data set
and a highly xible net w ork mo del do es indeed lead to a net w ork satisfying this prop ert y
W e ha v e deriv ed the sumfquares error function on the assumption that the distribution of
the target data is Gaussian F or some applications suc h an assumption ma y b e far from v alid
f the distribution is m ulti o dal for instance in whic h case the use of a sumfquares error
function can lead to extremely p o or results Examples of suc h distributions arise frequen tly in
in v erse problems suc h as rob ot kinematics the determination of sp ectral line parameters from the
sp ectrum itself or the reconstruction of spatial data from linefigh t information One general
approac h in suc h cases is to com bine a feedorw ard net w ork with a Gaussian mixtur e mo del a
linear com bination of Gaussian functions thereb y allo wing general conditional distributions p t j x
to b e mo delled ishop
Error functions for classiation
In the case of classiation problems the goal as w e ha v e seen is to appro ximate the p osterior
probabilities of class mem b ership P C j x giv en the input pattern x W e no w sho w ho w to arrange
k
for the outputs of a net w ork to appro ximate these probabilities
First w e consider the case of t w o classes C and C In this case w e can consider a net w ork

ha ving a singly output y whic h w e should represen t the p osterior probabilit y P C j x for class C

The p osterior probabilit y of class C will then b e giv en b y P C j x y T o ac hiev e this w e

consider a target co ding sc heme for whic h t if the input v ector b elongs to class C and t

if it b elongs to class C W e can com bine these in to a single expression so that the probabilit y of

observing either target v alue is
t t
p t j x y y
whic h is a particular case of the binomial distribution called the Bernoulli distribution With this
in terpretation of the output unit activ ations the lik eliho o d of observing the training data set
assuming the data p oin ts are dra wn indep enden tly from this distribution is then giv en b y
Y
n n
n t n t
y y
n
As usual it is more con v enien t to minim ize the negativ e logarithm of the lik eliho o d This leads
to the cr ossntr opy error function opld Baum and Wilczek Solla et al
Hin ton Hampshire and P earlm utter in the form
X
n n n n
E f t ln y t ln y g
n
F or the net w ork mo del in tro duced in the outputs w ere linear functions of the activ ations of
the hidden units While this is appropriate for regression problems w e need to consider the correct
c hoice of output unit activ ation function for the case of classiation problems W e shall assume
umelhart et al that the classonditional distributions of the outputs of the hidden units
represen ted here b y the v ector z are describ ed b y
n o
T
p z jC exp A B z z
k k
k
whic h is a mem b er of the exp onential family of distributions hic h includes man y of the common
distributions as sp ecial cases suc h as Gaussian binomial Bernoulli P oisson and so on The
parameters and con trol the form of the distribution In writing w e are implicitly
k
assuming that the distributions dir only in the parameters and not in An example w ould
k
b e t w o Gaussian distributions with diren t means but with common co v ariance matrices ote
that the decision b oundaries will then b e linear functions of z but will of course b e noninear
functions of the input v ariables as a consequence of the noninear transformation b y the hidden
units
Using Ba y es theorem w e can write the p osterior probabilit y for class C in the form

p z jC P C

P C j z

p z jC P C p z jC P C



exp a
whic h is a logistic sigmoid function in whic h
p z jC P C

a ln
p z jC P C

Using w e can write this in the form
T
a w z w

3.0
p(x|C ) p(x|C )
1 2
2.0
1.0
0.0
0.0 0.5 1.0
x
Figure Plots of the classondition al densities used to generate a data set to
demonstrate the in terpretation of net w ork outputs as p osterior probabili ties
The training data set w as generated from these densities using equal prior
probabili tie s
where w e ha v e deed
w

P C

w A A ln

P C

Th us the net w ork output is giv en b y a logistic sigmoid activ ation function acting on a w eigh ted
linear com bination of the outputs of those hidden units whic h send connections to the output unit
Inciden tally it is clear that w e can also apply the ab o v e argumen ts to the activ ations of hidden
units in a net w ork Pro vided suc h units use logistic sigmoid activ ation functions w e can in terpret
their outputs as probabilities of the presence of corresp onding eatures conditioned on the inputs
to the units
As a simple illustration of the in terpretation of net w ork outputs as probabilities w e consider
a t w olass problem with one input v ariable in whic h the classonditional densities are giv en b y
the Gaussian mixture functions sho wn in Figure A feedorw ard net w ork with e hidden units
ha ving sigmoidal activ ation functions and one output unit ha ving a logistic sigmoid activ ation
function w as trained b y minimi zing a crossn trop y error using cycles of the BF GS quasi
Newton algorithm ection The resulting net w ork mapping function is sho wn along with
the true p osterior probabilit y calculated using Ba y es theorem in Figure
F or the case of more than t w o classes w e consider a net w ork with one output for eac h class
so that eac h output represen ts the corresp onding p osterior probabilit y First of all w e c ho ose the
n
target v alues for net w ork training according to a of c co ding sc heme so that t for a
k l
k
pattern n from class C W e wish to arrange for the probabilit y of observing the set of target
l
n n
v alues t giv en an input v ector x to b e giv en b y the corresp onding net w ork output so that
k
p C j x y The v alue of the conditional distribution for this pattern can therefore b e written as
l l
c
Y
n
n n n t
k
p t j x y
k
k
If w e form the lik eliho o d function and tak e the negativ e logarithm as b efore w e obtain an error
function of the form
c
X X
n n
E t ln y
k k
n k
1.0
P (C | x)
1
0.5
0.0
0.0 0.5 1.0
x
Figure The result of training a m ultia y er p erceptron on data generated
from the densit y functions in Figure The solid curv e sho ws the output of
the trained net w ork as a function of the input v ariable x while the dashed
curv e sho ws the true p osterior probabili t y P x calculated from the class

conditional densities using Ba y es theorem
Again w e m ust seek the appropriate outputnit activ ation function to matc h this c hoice of
error function As b efore w e shall assume that the activ ations of the hidden units are distributed
according to F rom Ba y es theorem the p osterior probabilit y of class C is giv en b y
k
p z jC P C
k k
p C j z P
k
p z jC P C

k k
k
Substituting in to and rerranging w e obtain
exp a
k
p C j z y P
k k

exp a
k
k
where
T
a w z w
k k
k
and w e ha v e deed
w
k k
w A ln P C
k k k
The activ ation function is called a softmax function or normalize d exp onential It has the
P
prop erties that y and y as required for probabilities
k k
k
It is easily v erid ishop that the minim izatio n of the error function for an
inite data set and a highly xible net w ork function indeed leads to net w ork outputs whic h
represen t the p osterior probabilities for an y input v ector x
Note that the net w ork outputs of the trained net w ork need not b e close to or if the
classonditional densit y functions are o v erlapping Heuristic pro cedures suc h as applying extra
training using those patterns whic h fail to generate outputs close to the target v alues will b e
coun terpro ductiv e since this alters the distributions and mak es it less lik ely that the net w ork will
generate the correct Ba y esian probabilities
Error bac kropagation
Using the principle of maxim um lik eliho o d w e ha v e form ulated the problem of learning in neural
net w orks in terms of the minimi zation of an error function E w This error dep ends on the v ector
w of w eigh t and bias parameters in the net w ork and the goal is therefore to d a w eigh t v ector

w whic h minimi zes E F or mo dels of the form in whic h the basis functions are ed and
for an error function giv en b y the sumfquares form the error is a quadratic function of
the w eigh ts Its minimi zation then corresp onds to the solution of a set of coupled linear equations
and can b e p erformed rapidly in ed time W e ha v e seen ho w ev er that mo dels with ed basis
functions sur from v ery p o or scaling with input dimensionalit y In order to a v oid this diult y
w e need to consider mo dels with adaptiv e basis functions The error function no w b ecomes a highly
noninear function of the w eigh t v ector and its minim izatio n requires sophisticated optimization
tec hniques
W e ha v e considered error functions of the form and whic h are diren tiable
functions of the net w ork outputs Similarly w e ha v e considered net w ork mappings whic h are
diren tiable functions of the w eigh ts It therefore follo ws that the error function itself will b e a
diren tiable function of the w eigh ts and so w e can use gradien tased metho ds to d its minim a
W e no w sho w that there is a computationally eien t pro cedure called b ackr op agation whic h
allo ws the required deriv ativ es to b e ev aluated for arbitrary feedorw ard net w ork top ologies
In a general feedorw ard net w ork eac h unit computes a w eigh ted sum of its inputs of the form
X
z g a a w z
j j j j i i
i
where z is the activ ation of a unit or input whic h sends a connection to unit j and w is the
i j i
w eigh t asso ciated with that connection The summatio n runs o v er all units whic h send connections
to unit j Biases can b e included in this sum b y in tro ducing an extra unit or input with activ ation
ed at W e therefore do not need to deal with biases explicitly The error functions whic h
w e are considering can b e written as a sum o v er patterns of the error for eac h pattern separately
P
n
so that E E This follo ws from the assumed indep endence of the data p oin ts under the
n
giv en distribution W e can therefore consider one pattern at a time and then d the deriv ativ es
of E b y summing o v er patterns
F or eac h pattern w e shall supp ose that w e ha v e supplied the corresp onding input v ector to
the net w ork and calculated the activ ations of all of the hidden and output units in the net w ork
b y successiv e application of This pro cess is often called forwar d pr op agation since it can b e
regarded as a forw ard w of information through the net w ork
n
No w consider the ev aluation of the deriv ativ e of E with resp ect to some w eigh t w First w e
j i
n
note that E dep ends on the w eigh t w only via the summed input a to unit j W e can therefore
j i j
apply the c hain rule for partial deriv ativ es to giv e
n n
E E a
j

w a w
j i j j i
W e no w in tro duce a useful notation
n
E

j
a
j
where the are often referred to as err ors for reasons whic h will b ecome clear shortly Using
w e can write
a
j
z
i
w
j i
Substituting and in to w e then obtain
n
E
z
j i
w
j i
Equation tells us that the required deriv ativ e is obtained simply b y m ultiplying the v alue of
for the unit at the output end of the w eigh t b y the v alue of z for the unit at the input end of
the w eigh t here z in the case of a bias Th us in order to ev aluate the deriv ativ es w e need
δ
k
k
w
k j
j
δ
j
w
j i
i
z
i
Figure Illustration of the calculation of for hidden unit j b y bac k

propagation of the from those units k to whic h unit j sends connections
only to calculate the v alue of for eac h hidden and output unit in the net w ork and then apply
j

F or the output units the ev aluation of is straigh tforw ard F rom the deition w e ha v e
k
n n
E E

g a
k k
a y
k k
where w e ha v e used with z denoted b y y In order to ev aluate w e substitute appropriate
k k
n
expressions for g a and E y If for example w e consider the sumfquares error function
together with a net w ork ha ving linear outputs as in for instance w e obtain
n n
y t
k
k k
and so represen ts the error b et w een the actual and the desired v alues for output k The same
k
form is also obtained if w e consider the crossn trop y error function together with a
net w ork with a logistic sigmoid output or if w e consider the error function together with the
softmax activ ation function
T o ev aluate the for hidden units w e again mak e use of the c hain rule for partial deriv ativ es
to giv e
n n
X
E E a
k

j
a a a
j k j
k
where the sum runs o v er all units k to whic h unit j sends connections The arrangemen t of units
and w eigh ts is illustrated in Figure Note that the units lab elled k could include other hidden
units andr output units In writing do wn w e are making use of the fact that v ariations in
a giv e rise to v ariations in the error function only through v ariations in the v ariables a If w e
j k
no w substitute the deition of giv en b y in to and mak e use of w e obtain the
follo wing b ackr op agation form ula
X

g a w
j j k j k
k
whic h tells us that the v alue of for a particular hidden unit can b e obtained b y propagating the
bac kw ards from units higher up in the net w ork as illustrated in Figure Since w e already
kno w the v alues of the for the output units it follo ws that b y recursiv ely applying w e can
ev aluate the for all of the hidden units in a feedorw ard net w ork regardless of its top ology
Ha ving found the gradien t of the error function for this particular pattern the pro cess of forw ard
and bac kw ard propagation is rep eated for eac h pattern in the data set and the resulting deriv ativ es
summed to giv e the gradien t r E w of the total error function
The bac kropagation algorithm allo ws the error function gradien t r E w to b e ev aluated
eien tly W e no w seek a w a y of using this gradien t information to d a w eigh t v ector whic h
minim izes the error This is a standard problem in unconstrained noninear optimization and has
b een widely studied and a n um b er of p o w erful algorithms ha v e b een dev elop ed Suc h algorithms

b egin b y c ho osing an initial w eigh t v ector w hic h migh t b e selected at random and then
making a series of steps through w eigh t space of the form

w w w
where lab els the iteration step The simplest c hoice for the w eigh t up date is giv en b y the gradien t
descen t expression

w r E j


where the gradien t v ector r E m ust b e rev aluated at eac h step It should b e noted that gradi
en t descen t is a v ery ineien t algorithm for highly noninear problems suc h as neural net w ork
optimization Numerous ad ho c mo diations ha v e b een prop osed to try to impro v e its eiency
One of the most common is the addition of a momentum term in to giv e

w r E j w


where is called the momen tum parameter While this can often lead to impro v emen ts in the
p erformance of gradien t descen t there are no w t w o arbitrary parameters and whose v alues
m ust b e adjusted to giv e b est p erformance F urthermore the optimal v alues for these parameters
will often v ary during the optimization pro cess In fact m uc h more p o w erful tec hniques ha v e b een
dev elop ed for solving noninear optimization problems olak Gill et al Dennis and
Sc hnab el Luen b erger Fletc her Bishop These include conjugate gradien t
metho ds quasiewton algorithms and the Lev en b ergarquardt tec hnique
It should b e noted that the term bac kropagation is used in the neural computing literature
to mean a v ariet y of diren t things F or instance the m ultia y er p erceptron arc hitecture is
sometimes called a bac kropagation net w ork The term bac kropagation is also used to describ e
the training of a m ultia y er p erceptron using gradien t descen t applied to a sumfquares error
function In order to clarify the terminology it is useful to consider the nature of the training
pro cess more carefully Most training algorithms in v olv e an iterativ e pro cedure for minim izatio n
of an error function with adjustmen ts to the w eigh ts b eing made in a sequence of steps A t eac h
suc h step w e can distinguish b et w een t w o distinct stages In the st stage the deriv ativ es of
the error function with resp ect to the w eigh ts m ust b e ev aluated As w e shall see the imp ortan t
con tribution of the bac kropagation tec hnique is in pro viding a computationally eien t metho d
for ev aluating suc h deriv ativ es Since it is at this stage that errors are propagated bac kw ards
through the net w ork w e use the term bac kropagation sp ecially to describ e the ev aluation of
deriv ativ es In the second stage the deriv ativ es are then used to compute the adjustmen ts to b e
made to the w eigh ts The simplest suc h tec hnique and the one originally considered b y Rumelhart
et al in v olv es gradien t descen t It is imp ortan t to recognize that the t w o stages are distinct
Th us the st stage pro cess namely the propagation of errors bac kw ards through the net w ork
in order to ev aluate deriv ativ es can b e applied to man y other kinds of net w ork and not just the
m ultia y er p erceptron It can also b e applied to error functions other that the simple sumf
squares and to the ev aluation of other quan tities suc h as the Hessian matrix whose elemen ts
comprise the second deriv ativ es of the error function with resp ect to the w eigh ts ishop
Similarly the second stage of w eigh t adjustmen t using the calculated deriv ativ es can b e tac kled
using a v ariet y of optimization sc hemes iscussed ab o v e man y of whic h are substan tially more
ectiv e than simple gradien t descen t
One of the most imp ortan t asp ects of bac kropagation is its computational eiency T o
understand this let us examine ho w the n um b er of computer op erations required to ev aluate
the deriv ativ es of the error function scales with the size of the net w ork A single ev aluation of
the error function or a giv en input pattern w ould require O W op erations where W is the
total n um b er of w eigh ts in the net w ork F or W w eigh ts in total there are W suc h deriv ativ es

to ev aluate A direct ev aluation of these deriv ativ es individually w ould therefore require O W
op erations By comparison bac kropagation allo ws all of the deriv ativ es to b e ev aluated using
a single forw ard propagation and a single bac kw ard propagation together with the use of

Since eac h of these requires O W steps the o v erall computational cost is reduced from O W
to O W The training of m ultia y er p erceptron net w orks ev en using bac kropagation coupled
with eien t optimization algorithms can b e v ery time consuming and so this gain in eiency
is crucial
Generalization
The goal of net w ork training is not to learn an exact represen tation of the training data itself but
rather to build a statistical mo del of the pro cess whic h generates the data This is imp ortan t if
the net w ork is to exhibit go o d gener alization that is to mak e go o d predictions for new inputs
In order for the net w ork to pro vide a go o d represen tation of the generator of the data it is
imp ortan t that the ectiv e complexit y of the mo del b e matc hed to the data set This is most easily
illustrated b y returning to the analogy with p olynomial curv e ting in tro duced in Section In
this case the mo del complexit y is go v erned b y the order of the p olynomial whic h in turn go v erns
the n um b er of adjustable co eien ts Consider a data set of p oin ts generated b y sampling the
function
h x sin x
at equal in terv als of x and then adding random noise with a Gaussian distribution ha ving standard
deviation This rects a basic prop ert y of most data sets of in terest in pattern recognition
in that the data exhibits an underlying systematic comp onen t represen ted in this case b y the
function h x but is corrupted with random noise Figure sho ws the training data as w ell as
the function h x from together with the result of ting a linear p olynomial giv en b y
with M As can b e seen this p olynomial giv es a p o or represen tation of h x as a consequence
of its limited xibilit y W e can obtain a b etter b y increasing the order of the p olynomial since
this increases the n um b er of de gr e es of fr e e dom the n um b er of free parameters in the function
whic h giv es it greater xibilit y
Figure sho ws the result of ting a cubic p olynomial M whic h giv es a m uc h b etter
appro ximation to h x If ho w ev er w e increase the order of the p olynomial to o far then the
appro ximation to the underlying function actually gets w orse Figure sho ws the result of ting
a thrder p olynomial M This is no w able to ac hiev e a p erfect to the training data
since a thrder p olynomial has free parameters and there are data p oin ts Ho w ev er the
p olynomial has ted the data b y dev eloping some dramatic oscillations and consequen tly giv es a
p o or represen tation of h x F unctions of this kind are said to b e overtte d to the data
In order to determine the generalization p erformance of the diren t p olynomials w e generate
RMS
a second indep enden t test set and measure the ro oteanquare error E with resp ect to b oth
RMS
training and test sets Figure sho ws a plot of E for b oth the training data set and the
test data set as a function of the order M of the p olynomial W e see that the training set error
decreases steadily as the order of the p olynomial increases Ho w ev er the test set error reac hes
a minim um at M and thereafter increases as the order of the p olynomial is increased The
smallest error is ac hiev ed b y that p olynomial M whic h most closely matc hes the function
h x from whic h the data w as generated
In the case of neural net w orks the w eigh ts and biases are analogous to the p olynomial co ef
ien ts These parameters can b e optimized b y minim izatio n of an error function deed with
resp ect to a training data set The mo del complexit y is go v erned b y the n um b er of suc h parame
ters and so is determined b y the net w ork arc hitecture and in particular b y the n um b er of hidden
1.0
y
0.5
0.0
0.0 0.5 1.0
x
Figure An example of a set of data p oin ts obtained b y sampling the
function h x deed b y at equal in terv als of x and adding random noise
The dashed curv e sho ws the function h x while the solid curv e sho ws the
rather p o or appro ximation obtained with a linear p olynomial corresp onding
to M in
1.0
y
0.5
0.0
0.0 0.5 1.0
x
Figure This sho ws the same data set as in Figure but this time ted b y a
cubic M p olynomial sho wing the signian tl y impro v ed appro ximation
to h x ac hiev ed b y this more xible function
1.0
y
0.5
0.0
0.0 0.5 1.0
x
Figure The result of ting the same data set as in Figure using a th
order M p olynomial This giv es a p erfect to the training data but
at the exp ense of a function whic h has large oscillati ons and whic h therefore
giv es a p o orer represen tation of the generator function h x than did the cubic
p olynomial of Figure
0.3
test
training
0.0
0 2 4 6 8 10
order of polynomial
RMS
Figure Plots of the RMS error E as a function of the order of the p oly
nomial for b oth training and test sets for the example problem considered in
the previous three ures The error with resp ect to the training set decreases
monotonicall y with M while the error in making predictions for new data s
measured b y the test set sho ws a minim um at M

RMS errorunits W e ha v e seen that the complexit y cannot b e optimized b y minim i zation of training set error
since the smallest training error corresp onds to an o v ertted mo del whic h has p o or generalization
Instead w e see that the optim um complexit y can b e c hosen b y comparing the p erformance of a
range of trained mo dels using an indep enden t test set A more elab orate v ersion of this pro cedure
is cr ossalidation tone W ah ba and W old
Instead of directly v arying the n um b er of adaptiv e parameters in a net w ork the ectiv e
complexit y of the mo del ma y b e con trolled through the tec hnique of r e gularization This in v olv es
the use of a mo del with a relativ ely large n um b er of parameters together with the addition of a
p enalt y term to the usual error function E to giv e a total error function of the form
e
E E
where is called a regularization co eien t The p enalt y term is c hosen so as to encourage
smo other net w ork mapping functions since b y analogy with the p olynomial results sho wn in
Figures w e exp ect that go o d generalization is ac hiev ed when the rapid v ariations in the
mapping asso ciated with o v ertting are smo othed out There will b e an optim um v alue for
whic h can again b e found b y comparing the p erformance of mo dels trained using diren t v alues
of on an indep enden t test set Regularization is usually the preferred c hoice for mo del complexit y
con trol for a n um b er of reasons it allo ws prior kno wledge to b e incorp orated in to net w ork training
it has a natural in terpretation in the Ba y esian framew ork iscussed in Section and it can b e
extended to pro vide more complex forms of regularization in v olving sev eral diren t regularization
parameters whic h can b e used for example to determine the relativ e imp ortance of diren t
inputs
Discussion
In this c hapter w e ha v e presen ted a brief o v erview of neural net w orks from the viewp oin t of
statistical pattern recognition Due to lac k of space there are man y imp ortan t issues whic h w e
ha v e not discussed or ha v e only touc hed up on Here w e men tion t w o further topics of considerable
signiance for neural computing
In practical applications of neural net w orks one of the most imp ortan t factors determining
the o v erall p erformance of the al system is that of data prero cessing Since a neural net w ork
mapping has univ ersal appro ximation capabilities as discussed in Section it w ould in principle
b e p ossible to use the original data directly as the input to a net w ork In practice ho w ev er there
is generally considerable adv an tage in pro cessing the data in v arious w a ys b efore it is used for
net w ork training One imp ortan t reason wh y prepro cessing can lead to impro v ed p erformance is
that it can oet some of the ects of the urse of dimensionalit y discussed in Section b y
reducing the n um b er of input v ariables Input can b e com bined in linear or noninear w a ys to
giv e a smaller n um b er of new inputs whic h are then presen ted to the net w ork This is sometimes
called fe atur e extr action Although information is often lost in the pro cess this can b e more than
comp ensated for b y the b enes of a lo w er input dimensionalit y Another signian t asp ect of
prero cessing is that it allo ws the use of prior know le dge in other w ords information whic h is
relev an t to the solution of a problem whic h is additional to that con tained in the training data A
simple example w ould b e the prior kno wledge that the classiation of a handwritten digit should
not dep end on the lo cation of the digit within the input image By extracting features whic h are
indep enden t of p osition this translation in v ariance can b e incorp orated in to the net w ork structure
and this will generally giv e substan tially impro v ed p erformance compared with using the original
image directly as the input to the net w ork Another use for prepro cessing is to clean up deiencies
in the data F or example real data sets often sur from the problem of missing v alues in man y
of the patterns and these m ust b e accoun ted for b efore net w ork training can pro ceed
The discussion of learning in neural net w orks giv en ab o v e w as based on the principle of maxi
m um lik eliho o d whic h itself stems from the fr e quentist sc ho ol of statistics A more fundamen tal
and p oten tially more p o w erful approac h is giv en b y the Bayesian viewp oin t a ynes In

stead of describing a trained net w ork b y a single w eigh t v ector w the Ba y esian approac h expresses
our uncertain t y in the v alues of the w eigh ts through a probabilit y distribution p w The ect
of observing the training data is to cause this distribution to b ecome m uc h more concen trated in
particular regions of w eigh t space recting the fact that some w eigh t v ectors are more consisten t
with the data than others Predictions for new data p oin ts require the ev aluation of in tegrals o v er
w eigh t space w eigh ted b y the distribution p w The maxim um lik eliho o d approac h considered in
Section then represen ts a particular appro ximation in whic h w e consider only the most probable
w eigh t v ector corresp onding to a p eak in the distribution Aside from oring a more fundamen tal
view of learning in neural net w orks the Ba y esian approac h allo ws error bars to b e assigned to net
w ork predictions and regularization arises in a natural w a y in the Ba y esian setting F urthermore
a Ba y esian treatmen t allo ws the mo del complexit y s determined b y regularization co eien ts
for instance to b e treated without the need for indep enden t data as in cross alidation
Although the Ba y esian approac h is v ery app ealing a full implemen tatio n is in tractable for
neural net w orks Tw o principal appro ximation sc hemes ha v e therefore b een considered In the
st of these acKa y a b c the distribution o v er w eigh ts is appro ximated b y a
Gaussian cen tred on the most probable w eigh t v ector In tegrations o v er w eigh t space can then
b e p erformed analytically and this leads to a practical sc heme whic h in v olv es relativ ely small
mo diations to con v en tional algorithms An alternativ e approac h to the Ba y esian treatmen t of
neural net w orks is to use Mon te Carlo tec hniques eal to p erform the required in tegrations
n umerically without making analytical appro ximations Again this leads to a practical sc heme
whic h has b een applied to some real orld problems
An in teresting asp ect of the Ba y esian viewp oin t is that it is not in principle necessary to limit
net w ork complexit y eal and that o v ertting should not arise if the Ba y esian approac h
is implem en ted correctly
A more comprehensiv e discussion of these and other topics can b e found in Bishop
References
Anderson J A and E Rosenfeld ds Neur o c omputing F oundations of R ese ar ch
Cam bridge MA MIT Press
Baum E B and F Wilczek Sup ervised learning of probabilit y distributions b y neural
net w orks In D Z Anderson d Neur al Information Pr o c essing Systems pp New
Y ork American Institute of Ph ysics
Bellman R A daptive Contr ol Pr o c esses A Guide d T our New Jersey Princeton Uni
v ersit y Press
Bishop C M Exact calculation of the Hessian matrix for the m ultila y er p erceptron
Neur al Computation
Bishop C M Mixture densit y net w orks T ec hnical Rep ort NCR G Neural Com
puting Researc h Group Aston Univ ersit y Birmingham UK
Bishop C M Neur al Networks for Pattern R e c o gnition Oxford Univ ersit y Press
Bro omhead D S and D Lo w e Multiv ariable functional in terp olation and adaptiv e
net w orks Complex Systems
Cotter N E The Stoneeierstrass theorem and its application to neural net w orks
IEEE T r ansactions on Neur al Networks
Cyb enk o G Appro ximation b y sup erp ositions of a sigmoidal function Mathematics of
Contr ol Signals and Systems
Dennis J E and R B Sc hnab el Numeric al Metho ds for Unc onstr aine d Optimization
and Nonline ar Equations Englew o o d Cli NJ Pren ticeall
Devijv er P A and J Kittler Pattern R e c o gnition A Statistic al Appr o ach Englew o o d
Cli NJ Pren ticeall
Duda R O and P E Hart Pattern Classi ation and Sc ene A nalysis New Y ork John
Wiley
Fletc her R Pr actic al Metho ds of Optimization econd ed New Y ork John Wiley
F riedman J H and W Stuetzle Pro jection pursuit regression Journal of the A meric an
Statistic al Asso ciation
F ukunaga K Intr o duction to Statistic al Pattern R e c o gnition econd ed San Diego
Academic Press
F unahashi K On the appro ximate realization of con tin uous mappings b y neural net
w orks Neur al Networks
Gill P E W Murra y and M H W righ t Pr actic al Optimization London Academic
Press
Hampshire J B and B P earlm utter Equiv alence pro ofs for m ultia y er p erceptron
classirs and the Ba y esian discriminan t function In D S T ouretzky J L Elman T J
Sejno wski and G E Hin ton ds Pr o c e e dings of the Conne ctionist Mo dels Summer
Scho ol pp San Mateo CA Morgan Kaufmann
Hand D J Discrimination and Classi ation New Y ork John Wiley
Hec h tielsen R Theory of the bac kropagation neural net w ork In Pr o c e e dings of the
International Joint Confer enc e on Neur al Networks V olume pp San Diego CA
IEEE
Hin ton G E Connectionist learning pro cedures A rtiial Intel ligenc e
Hopld J J Learning algorithms and probabilit y distributions in feedorw ard and
feedac k net w orks Pr o c e e dings of the National A c ademy of Scienc es
Hornik K Appro ximation capabilities of m ultila y er feedforw ard net w orks Neur al Net
works
Hornik K M Stinc hcom b e and H White Multila y er feedforw ard net w orks are univ er
sal appro ximators Neur al Networks
Hornik K M Stinc hcom b e and H White Univ ersal appro ximation of an unkno wn
mapping and its deriv ativ es using m ultila y er feedforw ard net w orks Neur al Networks

Hub er P J Pro jection pursuit A nnals of Statistics
Ito Y Represen tation of functions b y sup erp ositions of a step or sigmoid function and
their applications to neural net w ork theory Neur al Networks
Ja ynes E T Ba y esian metho ds general bac kground In J H Justice d Maximum
Entr opy and Bayesian Metho ds in Applie d Statistics pp Cam bridge Univ ersit y Press
Kreino vic h V Y Arbitrary nonlinearit y is suien t to represen t all functions b y neural
net w orks a theorem Neur al Networks
Le Cun Y B Boser J S Denk er D Henderson R E Ho w ard W Hubbard and L D
Jac k el Bac kpropagation applied to handwritten zip co de recognition Neur al Com
putation
Luen b erger D G Line ar and Nonline ar Pr o gr amming econd ed Reading MA
Addison esley
MacKa y D J C Ba y esian in terp olation Neur al Computation
MacKa y D J C The evidence framew ork applied to classiation net w orks Neur al
Computation
MacKa y D J C A practical Ba y esian framew ork for bac kropagation net w orks
Neur al Computation
Mo o dy J and C J Dark en F ast learning in net w orks of lo callyuned pro cessing units
Neur al Computation
Neal R M Bayesian L e arning for Neur al Networks Ph thesis Univ ersit y of T oron to
Canada
P olak E Computational Metho ds in Optimization A Uni d Appr o ach New Y ork
Academic Press
Rumelhart D E R Durbin R Golden and Y Chauvin Bac kpropagation the basic
theory In Y Chauvin and D E Rumelhart ds Backpr op agation The ory A r chite ctur es
and Applic ations pp Hillsdale NJ La wrence Erlbaum
Rumelhart D E G E Hin ton and R J William s Learning in ternal represen tations
b y error propagation In D E Rumelhart J L McClelland and the PDP Researc h Group
ds Par al lel Distribute d Pr o c essing Explor ations in the Micr ostructur e of Co gnition
V olume F oundations pp Cam bridge MA MIT Press Reprin ted in Anderson
and Rosenfeld
Solla S A E Levin and M Fleisher Accelerated learning in la y ered neural net w orks
Complex Systems
Stinc hecom b e M and H White Univ ersal appro ximation using feedorw ard net w orks
with nonigmoid hidden la y er activ ation functions In Pr o c e e dings of the International Joint
Confer enc e on Neur al Networks V olume pp San Diego IEEE
Stone M Cross alidatory c hoice and assessmen t of statistical predictions Journal of
the R oyal Statistic al So ciety B
Stone M Cross alidation A review Math Op er ationsforsch Statist Ser Statis
tics
W ah ba G and S W old A completely automatic Frenc h curv e ting spline functions
b y cross alidation Communic ations in Statistics Series A