Neural Netw orks A P a ttern

Recognition Perspective

Christopher M Bishop

Neural Computing Resea rch Group

Aston Universit y Birmingham UK

Jan uary

T ec hnical Rep ort NCR G

Av ailable from httpwwcrgst onc k

In tro duction

Neural net w orks ha v e b een exploited in a wide v ariet y of applications the ma jorit y of whic h are

concerned with pattern recognition in one form or another Ho w ev er it has b ecome widely ac

kno wledged that the ectiv e solution of all but the simplest of suc h problems requires a principle d

treatmen t in other w ords one based on a sound theoretical framew ork

F rom the p ersp ectiv e of pattern recognition neural net w orks can b e regarded as an extension

of the man y con v en tional tec hniques whic h ha v e b een dev elop ed o v er sev eral decades Lac k of

understanding of the basic principles of statistical pattern recognition lies at the heart of man y

of the common mistak es in the application of neural net w orks In this c hapter w e aim to sho w

that the lac k b o x stigma of neural net w orks is largely unjustid and that there is actually

considerable insigh t a v ailable in to the w a y in whic h neural net w orks op erate and ho w to use them

ectiv ely

Some of the k ey p oin ts whic h are discussed in this c hapter are as follo ws

Neural net w orks can b e view ed as a general framew ork for represen ting noninear mappings

b et w een m ultiim ensional spaces in whic h the form of the mapping is go v erned b y a n um b er

of adjustable parameters They therefore b elong to a m uc h larger class of suc h mappings

man y of whic h ha v e b een studied extensiv ely in other lds

Simple tec hniques for represen ting m ulti aria te noninear mappings in one or t w o dimen

sions p olynomial s rely on linear com binations of e d basis functions r idden

functions Suc h metho ds ha v e sev ere limitations when extended to spaces of man y dimen

sions a phenomenon kno wn as the curse of dimensionality The k ey con tribution of neural

net w orks in this resp ect is that they emplo y basis functions whic h are themselv es adapted

to the data leading to eien t tec hniques for m ultiim ensional problems

The formalism of statistical pattern recognition in tro duced brie in Section lies at

the heart of a principled treatmen t of neural net w orks Man y of these topics are treated

in standard texts on statistical pattern recognition including Duda and Hart Hand

Devijv er and Kittler and F ukunaga

T o b e published in Fiesler E and Beale R ds Handb o ok of Neural Computation ew Y ork Oxford

Univ ersit y Press Bristol IOP Publishing Ltd

Net w ork training is usually based on the minim i zation of an error function W e sho w ho w

error functions arise naturally from the principle of maxim um lik eliho o d and ho w diren t

c hoices of error function corresp ond to diren t assumptions ab out the statistical prop er

ties of the data This allo ws the appropriate error function to b e selected for a particular

application

The statistical view of neural net w orks motiv ates sp eci forms for the activ ation functions

whic h arise in net w ork mo dels In particular w e see that the logistic sigmoid often in tro duced

b y analogy with the mean ing rate of a biological neuron is precisely the function whic h

allo ws the activ ation of a unit to b e giv en a particular probabilistic in terpretation

Pro vided the error function and activ ation functions are correctly c hosen the outputs of a

trained net w ork can b e giv en precise in terpretations F or regression problems they appro xi

mate the conditional a v erages of the distribution of target data while for classiation prob

lems they appro ximate the p osterior probabilities of class mem b ership This demonstrates

wh y neural net w orks can appro ximate the optimal solution to a regression or classiation

problem

Error bac kropagation is in tro duced as a general framew ork for ev aluating deriv ativ es for

feedorw ard net w orks The k ey feature of bac kropagation is that it is computationally

v ery eien t compared with a simple direct ev aluation of deriv ativ es F or net w ork training

algorithms this eiency is crucial

The original learning algorithm for m ultia y er feedorw ard net w orks umelhart et al

w as based on gradien t descen t In fact the problem of optimizing the w eigh ts in a

net w ork corresp onds to unconstrained noninear optimization for whic h man y substan tially

more p o w erful algorithms ha v e b een dev elop ed

Net w ork complexit y go v erned for example b y the n um b er of hidden units pla ys a cen tral

role in determining the generalization p erformance of a trained net w ork This is illustrated

using a simple curv e ting example in one dimension

These and man y related issues are discussed at greater length in Bishop

Classiation and Regression

In this c hapter w e concen trate on the t w o most common kinds of pattern recognition problem

The st of these w e shall refer to as r e gr ession and is concerned with predicting the v alues of one

or more con tin uous output v ariables giv en the v alues of a n um b er of input v ariables Examples

include prediction of the temp erature of a plasma giv en v alues for the in tensit y of ligh t emitted

at v arious w a v elengths or the estimation of the fraction of oil in a m ultihase pip eline giv en

measuremen ts of the absorption of gamm a b eams along v arious crossectional paths through the

pip e If w e denote the input v ariables b y a v ector x with comp onen ts x where i d and

i

the output v ariables b y a v ector y with comp onen ts y where k c then the goal of the

k

regression problem is to d a suitable set of functions whic h map the x to the y

i k

The second kind of task w e shall consider is called classi ation and in v olv es assigning input

patterns to one of a set of discrete classes C where k c An imp ortan t example in v olv es

k

the automatic in terpretation of handritten digits e Cun et al Again w e can form ulate

a classiation problem in terms of a set of functions whic h map inputs x to outputs y where

i k

no w the outputs sp ecify whic h of the classes the input pattern b elongs to F or instance the input

ma y b e assigned to the class whose output v alue y is largest

k

In general it will not b e p ossible to determine a suitable form for the required mapping

except with the help of a data set of examples The mapping is therefore mo delled in terms of

some mathematical function whic h con tains a n um b er of adjustable parameters whose v alues are

determined with the help of the data W e can write suc h functions in the form

y y x w

k k

where w denotes the v ector of parameters w w A neural net w ork mo del can b e regarded

W

simply as a particular c hoice for the set of functions y x w In this case the parameters

k

comprising w are often called weights

The imp ortance of neural net w orks in this con text is that they or a v ery p o w erful and v ery

general framew ork for represen ting noninear mappings from sev eral input v ariables to sev eral

output v ariables The pro cess of determining the v alues for these parameters on the basis of the

data set is called le arning or tr aining and for this reason the data set of examples is generally

referred to as a tr aining set Neural net w ork mo dels as w ell as man y con v en tional approac hes

to statistical pattern recognition can b e view ed as sp eci c hoices for the functional forms used

to represen t the mapping together with particular pro cedures for optimizing the parameters

in the mapping In fact neural net w ork mo dels often con tain con v en tional approac hes uc h as

linear or logistic regression as sp ecial cases

P olynomial curv e ting

Man y of the imp ortan t issues concerning the application of neural net w orks can b e in tro duced

in the simpler con text of curv e ting using p olynomial functions Here the problem is to a

p olynomial to a set of N data p oin ts b y minimi zing an error function Consider the M thrder

p olynomial giv en b y

M

X

M j

y x w w x w x w x

M j

j

This can b e regarded as a noninear mapping whic h tak es x as input and pro duces y as output The

precise form of the function y x is determined b y the v alues of the parameters w w whic h

M

are analogous to the w eigh ts in a neural net w ork It is con v enien t to denote the set of parameters

w w b y the v ector w in whic h case the p olynomial can b e written as a functional mapping

M

in the form V alues for the co eien ts can b e found b y minim izatio n of an error function as

will b e discussed in detail in Section W e shall giv e some examples of p olynomial curv e ting

in Section

Wh y neural net w orks

P attern recognition problems as w e ha v e already indicated can b e represen ted in terms of general

parametrized noninear mappings b et w een a set of input v ariables and a set of output v ariables

A p olynomial represen ts a particular class of mapping for the case of one input and one output

Pro vided w e ha v e a suien tly large n um b er of terms in the p olynomial w e can appro ximate

a wide class of functions to arbitrary accuracy This suggests that w e could simply extend the

concept of a p olynomial to higher dimensions Th us for d input v ariables and again one output

v ariable w e could for instance consider a thirdrder p olynomial of the form

d d d d d d

X X X X X X

y w w x w x x w x x x

i i i i i i i i i i i i

i i i i i i

F or an M thrder p olynomial of this kind the n um b er of indep enden t adjustable parameters

M

w ould gro w lik e d whic h represen ts a dramatic gro wth in the n um b er of degrees of freedom in

the mo del as the dimensionalit y of the input space increases This is an example of the curse

of dimensionality ellman The presence of a large n um b er of adaptiv e parameters in a

mo del can cause ma jor problems as w e shall discuss in Section In order that the mo del mak e

go o d predictions for new inputs it is necessary that the n um b er of data p oin ts in the training set

b e m uc h greater than the n um b er of adaptiv e parameters F or medium to large applications suc h

a mo del w ould need h uge quan tities of training data in order to ensure that the parameters n

this case the co eien ts in the p olynomial w ere w ell determined

There are in fact man y diren t w a ys in whic h to represen t general noninear mappings b e

t w een m ultidim ensional spaces The imp ortance of neural net w orks and similar tec hniques lies in

the w a y in whic h they deal with the problem of scaling with dimensionalit y In order to motiv ate

neural net w ork mo dels it is con v enien t to represen t the noninear mapping function in terms

of a linear com bination of b asis functions sometimes also called idden functions or hidden units

z x so that

j

M

X

y x w z x

k k j j

j

Here the basis function z tak es the ed v alue and allo ws a constan t term in the expansion

The corresp onding w eigh t parameter w is generally called a bias Both the oneimensional

k

p olynomial and the m ultiim ensional p olynomial can b e cast in this form in whic h basis

functions are ed functions of the input v ariables

W e ha v e seen from the example of the higherrder p olynomial that to represen t general func

tions of man y input v ariables w e ha v e to consider a large n um b er of basis functions whic h in

turn implies a large n um b er of adaptiv e parameters In most practical applications there will b e

signian t correlations b et w een the input v ariables so that the ectiv e dimensionalit y of the space

o ccupied b y the data no wn as the intrinsic dimensionality is signian tly less than the n um b er

of inputs The k ey to constructing a mo del whic h can tak e adv an tage of this phenomenon is to

allo w the basis functions themselv es to b e adapted to the data as part of the training pro cess In

this case the n um b er of suc h functions only needs to gro w as the complexit y of the problem itself

gro ws and not simply as the n um b er of input v ariables gro ws The n um b er of free parameters in

suc h mo dels for a giv en n um b er of hidden functions t ypically only gro ws linearly r quadrati

M

cally with the dimensionalit y of the input space as compared with the d gro wth for a general

M thrder p olynomial

One of the simplest and most commonly encoun tered mo dels with adaptiv e basis functions is

giv en b y the t w oa y er feedorw ard net w ork sometimes called a multiayer p er c eptr on whic h can

b e expressed in the form in whic h the basis functions themselv es con tain adaptiv e parameters

and are giv en b y

d

X

z x g w x

j j i i

i

where w are bias parameters and w e ha v e in tro duced an extra nput v ariable x in

j

order to allo w the biases to b e treated on the same fo oting as the other parameters and hence

b e absorb ed in to the summatio n in The function g is called an activation function and

m ust b e a noninear function of its argumen t in order that the net w ork mo del can ha v e general

appro ximation capabilities If g w ere linear then w ould reduce to the comp osition of t w o

linear mappings whic h w ould itself b e linear The activ ation function is also c hosen to b e a

diren tiable function of its argumen t in order that the net w ork parameters can b e optimized

using gradien tased metho ds as discussed in Section Man y diren t forms of activ ation

function can b e considered Ho w ev er the most common are sigmoidal eaning hap ed and

include the logistic sigmoid

g a

exp a

whic h is plotted in Figure The motiv ation for this form of activ ation function is considered

in Section W e can com bine and to obtain a complete expression for the function

represen ted b y a t w oa y er feedorw ard net w ork in the form

M d

X X

y x w g w x

k k j j i i

j i

The form of net w ork mapping giv en b y is appropriate for regression problems but needs some

mo diation for classiation applications as will also b e discussed in Section

It should b e noted that mo dels of this kind with basis functions whic h are adapted to the

data are not unique to neural net w orks Suc h mo dels ha v e b een considered for man y y ears in

1.0

g(a)

0.5

0.0

-5.0 0.0 5.0

a

Figure Plot of the logistic sigmoid activ ation function giv en b y

outputs

y y

1 c

hidden

units

bias

z

1

z z

0 M

bias

x x x

0 1 d

inputs

Figure An example of a feedorw ard net w ork ha ving t w o la y ers of adaptiv e

w eigh ts

the statistics literature and include for example pr oje ction pursuit r e gr ession riedman and

Stuetzle Hub er whic h has a form remark ably similar to that of the feedorw ard

net w ork discussed ab o v e The pro cedures for determining the parameters in pro jection pursuit

regression are ho w ev er quite diren t from those generally used for feedorw ard net w orks

It is often useful to represen t the net w ork mapping function in terms of a net w ork diagram as

sho wn in Figure Eac h elemen t of the diagram represen ts one of the terms of the corresp onding

mathematical expression The bias parameters in the st la y er are sho wn as w eigh ts from an

extra input ha ving a ed v alue of x Similarl y the bias parameters in the second la y er are

sho wn as w eigh ts from an extra hidden unit with activ ation again ed at z

More complex forms of feedorw ard net w ork function can b e considered corresp onding to

more complex top ologies of net w ork diagram Ho w ev er the simple structure of Figure has

the prop ert y that it can appro ximate an y con tin uous mapping to arbitrary accuracy pro vided

the n um b er M of hidden units is suien tly large This prop ert y has b een discussed b y man y

authors including F unahashi Hec h tielsen Cyb enk o Hornik et al

Stinc hecom b e and White Cotter Ito Hornik and Kreino vic h

A pro of that t w oa y er net w orks ha ving sigmoidal hidden units can sim ultaneously appro ximate

b oth a function and its deriv ativ es w as giv en b y Hornik et al

The other ma jor class of net w ork mo del whic h also p ossesses univ ersal appro ximation capabil

ities is the r adial b asis function network ro omhead and Lo w e Mo o dy and Dark en

Suc h net w orks again tak e the form but the basis functions no w dep end on some measure of

distanc e b et w een the input v ector x and a protot yp e v ector A t ypical example w ould b e a

j

Gaussian basis function of the form

k x k

j

z x exp

j

j

where the parameter con trols the width of the basis function T raining of radial basis function

j

net w orks usually in v olv es a t w otage pro cedure in whic h the basis functions are st optimized

using input data alone and then the parameters w in are optimized b y error function

k j

minim izatio n Suc h pro cedures are describ ed in detail in Bishop

Statistical pattern recognition

W e turn no w to some of the formalism of statistical pattern recognition whic h w e regard as

essen tial for a clear understanding of neural net w orks F or con v enience w e in tro duce man y of the

cen tral concepts in the con text of classiation problems although m uc h the same ideas apply also

to regression The goal is to assign an input pattern x to one of c classes C where k c In

k

the case of handritten digit recognition for example w e migh t ha v e ten classes corresp onding to

the ten digits One of the p o w erful results of the theory of statistical pattern recognition

is a formalism whic h describ es the theoretically b est ac hiev able p erformance corresp onding to

the smallest probabilit y of misclassifying a new input pattern This pro vides a principled con text

within whic h w e can dev elop neural net w orks and other tec hniques for classiation

F or an y but the simplest of classiation problems it will not b e p ossible to devise a system

whic h is able to giv e p erfect classiation of all p ossible input patterns The problem arises

b ecause man y input patterns cannot b e assigned unam biguously to one particular class Instead

the most general description w e can giv e is in terms of the probabilities of b elonging to eac h of

the classes C given an input v ector x These probabilities are written as P C j x and are called

k k

the p osterior probabilities of class mem b ership since they corresp ond to the probabilities after

w e ha v e observ ed the input pattern x If w e consider a large set of patterns all from a particular

class C then w e can consider the probabilit y distribution of the corresp onding input patterns

k

whic h w e write as p x jC These are called the classonditional distributions and since the

k

v ector x is a con tin uous v ariable they corresp ond to probabilit y densit y functions rather than

probabilities The distribution of input v ectors irresp ectiv e of their class lab els is written as p x

and is called the unc onditional distribution of inputs Finally w e can consider the probabilities

of o ccurrence of the diren t classes irresp ectiv e of the input pattern whic h w e write as P C

k

These corresp ond to the relativ e frequencies of patterns within the complete data set and are

called prior probabilities since they corresp ond to the probabilities of mem b ership of eac h of the

classes b efore w e observ e a particular input v ector

These v arious probabilities can b e related using t w o standard results from probabilit y theory

The st is the pr o duct rule whic h tak es the form

P C x P C j x p x

k k

and the second is the sum rule giv en b y

X

P C x p x

k

k

F rom these rules w e obtain the follo wing relation

p x jC P C

k k

P C j x

k

p x

whic h is kno wn as Bayes the or em The denominator in is giv en b y

X

p x p x jC P C

k k

k

and pla ys the role of a normalizing factor ensuring that the p osterior probabilities in sum to

P

one P C j x As w e shall see shortly kno wledge of the p osterior probabilities allo ws us to

k

k

d the optimal solution to a classiation problem A k ey result discussed in Section is that

under suitable circumstances the outputs of a correctly trained neural net w ork can b e in terpreted

as ppro ximations to the p osterior probabilities P C j x when the v ector x is presen ted to the

k

inputs of the net w ork

As w e ha v e already noted p erfect classiation of all p ossible input v ectors will in general b e

imp ossible The b est w e can do is to minimi ze the probabilit y that an input will b e misclassi

d This is ac hiev ed b y assigning eac h new input v ector x to that class for whic h the p osterior

probabilit y P C j x is largest Th us an input v ector x is assigned to class C if

k k

P C j x P C j x for all j k

k j

W e shall see the justiation for this rule shortly Since the denominator in Ba y es theorem

is indep enden t of the class w e see that this is equiv alen t to assigning input patterns to class C

k

pro vided

p x jC P C p x jC P C for all j k

k k j j

A pattern classir pro vides a rule for assigning eac h p oin t of feature space to one of c classes

W e can therefore regard the feature space as b eing divided up in to c de cision r e gions R R

c

suc h that a p oin t falling in region R is assigned to class C Note that eac h of these regions

k k

need not b e con tiguous but ma y itself b e divided in to sev eral disjoin t regions all of whic h are

asso ciated with the same class The b oundaries b et w een these regions are kno wn as de cision

surfac es or de cision b oundaries

In order to d the optimal criterion for placemen t of decision b oundaries consider the case of

a oneimensional feature space x and t w o classes C and C W e seek a decision b oundary whic h

minim izes the probabilit y of misclassiation as illustrated in Figure A misclassiation error

will o ccur if w e assign a new pattern to class C when in fact it b elongs to class C or vice v ersa

W e can calculate the total probabilit y of an error of either kind b y writing uda and Hart

P rror P x R C P x R C

P x R jC P C P x R jC P C

Z Z

p x jC P C dx p x jC P C dx

R R

where P x R C is the join t probabilit y of x b eing assigned to class C and the true class b eing

C F rom w e see that if p x jC P C p x jC P C for a giv en x w e should c ho ose the

regions R and R suc h that x is in R since this giv es a smaller con tribution to the error W e

recognise this as the decision rule giv en b y for minim izing the probabilit y of misclassiation

The same result can b e seen graphically in Figure in whic h misclassiation errors arise from

the shaded region By c ho osing the decision b oundary to coincide with the v alue of x at whic h the

t w o distributions cross ho wn b y the arro w w e minim ize the area of the shaded region and hence

minim ize the probabilit y of misclassiation This corresp onds to classifying eac h new pattern x

using whic h is equiv alen t to assigning eac h pattern to the class ha ving the largest p osterior

probabilit y A similar justiation for this decision rule ma y b e giv en for the general case of c

classes and d imensional feature v ectors uda and Hart

It is imp ortan t to distinguish b et w een t w o separate stages in the classiation pro cess The

st is infer enc e whereb y data is used to determine v alues for the p osterior probabilities These

are then used in the second stage whic h is de cision making in whic h those probabilities are used

to mak e decisions suc h as assigning a new data p oin t to one of the p ossible classes So far w e

p(x|C )P(C )

1 1

p(x|C )P(C )

2 2

x

R R

1 2

Figure Sc hematic illustratio n of the join t probabilit y densities giv en b y

p x p x P as a function of a feature v alue x for t w o classes

and If the v ertical line is used as the decision b oundary then the classia

tion errors arise from the shaded region By placing the decision b oundary at

the p oin t where the t w o probabilit y densit y curv es cross ho wn b y the arro w

the probabilit y of misclassiati on is minimized

ha v e based classiation decisions on the goal of minim i zing the probabilit y of misclassiation

In man y applications this ma y not b e the most appropriate criterion Consider for instance the

task of classifying images used in medical screening in to t w o classes corresp onding to ormal

and umour There ma y b e m uc h more serious consequences if w e classify an image of a tumour

as normal than if w e classify a normal image as that of a tumour Suc h ects ma y easily b e

tak en in to accoun t b y the in tro duction of a loss matrix with elemen ts L sp ecifying the p enalt y

k j

asso ciated with assigning a pattern to class C when in fact it b elongs to class C The o v erall

j k

exp ected loss is minim i zed if for eac h input x the decision regions R are c hosen suc h that x R

j j

when

c c

X X

L p x jC P C L p x jC P C for all i j

k j k k k i k k

k k

whic h represen ts a generalization of the usual decision rule for minim izing the probabilit y of

misclassiation Note that if w e assign a loss of if the pattern is placed in the wrong class and

a loss of if it is placed in the correct class so that L here is the Kronec k er delta

k j k j k j

sym b ol then reduces to the decision rule for minim izing the probabilit y of misclassiation

giv en b y

Another p o w erful consequence of kno wing p osterior probabilities is that it b ecomes p ossible

to in tro duce a r eje ct criterion In general w e exp ect most of the misclassiation errors to o ccur

in those regions of x pace where the largest of the p osterior probabilities is relativ ely lo w since

there is then a strong o v erlap b et w een diren t classes In some applications it ma y b e b etter not

to mak e a classiation decision in suc h cases This leads to the follo wing pro cedure

then classify x

if max P C j x

k

then reject x

k

where is a threshold in the range The larger the v alue of the few er p oin ts will b e

classid F or the medical classiation problem for example it ma y b e b etter not to rely on an

automatic classiation system in doubtful cases but to ha v e these classid instead b y a h uman

exp ert

Y et another application for the p osterior probabilities arises when the distributions of patterns

b et w een the classes corresp onding to the prior probabilities P C are strongly misatc hed If

k

w e kno w the p osterior probabilities corresp onding to the data in the training set it is then it is

a simple matter to use Ba y es theorem to mak e the necessary corrections This is ac hiev ed

b y dividing the p osterior probabilities b y the prior probabilities corresp onding to the training set

m ultiplying them b y the new prior probabilities and then normalizing the results Changes in

the prior probabilities can therefore b e accommo dated without retraining the net w ork The prior

probabilities for the training set ma y b e estimated simply b y ev aluating the fraction of the training

set data p oin ts in eac h class Prior probabilities corresp onding to the op erating en vironmen t can

often b e obtained v ery straigh tforw ardly since only the class lab els are needed and no input data

is required As an example consider again the problem of classifying medical images in to ormal

and umour When used for screening purp oses w e w ould exp ect a v ery small prior probabilit y

of umour T o obtain a go o d v ariet y of tumour images in the training set w ould therefore require

h uge n um b ers of training examples An alternativ e is to increase artiially the prop ortion of

tumour images in the training set and then to comp ensate for the diren t priors on the test

data as describ ed ab o v e The prior probabilities for tumours in the general p opulation can b e

obtained from medical statistics without ha ving to collect the corresp onding images Correction

of the net w ork outputs is then a simple matter of m ultiplicatio n and division

The most common approac h to the use of neural net w orks for classiation in v olv es ha ving

the net w ork itself directly pro duce the classiation decision As w e ha v e seen kno wledge of the

p osterior probabilities is substan tially more p o w erful

Error F unctions

W e turn next to the problem of determining suitable v alues for the w eigh t parameters w in a

net w ork

n

T raining data is pro vided in the form of N pairs of input v ectors x and corresp onding desired

n

output v ectors t where n N lab els the patterns These desired outputs are called

n n

tar get v alues in the neural net w ork con text and the comp onen ts t of t represen t the targets

k

for the corresp onding net w ork outputs y F or asso ciativ e prediction problems of the kind w e are

k

considering the most general and complete description of the statistical prop erties of the data is

giv en in terms of the conditional densit y of the target data p t j x conditioned on the input data

A principled w a y to devise an error function is to use the concept of maximum likeliho o d F or

n n

a set of training data f x t g the lik eliho o d can b e written as

Y

n n

L p t j x

n

n n

where w e ha v e assumed that eac h data p oin t x t is dra wn indep enden tly from the same distri

bution so that the lik eliho o d for the complete data set is giv en b y the pro duct of the probabilities

for eac h data p oin t separately Instead of maximi zing the lik eliho o d it is generally more con v e

nien t to minimi ze the negativ e logarithm of the lik eliho o d These are equiv alen t pro cedures since

the negativ e logarithm is a monotonic function W e therefore minimi ze

X

n n

E ln L ln p t j x

n

where E is called an err or function W e shall further assume that the distribution of the individual

target v ariables t where k c are indep enden t so that w e can write

k

c

Y

p t j x p t j x

k

k

As w e shall see a feedorw ard neural net w ork can b e regarded as a framew ork for mo delling the

conditional probabilit y densit y p t j x Diren t c hoices of error function then arise from diren t

assumptions ab out the form of the conditional distribution p t j x It is con v enien t to discuss error

functions for regression and classiation problems separately

Error functions for regression

F or regression problems the output v ariables are con tin uous T o dee a sp eci error function w e

m ust mak e some c hoice for the mo del of the distribution of target data The simplest assumption

is to tak e this distribution to b e Gaussian More sp ecially w e assume that the target v ariable

t is giv en b y some deterministic function of x with added Gaussian noise so that

k

t h x

k k k

W e then assume that the errors ha v e a normal distribution with zero mean and standard a

k

deviation whic h do es not dep end on x or k Th us the distribution of is giv en b y

k

k

p exp

k

W e no w mo del the functions h x b y a neural net w ork with outputs y x w where w is the set

k k

of w eigh t parameters go v erning the neural net w ork mapping Using and w e see that the

probabilit y distribution of target v ariables is giv en b y

f y x w t g

k k

p t j x exp

k

where w e ha v e replaced the unkno wn function h x b y our mo del y x w T ogether with

k k

and this leads to the follo wing expression for the error function

N c

X X

N c

n n

E f y x w t g N c ln ln

k

k

n k

W e note that for the purp oses of error minim izatio n the second and third terms on the righ tand

side of are indep enden t of the w eigh ts w and so can b e omitted Similarly the o v erall factor

of in the st term can also b e omitted W e then ally obtain the familiar expression for

the sumfquares error function

N

X

n n

E k y x w t k

n

Note that mo dels of the form with ed basis functions are linear functions of the pa

rameters w and so is a quadratic function of w This means that the minim um of E can b e

found in terms of the solution of a set of linear algebraic equations F or this reason the pro cess of

determining the parameters in suc h mo dels is extremely fast F unctions whic h dep end linearly on

the adaptiv e parameters are called line ar mo dels ev en though they ma y b e noninear functions

of the input v ariables If the basis functions themselv es con tain adaptiv e parameters w e ha v e to

address the problem of minimi zing an error function whic h is generally highly noninear

The sumfquares error function w as deriv ed from the requiremen t that the net w ork output

v ector should represen t the conditional mean of the target data as a function of the input v ector

It is easily sho wn ishop that minimi zation of this error for an initely large data set

and a highly xible net w ork mo del do es indeed lead to a net w ork satisfying this prop ert y

W e ha v e deriv ed the sumfquares error function on the assumption that the distribution of

the target data is Gaussian F or some applications suc h an assumption ma y b e far from v alid

f the distribution is m ulti o dal for instance in whic h case the use of a sumfquares error

function can lead to extremely p o or results Examples of suc h distributions arise frequen tly in

in v erse problems suc h as rob ot kinematics the determination of sp ectral line parameters from the

sp ectrum itself or the reconstruction of spatial data from linefigh t information One general

approac h in suc h cases is to com bine a feedorw ard net w ork with a Gaussian mixtur e mo del a

linear com bination of Gaussian functions thereb y allo wing general conditional distributions p t j x

to b e mo delled ishop

Error functions for classiation

In the case of classiation problems the goal as w e ha v e seen is to appro ximate the p osterior

probabilities of class mem b ership P C j x giv en the input pattern x W e no w sho w ho w to arrange

k

for the outputs of a net w ork to appro ximate these probabilities

First w e consider the case of t w o classes C and C In this case w e can consider a net w ork

ha ving a singly output y whic h w e should represen t the p osterior probabilit y P C j x for class C

The p osterior probabilit y of class C will then b e giv en b y P C j x y T o ac hiev e this w e

consider a target co ding sc heme for whic h t if the input v ector b elongs to class C and t

if it b elongs to class C W e can com bine these in to a single expression so that the probabilit y of

observing either target v alue is

t t

p t j x y y

whic h is a particular case of the binomial distribution called the Bernoulli distribution With this

in terpretation of the output unit activ ations the lik eliho o d of observing the training data set

assuming the data p oin ts are dra wn indep enden tly from this distribution is then giv en b y

Y

n n

n t n t

y y

n

As usual it is more con v enien t to minim ize the negativ e logarithm of the lik eliho o d This leads

to the cr ossntr opy error function opld Baum and Wilczek Solla et al

Hin ton Hampshire and P earlm utter in the form

X

n n n n

E f t ln y t ln y g

n

F or the net w ork mo del in tro duced in the outputs w ere linear functions of the activ ations of

the hidden units While this is appropriate for regression problems w e need to consider the correct

c hoice of output unit activ ation function for the case of classiation problems W e shall assume

umelhart et al that the classonditional distributions of the outputs of the hidden units

represen ted here b y the v ector z are describ ed b y

n o

T

p z jC exp A B z z

k k

k

whic h is a mem b er of the exp onential family of distributions hic h includes man y of the common

distributions as sp ecial cases suc h as Gaussian binomial Bernoulli P oisson and so on The

parameters and con trol the form of the distribution In writing w e are implicitly

k

assuming that the distributions dir only in the parameters and not in An example w ould

k

b e t w o Gaussian distributions with diren t means but with common co v ariance matrices ote

that the decision b oundaries will then b e linear functions of z but will of course b e noninear

functions of the input v ariables as a consequence of the noninear transformation b y the hidden

units

Using Ba y es theorem w e can write the p osterior probabilit y for class C in the form

p z jC P C

P C j z

p z jC P C p z jC P C

exp a

whic h is a logistic sigmoid function in whic h

p z jC P C

a ln

p z jC P C

Using w e can write this in the form

T

a w z w

3.0

p(x|C ) p(x|C )

1 2

2.0

1.0

0.0

0.0 0.5 1.0

x

Figure Plots of the classondition al densities used to generate a data set to

demonstrate the in terpretation of net w ork outputs as p osterior probabili ties

The training data set w as generated from these densities using equal prior

probabili tie s

where w e ha v e deed

w

P C

w A A ln

P C

Th us the net w ork output is giv en b y a logistic sigmoid activ ation function acting on a w eigh ted

linear com bination of the outputs of those hidden units whic h send connections to the output unit

Inciden tally it is clear that w e can also apply the ab o v e argumen ts to the activ ations of hidden

units in a net w ork Pro vided suc h units use logistic sigmoid activ ation functions w e can in terpret

their outputs as probabilities of the presence of corresp onding eatures conditioned on the inputs

to the units

As a simple illustration of the in terpretation of net w ork outputs as probabilities w e consider

a t w olass problem with one input v ariable in whic h the classonditional densities are giv en b y

the Gaussian mixture functions sho wn in Figure A feedorw ard net w ork with e hidden units

ha ving sigmoidal activ ation functions and one output unit ha ving a logistic sigmoid activ ation

function w as trained b y minimi zing a crossn trop y error using cycles of the BF GS quasi

Newton algorithm ection The resulting net w ork mapping function is sho wn along with

the true p osterior probabilit y calculated using Ba y es theorem in Figure

F or the case of more than t w o classes w e consider a net w ork with one output for eac h class

so that eac h output represen ts the corresp onding p osterior probabilit y First of all w e c ho ose the

n

target v alues for net w ork training according to a of c co ding sc heme so that t for a

k l

k

pattern n from class C W e wish to arrange for the probabilit y of observing the set of target

l

n n

v alues t giv en an input v ector x to b e giv en b y the corresp onding net w ork output so that

k

p C j x y The v alue of the conditional distribution for this pattern can therefore b e written as

l l

c

Y

n

n n n t

k

p t j x y

k

k

If w e form the lik eliho o d function and tak e the negativ e logarithm as b efore w e obtain an error

function of the form

c

X X

n n

E t ln y

k k

n k

1.0

P (C | x)

1

0.5

0.0

0.0 0.5 1.0

x

Figure The result of training a m ultia y er p erceptron on data generated

from the densit y functions in Figure The solid curv e sho ws the output of

the trained net w ork as a function of the input v ariable x while the dashed

curv e sho ws the true p osterior probabili t y P x calculated from the class

conditional densities using Ba y es theorem

Again w e m ust seek the appropriate outputnit activ ation function to matc h this c hoice of

error function As b efore w e shall assume that the activ ations of the hidden units are distributed

according to F rom Ba y es theorem the p osterior probabilit y of class C is giv en b y

k

p z jC P C

k k

p C j z P

k

p z jC P C

k k

k

Substituting in to and rerranging w e obtain

exp a

k

p C j z y P

k k

exp a

k

k

where

T

a w z w

k k

k

and w e ha v e deed

w

k k

w A ln P C

k k k

The activ ation function is called a softmax function or normalize d exp onential It has the

P

prop erties that y and y as required for probabilities

k k

k

It is easily v erid ishop that the minim izatio n of the error function for an

inite data set and a highly xible net w ork function indeed leads to net w ork outputs whic h

represen t the p osterior probabilities for an y input v ector x

Note that the net w ork outputs of the trained net w ork need not b e close to or if the

classonditional densit y functions are o v erlapping Heuristic pro cedures suc h as applying extra

training using those patterns whic h fail to generate outputs close to the target v alues will b e

coun terpro ductiv e since this alters the distributions and mak es it less lik ely that the net w ork will

generate the correct Ba y esian probabilities

Error bac kropagation

Using the principle of maxim um lik eliho o d w e ha v e form ulated the problem of learning in neural

net w orks in terms of the minimi zation of an error function E w This error dep ends on the v ector

w of w eigh t and bias parameters in the net w ork and the goal is therefore to d a w eigh t v ector

w whic h minimi zes E F or mo dels of the form in whic h the basis functions are ed and

for an error function giv en b y the sumfquares form the error is a quadratic function of

the w eigh ts Its minimi zation then corresp onds to the solution of a set of coupled linear equations

and can b e p erformed rapidly in ed time W e ha v e seen ho w ev er that mo dels with ed basis

functions sur from v ery p o or scaling with input dimensionalit y In order to a v oid this diult y

w e need to consider mo dels with adaptiv e basis functions The error function no w b ecomes a highly

noninear function of the w eigh t v ector and its minim izatio n requires sophisticated optimization

tec hniques

W e ha v e considered error functions of the form and whic h are diren tiable

functions of the net w ork outputs Similarly w e ha v e considered net w ork mappings whic h are

diren tiable functions of the w eigh ts It therefore follo ws that the error function itself will b e a

diren tiable function of the w eigh ts and so w e can use gradien tased metho ds to d its minim a

W e no w sho w that there is a computationally eien t pro cedure called b ackr op agation whic h

allo ws the required deriv ativ es to b e ev aluated for arbitrary feedorw ard net w ork top ologies

In a general feedorw ard net w ork eac h unit computes a w eigh ted sum of its inputs of the form

X

z g a a w z

j j j j i i

i

where z is the activ ation of a unit or input whic h sends a connection to unit j and w is the

i j i

w eigh t asso ciated with that connection The summatio n runs o v er all units whic h send connections

to unit j Biases can b e included in this sum b y in tro ducing an extra unit or input with activ ation

ed at W e therefore do not need to deal with biases explicitly The error functions whic h

w e are considering can b e written as a sum o v er patterns of the error for eac h pattern separately

P

n

so that E E This follo ws from the assumed indep endence of the data p oin ts under the

n

giv en distribution W e can therefore consider one pattern at a time and then d the deriv ativ es

of E b y summing o v er patterns

F or eac h pattern w e shall supp ose that w e ha v e supplied the corresp onding input v ector to

the net w ork and calculated the activ ations of all of the hidden and output units in the net w ork

b y successiv e application of This pro cess is often called forwar d pr op agation since it can b e

regarded as a forw ard w of information through the net w ork

n

No w consider the ev aluation of the deriv ativ e of E with resp ect to some w eigh t w First w e

j i

n

note that E dep ends on the w eigh t w only via the summed input a to unit j W e can therefore

j i j

apply the c hain rule for partial deriv ativ es to giv e

n n

E E a

j

w a w

j i j j i

W e no w in tro duce a useful notation

n

E

j

a

j

where the are often referred to as err ors for reasons whic h will b ecome clear shortly Using

w e can write

a

j

z

i

w

j i

Substituting and in to w e then obtain

n

E

z

j i

w

j i

Equation tells us that the required deriv ativ e is obtained simply b y m ultiplying the v alue of

for the unit at the output end of the w eigh t b y the v alue of z for the unit at the input end of

the w eigh t here z in the case of a bias Th us in order to ev aluate the deriv ativ es w e need

δ

k

k

w

k j

j

δ

j

w

j i

i

z

i

Figure Illustration of the calculation of for hidden unit j b y bac k

propagation of the from those units k to whic h unit j sends connections

only to calculate the v alue of for eac h hidden and output unit in the net w ork and then apply

j

F or the output units the ev aluation of is straigh tforw ard F rom the deition w e ha v e

k

n n

E E

g a

k k

a y

k k

where w e ha v e used with z denoted b y y In order to ev aluate w e substitute appropriate

k k

n

expressions for g a and E y If for example w e consider the sumfquares error function

together with a net w ork ha ving linear outputs as in for instance w e obtain

n n

y t

k

k k

and so represen ts the error b et w een the actual and the desired v alues for output k The same

k

form is also obtained if w e consider the crossn trop y error function together with a

net w ork with a logistic sigmoid output or if w e consider the error function together with the

softmax activ ation function

T o ev aluate the for hidden units w e again mak e use of the c hain rule for partial deriv ativ es

to giv e

n n

X

E E a

k

j

a a a

j k j

k

where the sum runs o v er all units k to whic h unit j sends connections The arrangemen t of units

and w eigh ts is illustrated in Figure Note that the units lab elled k could include other hidden

units andr output units In writing do wn w e are making use of the fact that v ariations in

a giv e rise to v ariations in the error function only through v ariations in the v ariables a If w e

j k

no w substitute the deition of giv en b y in to and mak e use of w e obtain the

follo wing b ackr op agation form ula

X

g a w

j j k j k

k

whic h tells us that the v alue of for a particular hidden unit can b e obtained b y propagating the

bac kw ards from units higher up in the net w ork as illustrated in Figure Since w e already

kno w the v alues of the for the output units it follo ws that b y recursiv ely applying w e can

ev aluate the for all of the hidden units in a feedorw ard net w ork regardless of its top ology

Ha ving found the gradien t of the error function for this particular pattern the pro cess of forw ard

and bac kw ard propagation is rep eated for eac h pattern in the data set and the resulting deriv ativ es

summed to giv e the gradien t r E w of the total error function

The bac kropagation algorithm allo ws the error function gradien t r E w to b e ev aluated

eien tly W e no w seek a w a y of using this gradien t information to d a w eigh t v ector whic h

minim izes the error This is a standard problem in unconstrained noninear optimization and has

b een widely studied and a n um b er of p o w erful algorithms ha v e b een dev elop ed Suc h algorithms

b egin b y c ho osing an initial w eigh t v ector w hic h migh t b e selected at random and then

making a series of steps through w eigh t space of the form

w w w

where lab els the iteration step The simplest c hoice for the w eigh t up date is giv en b y the gradien t

descen t expression

w r E j

where the gradien t v ector r E m ust b e rev aluated at eac h step It should b e noted that gradi

en t descen t is a v ery ineien t algorithm for highly noninear problems suc h as neural net w ork

optimization Numerous ad ho c mo diations ha v e b een prop osed to try to impro v e its eiency

One of the most common is the addition of a momentum term in to giv e

w r E j w

where is called the momen tum parameter While this can often lead to impro v emen ts in the

p erformance of gradien t descen t there are no w t w o arbitrary parameters and whose v alues

m ust b e adjusted to giv e b est p erformance F urthermore the optimal v alues for these parameters

will often v ary during the optimization pro cess In fact m uc h more p o w erful tec hniques ha v e b een

dev elop ed for solving noninear optimization problems olak Gill et al Dennis and

Sc hnab el Luen b erger Fletc her Bishop These include conjugate gradien t

metho ds quasiewton algorithms and the Lev en b ergarquardt tec hnique

It should b e noted that the term bac kropagation is used in the neural computing literature

to mean a v ariet y of diren t things F or instance the m ultia y er p erceptron arc hitecture is

sometimes called a bac kropagation net w ork The term bac kropagation is also used to describ e

the training of a m ultia y er p erceptron using gradien t descen t applied to a sumfquares error

function In order to clarify the terminology it is useful to consider the nature of the training

pro cess more carefully Most training algorithms in v olv e an iterativ e pro cedure for minim izatio n

of an error function with adjustmen ts to the w eigh ts b eing made in a sequence of steps A t eac h

suc h step w e can distinguish b et w een t w o distinct stages In the st stage the deriv ativ es of

the error function with resp ect to the w eigh ts m ust b e ev aluated As w e shall see the imp ortan t

con tribution of the bac kropagation tec hnique is in pro viding a computationally eien t metho d

for ev aluating suc h deriv ativ es Since it is at this stage that errors are propagated bac kw ards

through the net w ork w e use the term bac kropagation sp ecially to describ e the ev aluation of

deriv ativ es In the second stage the deriv ativ es are then used to compute the adjustmen ts to b e

made to the w eigh ts The simplest suc h tec hnique and the one originally considered b y Rumelhart

et al in v olv es gradien t descen t It is imp ortan t to recognize that the t w o stages are distinct

Th us the st stage pro cess namely the propagation of errors bac kw ards through the net w ork

in order to ev aluate deriv ativ es can b e applied to man y other kinds of net w ork and not just the

m ultia y er p erceptron It can also b e applied to error functions other that the simple sumf

squares and to the ev aluation of other quan tities suc h as the Hessian matrix whose elemen ts

comprise the second deriv ativ es of the error function with resp ect to the w eigh ts ishop

Similarly the second stage of w eigh t adjustmen t using the calculated deriv ativ es can b e tac kled

using a v ariet y of optimization sc hemes iscussed ab o v e man y of whic h are substan tially more

ectiv e than simple gradien t descen t

One of the most imp ortan t asp ects of bac kropagation is its computational eiency T o

understand this let us examine ho w the n um b er of computer op erations required to ev aluate

the deriv ativ es of the error function scales with the size of the net w ork A single ev aluation of

the error function or a giv en input pattern w ould require O W op erations where W is the

total n um b er of w eigh ts in the net w ork F or W w eigh ts in total there are W suc h deriv ativ es

to ev aluate A direct ev aluation of these deriv ativ es individually w ould therefore require O W

op erations By comparison bac kropagation allo ws all of the deriv ativ es to b e ev aluated using

a single forw ard propagation and a single bac kw ard propagation together with the use of

Since eac h of these requires O W steps the o v erall computational cost is reduced from O W

to O W The training of m ultia y er p erceptron net w orks ev en using bac kropagation coupled

with eien t optimization algorithms can b e v ery time consuming and so this gain in eiency

is crucial

Generalization

The goal of net w ork training is not to learn an exact represen tation of the training data itself but

rather to build a statistical mo del of the pro cess whic h generates the data This is imp ortan t if

the net w ork is to exhibit go o d gener alization that is to mak e go o d predictions for new inputs

In order for the net w ork to pro vide a go o d represen tation of the generator of the data it is

imp ortan t that the ectiv e complexit y of the mo del b e matc hed to the data set This is most easily

illustrated b y returning to the analogy with p olynomial curv e ting in tro duced in Section In

this case the mo del complexit y is go v erned b y the order of the p olynomial whic h in turn go v erns

the n um b er of adjustable co eien ts Consider a data set of p oin ts generated b y sampling the

function

h x sin x

at equal in terv als of x and then adding random noise with a Gaussian distribution ha ving standard

deviation This rects a basic prop ert y of most data sets of in terest in pattern recognition

in that the data exhibits an underlying systematic comp onen t represen ted in this case b y the

function h x but is corrupted with random noise Figure sho ws the training data as w ell as

the function h x from together with the result of ting a linear p olynomial giv en b y

with M As can b e seen this p olynomial giv es a p o or represen tation of h x as a consequence

of its limited xibilit y W e can obtain a b etter b y increasing the order of the p olynomial since

this increases the n um b er of de gr e es of fr e e dom the n um b er of free parameters in the function

whic h giv es it greater xibilit y

Figure sho ws the result of ting a cubic p olynomial M whic h giv es a m uc h b etter

appro ximation to h x If ho w ev er w e increase the order of the p olynomial to o far then the

appro ximation to the underlying function actually gets w orse Figure sho ws the result of ting

a thrder p olynomial M This is no w able to ac hiev e a p erfect to the training data

since a thrder p olynomial has free parameters and there are data p oin ts Ho w ev er the

p olynomial has ted the data b y dev eloping some dramatic oscillations and consequen tly giv es a

p o or represen tation of h x F unctions of this kind are said to b e overtte d to the data

In order to determine the generalization p erformance of the diren t p olynomials w e generate

RMS

a second indep enden t test set and measure the ro oteanquare error E with resp ect to b oth

RMS

training and test sets Figure sho ws a plot of E for b oth the training data set and the

test data set as a function of the order M of the p olynomial W e see that the training set error

decreases steadily as the order of the p olynomial increases Ho w ev er the test set error reac hes

a minim um at M and thereafter increases as the order of the p olynomial is increased The

smallest error is ac hiev ed b y that p olynomial M whic h most closely matc hes the function

h x from whic h the data w as generated

In the case of neural net w orks the w eigh ts and biases are analogous to the p olynomial co ef

ien ts These parameters can b e optimized b y minim izatio n of an error function deed with

resp ect to a training data set The mo del complexit y is go v erned b y the n um b er of suc h parame

ters and so is determined b y the net w ork arc hitecture and in particular b y the n um b er of hidden

1.0

y

0.5

0.0

0.0 0.5 1.0

x

Figure An example of a set of data p oin ts obtained b y sampling the

function h x deed b y at equal in terv als of x and adding random noise

The dashed curv e sho ws the function h x while the solid curv e sho ws the

rather p o or appro ximation obtained with a linear p olynomial corresp onding

to M in

1.0

y

0.5

0.0

0.0 0.5 1.0

x

Figure This sho ws the same data set as in Figure but this time ted b y a

cubic M p olynomial sho wing the signian tl y impro v ed appro ximation

to h x ac hiev ed b y this more xible function

1.0

y

0.5

0.0

0.0 0.5 1.0

x

Figure The result of ting the same data set as in Figure using a th

order M p olynomial This giv es a p erfect to the training data but

at the exp ense of a function whic h has large oscillati ons and whic h therefore

giv es a p o orer represen tation of the generator function h x than did the cubic

p olynomial of Figure

0.3

test

training

0.0

0 2 4 6 8 10

order of polynomial

RMS

Figure Plots of the RMS error E as a function of the order of the p oly

nomial for b oth training and test sets for the example problem considered in

the previous three ures The error with resp ect to the training set decreases

monotonicall y with M while the error in making predictions for new data s

measured b y the test set sho ws a minim um at M

RMS errorunits W e ha v e seen that the complexit y cannot b e optimized b y minim i zation of training set error

since the smallest training error corresp onds to an o v ertted mo del whic h has p o or generalization

Instead w e see that the optim um complexit y can b e c hosen b y comparing the p erformance of a

range of trained mo dels using an indep enden t test set A more elab orate v ersion of this pro cedure

is cr ossalidation tone W ah ba and W old

Instead of directly v arying the n um b er of adaptiv e parameters in a net w ork the ectiv e

complexit y of the mo del ma y b e con trolled through the tec hnique of r e gularization This in v olv es

the use of a mo del with a relativ ely large n um b er of parameters together with the addition of a

p enalt y term to the usual error function E to giv e a total error function of the form

e

E E

where is called a regularization co eien t The p enalt y term is c hosen so as to encourage

smo other net w ork mapping functions since b y analogy with the p olynomial results sho wn in

Figures w e exp ect that go o d generalization is ac hiev ed when the rapid v ariations in the

mapping asso ciated with o v ertting are smo othed out There will b e an optim um v alue for

whic h can again b e found b y comparing the p erformance of mo dels trained using diren t v alues

of on an indep enden t test set Regularization is usually the preferred c hoice for mo del complexit y

con trol for a n um b er of reasons it allo ws prior kno wledge to b e incorp orated in to net w ork training

it has a natural in terpretation in the Ba y esian framew ork iscussed in Section and it can b e

extended to pro vide more complex forms of regularization in v olving sev eral diren t regularization

parameters whic h can b e used for example to determine the relativ e imp ortance of diren t

inputs

Discussion

In this c hapter w e ha v e presen ted a brief o v erview of neural net w orks from the viewp oin t of

statistical pattern recognition Due to lac k of space there are man y imp ortan t issues whic h w e

ha v e not discussed or ha v e only touc hed up on Here w e men tion t w o further topics of considerable

signiance for neural computing

In practical applications of neural net w orks one of the most imp ortan t factors determining

the o v erall p erformance of the al system is that of data prero cessing Since a neural net w ork

mapping has univ ersal appro ximation capabilities as discussed in Section it w ould in principle

b e p ossible to use the original data directly as the input to a net w ork In practice ho w ev er there

is generally considerable adv an tage in pro cessing the data in v arious w a ys b efore it is used for

net w ork training One imp ortan t reason wh y prepro cessing can lead to impro v ed p erformance is

that it can oet some of the ects of the urse of dimensionalit y discussed in Section b y

reducing the n um b er of input v ariables Input can b e com bined in linear or noninear w a ys to

giv e a smaller n um b er of new inputs whic h are then presen ted to the net w ork This is sometimes

called fe atur e extr action Although information is often lost in the pro cess this can b e more than

comp ensated for b y the b enes of a lo w er input dimensionalit y Another signian t asp ect of

prero cessing is that it allo ws the use of prior know le dge in other w ords information whic h is

relev an t to the solution of a problem whic h is additional to that con tained in the training data A

simple example w ould b e the prior kno wledge that the classiation of a handwritten digit should

not dep end on the lo cation of the digit within the input image By extracting features whic h are

indep enden t of p osition this translation in v ariance can b e incorp orated in to the net w ork structure

and this will generally giv e substan tially impro v ed p erformance compared with using the original

image directly as the input to the net w ork Another use for prepro cessing is to clean up deiencies

in the data F or example real data sets often sur from the problem of missing v alues in man y

of the patterns and these m ust b e accoun ted for b efore net w ork training can pro ceed

The discussion of learning in neural net w orks giv en ab o v e w as based on the principle of maxi

m um lik eliho o d whic h itself stems from the fr e quentist sc ho ol of statistics A more fundamen tal

and p oten tially more p o w erful approac h is giv en b y the Bayesian viewp oin t a ynes In

stead of describing a trained net w ork b y a single w eigh t v ector w the Ba y esian approac h expresses

our uncertain t y in the v alues of the w eigh ts through a probabilit y distribution p w The ect

of observing the training data is to cause this distribution to b ecome m uc h more concen trated in

particular regions of w eigh t space recting the fact that some w eigh t v ectors are more consisten t

with the data than others Predictions for new data p oin ts require the ev aluation of in tegrals o v er

w eigh t space w eigh ted b y the distribution p w The maxim um lik eliho o d approac h considered in

Section then represen ts a particular appro ximation in whic h w e consider only the most probable

w eigh t v ector corresp onding to a p eak in the distribution Aside from oring a more fundamen tal

view of learning in neural net w orks the Ba y esian approac h allo ws error bars to b e assigned to net

w ork predictions and regularization arises in a natural w a y in the Ba y esian setting F urthermore

a Ba y esian treatmen t allo ws the mo del complexit y s determined b y regularization co eien ts

for instance to b e treated without the need for indep enden t data as in cross alidation

Although the Ba y esian approac h is v ery app ealing a full implemen tatio n is in tractable for

neural net w orks Tw o principal appro ximation sc hemes ha v e therefore b een considered In the

st of these acKa y a b c the distribution o v er w eigh ts is appro ximated b y a

Gaussian cen tred on the most probable w eigh t v ector In tegrations o v er w eigh t space can then

b e p erformed analytically and this leads to a practical sc heme whic h in v olv es relativ ely small

mo diations to con v en tional algorithms An alternativ e approac h to the Ba y esian treatmen t of

neural net w orks is to use Mon te Carlo tec hniques eal to p erform the required in tegrations

n umerically without making analytical appro ximations Again this leads to a practical sc heme

whic h has b een applied to some real orld problems

An in teresting asp ect of the Ba y esian viewp oin t is that it is not in principle necessary to limit

net w ork complexit y eal and that o v ertting should not arise if the Ba y esian approac h

is implem en ted correctly

A more comprehensiv e discussion of these and other topics can b e found in Bishop

References

Anderson J A and E Rosenfeld ds Neur o c omputing F oundations of R ese ar ch

Cam bridge MA MIT Press

Baum E B and F Wilczek Sup ervised learning of probabilit y distributions b y neural

net w orks In D Z Anderson d Neur al Information Pr o c essing Systems pp New

Y ork American Institute of Ph ysics

Bellman R A daptive Contr ol Pr o c esses A Guide d T our New Jersey Princeton Uni

v ersit y Press

Bishop C M Exact calculation of the Hessian matrix for the m ultila y er p erceptron

Neur al Computation

Bishop C M Mixture densit y net w orks T ec hnical Rep ort NCR G Neural Com

puting Researc h Group Aston Univ ersit y Birmingham UK

Bishop C M Neur al Networks for Pattern R e c o gnition Oxford Univ ersit y Press

Bro omhead D S and D Lo w e Multiv ariable functional in terp olation and adaptiv e

net w orks Complex Systems

Cotter N E The Stoneeierstrass theorem and its application to neural net w orks

IEEE T r ansactions on Neur al Networks

Cyb enk o G Appro ximation b y sup erp ositions of a sigmoidal function Mathematics of

Contr ol Signals and Systems

Dennis J E and R B Sc hnab el Numeric al Metho ds for Unc onstr aine d Optimization

and Nonline ar Equations Englew o o d Cli NJ Pren ticeall

Devijv er P A and J Kittler Pattern R e c o gnition A Statistic al Appr o ach Englew o o d

Cli NJ Pren ticeall

Duda R O and P E Hart Pattern Classi ation and Sc ene A nalysis New Y ork John

Wiley

Fletc her R Pr actic al Metho ds of Optimization econd ed New Y ork John Wiley

F riedman J H and W Stuetzle Pro jection pursuit regression Journal of the A meric an

Statistic al Asso ciation

F ukunaga K Intr o duction to Statistic al Pattern R e c o gnition econd ed San Diego

Academic Press

F unahashi K On the appro ximate realization of con tin uous mappings b y neural net

w orks Neur al Networks

Gill P E W Murra y and M H W righ t Pr actic al Optimization London Academic

Press

Hampshire J B and B P earlm utter Equiv alence pro ofs for m ultia y er p erceptron

classirs and the Ba y esian discriminan t function In D S T ouretzky J L Elman T J

Sejno wski and G E Hin ton ds Pr o c e e dings of the Conne ctionist Mo dels Summer

Scho ol pp San Mateo CA Morgan Kaufmann

Hand D J Discrimination and Classi ation New Y ork John Wiley

Hec h tielsen R Theory of the bac kropagation neural net w ork In Pr o c e e dings of the

International Joint Confer enc e on Neur al Networks V olume pp San Diego CA

IEEE

Hin ton G E Connectionist learning pro cedures A rtiial Intel ligenc e

Hopld J J Learning algorithms and probabilit y distributions in feedorw ard and

feedac k net w orks Pr o c e e dings of the National A c ademy of Scienc es

Hornik K Appro ximation capabilities of m ultila y er feedforw ard net w orks Neur al Net

works

Hornik K M Stinc hcom b e and H White Multila y er feedforw ard net w orks are univ er

sal appro ximators Neur al Networks

Hornik K M Stinc hcom b e and H White Univ ersal appro ximation of an unkno wn

mapping and its deriv ativ es using m ultila y er feedforw ard net w orks Neur al Networks

Hub er P J Pro jection pursuit A nnals of Statistics

Ito Y Represen tation of functions b y sup erp ositions of a step or sigmoid function and

their applications to neural net w ork theory Neur al Networks

Ja ynes E T Ba y esian metho ds general bac kground In J H Justice d Maximum

Entr opy and Bayesian Metho ds in Applie d Statistics pp Cam bridge Univ ersit y Press

Kreino vic h V Y Arbitrary nonlinearit y is suien t to represen t all functions b y neural

net w orks a theorem Neur al Networks

Le Cun Y B Boser J S Denk er D Henderson R E Ho w ard W Hubbard and L D

Jac k el Bac kpropagation applied to handwritten zip co de recognition Neur al Com

putation

Luen b erger D G Line ar and Nonline ar Pr o gr amming econd ed Reading MA

Addison esley

MacKa y D J C Ba y esian in terp olation Neur al Computation

MacKa y D J C The evidence framew ork applied to classiation net w orks Neur al

Computation

MacKa y D J C A practical Ba y esian framew ork for bac kropagation net w orks

Neur al Computation

Mo o dy J and C J Dark en F ast learning in net w orks of lo callyuned pro cessing units

Neur al Computation

Neal R M Bayesian L e arning for Neur al Networks Ph thesis Univ ersit y of T oron to

Canada

P olak E Computational Metho ds in Optimization A Uni d Appr o ach New Y ork

Academic Press

Rumelhart D E R Durbin R Golden and Y Chauvin Bac kpropagation the basic

theory In Y Chauvin and D E Rumelhart ds Backpr op agation The ory A r chite ctur es

and Applic ations pp Hillsdale NJ La wrence Erlbaum

Rumelhart D E G E Hin ton and R J William s Learning in ternal represen tations

b y error propagation In D E Rumelhart J L McClelland and the PDP Researc h Group

ds Par al lel Distribute d Pr o c essing Explor ations in the Micr ostructur e of Co gnition

V olume F oundations pp Cam bridge MA MIT Press Reprin ted in Anderson

and Rosenfeld

Solla S A E Levin and M Fleisher Accelerated learning in la y ered neural net w orks

Complex Systems

Stinc hecom b e M and H White Univ ersal appro ximation using feedorw ard net w orks

with nonigmoid hidden la y er activ ation functions In Pr o c e e dings of the International Joint

Confer enc e on Neur al Networks V olume pp San Diego IEEE

Stone M Cross alidatory c hoice and assessmen t of statistical predictions Journal of

the R oyal Statistic al So ciety B

Stone M Cross alidation A review Math Op er ationsforsch Statist Ser Statis

tics

W ah ba G and S W old A completely automatic Frenc h curv e ting spline functions

b y cross alidation Communic ations in Statistics Series A

## Comments 0

Log in to post a comment