An

introduction

to

Networks

Neural

..

Ben Krose Patrick van der Smagt

th edition

v ber

em No

Eigh

c The Univ ersit yof Amsterdam P ermission s i ted distribute copies of

book for nonommercial use as long as it is distributed as a whole in its original and

names of the authors and the ersit y of Amsterdam are men tioned P ermission i s

gran ted use this book nonommercial courses pro vided the authors notid of

b eforehand

The uthors a can b e eac r hed a t

Ben Kr ose P atric kv an der Smagt

F y o f Mathematics Computer Science Institute f o Rob otics and System Dynamics

Univ ersit y of A msterdam German Aerospace Researc h E t

Kruislaan NL SJ Amsterdam P Bo x D W essling

THE NETHERLANDS GERMANY

Phone Phone

F F ax

email krosewival email smagtlre

URL httpwwwivale sea neur o URL httpwwplreF DR RS

rch

ax

stablishmen

acult

this are for to

also Univ the

form

this single to gran Con ts

Preface

I FUND AMENT

In tro d uction

F undamen tals

A framew ork for distributed represen tation

Pro cessing units

Connections b e t w een nitsu

Activ ation and output rules

Net w ork top logies o

T raining o f rtiial a neural net w

P aradigms of learning

Mo difying patterns of connectivit y

Notation and terminology

Notation

T erminology

II THEOR Y

P erceptron and Adaline

Net w orks with threshold ctiv a ation f unctions

P erceptron learning ule r and con v ergence theorem

Example of the P erceptron l earning rule

Con v ergence theorem

The original P erceptron

The adaptiv e l inear elemen t Adaline

Net w orks with linear activ ation functions the delta rule

Exclusiv eR problem

Multia y p e erything

Conclusions

Bac kropagation

Multia y er feedorw n w orks

The generalised delta rule

Understanding bac kropagation

W orking with bac kropagation

An example

Other activ ation functions

et ard

ev do can rceptrons er

orks

ALS

ten CONTENTS

Deiencies of bac kropagation

Adv anced a lgorithms

Ho w go o d are m ultia y er feedorw ard et n w orks

The ect of the n b e r of learning samples

The ect of the n b e r of hidden u nits

Applications

Recurren t w orks

The generalised deltaule in recurren tnet w orks

The Jordan net w ork

The Elman net w ork

Bac kropagation n i fully recurren tnet w orks

The Hopld net w ork

Description

Hopld net w ork a s asso iativ c e memory

Neurons with g raded r esp onse

Boltzmann m ac hines

Selfrganising w orks

Comp etitiv e learning

Clustering

V ector quan tisation

Kohonen net w ork

Principal comp onen t net w orks

In tro duction

Normalised Hebbian ule r

Principal comp onen t extractor

More eigen v ectors

Adaptiv e resonance theory

Bac kground Adaptiv e r esonance heory t

AR T The implid s neural et n w ork mo del

AR T The riginal o o m d el

Reinforcemen t learning

The critic

The con troller net w ork

Barto approac h A CE com bination

Asso ciativ eseacr h

Adaptiv e ritic c

The cart ole s ystem

Reinforcemen tlernngai v ersus optimal c on trol

III APPLICA TIONS

Rob o t Con trol

Endctor p ositioning

Cameraob ot co rdination o is f unction appro ximation

Rob o t arm dynamics

Mobile rob ots

Mo del based na vigation

Sensor based con trol

SE the

Net

Net

um

umCONTENTS

Vision

In tro duction

F eedorw ard t yp es of net w orks

Selfrganising net w orks for i mage compression

Bac kropagation

Linear net w orks

Principal comp onen ts as features

The cognitron and n eo cognitron

Description of the cells

Structure f o the cognitron

Sim ulation results

Relaxation t yp es of net w orks

Depth f rom tereo s

Image restoration and image egmen s tation

Silicon r etina

IMPLEMENT A TIONS

General Purp ose Hardw are

The Connection Mac hine

Arc hitecture

Applicabilit y t o neural net w orks

Systolic rra a

Dedicated Neuroardw are

General ssues i

Connectivit y c onstrain ts

Analogue vs digital

Optics

Learning v s nonearning

Implemen tation examples

Carv er Mead silicon retina

LEP LNeuro c hip

References

Index

ys

IV CONTENTSList of Figures

The basic comp onen ts of an artiial n eural net w

V arious activ ation unctions f for a nit u

Single la y er net w ork w ith one output and t w o i nputs

Geometric represen tation of the discriminan t unction f and the w eigh ts

Discriminan t function b efore nd a after w eigh t up d

The P erceptron

The Adaline

Geometric represen tation of input space

Solution of the X OR problem

Am ultia y er net w ork with l y ers f o units

The descen tin w eigh t s pace

Example o f function a ppro ximation ith w a eedforw f ard net w

The p erio dic function f x sin x s in x appro ximated sine activ ation

functions

The p erio dic function f x ins x in x appro ximated w ith s igmoid activ ation

functions

Slo w decrease with conjugate gradien t in nonuadratic systems

Ect f o the learning et s size on the g eneralization

Ect f o the learning et s size on the e rror rate

Ect f o the n um b er of h idden units on the net w ork p erformance

Ect f o the n um b er of h idden units on the error rate

The Jordan n et w ork

The Elman net w ork

T raining a n Elman net w ork t o onc trol an ob ject

T raining a feedorw ard net w ork to con trol an ob ject

The autosso ciator n et w ork

A simple comp etitiv e l earning net w ork

Example o f clustering in D w ith normalised v ectors

Determining the winner in a comp titiv e e learning net w

Comp etitiv e learning for clustering data

V ector quan tisation t rac ks input densit y

Anet w ork com bining a v ector quan tisation la y er with a la y er feedorw neu

n w ork This net w ork can b e used o t appro ximate functions rom f to

the i nput space is discretised n i disjoin t subspaces

Gaussian neuron distance f unction

A t op ologyonserving map con v erging

The mapping of a t w oimensional input on a oneimensional Kohonen

net w

ork

space

et ral

ard

ork

with

ork

la

ate

ork LIST OF FIGURES

Mexican hat

Distribution of i nput samples

The AR T arc hitecture

The AR T neural net w

An example AR T run

Reinforcemen t learning c s heme

Arc hitecture of a reinforcemen t learning sc heme with critic elemen t

The cart ole system

An exemplar rob ot manipulator

Indirect learning s ystem f or rob o tics

The system used for sp cialised e learning

A Kohonen net w ork merging the output of t w o ameras c

The neural mo del prop osed b yKa w

The neural net w ork used b yKa w ato et al

The desired join t pattern or f join ts ts and ha v e s imilar time patterns

Sc hematic represen tation of the stored ro oms and t he partial information w hic h

is a v ailable from a ingle s sonar scan

The structure of the net w ork for the a utonomous and l v ehicle

Input image for the net w ork

W eigh ts of the PCA net w ork

The basic structure of the cognitron

Cognitron receptiv e r egions

Tw o earning l iterations in the cognitron

F eeding bac k activ ation v alues in the ognitron c

The Connection Mac hine system organisation

T ypical use of a s ystolic a rra y

The W arp system arc hitecture

Connections b t e w een M input and N output neurons

Optical implemen tation of matrix m ultiplication

The photoeceptor used b y eadM

The resistiv ela y er nd a enlarged a single no de

The LNeuro c hip

Join

al et ato

orkPreface

This man uscript attempts to pro reader with insigh t in artiial neural w orks

k in the absence of an y statefhert textb o ok forced us in to writing our o wn

w ev mean time a n um ber of w orth while textb o oks v e been published h can

b e used for bac kground and inepth i nformation W eaer a w of f t at

uscript ma ypro v e to b e t o o thorough or ot n thorough enough for a complete understanding

of the material therefore further reading be found in some excellen t text boosk

h ertz Krogh P almer Ritter Martinetz h ulten Kohonen

Anderson Rosenfeld D ARP A McClelland Rumelhart Rumelhart

McClelland

Some o f the material i n this b o ok esp cially e parts I I I and I V con tains timely material and

th ma y vily c hange throughout ages The c hoice of describing rob otics and vision as

neural et n w ork applications coincides with the neural et n w ork researc hin terests of he t authors

Muc h f o he t material p resen ted in c hapter as h b e en written b y J v an Dam nd a An uj

at the Univ ersit y f o Amsterdam Also An uj con tributed to material in c hapter The basis of

c hapter w b y a rep o rt of Gerard S c hram at the nivU ersit y of Amsterdam F urthermore

w e express our gratitude to those p oplee out there in Netand who ga v e us feedbac k on

uscript esp ecially Mic hiel v an der Korst and Nicolas Maudit who poin ted out a few

of our g o o fps W eo w e t hem man y artjes for heir t help

The ev s th edition is not d rastically diren t f rom the sixth one w e orrected c some t

errors a dded some examples and deleted some obscure parts of the text In e th edition

b o ls used text ha v e b een globally c hanged Also the c hapter on recurren t w orks

has b een alb eit marginally up dated The index still requires an up date though

Amsterdamb erpfanhofen No v em b er

P atric kv an der Smagt

Ben Kr ose

net the in sym

igh the

yping en

kw

quite man

this

form as

Dev oris

the hea us

Sc as suc

can material

man

this times hat act the are

whic ha the in er Ho

Bac

net an the vide LIST OF FIGURESP I

FUND AMENT ALS

art In tro duction

A w a v e of in terest in neural w kno as onnectionist mo dels or arallel

distributed pro cessing emerged after he t in tro duction of simplid neurons b y cCullo M c h and

Pitts i n cCullo c h Pitts These neurons w ted as mo dels of biological

neurons and as conceptual comp onen ts for ircuits c that could p rform e computational asks t

When Minsky and P ap ert published their b o ok P erceptrons in Minsky P ap ert

in whic h they w ed deiencies of p e rceptron mo dels most neural net w ork funding w as

redirected and researc hers left the a few researc hers con ued their erts

notably T o Kohonen Stephen Grossb erg James Anderson and Kunihik oF ukushima

in terest i n eural n net w orks remerged nly o after some imp rtan o t heoretical t results w ere

attained in the early eigh ties ost notably t he v of error bac kropagation and new

hardw are elopmen increased the pro cessing capacities This renew ed in terest is rected

in n ber of scien tists the amoun ts funding the n um ber large conferences and the

n b er f o journals asso ciated with neural net w orks No w ys most univ ersities ha v eanaeurl

net w orks group within their psyc hology h ysics computer science or b iology departmen ts

Artiial neural w orks be adequately c haracterised as omputational mo dels

with particular prop erties suc h as the abilit y adapt or learn generalise to cluster or

organise data and whic h p o eration is based on parallel pro cessing Ho w er man y of he t ab o v e

men tioned prop erties an c b e attributed to existing noneural mo dels the in triguing question

is to whic h e xten t the neural approac h pro v es to b e b tter e suited for ertain c applications than

existing mo dels T odteaa n qiveu o c er to this question is not found

Often arallels p with iological b systems are describ ed Ho w ev er there is still so little kno wn

v en at the w cell lev el ab out biological systems that the mo dels w e are using for

artiial neural systems seem to n i tro duce an o v ersimpliation of the biological mo dels

this course w e giv e an in tro duction artiial neural net w poin t of view w e

e is t hat of a computer scien tist W e are not concerned with the psyc hological implication of

net w orks and w e will at most o ccasionally refer to biological neural mo dels W e consider

neural net w orks as an alternativ e omputational c sc heme rather han t an ything else

lecture notes start with a c hapter in h a n um ber fundamen tal prop erties are

discussed c hapter a n um b er f o classical approac hes re a describ d e as w ell as he t discussion

on their imitations l whic h t o o k place in the early sixties Chapter con tin ues with the descrip

of attempts o v ercome these limitations and in tro duces the kropagation learning

algorithm Chapter discusses recurren t net w orks in these w orks restrain t there

are o n c ycles in the n et w ork g raph is remo v ed Selfrganising net w orks whic h equire r no exter

nal eac t her are discussed in c hapter Then in c hapter reinforcemen t l earning is in tro duced

Chapters fo cus applications of neural net w orks in the lds of rob otics and image

pro cessing resp ectiv ely The al c hapters discuss implemen tational asp cts e

on and

that the net

bac to tion

In

of whic These

the

tak

The orks to In

our est lo

answ al

ev

or to to

most can net

ada um

of of um the

ts dev

ery disco

The

euv

most tin Only ld

the sho

presen ere

wn lso orks net st CHAPTER ODUCTION

INTR F undamen

The rtiial a neural net w orks whic hw e describ e in his t ourse c are all v ariations on the parallel

distributed pro cessing DP idea The arc hitecture of h net w ork is based on v ery similar

building blo c ks whic h p erform the pro cessing In c hapter w e st discuss these pro cessing

units a nd discuss diren t net w ork top logies o Learning strategiess a basis for e

systemill b e presen ted in he t last section

A framew ork for distributed represen tation

An artiial net w ork onsists c of a p o l o o f s imple pro cessing nits u whic h unicate b ysnnedig

signals t o eac h ther o o v er a large n um ber of w ted connections

A set of ma jor asp e cts of a parallel distributed mo del be distinguished f Rumelhart

and cClelland M cClelland Rumelhart Rumelhart cClelland M

a et s of pro cessing units eurons cells

a tate s of activ ation y for ev ery unit whic h equiv alen tto the o utput of the unit

k

connections b e t w een the units Generally ac e h connection i s eed d b yaw eigh t w whic h

jk

determines the e ct whic h the signal of unit j has on nit u k

a propagation rule whic h determines the ectiv e input s of a unit external

k

inputs

an activ ation function F h determines the new el of ation on the

k

ectiv einptu s t nd a the curren t a y t i the up date

k k

an external input ak a ias b oet for ac e h unit

k

a metho d for information gathering the learning rule

an en vironmen t within whic h the system m ust op erate pro viding input signals andf

necessaryrror signals

Figure illustrates these basics some of w hic h ill w b e discussed in the next sections

Pro cessing units

Eac h unit p erforms a relativ ely simple job receiv e input neigh b o urs or external sources

use his t t o compute an output w h is propagated other units Apart

pro cessing a second t ask is he t adjustmen tofthe w eigh ts The system is inheren tly parallel in

the sense that man y nits u can carry out their computations at the same t ime

Within neural systems it is useful to distinguish three t yp es of units input units ndicated

b yan index i whic h r eceiv e data outside t he neural w ork output units ndicated b y

net from

this from to hic signal and

from

ation ctiv

based activ lev whic

its from

can

eigh

comm

adaptiv an

this

eac

tals CHAPTER FUND ALS

k

w

P

y

w

s w y k

jk

k j j

F

k

w

jk

k

w

j

y

j

k

Figure The basic comp onents of an a rtiial neural net w o rk The p ropagation rule used here is

the tanda rd w eighted summation

an index o whic h end s data out o f he t neural net w ork and hidden units ndicated b y a n

h hose w input and output signals remain within the eural n net w ork

During op eration units can b e up dated either sync hronously or async hronously With syn

c hronous u p dating all units up date their activ ation sim ultaneously with async hronous up d at

ing eac h unit h as a usually xed probabilit y of u p dating its activ ation at a ime t t and

only unit will be able to do this at a time some cases the latter mo has some

tages

Connections bet w units

most cases w e assume that eac h pro an additiv e con tribution to the input of the

unit with whic h it is connected input to k is simply the w ted sum of the

separate outputs from eac h o f the connected units p lus a bias oet

k

X

s t w t y t t

jk k

k j

j

The on c tribution for p ositiv e w is considered as a n excitation and for negativ e w as

jk jk

In some cases more complex rules for com bining inputs are used in whic h a distinction is made

bet w een excitatory and inhibitory inputs W e call units a propagation rule

units

A iren d t p ropagation r ule in tro d uced b yF eldman and Ballard eldman Ballard

is kno wn as the propagation rule for the sigmai unit

X Y

s t w t y t t

jk k

k j

m

m

j

Often the y are w eigh ted b efore m ultiplication Although t hese units are not requen f tly used

j

m

they ha v e heir t v alue for gating of input as w ell as implemen tation of lo okup tables el

Activ ation and output rules

W e lso a need a r ule whic hgvi es the ect of the t otal input on the ctiv a ation o f the unit W e need

a function F whic htka es the otal t input s t and the curren tacivt ation y t a

k

k k

new v alue of the ctiv a ation of the unit k

y t F y t t

k

k k k

duces pro and

sigma with

inhibition

term or

eigh unit total The

vides unit In

een

an adv

del In one

usually

index

AMENT NETW ORK TOPOLOGIES

Often the a ctiv ation function is a nondecreasing function of the total input of t he unit

X

A

y t F s t F w t y t t

k k jk k

k k j

j

although activ ation functions are not estricted r t o nondecreasing functions Generally omesrto

of threshold function is used a hard imiting l threshold function a s gn function or a inear l or

semiinear function or a s mo othly l imiting threshold ee ure F or this smo o thly limiting

function often a sigmoid hap ed function lik e

y F s

k k

s

k

e

is used In some applications a h yp erb olic tangen t is used yielding output v alues in the range

ii i

sgn

semi-linear sigmoid

Figure V a rious activation functions o f r a unit

some cases output of a can be a c hastic function of the total input of the

unit In that case the activ ation not deterministically determined b y the neuron input but

the n euron nput i determines the p robabilit y p that a neuron get a high ctiv a ation v alue

p y

k

s

k

e

in whic h T f temp erature is a parameter h determines the e of probabilit y

function This t yp e of u nit w ill b e discussed more extensiv in c hapter

In all net w w edecrbsie w e consider t he output of a neuron o t b e iden tical to its activ ation

lev el

Net w ork top ologies

the previous section w e discussed prop erties of the basic pro essing c in artiial

neural w ork This section fo cuses on pattern connections bet w the units and the

propagation o f d ata

As for this pattern of connections the main distinction w e an c mak e is b et w een

F eedorw ard net w orks where the w from input output units is strictly feed

forw ard The data pro c essing can e xtend o v er m ultiple a y of units but no eedbac f k

connections are presen t that is c onnections extending from outputs of units to inputs of

units in the same la y la y ers

Recurren t net w orks that do con tain feedbac k connections Con trary feedorw ard net

w orks the dynamical p rop e rties f o the et n w ork are imp ortan t In some cases the activ a

v alues of the units undergo a relaxation pro cess suc h that t he net w ork will ev olv e

a stable state in whic h these activ ations do c hange an ymore other applications

the c hange of the activ ation v the output neurons are signian t suc h that the

dynamical b eha viour constitutes the output of the net w ork earlm utter

of alues

In not

to tion

to

previous or er

ers

to data

een of the net

an unit the In

orks

ely

the slop whic

is

sto unit the In

CHAPTER FUND ALS

Classical examples of feedorw ard net w orks are the P erceptron and Adaline h will be

discussed i n the next c hapter Examples f o ecurren r tnte w orks ha v e b een presen ted b y Anderson

nderson Kohonen ohonen and Hopld Hopld and will b e discussed

in c hapter

T raining of artiial neural net w

A neural w ork has to be conured suc h the application of a set inputs pro duces

ither irect via a relaxation c the desired set outputs V arious metho ds to

strengths of the connections exist One w a y is to set the w eigh ts explicitly using a priori

wledge Another w a y to rain the neural net w ork b y feeding it teac hing patterns and

letting i t c hange its w eigh ts according to some learning rule

P aradigms of learning

W e can categorise the learning situations in t w o distinct sorts These a re

Sup ervised l earning or Asso ciativ e learning in hteh w ork is trained b y pro viding

it with input and m atc hing output patterns These inpututput pairs an c b e pro vided b y

an external teac her b y he t system whic hcon tains the net w ork selfup ervised

Unsup ervised earning l Selfrganisation in h an utput unit is trained to resp ond

clusters of pattern within input In this paradigm the supp osed dis

co v er statistically salien t f eatures of the input p opulation Unlik e he t sup ervised learning

paradigm there is no a riori p set of ategories c in to whic h t he patterns are to b e c lassid

rather the system m ust dev elop its o wn represen of i

Mo difying patterns of connectivit y

Both l earning paradigms discussed ab o v e result in an adjustmen tof eth w ts of the connec

tions b et w een units a ccording to some mo diation ule r Virtually all learning rules f or mo dels

of this t yp e can be considered as a v arian t of Hebbian learning rule suggested b y in

his c lassic book Organization of Beha viour ebb basic idea is if t w o

units j and k are activ esim ultaneously heir t in terconnection m ust b e strengthened j receiv es

input from k the simplest v ersion of Hebbian learning prescrib s e t o m o dify he t w eigh t w with

jk

w y

jk j k

where is a p ositiv e constan t of p rop ortionalit y represen ting the learning rate Another common

rule uses not the actual activ ation k but the dirence bet w een the actual desired

activ ation for adjusting the w eigh ts

w d y

jk

j k k

in whic h d is the d esired activ ation pro vided b ya etac This is often called the Widro wo

k

rule or the delta rule a nd will b e discussed in the next c hapter

Man y v arian ts ften v ery exotic ones ha v e b e en published the last few y ears In

c hapters s ome o f t hese up date rules will b e discussed

Notation terminology

Throughout t he y ears researc hers from diren t d isciplines ha v ecomeup with a v ast n um ber of

terms applicable in the ld neural net w Our computer scien poin tfiew

us to adhere to a subset of the terminology whic h is l ess biologically inspired y still concts

arise c v en tions are discussed b elo w

on Our

et

enables tist orks of

and

next the

her

and unit of

If

that The

Hebb the

eigh

uli stim nput the tation

to is system the to

whic or

or

net whic

is kno

the

set of ess pro or

of that net

orks

whic

AMENT NOT A TION AND T ERMINOLOGY

Notation

W e se u the follo wing notation in our form ulae Note that not a ll sym b o ls are meaningful for all

net w orks and hat t n i ome s cases subscripts or sup rscripts e ma y b e left out p is

necessary or added v ectors trariwise the notation belo w v e indices where

necessary V ectors are indicated with a b o ld nonlan fon t

j k the nit u j k

i an input unit

h a idden h unit

o an output unit

p

x p th input pattern v ector

p

x the j th elemen tof the p th input pattern v

j

p

s the input to a set of neurons when input pattern v ector p is clamp ed presen ted to the

net w ork often the input of the w ork b y c lamping input pattern v ector p

p

d the desired output of the w ork when input pattern v ector p w as input to the net w

p

d j th elemen t f o he t desired output of the net w ork w hen input pattern v ector p w as input

j

to the net w

p

y the a ctiv ation v alues of the net w ork when input pattern v ector p w as input to the net w ork

p

y the ctiv a ation v alues of elemen t j of the w ork when input pattern v ector p w to

j

n w ork

W the atrix m of connection w eigh

w the w eigh ts of the c onnections whic hfeed in to unit j

j

w w eigh t of t he connection from unit j to unit k

jk

F the ctiv a ation function asso ciated with unit j

j

the l earning rate asso c iated w ith w t w

jk jk

the iases b t o the units

the bias input to unit j

j

U the threshold of unit j in F

j j

p

E the e rror in the output of the net w ork when input pattern v p is input

E the energy of the n et w ork

T erminology

Output vs activ ation of a unit Since there is no need do otherwise w e consider the

output and the activ ation v alue of a u nit to b e one and t he same thing That is he t output of

eac h euron n equals its activ ation v alue

to

ector

eigh

the

ts

et the

input as net

ork

the

ork net

net

ector

the

ted

ha to con can

not often CHAPTER FUND ALS

Bias oet threshold These erms t all refer o t a constan t indep e nden t f o t he net w ork

input but adapted b y the learning term whic h is input to a unit ma y be used

in terc hangeably although the latter t w o t erms are often en visaged s a a prop ert y of he t activ ation

function F urthermore this external input is usually implemen ted nd can be written as a

w eigh t f rom a nit u with activ ation v

ber y ers In a feedorw ard net w ork the inputs p rform e n o computation and their

la y er is therefore not coun ted Th a n w ork w ith one input la y er one hidden l a y er and one

la y er is referred to as a net w ork t w o y ers This con v tion is widely though not

y et univ ersally sed u

Represen tation vs learning When using a neural net w ork one has o t distinguish t w o ssuesi

whic h inence the p erformance of system The st is the represen tational po w er of

the n et w ork the econd s one is the learning algorithm

The r epresen tational po w er of a neural w ork to the abilit y a neural w ork

represen t a desired unction f Because a neural net w ork is built from a set of standard f unctions

in most cases the net w ork w ill only ximate the desired f unction and e v en optimal

set f o w eigh ts the appro ximation error i s not zero

The s econd issue is the learning algorithm en that there exist a set of optimal w eigh ts

in the n et w ork is there a pro c edure to terativ d t s et of w eigh ts

his ely

Giv

an for appro

to net of refers net

one the

en la with output

et us

la of Num

alue

They rule

AMENTP art II

THEOR Y

P erceptron Adaline

This c hapter describ e s single a l y er neural net w orks including some of the classical approac hes

to neural omputing c and learning problem st part of this c hapter w e discuss the

represen tational po w er the single la y w orks and their learning algorithms a nd e

some examples of using w orks the second part w e will discuss represen tational

limitations of single la y er net w orks

Tw o lassical will be describ ed st of the c hapter the P erceptron

prop osed b y Rosen blatt osen blatt in late the Adaline presen ted in the

early s b yb y idro W w and Ho Widro w Ho

Net w orks with threshold activ ation functions

A ingle s la y er feedorw ard net w ork consists of o ne or more output neurons o eac h of whic his

connected with a w eigh ting factor w to of the i In the simplest case the net w ork

io

has only t w o inputs and a single output as sk etc hed in ure e lea v e the o

out The i nput of the neuron s i the w eigh ted um s of the inputs plus t he bias term The utputo

x

w

y

w

x

Figure Single la y er net w o rk with one utput o and t w o nputs i

of the etn w ork is formed b y the activ of the output neuron whic his smeo function of the

input

X

y F w x

i

i

i

The a ctiv ation function F can b e linear o s that w e v e a net w or nonlinear In

section w e consider the threshold r ea H viside or gn s f unction

if s

F s

otherwise

output the w ork us or dep nding e on the input The net w ork

can no w be used for a classiation it can decide whether an input pattern b e longs

f o t w o classes If the total input is p sitiv o e the pattern will b e ssigned a to c lass i f the

one

to task

either is th net of The

this ork linear ha

ation

index output

inputs all

and the

part the in dels mo

the In net the

giv will net er of

the In the

and CHAPTER PER CEPTR ON A D ALINE

total input is negativ e the ample s will b e assigned to c lass The eparation s b et w een the t w o

classes in this c ase is a straigh t l en b y t he equation

w x w x

esTh ngiell a y er net w ork r epresen a linear discriminan t f unction

A geometrical represen tation of the linear threshold neural w ork is giv en in ure

Equation can b e written as

w

x x

w w

w e s ee that the w eigh ts determine he t slop e o f he t line and the bias determines the et

i ho w far the line is from the origin that the w b e plotted in input

space w eigh tv ector i s alw a ys p erp endicular t o the discriminan t unction f

x

w

w

x

k w k

Figure Geometric rep resentation of the discriminant function a nd the w eights

w that w e v e sho wn the represen tational p o w the single la y net w with linear

threshold units w e come to second issue ho w do w e the w eigh ts and biases in the

net w ork W e will describ e t w o learning m etho ds these t yp es of net w orks the erceptron

learning ule r and the elta or MS rule metho ds are iterativ e pro cedures t hat adjust

w ts A learning s ample presen ted to the net w ork F or eac h w t the v alue

computed b y adding a c orrection t o the old v The hreshold t is up dated in a same w a y

w t w t w t

i i i

t t t

The earning l problem can no w b e orm f ulated as wdo w e c ompute w t a nd t in

i

to classify the earning l patterns correctly

P erceptron learning and con v ergence theorem

Supp ose w eha v e a set of learning samples consisting of an input v ector x and a esired d output

d x F or a classiation task the d x is or The p erceptron learning ule r is v

simple and c an b e stated as follo

Start with random w eigh ts for he t connections

Select an input v ector x from the set of training samples

If y d x the p erceptron giv es an incorrect resp onse mo dify all connections w accord

i

ing to w d x x

i

i

ws

ery usually

rule

order ho

alue

is new eigh is eigh the

Both

for

learn the

ork er of er ha No

the

the can ts eigh also Note

and

net

ts

giv ine

AND PER CEPTR ON LEARNING R ULE AND CONVER GENCE THEOREM

Go bac kto

Note that the pro cedure is v ery similar to the Hebb rule t he only dirence is t hat when the

net w ork esp r onds c orrectly n o connection w ts are mo did Besides m o difying the w eigh ts

w em ust a lso mo dify the threshold is considered as a onnection c w bet w een the output

neuron and a umm y predicate unit w hic hisalw a ys on x en the p erceptron learning

rule as stated ab o v e this threshold i s mo did according to

if the p erceptron resp onds c orrectly

d x otherwise

Example of the P erceptron learning rule

A p erceptron i s initialized with the follo wing w eigh ts w The p erceptron

learning r ule is sed u to earn l a c orrect discriminan t unction f for a n b er of samples sk etc hed in

ure The st sample A with v alues x and target v alue d x is presen ted

to the n et w ork F rom eq it can b e calculated that the net w ork output is so no w eigh ts

adjusted The same is the case for poin t B v alues x and target v alue

d x the net w ork output is negativ e so no c When presen ting p o in tCwhvit alues

x the net w ork output will b e while the target v alue d x According

the p erceptron learning rule the w tc hanges are w w The new

w eigh ts are o n w w w and s ample C is lassid c orrectly c

the iscriminan d t unction f b e fore and after this w t up is s ho wn

x

original discriminant function

after weight update

2

A

1

C

B

12

x

Figure Discriminant function b e fo re and after w eight up ate d

Con v ergence theorem

F or the p erceptron learning ule r there exists a con v ergence theorem whic h states he t follo

Theorem If ther e exists a set of c onne ction weights w which is able to p erform the tr ansfor

mation y d x he p er c eptr arning rule wil l c onver ge to some solution which may o r may

b e the same as w in a ite numb er of steps for any initial choic e of t he weights

Pro o f Given fact that the length of the ve ctor w do es not play a r ole e c ause of the sgn

ation take k w k Be c w is a c e ct solution the j w x j e

denotes dot or inner pr o duct w il l b egr e ater than or ther e exists a such that j w x j

al l inputs x Now dee cos w w k w k When c or the p er c eptr le arning

T ec hnically this need not to b e t rue f or an y w w x could in fact b e e qual to f or a w h yields no

misclassiations o ok at deition of F Ho w er another w can b e f ound for whic h t he quan tit y w b e

hanks to T erry Regier Computer Science C U B erk eley

not ill ev

whic

on to ding ac for

wher value orr ause we er op

the

not

le on

wing

date eigh ure In

eigh

to

hange

with are

um

Giv

This

eigh

CHAPTER PER CEPTR ON A D ALINE

rule c onne ction weights ar e di d at a given input x we that w d x x and the

weight fter a mo di ation is w w w F r om this it fol lows hat t

w w w w d x w x

w w sgn w x w x

w w

k w k k w d x x k

w d x w x x

w x e c d x sgn w x

w M

After t mo di ations we have

w t w w w t

k w t k w tM

such that

w w t

t

k w t k

w w t

p

w tM

p

p

F r om this fol lows that cos t ilm t while by deition cos

t t

M

c onclusion is that ther e b e upp er limit t for t The system mo dis its

max

c onne ctions only a limite d numb er of times In other w or ds after maximal ly t mo di ations

max

the weights the p er c eptr on is c orr e ctly p erforming the mapping t wil l b e r e ache d when

max

If we start w ith c onne ctions w

M

t

max

The original P erceptron

P erceptron prop osed b y Rosen osen blatt is somewhat more complex than a

la y net w ork with threshold activ ation functions In its simplest it consist of an

N lemen t input la y er retina whic hfedse in to a la y er of M sso ciation ask or redicate

units a nd a ingle s output unit The goal of t he op eration of t he p erceptron is to learn a giv en

h

N

transformation d f g g using learning s amples with input x and corresp onding

y d x In the o riginal deition the activit y f o t he predicate units can b e n a y function

of the input la y er x the learning pro cedure only adjusts the connections to the output

h

unit The reason for this is that no recip e had been found to the connections bet w een

x and Dep ending on the functions p rceptrons e be ed in diren t families

h h

insky P ap ert a n um ber of these families are describ ed and prop erties of these

families v e been describ ed The output unit of a p rceptron e a linear threshold elemen t

Rosen blatt osen blatt pro v ed the r emark able theorem ab out p erceptron learning

the early s p e rceptrons a great deal of in terest and optimism

euphoria w as replaced b y disillusion a fter the publication of Minsky P ap ert P erceptrons

in insky P ap ert In this book analysed the p e rceptron thoroughly and

v ed that there are sev ere restrictions on what p e rceptrons can epresen r t

pro

they

and

initial The created in and

is ha

In

to group can

adjust

but

output

form er single

blatt The

cos

of

an must The

lim

cos

ause

know mo

AND THE A D APTIVE LINEAR ELEMENT AD ALINE

φ

1

Ψ

Ω

φ

n

Figure P erceptron

The adaptiv e linear elemen t daline

An imp rtan o t generalisation of the p erceptron raining t algorithm w as presen ted b y W idro w and

Ho as east mean square MS learning pro edure c also kno wn the delta The

main functional dirence with the p erceptron raining t rule is the w a y the output of the system is

used in the learning rule The p e rceptron learning r ule uses the output o f t he threshold function

ither o r for earning l The deltaule uses he t net output without urther f mapping in

v alues r o

The earning l rule w as applied o t the daptiv e l inear elemen t Adaline dev el

op ed b y Widro w and Ho idro w Ho In a s imple ph ysical implemen tation g

this device consists of a set of con trollable resistors connected to a circuit h can sum up

curren caused b y the input v oltage signals Usually the cen tral blo c k the summer is

follo w ed b y a quan tiser whic h o utputs either f o dep e nding n o t he p olarit y of t he

+1

−1 +1

level

w

w

1

0

w

2

output

+1

w

Σ

3

−1

−

error

quantizer

summer Σ

gains

input +

pattern

−1 +1

switches reference

switch

Figure The Adaline

Although the adaptiv e pro cess is here exemplid in a case when there is only one output

it ma y b e clear that a system with man y parallel utputs o is directly implemen table b ym ultiple

units o f the ab o v ekndi

If the input onductances c are enoted d b y w i and the input and utput o signals

i

ALINE st sto o d for AD Aptiv e LInear NEuron but when artiial neurons b e came less and less p opular

this acron ym w as c hanged to AD Aptiv e LINear Elemen t

AD

sum

also ts

whic

named also

output

to

rule as the

The CHAPTER PER CEPTR ON A D ALINE

b y x and y resp ectiv ely then the output of t he cen tral blo c k is d to b e

i

n

X

y w x

i

i

i

p

where w The purp ose of this device is yield a giv v alue y d its output when

p

v alues x i applied at the inputs The problem is to determine the

i

co eien ts w i n scu ha w a y hat t he t inpututput resp onse is correct f or a large

i

n b er f o arbitrarily c hosen signal sets If an exact mapping is n ot p ossible the a v erage

m ust be minimised instance in the sense of least squares An adaptiv e op eration means

that there xists e a m ec hanism b y w hic hthe w can b e adjusted usually iterativ ely to attain the

i

correct v alues F or the Adaline Widro w tro duced the delta rule to adjust the w ts This

rule will b e discussed in section

Net w orks with linear activ ation functions the delta rule

F a single la y er w with an unit with a linear activ ation function the output

simply giv b y

X

y w x

j

j

j

h a simple net w ork able to represen t a linear relationship bet w een the v alue of the

output unit and the v alue of the input units By thresholding t he output v alue a lassir c can

b e constructed uc h a s W idro w Adaline but h ere w e f o c us on the linear relationship and use

net w ork for a function appro ximation task In high dimensional input spaces the net w ork

represen a h yp erlane and it will b e c lear that also m ultiple output units ma y b e

Supp ose w e w t train the net w ork suc h that a h yp erplane is ted as w ell as p ossible

p

to a training samples consisting of v alues x and esired d r target v alues

p p

d F or ev giv input sample the output w ork dirs target v alue d

p p p

b y d y where y is the actual output for t his pattern The deltaule no w ses u a cost or

errorunction based on these dirences o t adjust the w eigh ts

error function as indicated b y the name least mean square is the summed squared

error That is the t otal error E is deed t o b e

X X

p p p

E E d y

p p

p

where he t index p ranges o v the set of input patterns and E represen the error n o pattern

p The LMS pro cedure ds the v alues o f all the w eigh ts that minimise the error unction f b ya

metho d called gradien t descen t The idea is o t ak m ea c hange in the w t rop p ortional t o the

negativ e of the deriv ativ eof the error as measured on curren t pattern with resp ect to eac h

w eigh t

p

w

p j

j

where is a constan t of p rop ortionalit y The deriv ativ eis

p p p

p

j j

Because f o the linear units q

p

x

j

j

the

eigh

ts er

The

the from net the of en ery

output input of set

to an

deed ts

the

is Suc

en

is output ork net or

eigh in

for

error um

is of set the

at en to

eed

AND EX CLUSIVER PR OBLEM

p

p p

d y

p

httha

p

w x

p j

j

p p p

where d y is the dirence b et w een t he target utput o and the ctual a output for pattern

p

The elta d r ule mo is d w eigh t appropriately or f target and actual outputs f o either p olarit y

and for b o th con uous and binary input and output units These c haracteristics ha v e p o ened

up a w ealth of new applications

Exclusiv eR problem

In the p revious s ections w eha v e iscussed d t w o learning algorithms for single la y er w orks but

w eha v e n ot discussed the limitations on the represen tation of net w orks

x x d

T able Exclusiv er truth t able

One Minsky P ap ert discouraging results sho that a single la y p ercep

tron cannot represen ta simple exclusiv er function T ws the desired relationships

bet w een inputs and output units for this function

In a imple s net w ork with t w o inputs and one output as depicted in ure the net input

is equal to

s w x w x

According to eq the output of the p erceptron is ero z when s is negativ e a nd to

one when s is p sitiv o e In ure a geometrical represen tation of the input domain is giv en

F a constan t the output the p e rceptron is equal on side of the dividing line

whic h s i deed b y

w x w x

and equal to zero on the other ide s f o this ine l

x x x

(−1,1) (1,1)

?

x x x

?

(−1,−1) (1,−1)

AND OR XOR

Figure Geometric rep resentation of input space

one one to of or

equal

sho able

er ws most and of

these

net

tin

suc

and CHAPTER PER CEPTR ON A D ALINE

T o ee s that suc h a solution cannot b e found tak e a lo ot at ure The input space consists

of four p o in ts and the t w o s olid circles a t and cannot b e separated b y a straigh t

line from the t w o op en circles at and The ob vious q uestion t o ask is Ho w can

this problem b e o v ercome Minsky and P ap v e in their b o ok that f or binary inputs an y

transformation can be carried out b y adding a la y er of predicates whic h are connected to all

inputs The ro p of is giv en in the next section

F the sp eci X problem w e geometrically sho w b y in tro ducing hidden units

thereb y extending the net w ork o t a m ultia y er p rceptron e t he problem an c b e solv ed

demonstrates that the four input p oin ts are no wem b e dded in a threeimensional space deed

b y the t w o inputs plus the single hidden unit These four poin ts are no w easily separated b y

(1,1,1)

1

1

−2

−0.5 −0.5

1

1

(-1,-1,-1)

a.

b.

Figure Solution of the X OR p roblem

a The p erceptron of with an extra hidden unit With the indicated values of the

w eights w ext to the connecting lines and the thresholds n t he circles this p erceptron

ij i

solves the X p roblem b This is accomplished b y mapping four potsin ure

onto the four p oints indicated here clea rly eap ration b y a r m anifold into the equired r

groups is no w p ossible

a linear manifold lane in to t w o groups as desired This simple example demonstrates

adding hidden units i ncreases he t class of problems that are soluble b y eedorw f ard p rceptron e

e net w orks Ho w er b y t his g eneralisation of the basic arc hitecture w eha v e also incurred a

serious loss w eno lngroe ah v e a learning rule to determine t he optimal w eigh ts

Multia y p erceptrons can do ev erything

previous section w e sho w ed that b y adding an extra hidden unit the X OR problem

can be solv ed F binary units one can pro v e that arc hitecture is able to p rform e y

transformation giv en the correct connections and w eigh ts The m ost primitiv e i s the next one

F a g en transformation y d x w e an c divide the set of all p ossible input v ectors in to t w o

classes

X f x j d x g X f x j d x g

N

Since here t re a N input units the total n um b er of p ossible input v ectors x is F or ev

p

x X a hidden unit h can b e r eserv ed of whic h he t activ ation y is if and only if the sp eci

h

p

pattern p is presen t at the input w ecna c ho ose its w ts w equal o t he t sp eci pattern x

ih

equal to N suc htaht

h

X

p p

y ng w x N

ih

h i

i

bias the and

eigh

ery

and

iv or

an this or

the In

er

ev lik

that

linea

of the OR

Fig

that OR or

pro ert

AND CONCLUSIONS

p

is equal to for x w only Similarly the w eigh to the output neuron can b e c hosen suc h

h

o is o M predicate neurons is o ne

M

X

p

y gn y M

o h

h

p erceptron will e y only if x X it p e rforms desired mapping The

o

problem s i the large n um b er o f p redicate units whic h is e qual to the n um b r e o f patterns in X

N

whic h is maximally Of course w e can do the same tric k X and w e will alw a ys tak e

N

minimal n um ber mask units h maximally A elegan t of is giv en

in insky P the poin t is that for complex transformations the n um ber of

required units n i he t hidden la y er is exp onen tial in N

Conclusions

this c hapter w e presen ted single y er feedforw ard net w orks for classiation tasks and

function appro ximation tasks The represen tational p o w er of single la y er feedforw w orks

w as discussed a nd t w o learning algorithms f or ding the optimal w eigh ts w ere presen The

simple net w orks presen ted here ha v e their an tages and disadv an tages The disadv an tage

is the limited represen tational po w er only linear classirs can be constructed or in case of

function appro ximation only linear functions be represen The an tage ho w ev er

b ecause of the linearit yof the system t he training algorithm will con v erge to the optimal

solution This is not the case an ymore for nonlinear systems suc has m ultiple la y er net w orks as

w e will s ee in the next c hapter

that

is adv ted can

adv

ted

net ard

for la In

but ert ap

pro more is whic of the

for

the giv This

the of one as on so as ne utput the that

ts CHAPTER PER CEPTR ON A D ALINE

AND kropagation

w eha v e seen in the previous c hapter a singlea y er w ork has sev ere estrictions r the class

of tasks t hat can b e accomplished is v ery limited In this c hapter w e w ill o f cus on feedorw ard

net w orks with la y ers of pro cessing units

Minsky and P ap ert insky P ap ert sho w ed in that a t w o la y feedorw ard

net w ork can o v ercome y restrictions but did not presen t a to the problem of ho w

to adjust the w eigh ts from input to hidden units An answ er to this question w as presen ted b y

Rumelhart Hin Williams in umelhart Hin ton Williams and similar

solutions app eared t o h a v e b een published earlier erb os P ark er Cun

The en c tral idea b ehind this solution is that the rrors e for the units of t he hidden la y er are

determined b y bac kropagating the errors of the units of output la y er F or this reason

metho d is often called the bac kropagation learning rule kropagation also be

considered a generalisation of the delta rule noninear activ ation functions m ulti

la y er net w orks

Multia y feedorw ard net w orks

A feedorw ard net w ork has a la y ered structure Eac hla y er consists of units whic h eceiv r e

input from units from a la y er directly belo w and send their output a la y er

ab o v e the unit There are connections within a la y er The N inputs are fed in the st

i

la y er of N hidden units The nput i units are merely anut units no pro cessing tak es place

h

in these units The activ ation a hidden is a function F of the w inputs plus a

i

as g en in in eq The o utput of the h idden units is distributed o v er the next la y er of

N hidden units un til the last la y er of hidden units of w hic h t he outputs are fed in toala y er

h

of N output units ee ure

o

Although bac kropagation be applied net w orks y n um ber of la y ers just as

for etn w orks with binary nits u ection it has b een ho s wn ornik Stinc hcom b e White

F unahashi Cyb enk o Hartman Keeler Ko w alski that only one

la y er of hidden units ues s to appro ximate an y function ith w nitely man y discon to

arbitrary precision pro vided the activ functions of the hidden units are noninear he

univ ersal appro ximation theorem In most applications a feedorw ard net w ork with a single

la y er of hidden units is sed u w ith a sigmoid activ ation f unction for he t units

The generalised delta rule

Since w eaer no w using units with nonlinear activ ation functions w eha v e to g eneralise the d elta

rule h w as presen ted in c hapter linear functions to set noninear activ ation

Of course when linear activ ation functions are used a m ultia y net w ork is not more po w than a

singlea y er net w ork

erful er

of the for whic

ation

uities tin

an with to can

iv bias

ted eigh unit of

to no

directly in units to

their

er

and for as

can Bac the

the

and ton

solution man

er

net As

Bac CHAPTER BA CKR OP A

h o

N

o

N

i

N N N

h h h

Figure A multia y er net w o rk l la y ers of units

functions The activ ation is a diren tiable function of the total input g iv en b y

p p

y F s

k k

in whic h

X

p p

s w y

jk k

j

k

j

T o g et the correct generalisation of the delta ule r a s resen p ted in he t previous c hapter w em ust

set

p

w

p jk

jk

p

The e rror measure E is deed as the total quadratic error or f pattern p at the output units

N

o

X

p p p

E d y

o o

o

X

p

p

where d is the desired output for unit o when pattern p is clamp ed W e urther f set E E

o

p

as the summed squared error W e anc write

p

p p

k

p

jk jk

k

By equation w e see that the s econd actor f is

p

p

k

y

j

jk

w e d ee

p

p

p

k

k

w e will get an up date rule whic h is alen t delta rule describ ed in previous

c hapter resulting in a gradien t descen t the error surface if w e mak e the w eigh t c hanges

according to

p p

w y

p jk

k j

p

The tric k is to gure out what should b e for e ac h unit k in the net w in

k

result whic hw e w deriv e is that there is a simple recursiv e c omputation of these whic h

can b e i mplemen ted b y propagating error signals bac kw ard through the et n w

ork

no

teresting The ork

on

the as the to equiv

When

with

TION GA THE G ENERALISED DEL T AR ULE

p

T o c ompute w e pply a t he c hain rule to write his t partial deriv ativ e a s the pro duct of t w o

k

factors o ne factor recting the c hange in error as a f unction of the o utput of the unit and one

recting the c hange in the output as a f unction o f c hanges in he t input Th us w eha v e

p

p p

p

k

p p p

k

k k k

Let us compute the second factor w e ee s that

p

p

k

F s

p

k

k

whic h is simply the deriv ativ e of the squashing function F the k th unit ev aluated at the

p

net input s to that unit T o compute the st actor f of equation w e consider t w o cases

k

First assume that unit k is k o of the w ork In this case it follo ws from

p

the d eition of E that

p

p p

d y

p

o o

o

whic h is the same result as w e with the delta rule Substituting this and

equation in equation w eget

p p p p

d y F s

o

o o o o

for y output unit o Secondly if k is not an output unit but a hidden unit k h w e do

readily kno w the con tribution of the unit to output error of the w ork Ho w ev er

error measure can be written as a function the inputs rom f hidden to output la y er

p p p

p p

E E s dn w eues ecth hain rule to write

j

N N N N N

o o h o o

p p p p p

X X X X X

o p

p

w y w w

p p p p p ko p ho ho

j o

o o o

h h h

o o j o o

Substituting this in equation yields

N

o

X

p p

p

F s w

h h o

o

Equations and e a recursiv e pro cedure for c omputing the for all units in

net w whic h are then compute the w eigh t c hanges according to equation

This pro c edure constitutes the generalised delta rule for a feedorw ard net w ork of noninear

units

Understanding bac kropagation

equations deriv ed in the previous section ma y be mathematically what do

they actually mean Is there aw a y of understanding bac kropagation other than reciting the

necessary equations

answ is of course y es In the whole kropagation pro cess in tuitiv

v clear happ ens in the ab o v e equations the follo wing When a learning pattern

is clamp ed the activ ation v alues are propagated to the output units the actual net w ork

output is compared with the desired output v alues w e usually end up with n a e rror in eac h of

the utput o units Let call this error e for a particular output unit o W e ha v e to bring e

o o

zero

to

and

is What ery

ely is bac fact er The

but correct The

to used ork the

giv

ho

net of the

net the not

an

standard obtained

net unit output an

for

equation By

CHAPTER BA CKR OP A

The s implest metho d do this is the greedy d w e striv e to c hange the connections

in neural net w ork in suc h a w a y that next time around the error e be for

o

particular pattern W e w from the rule that in order to reduce error w e ha v e

adapt ts i incoming w eigh ts according to

w d y y

ho

o o h

That step one it alone is not enough when w e only apply this rule the w eigh ts from

input hidden units are nev er c hanged w e do not ha v e the represen po w er

of feedorw ard net w ork as promised b y the ersal ppro a theorem In order

adapt w eigh ts from input to hidden units w e again w an t to apply the delta rule In

case ho w ev er w e do not ha v e a v for for the hidden units is ed b y c hain

rule whic h the follo wing distribute the error an output o all the hidden units

that is it connected to w eigh ted b y his t c onnection Diren tly put a hidden unit h es a

delta from eac h o utput nit u o equal to the delta of hat t output unit w eigh ted with m

P

b y the w eigh t of connection bet w een those units In b o ls w W ell

h o ho

o

exactly w e forgot the activ ation function of t he hidden unit F has to b e applied to the delta

b efore the ac b kropagation pro cess can c on ue

W orking with bac kropagation

application of the generalised delta rule th us in v olv t w o phases During the st phase

input x is presen ted and propagated forw ard through the net w ork to compute the output

p

v alues y for eac h output unit This output is compared with its desired v d resulting in

o o

p

an error signal for eac h output unit The second in v olv es a kw ard through

o

the et n w ork during whic hthe error snaiglispasesdto cea h unit in he t net w ork a nd appropriate

w eigh tc hanges are calculated

W eigh t adjustmen ts with sigmoid activ function results the previous

section can b e summarised i n three equations

The w eigh t a connection is adjusted b y an amoun t prop ortional the pro duct of an

error ignal s on the unit k receiving t he input and he t output of the unit j sending this

signal along t he connection

p p

w y

p jk

k j

If the u nit is an utput o u nit the error signal i s iv g en b y

p p p p

d y F s

o o o o

T e as the activ ation f unction F the igmoid unction f as deed n i c hapter

p p

y F s

p

s

e

In this case the deriv ativ e is

p

F s

p

p s

e

p

s

e

p

s

e

p

s

e

p p

s s

e e

p p

y y

to equal

ak

to of

from The ation

pass bac phase

alue

the

es The

tin

not sym the

ultiplied

receiv

to unit of es do

the solv This alue

this the

to ximation univ the

tational full and to

But

to an delta kno

this zero will the

metho to

TION GA AN EXAMPLE

suc h hat t the error signal for an o utput unit can b e written as

p p p p p

d y y y

o o o o o

The error signal f or a idden h unit is determined recursiv ely i n terms of error signals of the

units to whic h i t directly connects and he t w eigh ts of those connections F or the sigmoid

activ ation function

N N

o o

X X

p p p p p p

F s w y y w

o ho o

h h h h

o o

Learning rate and momen tum The learning ro p cedure requires that the c hange in w eigh t

p

is prop ortional to w T rue g radien t escen d t requires t hat initesimal steps are tak en The

constan t prop ortionalit yis the learning rate F practical purp oses w e c ho a learning

rate that is as large as p ossible without leading to oscillation One w a y a v oid oscillation

at large is to mak e c hange in w t dep enden t of the w t c hange b y adding a

momen term

p p

w t y w t

jk jk

j

k

where t indexes the presen tation n um ber nda is a onstan c t w h determines the ect of the

previous w eigh tc hange

r of the momen wn in ure no momen tum term is used

it tak es a long time b e fore the minim b hed ithw a lo w l earning rate hereas w for

high learning rates the minim um is nev er reac hed b ecause of the oscillations When adding the

momen tum t erm the m inim um will b e reac hed aster f

a

b

c

Figure The descent in w eight space a fo r l ea rate b fo r la rge lea note

the scillations o and c with la rge lea rning rate and momentum term added

Learning per pattern Although theoretically the kropagation algorithm p erforms

gradien t descen t on the total error only if the w ts are adjusted after t he full set o f learning

patterns has been presen more often than not the learning rule is applied to eac h pattern

p

separately i a pattern p is applied E is calculated and the w ts are adapted p

There exists empirical indication that results in faster con v ergence

to be en ho w ev er with the order in h patterns are taugh t F or example when

using the same sequence o v er and o v er again w ork ma y b fo cused n o t he st few

patterns This problem can b e o v ercome b y using a p erm uted training metho d

example

A feedorw ard net w ork can be used appro ximate a function from examples Supp ose w e

ha v e a system or example a c hemical pro cess or a ancial mark et of whic hw ew an t to kno w

to

An

ecome net the

the whic tak

has Care this

eigh

ted

eigh

bac

rate rning rning small

reac een has um

When sho is term tum ole The

hic

tum

eigh past eigh the

to

ose or of

ho

CHAPTER BA CKR OP A

c haracteristics The input of system is giv b y t w oimensional v ector x the

en b y he t oneimensional v ector d W ew an t t o estimate the r elationship d f x

p p

examples f x g as op left A feedorw ard w ork w as

1 1

0 0

−1 −1

1 1

1 1

0 0

0 0

−1 −1 −1 −1

1 1

0 0

−1 −1

1 1

1 1

0 0

0 0

−1 −1

−1 −1

Figure Example function app ro ximation with a feedfo rw a rd net w o rk T op left o

rning amples s T op right The app ro ximation with the net w o rk The

generated the lea rning samples Bottom ight r The erro r n i the app ro ximation

programmed with t w o inputs hidden units with sigmoid activ ation function and an output

unit with a linear activ ation function k f or y ourself o h w quation e should b e adapted

for linear instead of sigmoid activ function The w w eigh ts are initialized

small v alues and the n et w is trained f or learning iterations with the bac kropagation

training rule describ ed in the previous section The elationship r b et w x and d as represen ted

b yteh nte w ork is sho wn in ure op righ t hile w the function whic h generated the learning

samples giv in ure ottom left ximation error is in ure

ottom righ t W e see that the error is higher at the of the region within whic h the

learning samples w ere generated w ork is considerably b etter at in terp olation than

extrap olation

Other activ ation functions

Although sigmoid functions are quite often used as activ ation functions other functions c an b e

used w ell In some cases this leads to a form ula whic h is kno wn traditional function

appro ximation theories

F or example from F ourier analysis it is kno wn an y p erio dic function can b e w ritten as

a inite sum of s ine nd a cosine terms ourier series

X

f x a nx b sin nx

n n

n

cos

that

from as

net The

edges

depicted appro The en is

een

ork

to ork net ation the

Chec

which function left Bottom lea

riginal The of

net ure in depicted from

giv is output

and the en the the

TION GA DEFICIENCIES OF BA CKR OP A TION

W e can rewrite his t as a ummation s of sine terms

X

f x a c sin nx

n n

n

p

with c a b and arctan b This can be seen as a feedorw ard net w with

n n

n n

a single input unit for x a single unit f x and hidden units with an activ ation

function F sin s The factor a corresp o nds with the bias of the output unit the factors c

n

corresp o nd with w eighs from hidden to unit the factor corresp nds o with

n

term of the hidden the factor n corresp onds the w eigh bet w een the

input nd a hidden la y

basic dirence bet w een the F ourier approac h and the kropagation approac h

the the F ourier approac h the ts bet w een the input the hidden units hese

factors n are ed in teger n um b e rs whic h are analytically determined whereas in the

kropagation approac h t hese w eigh ts can e an yv alue and a re t ypically learning using a

learning heuristic

T o illustrate the use of other activ ation functions w e ha v e trained a feedorw ard net w ork

with one utput o unit four hidden units and one input w ith t en patterns dra wn from the function

f x isn x in x The result is depicted i n Figure The same function lb eit with other

learning p oin learned with a net w ork with eigh t sigmoid hidden nits u ee ure

F rom he t ures it is clear that it pa ys o to use s a m uc hkon wledge of the problem t a hand as

p ssible o

+1

-4 -2 2 46 8

-0.5

Figure The p erio dic function f x sni x s in x app ro ximated with ine s activation functions

dapted from Dastani

Deiencies of bac kropagation

Despite the apparen t s uccess of the bac kropagation learning algorithm there are some asp ects

whic hmka e the algorithm not uaran g teed to b e univ ersally useful Most troublesome is the long

training pro cess This can b e a esult r of a nonptim um learning rate and m omen tum A lot of

anced algorithms based on bac kropagation learning ha v e some optimised etho m d to adapt

this learning rate as will b e discussed in he t section Outrigh t training failures generally

arise from t w o sources net w ork paralysis and l o c al minima

w paralysis As the net w ork trains the w ts can b e a djusted to v ery large v alues

The t otal input f o a hidden unit r o output unit an c therefore reac hv ery high ither p ositiv eor

negativ e v alues and b ecause of the sigmoid ctiv a ation f unction the u nit will ha v e an activ ation

v ery lose c to zero or v close t o one As is clear f rom quations e and the w eigh t

ery

eigh ork Net

next

adv

is ts

tak bac

the are

and eigh in that

is bac The

er

ts with and units bias the

phase output the

for output

ork

GA CHAPTER BA CKR OP A

+1

-4 246

-1

Figure The p erio dic function f x ins x in x ximated with sigmoid ctivation a func

tions

dapted from Dastani

p p

adjustmen ts whic h a re prop ortional to y y will b e close to zero a nd the training ro p cess

k k

can ome c t o a virtual standstill

Lo cal minima The error surface of a complex w ork of hills and v alleys Because

of the g radien t descen t w ork can get t rapp d e in a lo cal minim um when there is a m uc h

deep er minim um nearb y Probabilistic metho ds can to a v oid but they tend

b e w Another suggested p ossibilit y is o t i ncrease t he n um b e r f o h idden units Although this

will w ork b ecause of the h igher imensionalit d y o f he t error space and the c hance to get trapp ed

is smaller t i app ears that there s i some upp er limit of the n um b e r of hidden units whic h

exceeded again results i n the system b eing trapp ed in lo cal inima m

anced algorithms

Man y researc hers v e devised impro v ts of and extensions to the basic bac kropagation

algorithm describ ed ab o v e It is to o early for a full ev aluation some of these tec hniques ma y

v e be fundamen tal others ma y simply fade a w a y A few metho ds are discussed in

section

e the most ob vious impro v emen t is to replace the rather primitiv e steep est descen t

metho d with a direction set minimisation metho d e conjugate gradien t minimisation Note

minimisation along a direction u brings the function f at a place where its gradien t

p erp endicular to u therwise minimisation along u is not complete Instead of follo wing the

gradien tat ev ery s tep a et s of n directions i s onstructed c whic h re a all c onjugate to eac h other

h that minimisation along one of these directions u do es not sp il o the minimisation along one

j

of the earlier directions u i the directions are nonn terfering us one minimisation in the

i

direction f o u sues uc s h that n minimisations in a ystem s w ith n degrees of freedom bring this

i

system to a minim um ro vided the system is quadratic This is diren t f rom gradien t descen t

whic h irectly d minimises in the irection d of the steep st e descen t Press Flannery euk olsky

V etterling

Supp ose the function to b e minimised i s ppro a ximated b yits T a ylor series

X X

f

f x f p x x x

i i j

i i j

p

i i

p

T T

x Ax b x c

Th

suc

is that

yb Ma

this to pro

emen ha

Adv

when

slo

to trap this help

net the

full is net

ro app

TION GA AD V ANCED LGORITHMS A

where T denotes transp o se and

f

c f p b f j A

ij

p

i j

p

A is a ymmetric s p ositiv e eite d n n matrix the Hessian of f at p The g radien tof f is

r f A x b

httha a c hange of x results in a c hange of t he gradien tas

r f A x

w s upp ose f w as minimised along a irection d u to a p t here w the gradien t g of f

i

i

p erp endicular to u i

i

T

u g

i i

and a new direction u is sough t In order t o mak e sure that mo ving along u do es not sp oil

i i

minimisation along u w e r equire that the gradien tof f remain p e rp endicular to u i

i i

T

u g

i i

otherwise w e w ould once more ha v e to minimise in a direction whic h a comp onen t of u

i

bining and w eget

T T T

u g g u r f u Au

i

i i i i i

When eq holds for t w o v u and u they are said to b e conjugate

i i

w starting at poin t p the st minimisation direction u is tak en equal to g

f p resulting in a new p oin t p F or i calculate the directions

u g u

i i i

i

T

where is c hosen to mak e u and he t successiv e radien g ts p erp endicular i

i i

i

T

g g

i

i

g f j for all k

i

k

p

T

k

g g

i

i

Next alculate c p p u where is c hosen so as to minimise f p

i i i

i i i

can be sho wn that the u constructed are all m utually conjugate see to er

B ulirsc h The pro c ess escrib d ed ab o v e is kno wn as the Fletc hereev es metho d but

there are man y v arian ts whic h w ork or the same estenes Stiefel P olak

P o w ell

Although only n iterations are for a quadratic system n degrees of freedom

the fact that w e are not minimising quadratic systems as w as a result of round

errors the n directions ha v e be follo w ed sev eral times ee ure P o w ell in tro duced

some impro v ts to correct for b eha viour in nonuadratic systems The resulting cost

O n chi h is s ignian tly b etter than t he linear on c v ergence of steep est descen t

A matrix A is called p ositiv e eite d if y

T

y Ay

This not a trivial problem ee ress et al Ho w ev er line minimisation metho ds exist with

sup erinear con v ergence ee fo o tnote

A metho d i s said to con v erge linearly if E cE with c Metho ds hic w hcon v erge with a h igher p o w

i i

m

i E c E with m are called sup erinear

i i

er

is

is emen

to

ell to due

with needed

less more

us th It

with

Au

some No

ectors

Com

has

is oin No

suc

CHAPTER BA CKR OP A

gradient

u

i

u

i+1

a very slow approximation

Figure Slo w decrease with conjugate gradient in nonuadratic systems The hills the left

a re very steep resulting in a la s vecto r u When the uadratic q p o rtion is e ntered the new

i

sea direction is constructed from p revious direction and gradient resulting n i a

minimisation This p roblem can b e overcome b y etecting d such spiraling m inimisations and resta

the algo rithm with u f

Some i mpro v emen ts on bac kropagation ha v e b een presen ted ased b on an indep enden t a dap

e earning l ate r arameter p for eac hw eigh t

V den Bo omgaard and Smeulders o omgaard Smeulders sho w that a feed

ard w ork without hidden units an incremen pro edure c to d the optimal w eigh t

matrix W needs an a djustmen t of he t w eigh ts with

W t t d t W t x t x t

in whic h is not a constan t but an v ariable N N matrix whic h on the

i i

input v ector By using a priori kno wledge ab out t he input signal the storage requiremen ts

can b e reduced

Silv a nd a Almeida Silv a Almeida a lso sho w het adv an tages o f an indep enden t step

size for eac h w eigh t in the w ork In their algorithm the l earning rate is after ev

learning pattern

t t

u t if and ha v e t he same signs

jk

jk jk

t

jk

t t

d t if and ha v e opp osite s igns

jk

jk jk

where u and d are p ositiv e constan ts v alues sligh tly o v e and belo w unit y resp ectiv ely

The dea i is to ecrease d the learning ate r in ase c of oscillations

w go o d are m ultia y er feedorw ard net w orks

F rom the example wn in ure is clear that the appro ximation of the w ork is

p erfect The esulting r appro ximation error is inenced b y

The learning algorithm and n b e r of iterations This determines ho w g o o d the e rror on

the training set is minimized

um

not net is sho

Ho

ab with

ery adapted net

for

ends dep

tal net forw

for an

tiv

rting

spiraling the the rch

rch ea rge

on

TION GA HO W OOD G ARE M UL TIA YER F EED OR W ARD NETW ORKS

The n b r e of l earning samples This determines ho w go o d he t training samples r epresen t

the a ctual function

The n um b er of hidden units This determines he t xpressiv e po w er of the net w ork F or

mo oth functions only a n um ber of hidden units are needed for wildly

functions more hidden u nits will b e needed

In the p revious sections w e discussed the l earning r ules suc hasbac kropagation and t he other

gradien t based learning algorithms and the problem of ding the minim um error In

section w e p articularly address the ct e of the n um b e r of learning samples and the ect of the

n b er f o idden h units

W e ha v e to dee an adequate error measure All neural net w ork training

to minimize the error of the learning samples whic h are a v ailable the

net w ork a v erage error p er learning sample is deed as the learning error rate error ate r

P

learning

X

p

E E

P

learning

p

p

in whic h E is the dirence bet w een desired output v alue and the actual net w output

for the learning samples

N

o

X

p p p

E d y

o o

o

This is the error whic h i s measurable during the raining t pro c ess

vious that the actual error f o he t net w ork will dir f rom t he error at the lo ations c of

training samples The dirence bet w een the desired utput o v alue and the net w ork

output should b e in tegrated o v er the e n tire input domain to giv e a more realistic error measure

tegral can b e estimated if w eha v e a large set of samples the test set W eno w dee the

test error rate as the a v erage error of the test set

P

test

X

p

E E

test

P

p

follo wing subsections w e will ho w these error measures dep end on learning set size

n um b er of h idden nits u

The ect of the n um ber of learning samples

A imple s problem is used a s e xample a f unction y f x has to b e appro ximated with a feed

ard n eural net w A neural net w ork is created w ith an input hidden units with sigmoid

activ ation function and a l inear output unit Supp ose w eha v e nly o a mall s n um b r e of learning

samples and the w orks is trained ith w t hese samples T raining i s s topp ed when the

error do es not decrease an ymore original esired function is sho in ure as a

dashed line The learning samples and t he appro ximation of the net w ork are sho wn in the same

ure W e see that in this case E is small he net w ork output g o e s p erfectly through the

learning amples s E is large the test error the net w ork is large appro ximation

test

obtained from learning samples is sho wn in ure The E is larger in the

learning

case of l earning samples but the E is smaller

test

This exp e rimen tw as carried out ith w ther o learning set sizes where or f eac h learning set size

e tw as rep eated times The a v erage learning and est t error a function

of the learning set size are giv en in ure the learning error increases with an

increasing learning set size and he t test error decreases w ith increasing learning set size Alo w

that Note

as rates erimen xp the

than

The of but

learning

wn The

net

ork forw

and

see the In

test

in This

actual the

ob is It

ork the

learning

The

training for of set try

algorithms st

um

this

ctuating few

um CHAPTER BA CKR OP A

A B

1 1

0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2

0 0

0 0.5 1 0 0.5 1

x x

Figure Ect of the lea rning set on the generalization The dashed gives

function the lea rning samples a re depicted as circles and the app ro ximation b y he t net w o rk is sho

b y he t dra wn line hidden units a re used a l ea rning s amples b rning samples

learning e rror on the mall learning s et is no guaran tee for agoodnte w ork p erformance With

increasing n um ber of learning samples t w o error rates con v to the same v alue This

v alue dep ends on represen tational po w the net w giv en optimal w eigh ts ho w

go o d is the appro ximation This rror e dep nds e n o t he n um b r e of hidden units nd a the activ ation

function If the learning error rate do es not con v erge to the test error rate he t learning pro cedure

has n ot found a global minim um

error

rate

test set

learning set

number of learning samples

Figure Ect of the lea rning set ize s o n the erro r The average erro r rate and the average

test erro r rate as a function of t he numb e r o f ea l rning samples

The ect of the n um ber of hidden units

The ame s f unction as in the previous subsection is used but n o w he t n um b r e o f hidden units s i

v aried original esired function learning samples and net w ork appro ximation is sho wn

in ure hidden units and in for hidden units The ect visible

in ure is called o v ertraining The net w ork s exactly with the learning samples but

b ecause of large n um ber of hidden units the h is actually represen ted b y the

net w ork is far more wild than the original one P articularly case learning samples whic h

con tain a ertain c amoun t of oise n hic h all real orld data ha v e the net w ork will t the noise

of the earning l samples instead of making a s mo oth appro ximation

y

y

of in

whic function the

ure for

The

rate

the ork of er the

erge the

lea

wn

desired the line size

TION GA APPLICA TIONS

A B

1 1

0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2

0 0

0 0.5 1 0 0.5 1

x x

Figure Ect of the numb er of units the net w o rk p rfoe rmance dashed line

gives the desired function the circles denote the lea samples and the dra wn line gives the

app ro ximation b y t he net w o rk lea rning samples a re used a h idden units b hidden units

This example s ho ws that a large n um b e r f o hidden units leads to a s mall error on the training

set but not necessarily leads to a small error on the test set Adding idden h units will alw a ys

lead t o a reduction f o t he E Ho w er adding hidden units will st lead to a reduction

learning

of the E but then lead to an increase of E This ect is p e aking ect The

test

a v erage learning and t est rror e rates as a unction f of the learning s et size are iv g en in ure

error

rate

test set

learning set

number of hidden units

Figure The average lea rning r rate and the average test erro r a function of the

numb er of hidden units

Applications

kropagation has b een applied to a v ariet y of researc h applications wski and

Rosen b erg ejno wski Rosen b e rg pro duced a s p ectacular success with NETtalk

a system that con v erts prin ted English text in to highly n i telligible sp eec h

A feedorw ard net w ork with one la y er hidden u nits has bene describ e d b y Gorman and

Sejno wski orman ejno S wski as a c lassiation m ac hine for sonar signals

Another application of a m ultia y er feedorw net w ork w ith a bac kropagation training

algorithm is to learn unkno wn function bet w een input and output signals from presen

y

y

the an

ard

of

Sejno wide Bac

as rate erro

test

the called

ev

rning

The on hidden CHAPTER BA CKR OP A

tation of examples It d e that the net w ork is able generalise correctly so that input

v alues whic h are not presen ted as learning patterns will result in correct output v alues An

example is the w ork Josin osin a t w oa y feedorw net w ork with

kropagation learning p erform in v erse kinematic transform h is needed b y a

c troller ee c hapter

on arm ot rob

whic the to bac

ard er used who of

to hop is

TION GA Recurren t Net w

The earning l algorithms discussed in the p revious c hapter w ere applied to feedorw net w orks

all ata d o ws in a etn w ork n i w hic h no c ycles are presen t

But what happ ens when w e in tro d uce a F or instance w e can connect a hidden unit

with itself o v er a w eigh ted c onnection connect hidden units t o input units o r ev connect all

units with eac h other Although as w e kno w the previous c hapter appro ximational

capabilities of suc hnet w orks o d ot n increase w e ma y obtain decreased complexit y net w ork size

e the same problem

imp ortan t question w e v e to consider is the follo what do w e w an t to learn in

a recurren t net w ork After all when is considering a t net w ork it is p o ssible

con tin ue propagating activ ation v ad initum r un til a s table p oin t ttractor is reac hed

w e will see the sequel there recurren t net w ork h are attractor based i the

activ ation v alues in the net w ork are rep atedly e up dated un til a stable p oin t reac hed after

whic h w eigh ts are adapted but there are also recurren t w where the learning rule

is used after eac h propagation an activ ation v alue is transv ersed o v er eac h w t

once while e xternal inputs are included in eac h propagation In suc h net w orks recurren t

connections can b e regarded as extra inputs to the net w ork he v alues of whic hare computed

b y w ork itself

c hapter recurren t extensions to the f eedorw ard net w ork tro duced in the previous

c hapters will be discussed et not to exhaustion The theory of dynamics of recurren t

net w orks extends b ey ond the scop e of a oneemester course on neural net w Y et the basics

of these et n w orks will b e discussed

Subsequen tly some sp ecial recurren t w will be discussed Hopld net w ork in

section whic h an c b e used for the represen tation of binary patterns ubsequen s tly w e touc h

up on Boltzmann mac hines therewith in tro d ucing s to c hasticit y in n eural c omputation

The generalised deltaule in recurren t w

bac kropagation learning rule tro duced in c hapter can be easily used for training

patterns in recurren tnet w orks Before w e will consider this general c ase ho w ev er w e will st

describ e net w orks where some of the hidden nit u activ ation v are fed ac b kto an extra

of input nits u he Elman net w ork or where o utput v alues are fed bac kin to hidden units he

Jordan net w ork

A t ypical application of h a w ork is the follo wing Supp ose w e ha v e construct a

net w rkto tmha ust generate a on c trol command dep nding e on an external input whic h is a time

series x t x t x t With a f eedorw ard et n w ork there are t w o p ossible approac hes

create inputs x x x whic h c onstitute the n v alues of the input v ector Th

n

a ime windo w of the input v is to w ork

create inputs x x x Besides only inputting x t w e also input its st second etc

net the input ector

us last

to net suc

set alues

in The

orks net

the orks net

orks

## Comments 0

Log in to post a comment