An introduction to Neural Networks - Laboratoire Informatique d ...

prudencewooshAI and Robotics

Oct 19, 2013 (4 years and 2 months ago)

292 views

An
introduction
to
Networks
Neural
..
Ben Krose Patrick van der Smagt
th edition
v ber
em No
Eigh
c The Univ ersit yof Amsterdam P ermission s i ted distribute copies of
book for nonommercial use as long as it is distributed as a whole in its original and
names of the authors and the ersit y of Amsterdam are men tioned P ermission i s
gran ted use this book nonommercial courses pro vided the authors notid of
b eforehand
The uthors a can b e eac r hed a t
Ben Kr ose P atric kv an der Smagt
F y o f Mathematics Computer Science Institute f o Rob otics and System Dynamics
Univ ersit y of A msterdam German Aerospace Researc h E t
Kruislaan NL SJ Amsterdam P Bo x D W essling
THE NETHERLANDS GERMANY
Phone Phone
F F ax
email krosewival email smagtlre
URL httpwwwivale sea neur o URL httpwwplreF DR RS
rch
ax

stablishmen
acult
this are for to
also Univ the
form
this single to gran Con ts
Preface
I FUND AMENT
In tro d uction
F undamen tals
A framew ork for distributed represen tation
Pro cessing units
Connections b e t w een nitsu
Activ ation and output rules
Net w ork top logies o
T raining o f rtiial a neural net w
P aradigms of learning
Mo difying patterns of connectivit y
Notation and terminology
Notation
T erminology
II THEOR Y
P erceptron and Adaline
Net w orks with threshold ctiv a ation f unctions
P erceptron learning ule r and con v ergence theorem
Example of the P erceptron l earning rule
Con v ergence theorem
The original P erceptron
The adaptiv e l inear elemen t Adaline
Net w orks with linear activ ation functions the delta rule
Exclusiv eR problem
Multia y p e erything
Conclusions
Bac kropagation
Multia y er feedorw n w orks
The generalised delta rule
Understanding bac kropagation
W orking with bac kropagation
An example
Other activ ation functions

et ard
ev do can rceptrons er
orks
ALS
ten CONTENTS
Deiencies of bac kropagation
Adv anced a lgorithms
Ho w go o d are m ultia y er feedorw ard et n w orks
The ect of the n b e r of learning samples
The ect of the n b e r of hidden u nits
Applications
Recurren t w orks
The generalised deltaule in recurren tnet w orks
The Jordan net w ork
The Elman net w ork
Bac kropagation n i fully recurren tnet w orks
The Hopld net w ork
Description
Hopld net w ork a s asso iativ c e memory
Neurons with g raded r esp onse
Boltzmann m ac hines
Selfrganising w orks
Comp etitiv e learning
Clustering
V ector quan tisation
Kohonen net w ork
Principal comp onen t net w orks
In tro duction
Normalised Hebbian ule r
Principal comp onen t extractor
More eigen v ectors
Adaptiv e resonance theory
Bac kground Adaptiv e r esonance heory t
AR T The implid s neural et n w ork mo del
AR T The riginal o o m d el
Reinforcemen t learning
The critic
The con troller net w ork
Barto approac h A CE com bination
Asso ciativ eseacr h
Adaptiv e ritic c
The cart ole s ystem
Reinforcemen tlernngai v ersus optimal c on trol
III APPLICA TIONS
Rob o t Con trol
Endctor p ositioning
Cameraob ot co rdination o is f unction appro ximation
Rob o t arm dynamics
Mobile rob ots
Mo del based na vigation
Sensor based con trol
SE the
Net
Net
um
umCONTENTS
Vision
In tro duction
F eedorw ard t yp es of net w orks
Selfrganising net w orks for i mage compression
Bac kropagation
Linear net w orks
Principal comp onen ts as features
The cognitron and n eo cognitron
Description of the cells
Structure f o the cognitron
Sim ulation results
Relaxation t yp es of net w orks
Depth f rom tereo s
Image restoration and image egmen s tation
Silicon r etina
IMPLEMENT A TIONS
General Purp ose Hardw are
The Connection Mac hine
Arc hitecture
Applicabilit y t o neural net w orks
Systolic rra a
Dedicated Neuroardw are
General ssues i
Connectivit y c onstrain ts
Analogue vs digital
Optics
Learning v s nonearning
Implemen tation examples
Carv er Mead silicon retina
LEP LNeuro c hip
References
Index
ys
IV CONTENTSList of Figures
The basic comp onen ts of an artiial n eural net w
V arious activ ation unctions f for a nit u
Single la y er net w ork w ith one output and t w o i nputs
Geometric represen tation of the discriminan t unction f and the w eigh ts
Discriminan t function b efore nd a after w eigh t up d
The P erceptron
The Adaline
Geometric represen tation of input space
Solution of the X OR problem
Am ultia y er net w ork with l y ers f o units
The descen tin w eigh t s pace
Example o f function a ppro ximation ith w a eedforw f ard net w
The p erio dic function f x sin x s in x appro ximated sine activ ation
functions
The p erio dic function f x ins x in x appro ximated w ith s igmoid activ ation
functions
Slo w decrease with conjugate gradien t in nonuadratic systems
Ect f o the learning et s size on the g eneralization
Ect f o the learning et s size on the e rror rate
Ect f o the n um b er of h idden units on the net w ork p erformance
Ect f o the n um b er of h idden units on the error rate
The Jordan n et w ork
The Elman net w ork
T raining a n Elman net w ork t o onc trol an ob ject
T raining a feedorw ard net w ork to con trol an ob ject
The autosso ciator n et w ork
A simple comp etitiv e l earning net w ork
Example o f clustering in D w ith normalised v ectors
Determining the winner in a comp titiv e e learning net w
Comp etitiv e learning for clustering data
V ector quan tisation t rac ks input densit y
Anet w ork com bining a v ector quan tisation la y er with a la y er feedorw neu

n w ork This net w ork can b e used o t appro ximate functions rom f to

the i nput space is discretised n i disjoin t subspaces
Gaussian neuron distance f unction
A t op ologyonserving map con v erging
The mapping of a t w oimensional input on a oneimensional Kohonen
net w

ork
space
et ral
ard
ork

with
ork
la
ate
ork LIST OF FIGURES
Mexican hat
Distribution of i nput samples
The AR T arc hitecture
The AR T neural net w
An example AR T run
Reinforcemen t learning c s heme
Arc hitecture of a reinforcemen t learning sc heme with critic elemen t
The cart ole system
An exemplar rob ot manipulator
Indirect learning s ystem f or rob o tics
The system used for sp cialised e learning
A Kohonen net w ork merging the output of t w o ameras c
The neural mo del prop osed b yKa w
The neural net w ork used b yKa w ato et al
The desired join t pattern or f join ts ts and ha v e s imilar time patterns
Sc hematic represen tation of the stored ro oms and t he partial information w hic h
is a v ailable from a ingle s sonar scan
The structure of the net w ork for the a utonomous and l v ehicle
Input image for the net w ork
W eigh ts of the PCA net w ork
The basic structure of the cognitron
Cognitron receptiv e r egions
Tw o earning l iterations in the cognitron
F eeding bac k activ ation v alues in the ognitron c
The Connection Mac hine system organisation
T ypical use of a s ystolic a rra y
The W arp system arc hitecture
Connections b t e w een M input and N output neurons
Optical implemen tation of matrix m ultiplication
The photoeceptor used b y eadM
The resistiv ela y er nd a enlarged a single no de
The LNeuro c hip
Join
al et ato
orkPreface
This man uscript attempts to pro reader with insigh t in artiial neural w orks
k in the absence of an y statefhert textb o ok forced us in to writing our o wn
w ev mean time a n um ber of w orth while textb o oks v e been published h can
b e used for bac kground and inepth i nformation W eaer a w of f t at
uscript ma ypro v e to b e t o o thorough or ot n thorough enough for a complete understanding
of the material therefore further reading be found in some excellen t text boosk
h ertz Krogh P almer Ritter Martinetz h ulten Kohonen
Anderson Rosenfeld D ARP A McClelland Rumelhart Rumelhart
McClelland
Some o f the material i n this b o ok esp cially e parts I I I and I V con tains timely material and
th ma y vily c hange throughout ages The c hoice of describing rob otics and vision as
neural et n w ork applications coincides with the neural et n w ork researc hin terests of he t authors
Muc h f o he t material p resen ted in c hapter as h b e en written b y J v an Dam nd a An uj
at the Univ ersit y f o Amsterdam Also An uj con tributed to material in c hapter The basis of
c hapter w b y a rep o rt of Gerard S c hram at the nivU ersit y of Amsterdam F urthermore
w e express our gratitude to those p oplee out there in Netand who ga v e us feedbac k on
uscript esp ecially Mic hiel v an der Korst and Nicolas Maudit who poin ted out a few
of our g o o fps W eo w e t hem man y artjes for heir t help
The ev s th edition is not d rastically diren t f rom the sixth one w e orrected c some t
errors a dded some examples and deleted some obscure parts of the text In e th edition
b o ls used text ha v e b een globally c hanged Also the c hapter on recurren t w orks
has b een alb eit marginally up dated The index still requires an up date though
Amsterdamb erpfanhofen No v em b er
P atric kv an der Smagt
Ben Kr ose

net the in sym
igh the
yping en
kw
quite man
this
form as
Dev oris
the hea us
Sc as suc
can material
man
this times hat act the are
whic ha the in er Ho
Bac
net an the vide LIST OF FIGURESP I
FUND AMENT ALS

art In tro duction
A w a v e of in terest in neural w kno as onnectionist mo dels or arallel
distributed pro cessing emerged after he t in tro duction of simplid neurons b y cCullo M c h and
Pitts i n cCullo c h Pitts These neurons w ted as mo dels of biological
neurons and as conceptual comp onen ts for ircuits c that could p rform e computational asks t
When Minsky and P ap ert published their b o ok P erceptrons in Minsky P ap ert
in whic h they w ed deiencies of p e rceptron mo dels most neural net w ork funding w as
redirected and researc hers left the a few researc hers con ued their erts
notably T o Kohonen Stephen Grossb erg James Anderson and Kunihik oF ukushima
in terest i n eural n net w orks remerged nly o after some imp rtan o t heoretical t results w ere
attained in the early eigh ties ost notably t he v of error bac kropagation and new
hardw are elopmen increased the pro cessing capacities This renew ed in terest is rected
in n ber of scien tists the amoun ts funding the n um ber large conferences and the
n b er f o journals asso ciated with neural net w orks No w ys most univ ersities ha v eanaeurl
net w orks group within their psyc hology h ysics computer science or b iology departmen ts
Artiial neural w orks be adequately c haracterised as omputational mo dels
with particular prop erties suc h as the abilit y adapt or learn generalise to cluster or
organise data and whic h p o eration is based on parallel pro cessing Ho w er man y of he t ab o v e
men tioned prop erties an c b e attributed to existing noneural mo dels the in triguing question
is to whic h e xten t the neural approac h pro v es to b e b tter e suited for ertain c applications than
existing mo dels T odteaa n qiveu o c er to this question is not found
Often arallels p with iological b systems are describ ed Ho w ev er there is still so little kno wn
v en at the w cell lev el ab out biological systems that the mo dels w e are using for
artiial neural systems seem to n i tro duce an o v ersimpliation of the biological mo dels
this course w e giv e an in tro duction artiial neural net w poin t of view w e
e is t hat of a computer scien tist W e are not concerned with the psyc hological implication of
net w orks and w e will at most o ccasionally refer to biological neural mo dels W e consider
neural net w orks as an alternativ e omputational c sc heme rather han t an ything else
lecture notes start with a c hapter in h a n um ber fundamen tal prop erties are
discussed c hapter a n um b er f o classical approac hes re a describ d e as w ell as he t discussion
on their imitations l whic h t o o k place in the early sixties Chapter con tin ues with the descrip
of attempts o v ercome these limitations and in tro duces the kropagation learning
algorithm Chapter discusses recurren t net w orks in these w orks restrain t there
are o n c ycles in the n et w ork g raph is remo v ed Selfrganising net w orks whic h equire r no exter
nal eac t her are discussed in c hapter Then in c hapter reinforcemen t l earning is in tro duced
Chapters fo cus applications of neural net w orks in the lds of rob otics and image
pro cessing resp ectiv ely The al c hapters discuss implemen tational asp cts e

on and
that the net
bac to tion
In
of whic These
the
tak
The orks to In
our est lo
answ al
ev
or to to
most can net

ada um
of of um the
ts dev
ery disco
The
euv
most tin Only ld
the sho
presen ere
wn lso orks net st CHAPTER ODUCTION
INTR F undamen
The rtiial a neural net w orks whic hw e describ e in his t ourse c are all v ariations on the parallel
distributed pro cessing DP idea The arc hitecture of h net w ork is based on v ery similar
building blo c ks whic h p erform the pro cessing In c hapter w e st discuss these pro cessing
units a nd discuss diren t net w ork top logies o Learning strategiess a basis for e
systemill b e presen ted in he t last section
A framew ork for distributed represen tation
An artiial net w ork onsists c of a p o l o o f s imple pro cessing nits u whic h unicate b ysnnedig
signals t o eac h ther o o v er a large n um ber of w ted connections
A set of ma jor asp e cts of a parallel distributed mo del be distinguished f Rumelhart
and cClelland M cClelland Rumelhart Rumelhart cClelland M
a et s of pro cessing units eurons cells
a tate s of activ ation y for ev ery unit whic h equiv alen tto the o utput of the unit
k
connections b e t w een the units Generally ac e h connection i s eed d b yaw eigh t w whic h
jk
determines the e ct whic h the signal of unit j has on nit u k
a propagation rule whic h determines the ectiv e input s of a unit external
k
inputs
an activ ation function F h determines the new el of ation on the
k
ectiv einptu s t nd a the curren t a y t i the up date
k k
an external input ak a ias b oet for ac e h unit
k
a metho d for information gathering the learning rule
an en vironmen t within whic h the system m ust op erate pro viding input signals andf
necessaryrror signals
Figure illustrates these basics some of w hic h ill w b e discussed in the next sections
Pro cessing units
Eac h unit p erforms a relativ ely simple job receiv e input neigh b o urs or external sources
use his t t o compute an output w h is propagated other units Apart
pro cessing a second t ask is he t adjustmen tofthe w eigh ts The system is inheren tly parallel in
the sense that man y nits u can carry out their computations at the same t ime
Within neural systems it is useful to distinguish three t yp es of units input units ndicated
b yan index i whic h r eceiv e data outside t he neural w ork output units ndicated b y

net from
this from to hic signal and
from
ation ctiv
based activ lev whic
its from
can
eigh
comm
adaptiv an
this
eac
tals CHAPTER FUND ALS
k
w
P
y
w
s w y k
jk
k j j
F
k
w
jk

k
w
j
y
j

k
Figure The basic comp onents of an a rtiial neural net w o rk The p ropagation rule used here is
the tanda rd w eighted summation
an index o whic h end s data out o f he t neural net w ork and hidden units ndicated b y a n
h hose w input and output signals remain within the eural n net w ork
During op eration units can b e up dated either sync hronously or async hronously With syn
c hronous u p dating all units up date their activ ation sim ultaneously with async hronous up d at
ing eac h unit h as a usually xed probabilit y of u p dating its activ ation at a ime t t and
only unit will be able to do this at a time some cases the latter mo has some
tages
Connections bet w units
most cases w e assume that eac h pro an additiv e con tribution to the input of the
unit with whic h it is connected input to k is simply the w ted sum of the
separate outputs from eac h o f the connected units p lus a bias oet
k
X
s t w t y t t
jk k
k j
j
The on c tribution for p ositiv e w is considered as a n excitation and for negativ e w as
jk jk
In some cases more complex rules for com bining inputs are used in whic h a distinction is made
bet w een excitatory and inhibitory inputs W e call units a propagation rule
units
A iren d t p ropagation r ule in tro d uced b yF eldman and Ballard eldman Ballard
is kno wn as the propagation rule for the sigmai unit
X Y
s t w t y t t
jk k
k j
m
m
j
Often the y are w eigh ted b efore m ultiplication Although t hese units are not requen f tly used
j
m
they ha v e heir t v alue for gating of input as w ell as implemen tation of lo okup tables el
Activ ation and output rules
W e lso a need a r ule whic hgvi es the ect of the t otal input on the ctiv a ation o f the unit W e need
a function F whic htka es the otal t input s t and the curren tacivt ation y t a
k
k k
new v alue of the ctiv a ation of the unit k
y t F y t t
k
k k k

duces pro and

sigma with
inhibition

term or
eigh unit total The
vides unit In
een
an adv
del In one
usually
index
AMENT NETW ORK TOPOLOGIES
Often the a ctiv ation function is a nondecreasing function of the total input of t he unit

X
A
y t F s t F w t y t t
k k jk k
k k j
j
although activ ation functions are not estricted r t o nondecreasing functions Generally omesrto
of threshold function is used a hard imiting l threshold function a s gn function or a inear l or
semiinear function or a s mo othly l imiting threshold ee ure F or this smo o thly limiting
function often a sigmoid hap ed function lik e

y F s
k k
s
k
e
is used In some applications a h yp erb olic tangen t is used yielding output v alues in the range

ii i
sgn
semi-linear sigmoid
Figure V a rious activation functions o f r a unit
some cases output of a can be a c hastic function of the total input of the
unit In that case the activ ation not deterministically determined b y the neuron input but
the n euron nput i determines the p robabilit y p that a neuron get a high ctiv a ation v alue

p y
k
s
k
e
in whic h T f temp erature is a parameter h determines the e of probabilit y
function This t yp e of u nit w ill b e discussed more extensiv in c hapter
In all net w w edecrbsie w e consider t he output of a neuron o t b e iden tical to its activ ation
lev el
Net w ork top ologies
the previous section w e discussed prop erties of the basic pro essing c in artiial
neural w ork This section fo cuses on pattern connections bet w the units and the
propagation o f d ata
As for this pattern of connections the main distinction w e an c mak e is b et w een
F eedorw ard net w orks where the w from input output units is strictly feed
forw ard The data pro c essing can e xtend o v er m ultiple a y of units but no eedbac f k
connections are presen t that is c onnections extending from outputs of units to inputs of
units in the same la y la y ers
Recurren t net w orks that do con tain feedbac k connections Con trary feedorw ard net
w orks the dynamical p rop e rties f o the et n w ork are imp ortan t In some cases the activ a
v alues of the units undergo a relaxation pro cess suc h that t he net w ork will ev olv e
a stable state in whic h these activ ations do c hange an ymore other applications
the c hange of the activ ation v the output neurons are signian t suc h that the
dynamical b eha viour constitutes the output of the net w ork earlm utter
of alues
In not
to tion
to
previous or er
ers
to data
een of the net
an unit the In
orks
ely
the slop whic

is
sto unit the In




CHAPTER FUND ALS
Classical examples of feedorw ard net w orks are the P erceptron and Adaline h will be
discussed i n the next c hapter Examples f o ecurren r tnte w orks ha v e b een presen ted b y Anderson
nderson Kohonen ohonen and Hopld Hopld and will b e discussed
in c hapter
T raining of artiial neural net w
A neural w ork has to be conured suc h the application of a set inputs pro duces
ither irect via a relaxation c the desired set outputs V arious metho ds to
strengths of the connections exist One w a y is to set the w eigh ts explicitly using a priori
wledge Another w a y to rain the neural net w ork b y feeding it teac hing patterns and
letting i t c hange its w eigh ts according to some learning rule
P aradigms of learning
W e can categorise the learning situations in t w o distinct sorts These a re
Sup ervised l earning or Asso ciativ e learning in hteh w ork is trained b y pro viding
it with input and m atc hing output patterns These inpututput pairs an c b e pro vided b y
an external teac her b y he t system whic hcon tains the net w ork selfup ervised
Unsup ervised earning l Selfrganisation in h an utput unit is trained to resp ond
clusters of pattern within input In this paradigm the supp osed dis
co v er statistically salien t f eatures of the input p opulation Unlik e he t sup ervised learning
paradigm there is no a riori p set of ategories c in to whic h t he patterns are to b e c lassid
rather the system m ust dev elop its o wn represen of i
Mo difying patterns of connectivit y
Both l earning paradigms discussed ab o v e result in an adjustmen tof eth w ts of the connec
tions b et w een units a ccording to some mo diation ule r Virtually all learning rules f or mo dels
of this t yp e can be considered as a v arian t of Hebbian learning rule suggested b y in
his c lassic book Organization of Beha viour ebb basic idea is if t w o
units j and k are activ esim ultaneously heir t in terconnection m ust b e strengthened j receiv es
input from k the simplest v ersion of Hebbian learning prescrib s e t o m o dify he t w eigh t w with
jk
w y
jk j k
where is a p ositiv e constan t of p rop ortionalit y represen ting the learning rate Another common
rule uses not the actual activ ation k but the dirence bet w een the actual desired
activ ation for adjusting the w eigh ts
w d y
jk
j k k
in whic h d is the d esired activ ation pro vided b ya etac This is often called the Widro wo
k
rule or the delta rule a nd will b e discussed in the next c hapter
Man y v arian ts ften v ery exotic ones ha v e b e en published the last few y ears In
c hapters s ome o f t hese up date rules will b e discussed
Notation terminology
Throughout t he y ears researc hers from diren t d isciplines ha v ecomeup with a v ast n um ber of
terms applicable in the ld neural net w Our computer scien poin tfiew
us to adhere to a subset of the terminology whic h is l ess biologically inspired y still concts
arise c v en tions are discussed b elo w
on Our
et
enables tist orks of
and
next the
her

and unit of

If
that The
Hebb the
eigh
uli stim nput the tation
to is system the to
whic or
or
net whic
is kno
the
set of ess pro or
of that net
orks
whic
AMENT NOT A TION AND T ERMINOLOGY
Notation
W e se u the follo wing notation in our form ulae Note that not a ll sym b o ls are meaningful for all
net w orks and hat t n i ome s cases subscripts or sup rscripts e ma y b e left out p is
necessary or added v ectors trariwise the notation belo w v e indices where
necessary V ectors are indicated with a b o ld nonlan fon t
j k the nit u j k
i an input unit
h a idden h unit
o an output unit
p
x p th input pattern v ector
p
x the j th elemen tof the p th input pattern v
j
p
s the input to a set of neurons when input pattern v ector p is clamp ed presen ted to the
net w ork often the input of the w ork b y c lamping input pattern v ector p
p
d the desired output of the w ork when input pattern v ector p w as input to the net w
p
d j th elemen t f o he t desired output of the net w ork w hen input pattern v ector p w as input
j
to the net w
p
y the a ctiv ation v alues of the net w ork when input pattern v ector p w as input to the net w ork
p
y the ctiv a ation v alues of elemen t j of the w ork when input pattern v ector p w to
j
n w ork
W the atrix m of connection w eigh
w the w eigh ts of the c onnections whic hfeed in to unit j
j
w w eigh t of t he connection from unit j to unit k
jk
F the ctiv a ation function asso ciated with unit j
j
the l earning rate asso c iated w ith w t w
jk jk
the iases b t o the units
the bias input to unit j
j
U the threshold of unit j in F
j j
p
E the e rror in the output of the net w ork when input pattern v p is input
E the energy of the n et w ork
T erminology
Output vs activ ation of a unit Since there is no need do otherwise w e consider the
output and the activ ation v alue of a u nit to b e one and t he same thing That is he t output of
eac h euron n equals its activ ation v alue
to
ector
eigh
the
ts
et the
input as net
ork
the
ork net
net
ector
the

ted
ha to con can
not often CHAPTER FUND ALS
Bias oet threshold These erms t all refer o t a constan t indep e nden t f o t he net w ork
input but adapted b y the learning term whic h is input to a unit ma y be used
in terc hangeably although the latter t w o t erms are often en visaged s a a prop ert y of he t activ ation
function F urthermore this external input is usually implemen ted nd can be written as a
w eigh t f rom a nit u with activ ation v
ber y ers In a feedorw ard net w ork the inputs p rform e n o computation and their
la y er is therefore not coun ted Th a n w ork w ith one input la y er one hidden l a y er and one
la y er is referred to as a net w ork t w o y ers This con v tion is widely though not
y et univ ersally sed u
Represen tation vs learning When using a neural net w ork one has o t distinguish t w o ssuesi
whic h inence the p erformance of system The st is the represen tational po w er of
the n et w ork the econd s one is the learning algorithm
The r epresen tational po w er of a neural w ork to the abilit y a neural w ork
represen t a desired unction f Because a neural net w ork is built from a set of standard f unctions
in most cases the net w ork w ill only ximate the desired f unction and e v en optimal
set f o w eigh ts the appro ximation error i s not zero
The s econd issue is the learning algorithm en that there exist a set of optimal w eigh ts
in the n et w ork is there a pro c edure to terativ d t s et of w eigh ts
his ely
Giv
an for appro
to net of refers net
one the
en la with output
et us
la of Num
alue
They rule
AMENTP art II
THEOR Y
P erceptron Adaline
This c hapter describ e s single a l y er neural net w orks including some of the classical approac hes
to neural omputing c and learning problem st part of this c hapter w e discuss the
represen tational po w er the single la y w orks and their learning algorithms a nd e
some examples of using w orks the second part w e will discuss represen tational
limitations of single la y er net w orks
Tw o lassical will be describ ed st of the c hapter the P erceptron
prop osed b y Rosen blatt osen blatt in late the Adaline presen ted in the
early s b yb y idro W w and Ho Widro w Ho
Net w orks with threshold activ ation functions
A ingle s la y er feedorw ard net w ork consists of o ne or more output neurons o eac h of whic his
connected with a w eigh ting factor w to of the i In the simplest case the net w ork
io
has only t w o inputs and a single output as sk etc hed in ure e lea v e the o
out The i nput of the neuron s i the w eigh ted um s of the inputs plus t he bias term The utputo
x
w

y
w
x


Figure Single la y er net w o rk with one utput o and t w o nputs i
of the etn w ork is formed b y the activ of the output neuron whic his smeo function of the
input


X
y F w x
i
i
i
The a ctiv ation function F can b e linear o s that w e v e a net w or nonlinear In
section w e consider the threshold r ea H viside or gn s f unction

if s
F s
otherwise
output the w ork us or dep nding e on the input The net w ork
can no w be used for a classiation it can decide whether an input pattern b e longs
f o t w o classes If the total input is p sitiv o e the pattern will b e ssigned a to c lass i f the

one
to task
either is th net of The

this ork linear ha
ation

index output
inputs all
and the
part the in dels mo
the In net the
giv will net er of
the In the
and CHAPTER PER CEPTR ON A D ALINE
total input is negativ e the ample s will b e assigned to c lass The eparation s b et w een the t w o
classes in this c ase is a straigh t l en b y t he equation
w x w x


esTh ngiell a y er net w ork r epresen a linear discriminan t f unction
A geometrical represen tation of the linear threshold neural w ork is giv en in ure
Equation can b e written as
w

x x

w w

w e s ee that the w eigh ts determine he t slop e o f he t line and the bias determines the et
i ho w far the line is from the origin that the w b e plotted in input
space w eigh tv ector i s alw a ys p erp endicular t o the discriminan t unction f
x

w

w

x


k w k
Figure Geometric rep resentation of the discriminant function a nd the w eights
w that w e v e sho wn the represen tational p o w the single la y net w with linear
threshold units w e come to second issue ho w do w e the w eigh ts and biases in the
net w ork W e will describ e t w o learning m etho ds these t yp es of net w orks the erceptron
learning ule r and the elta or MS rule metho ds are iterativ e pro cedures t hat adjust
w ts A learning s ample presen ted to the net w ork F or eac h w t the v alue
computed b y adding a c orrection t o the old v The hreshold t is up dated in a same w a y
w t w t w t
i i i
t t t
The earning l problem can no w b e orm f ulated as wdo w e c ompute w t a nd t in
i
to classify the earning l patterns correctly
P erceptron learning and con v ergence theorem
Supp ose w eha v e a set of learning samples consisting of an input v ector x and a esired d output
d x F or a classiation task the d x is or The p erceptron learning ule r is v
simple and c an b e stated as follo
Start with random w eigh ts for he t connections
Select an input v ector x from the set of training samples
If y d x the p erceptron giv es an incorrect resp onse mo dify all connections w accord
i
ing to w d x x
i
i
ws
ery usually
rule
order ho


alue
is new eigh is eigh the
Both
for
learn the
ork er of er ha No
the
the can ts eigh also Note
and
net
ts

giv ine

AND PER CEPTR ON LEARNING R ULE AND CONVER GENCE THEOREM
Go bac kto
Note that the pro cedure is v ery similar to the Hebb rule t he only dirence is t hat when the
net w ork esp r onds c orrectly n o connection w ts are mo did Besides m o difying the w eigh ts
w em ust a lso mo dify the threshold is considered as a onnection c w bet w een the output

neuron and a umm y predicate unit w hic hisalw a ys on x en the p erceptron learning

rule as stated ab o v e this threshold i s mo did according to

if the p erceptron resp onds c orrectly

d x otherwise
Example of the P erceptron learning rule
A p erceptron i s initialized with the follo wing w eigh ts w The p erceptron

learning r ule is sed u to earn l a c orrect discriminan t unction f for a n b er of samples sk etc hed in
ure The st sample A with v alues x and target v alue d x is presen ted
to the n et w ork F rom eq it can b e calculated that the net w ork output is so no w eigh ts
adjusted The same is the case for poin t B v alues x and target v alue
d x the net w ork output is negativ e so no c When presen ting p o in tCwhvit alues
x the net w ork output will b e while the target v alue d x According
the p erceptron learning rule the w tc hanges are w w The new

w eigh ts are o n w w w and s ample C is lassid c orrectly c

the iscriminan d t unction f b e fore and after this w t up is s ho wn
x
original discriminant function
after weight update
2
A
1
C
B
12
x

Figure Discriminant function b e fo re and after w eight up ate d
Con v ergence theorem
F or the p erceptron learning ule r there exists a con v ergence theorem whic h states he t follo

Theorem If ther e exists a set of c onne ction weights w which is able to p erform the tr ansfor
mation y d x he p er c eptr arning rule wil l c onver ge to some solution which may o r may

b e the same as w in a ite numb er of steps for any initial choic e of t he weights

Pro o f Given fact that the length of the ve ctor w do es not play a r ole e c ause of the sgn

ation take k w k Be c w is a c e ct solution the j w x j e

denotes dot or inner pr o duct w il l b egr e ater than or ther e exists a such that j w x j

al l inputs x Now dee cos w w k w k When c or the p er c eptr le arning

T ec hnically this need not to b e t rue f or an y w w x could in fact b e e qual to f or a w h yields no

misclassiations o ok at deition of F Ho w er another w can b e f ound for whic h t he quan tit y w b e
hanks to T erry Regier Computer Science C U B erk eley
not ill ev
whic
on to ding ac for

wher value orr ause we er op
the
not
le on
wing
date eigh ure In

eigh
to
hange
with are

um

Giv
This
eigh
CHAPTER PER CEPTR ON A D ALINE
rule c onne ction weights ar e di d at a given input x we that w d x x and the

weight fter a mo di ation is w w w F r om this it fol lows hat t

w w w w d x w x


w w sgn w x w x

w w

k w k k w d x x k

w d x w x x

w x e c d x sgn w x

w M
After t mo di ations we have

w t w w w t

k w t k w tM
such that

w w t
t
k w t k

w w t
p

w tM
p

p
F r om this fol lows that cos t ilm t while by deition cos
t t
M
c onclusion is that ther e b e upp er limit t for t The system mo dis its
max
c onne ctions only a limite d numb er of times In other w or ds after maximal ly t mo di ations
max
the weights the p er c eptr on is c orr e ctly p erforming the mapping t wil l b e r e ache d when
max
If we start w ith c onne ctions w
M
t
max


The original P erceptron
P erceptron prop osed b y Rosen osen blatt is somewhat more complex than a
la y net w ork with threshold activ ation functions In its simplest it consist of an
N lemen t input la y er retina whic hfedse in to a la y er of M sso ciation ask or redicate
units a nd a ingle s output unit The goal of t he op eration of t he p erceptron is to learn a giv en
h
N
transformation d f g g using learning s amples with input x and corresp onding
y d x In the o riginal deition the activit y f o t he predicate units can b e n a y function
of the input la y er x the learning pro cedure only adjusts the connections to the output
h
unit The reason for this is that no recip e had been found to the connections bet w een
x and Dep ending on the functions p rceptrons e be ed in diren t families
h h
insky P ap ert a n um ber of these families are describ ed and prop erties of these
families v e been describ ed The output unit of a p rceptron e a linear threshold elemen t
Rosen blatt osen blatt pro v ed the r emark able theorem ab out p erceptron learning
the early s p e rceptrons a great deal of in terest and optimism
euphoria w as replaced b y disillusion a fter the publication of Minsky P ap ert P erceptrons
in insky P ap ert In this book analysed the p e rceptron thoroughly and
v ed that there are sev ere restrictions on what p e rceptrons can epresen r t
pro
they
and
initial The created in and
is ha
In
to group can
adjust
but
output

form er single
blatt The
cos
of
an must The

lim
cos
ause


know mo
AND THE A D APTIVE LINEAR ELEMENT AD ALINE
φ
1
Ψ

φ
n
Figure P erceptron
The adaptiv e linear elemen t daline
An imp rtan o t generalisation of the p erceptron raining t algorithm w as presen ted b y W idro w and
Ho as east mean square MS learning pro edure c also kno wn the delta The
main functional dirence with the p erceptron raining t rule is the w a y the output of the system is
used in the learning rule The p e rceptron learning r ule uses the output o f t he threshold function
ither o r for earning l The deltaule uses he t net output without urther f mapping in
v alues r o

The earning l rule w as applied o t the daptiv e l inear elemen t Adaline dev el
op ed b y Widro w and Ho idro w Ho In a s imple ph ysical implemen tation g
this device consists of a set of con trollable resistors connected to a circuit h can sum up
curren caused b y the input v oltage signals Usually the cen tral blo c k the summer is
follo w ed b y a quan tiser whic h o utputs either f o dep e nding n o t he p olarit y of t he
+1
−1 +1
level
w
w
1
0
w
2
output
+1
w
Σ
3
−1

error
quantizer
summer Σ
gains
input +
pattern
−1 +1
switches reference
switch
Figure The Adaline
Although the adaptiv e pro cess is here exemplid in a case when there is only one output
it ma y b e clear that a system with man y parallel utputs o is directly implemen table b ym ultiple
units o f the ab o v ekndi
If the input onductances c are enoted d b y w i and the input and utput o signals
i

ALINE st sto o d for AD Aptiv e LInear NEuron but when artiial neurons b e came less and less p opular
this acron ym w as c hanged to AD Aptiv e LINear Elemen t
AD


sum
also ts
whic
named also
output
to
rule as the
The CHAPTER PER CEPTR ON A D ALINE
b y x and y resp ectiv ely then the output of t he cen tral blo c k is d to b e
i
n
X
y w x
i
i
i
p
where w The purp ose of this device is yield a giv v alue y d its output when

p
v alues x i applied at the inputs The problem is to determine the
i
co eien ts w i n scu ha w a y hat t he t inpututput resp onse is correct f or a large
i
n b er f o arbitrarily c hosen signal sets If an exact mapping is n ot p ossible the a v erage
m ust be minimised instance in the sense of least squares An adaptiv e op eration means
that there xists e a m ec hanism b y w hic hthe w can b e adjusted usually iterativ ely to attain the
i
correct v alues F or the Adaline Widro w tro duced the delta rule to adjust the w ts This
rule will b e discussed in section
Net w orks with linear activ ation functions the delta rule
F a single la y er w with an unit with a linear activ ation function the output
simply giv b y
X
y w x
j
j
j
h a simple net w ork able to represen t a linear relationship bet w een the v alue of the
output unit and the v alue of the input units By thresholding t he output v alue a lassir c can
b e constructed uc h a s W idro w Adaline but h ere w e f o c us on the linear relationship and use
net w ork for a function appro ximation task In high dimensional input spaces the net w ork
represen a h yp erlane and it will b e c lear that also m ultiple output units ma y b e
Supp ose w e w t train the net w ork suc h that a h yp erplane is ted as w ell as p ossible
p
to a training samples consisting of v alues x and esired d r target v alues
p p
d F or ev giv input sample the output w ork dirs target v alue d
p p p
b y d y where y is the actual output for t his pattern The deltaule no w ses u a cost or
errorunction based on these dirences o t adjust the w eigh ts
error function as indicated b y the name least mean square is the summed squared
error That is the t otal error E is deed t o b e
X X
p p p
E E d y

p p
p
where he t index p ranges o v the set of input patterns and E represen the error n o pattern
p The LMS pro cedure ds the v alues o f all the w eigh ts that minimise the error unction f b ya
metho d called gradien t descen t The idea is o t ak m ea c hange in the w t rop p ortional t o the
negativ e of the deriv ativ eof the error as measured on curren t pattern with resp ect to eac h
w eigh t
p
w
p j
j
where is a constan t of p rop ortionalit y The deriv ativ eis
p p p

p
j j
Because f o the linear units q
p
x
j
j






the
eigh
ts er
The

the from net the of en ery
output input of set
to an
deed ts
the
is Suc

en
is output ork net or
eigh in
for
error um

is of set the
at en to


eed
AND EX CLUSIVER PR OBLEM
p
p p
d y
p
httha
p
w x
p j
j
p p p
where d y is the dirence b et w een t he target utput o and the ctual a output for pattern
p
The elta d r ule mo is d w eigh t appropriately or f target and actual outputs f o either p olarit y
and for b o th con uous and binary input and output units These c haracteristics ha v e p o ened
up a w ealth of new applications
Exclusiv eR problem
In the p revious s ections w eha v e iscussed d t w o learning algorithms for single la y er w orks but
w eha v e n ot discussed the limitations on the represen tation of net w orks
x x d





T able Exclusiv er truth t able
One Minsky P ap ert discouraging results sho that a single la y p ercep
tron cannot represen ta simple exclusiv er function T ws the desired relationships
bet w een inputs and output units for this function
In a imple s net w ork with t w o inputs and one output as depicted in ure the net input
is equal to
s w x w x


According to eq the output of the p erceptron is ero z when s is negativ e a nd to
one when s is p sitiv o e In ure a geometrical represen tation of the input domain is giv en
F a constan t the output the p e rceptron is equal on side of the dividing line
whic h s i deed b y
w x w x


and equal to zero on the other ide s f o this ine l
x x x

(−1,1) (1,1)
?
x x x

?
(−1,−1) (1,−1)
AND OR XOR
Figure Geometric rep resentation of input space
one one to of or
equal

sho able
er ws most and of
these
net
tin

suc


and CHAPTER PER CEPTR ON A D ALINE
T o ee s that suc h a solution cannot b e found tak e a lo ot at ure The input space consists
of four p o in ts and the t w o s olid circles a t and cannot b e separated b y a straigh t
line from the t w o op en circles at and The ob vious q uestion t o ask is Ho w can
this problem b e o v ercome Minsky and P ap v e in their b o ok that f or binary inputs an y
transformation can be carried out b y adding a la y er of predicates whic h are connected to all
inputs The ro p of is giv en in the next section
F the sp eci X problem w e geometrically sho w b y in tro ducing hidden units
thereb y extending the net w ork o t a m ultia y er p rceptron e t he problem an c b e solv ed
demonstrates that the four input p oin ts are no wem b e dded in a threeimensional space deed
b y the t w o inputs plus the single hidden unit These four poin ts are no w easily separated b y
(1,1,1)
1
1
−2
−0.5 −0.5
1
1
(-1,-1,-1)
a.
b.
Figure Solution of the X OR p roblem
a The p erceptron of with an extra hidden unit With the indicated values of the
w eights w ext to the connecting lines and the thresholds n t he circles this p erceptron
ij i
solves the X p roblem b This is accomplished b y mapping four potsin ure
onto the four p oints indicated here clea rly eap ration b y a r m anifold into the equired r
groups is no w p ossible
a linear manifold lane in to t w o groups as desired This simple example demonstrates
adding hidden units i ncreases he t class of problems that are soluble b y eedorw f ard p rceptron e
e net w orks Ho w er b y t his g eneralisation of the basic arc hitecture w eha v e also incurred a
serious loss w eno lngroe ah v e a learning rule to determine t he optimal w eigh ts
Multia y p erceptrons can do ev erything
previous section w e sho w ed that b y adding an extra hidden unit the X OR problem
can be solv ed F binary units one can pro v e that arc hitecture is able to p rform e y
transformation giv en the correct connections and w eigh ts The m ost primitiv e i s the next one
F a g en transformation y d x w e an c divide the set of all p ossible input v ectors in to t w o
classes

X f x j d x g X f x j d x g
N
Since here t re a N input units the total n um b er of p ossible input v ectors x is F or ev
p
x X a hidden unit h can b e r eserv ed of whic h he t activ ation y is if and only if the sp eci
h
p
pattern p is presen t at the input w ecna c ho ose its w ts w equal o t he t sp eci pattern x
ih
equal to N suc htaht
h

X
p p

y ng w x N
ih
h i

i

bias the and
eigh
ery
and
iv or
an this or
the In
er
ev lik
that
linea
of the OR
Fig
that OR or
pro ert
AND CONCLUSIONS
p
is equal to for x w only Similarly the w eigh to the output neuron can b e c hosen suc h
h
o is o M predicate neurons is o ne

M
X
p
y gn y M
o h

h

p erceptron will e y only if x X it p e rforms desired mapping The
o
problem s i the large n um b er o f p redicate units whic h is e qual to the n um b r e o f patterns in X
N
whic h is maximally Of course w e can do the same tric k X and w e will alw a ys tak e
N
minimal n um ber mask units h maximally A elegan t of is giv en
in insky P the poin t is that for complex transformations the n um ber of
required units n i he t hidden la y er is exp onen tial in N
Conclusions
this c hapter w e presen ted single y er feedforw ard net w orks for classiation tasks and
function appro ximation tasks The represen tational p o w er of single la y er feedforw w orks
w as discussed a nd t w o learning algorithms f or ding the optimal w eigh ts w ere presen The
simple net w orks presen ted here ha v e their an tages and disadv an tages The disadv an tage
is the limited represen tational po w er only linear classirs can be constructed or in case of
function appro ximation only linear functions be represen The an tage ho w ev er
b ecause of the linearit yof the system t he training algorithm will con v erge to the optimal
solution This is not the case an ymore for nonlinear systems suc has m ultiple la y er net w orks as
w e will s ee in the next c hapter
that
is adv ted can
adv
ted
net ard
for la In
but ert ap
pro more is whic of the
for

the giv This


the of one as on so as ne utput the that
ts CHAPTER PER CEPTR ON A D ALINE
AND kropagation
w eha v e seen in the previous c hapter a singlea y er w ork has sev ere estrictions r the class
of tasks t hat can b e accomplished is v ery limited In this c hapter w e w ill o f cus on feedorw ard
net w orks with la y ers of pro cessing units
Minsky and P ap ert insky P ap ert sho w ed in that a t w o la y feedorw ard
net w ork can o v ercome y restrictions but did not presen t a to the problem of ho w
to adjust the w eigh ts from input to hidden units An answ er to this question w as presen ted b y
Rumelhart Hin Williams in umelhart Hin ton Williams and similar
solutions app eared t o h a v e b een published earlier erb os P ark er Cun
The en c tral idea b ehind this solution is that the rrors e for the units of t he hidden la y er are
determined b y bac kropagating the errors of the units of output la y er F or this reason
metho d is often called the bac kropagation learning rule kropagation also be

considered a generalisation of the delta rule noninear activ ation functions m ulti
la y er net w orks
Multia y feedorw ard net w orks
A feedorw ard net w ork has a la y ered structure Eac hla y er consists of units whic h eceiv r e
input from units from a la y er directly belo w and send their output a la y er
ab o v e the unit There are connections within a la y er The N inputs are fed in the st
i
la y er of N hidden units The nput i units are merely anut units no pro cessing tak es place
h
in these units The activ ation a hidden is a function F of the w inputs plus a
i
as g en in in eq The o utput of the h idden units is distributed o v er the next la y er of
N hidden units un til the last la y er of hidden units of w hic h t he outputs are fed in toala y er
h
of N output units ee ure
o
Although bac kropagation be applied net w orks y n um ber of la y ers just as
for etn w orks with binary nits u ection it has b een ho s wn ornik Stinc hcom b e White
F unahashi Cyb enk o Hartman Keeler Ko w alski that only one
la y er of hidden units ues s to appro ximate an y function ith w nitely man y discon to
arbitrary precision pro vided the activ functions of the hidden units are noninear he
univ ersal appro ximation theorem In most applications a feedorw ard net w ork with a single
la y er of hidden units is sed u w ith a sigmoid activ ation f unction for he t units
The generalised delta rule
Since w eaer no w using units with nonlinear activ ation functions w eha v e to g eneralise the d elta
rule h w as presen ted in c hapter linear functions to set noninear activ ation

Of course when linear activ ation functions are used a m ultia y net w ork is not more po w than a
singlea y er net w ork

erful er
of the for whic
ation
uities tin
an with to can
iv bias
ted eigh unit of
to no
directly in units to
their
er
and for as
can Bac the
the
and ton
solution man
er
net As
Bac CHAPTER BA CKR OP A
h o
N
o
N
i
N N N
h h h
Figure A multia y er net w o rk l la y ers of units
functions The activ ation is a diren tiable function of the total input g iv en b y
p p
y F s
k k
in whic h
X
p p
s w y
jk k
j
k
j
T o g et the correct generalisation of the delta ule r a s resen p ted in he t previous c hapter w em ust
set
p
w
p jk
jk
p
The e rror measure E is deed as the total quadratic error or f pattern p at the output units
N
o
X
p p p
E d y
o o

o
X
p
p
where d is the desired output for unit o when pattern p is clamp ed W e urther f set E E
o
p
as the summed squared error W e anc write
p
p p
k

p
jk jk
k
By equation w e see that the s econd actor f is
p
p
k
y
j
jk
w e d ee
p
p

p
k
k
w e will get an up date rule whic h is alen t delta rule describ ed in previous
c hapter resulting in a gradien t descen t the error surface if w e mak e the w eigh t c hanges
according to
p p
w y
p jk
k j
p
The tric k is to gure out what should b e for e ac h unit k in the net w in
k
result whic hw e w deriv e is that there is a simple recursiv e c omputation of these whic h
can b e i mplemen ted b y propagating error signals bac kw ard through the et n w
ork
no
teresting The ork

on
the as the to equiv


When







with
TION GA THE G ENERALISED DEL T AR ULE
p
T o c ompute w e pply a t he c hain rule to write his t partial deriv ativ e a s the pro duct of t w o
k
factors o ne factor recting the c hange in error as a f unction of the o utput of the unit and one
recting the c hange in the output as a f unction o f c hanges in he t input Th us w eha v e
p
p p
p
k

p p p
k
k k k
Let us compute the second factor w e ee s that
p
p
k
F s
p
k
k
whic h is simply the deriv ativ e of the squashing function F the k th unit ev aluated at the
p
net input s to that unit T o compute the st actor f of equation w e consider t w o cases
k
First assume that unit k is k o of the w ork In this case it follo ws from
p
the d eition of E that
p
p p
d y
p
o o
o
whic h is the same result as w e with the delta rule Substituting this and
equation in equation w eget
p p p p
d y F s
o
o o o o
for y output unit o Secondly if k is not an output unit but a hidden unit k h w e do
readily kno w the con tribution of the unit to output error of the w ork Ho w ev er
error measure can be written as a function the inputs rom f hidden to output la y er
p p p
p p
E E s dn w eues ecth hain rule to write

j
N N N N N
o o h o o
p p p p p
X X X X X

o p
p
w y w w
p p p p p ko p ho ho
j o
o o o
h h h
o o j o o
Substituting this in equation yields
N
o
X
p p
p
F s w
h h o
o
Equations and e a recursiv e pro cedure for c omputing the for all units in
net w whic h are then compute the w eigh t c hanges according to equation
This pro c edure constitutes the generalised delta rule for a feedorw ard net w ork of noninear
units
Understanding bac kropagation
equations deriv ed in the previous section ma y be mathematically what do
they actually mean Is there aw a y of understanding bac kropagation other than reciting the
necessary equations
answ is of course y es In the whole kropagation pro cess in tuitiv
v clear happ ens in the ab o v e equations the follo wing When a learning pattern
is clamp ed the activ ation v alues are propagated to the output units the actual net w ork
output is compared with the desired output v alues w e usually end up with n a e rror in eac h of
the utput o units Let call this error e for a particular output unit o W e ha v e to bring e
o o
zero
to
and
is What ery
ely is bac fact er The
but correct The
to used ork the
giv

ho



net of the
net the not
an

standard obtained


net unit output an
for


equation By

CHAPTER BA CKR OP A
The s implest metho d do this is the greedy d w e striv e to c hange the connections
in neural net w ork in suc h a w a y that next time around the error e be for
o
particular pattern W e w from the rule that in order to reduce error w e ha v e
adapt ts i incoming w eigh ts according to
w d y y
ho
o o h
That step one it alone is not enough when w e only apply this rule the w eigh ts from
input hidden units are nev er c hanged w e do not ha v e the represen po w er
of feedorw ard net w ork as promised b y the ersal ppro a theorem In order
adapt w eigh ts from input to hidden units w e again w an t to apply the delta rule In
case ho w ev er w e do not ha v e a v for for the hidden units is ed b y c hain
rule whic h the follo wing distribute the error an output o all the hidden units
that is it connected to w eigh ted b y his t c onnection Diren tly put a hidden unit h es a
delta from eac h o utput nit u o equal to the delta of hat t output unit w eigh ted with m
P
b y the w eigh t of connection bet w een those units In b o ls w W ell
h o ho
o

exactly w e forgot the activ ation function of t he hidden unit F has to b e applied to the delta
b efore the ac b kropagation pro cess can c on ue
W orking with bac kropagation
application of the generalised delta rule th us in v olv t w o phases During the st phase
input x is presen ted and propagated forw ard through the net w ork to compute the output
p
v alues y for eac h output unit This output is compared with its desired v d resulting in
o o
p
an error signal for eac h output unit The second in v olv es a kw ard through
o
the et n w ork during whic hthe error snaiglispasesdto cea h unit in he t net w ork a nd appropriate
w eigh tc hanges are calculated
W eigh t adjustmen ts with sigmoid activ function results the previous
section can b e summarised i n three equations
The w eigh t a connection is adjusted b y an amoun t prop ortional the pro duct of an
error ignal s on the unit k receiving t he input and he t output of the unit j sending this
signal along t he connection
p p
w y
p jk
k j
If the u nit is an utput o u nit the error signal i s iv g en b y
p p p p
d y F s
o o o o
T e as the activ ation f unction F the igmoid unction f as deed n i c hapter

p p
y F s
p
s
e
In this case the deriv ativ e is

p
F s
p
p s
e
p
s
e
p
s
e
p
s
e

p p
s s
e e
p p
y y


to equal


ak


to of
from The ation
pass bac phase
alue
the
es The
tin
not sym the
ultiplied
receiv
to unit of es do
the solv This alue
this the
to ximation univ the
tational full and to
But

to an delta kno
this zero will the
metho to
TION GA AN EXAMPLE
suc h hat t the error signal for an o utput unit can b e written as
p p p p p
d y y y
o o o o o
The error signal f or a idden h unit is determined recursiv ely i n terms of error signals of the
units to whic h i t directly connects and he t w eigh ts of those connections F or the sigmoid
activ ation function
N N
o o
X X
p p p p p p
F s w y y w
o ho o
h h h h
o o
Learning rate and momen tum The learning ro p cedure requires that the c hange in w eigh t
p
is prop ortional to w T rue g radien t escen d t requires t hat initesimal steps are tak en The
constan t prop ortionalit yis the learning rate F practical purp oses w e c ho a learning
rate that is as large as p ossible without leading to oscillation One w a y a v oid oscillation
at large is to mak e c hange in w t dep enden t of the w t c hange b y adding a
momen term
p p
w t y w t
jk jk
j
k
where t indexes the presen tation n um ber nda is a onstan c t w h determines the ect of the
previous w eigh tc hange
r of the momen wn in ure no momen tum term is used
it tak es a long time b e fore the minim b hed ithw a lo w l earning rate hereas w for
high learning rates the minim um is nev er reac hed b ecause of the oscillations When adding the
momen tum t erm the m inim um will b e reac hed aster f
a
b
c
Figure The descent in w eight space a fo r l ea rate b fo r la rge lea note
the scillations o and c with la rge lea rning rate and momentum term added
Learning per pattern Although theoretically the kropagation algorithm p erforms
gradien t descen t on the total error only if the w ts are adjusted after t he full set o f learning
patterns has been presen more often than not the learning rule is applied to eac h pattern
p
separately i a pattern p is applied E is calculated and the w ts are adapted p
There exists empirical indication that results in faster con v ergence
to be en ho w ev er with the order in h patterns are taugh t F or example when
using the same sequence o v er and o v er again w ork ma y b fo cused n o t he st few
patterns This problem can b e o v ercome b y using a p erm uted training metho d
example
A feedorw ard net w ork can be used appro ximate a function from examples Supp ose w e
ha v e a system or example a c hemical pro cess or a ancial mark et of whic hw ew an t to kno w
to
An
ecome net the
the whic tak
has Care this
eigh
ted
eigh
bac
rate rning rning small
reac een has um
When sho is term tum ole The
hic

tum
eigh past eigh the
to
ose or of

ho
CHAPTER BA CKR OP A
c haracteristics The input of system is giv b y t w oimensional v ector x the
en b y he t oneimensional v ector d W ew an t t o estimate the r elationship d f x
p p
examples f x g as op left A feedorw ard w ork w as
1 1
0 0
−1 −1
1 1
1 1
0 0
0 0
−1 −1 −1 −1
1 1
0 0
−1 −1
1 1
1 1
0 0
0 0
−1 −1
−1 −1
Figure Example function app ro ximation with a feedfo rw a rd net w o rk T op left o
rning amples s T op right The app ro ximation with the net w o rk The
generated the lea rning samples Bottom ight r The erro r n i the app ro ximation
programmed with t w o inputs hidden units with sigmoid activ ation function and an output
unit with a linear activ ation function k f or y ourself o h w quation e should b e adapted
for linear instead of sigmoid activ function The w w eigh ts are initialized
small v alues and the n et w is trained f or learning iterations with the bac kropagation
training rule describ ed in the previous section The elationship r b et w x and d as represen ted
b yteh nte w ork is sho wn in ure op righ t hile w the function whic h generated the learning
samples giv in ure ottom left ximation error is in ure
ottom righ t W e see that the error is higher at the of the region within whic h the
learning samples w ere generated w ork is considerably b etter at in terp olation than
extrap olation
Other activ ation functions
Although sigmoid functions are quite often used as activ ation functions other functions c an b e
used w ell In some cases this leads to a form ula whic h is kno wn traditional function
appro ximation theories
F or example from F ourier analysis it is kno wn an y p erio dic function can b e w ritten as
a inite sum of s ine nd a cosine terms ourier series

X
f x a nx b sin nx
n n
n
cos
that
from as
net The
edges
depicted appro The en is
een
ork
to ork net ation the
Chec
which function left Bottom lea
riginal The of
net ure in depicted from
giv is output
and the en the the
TION GA DEFICIENCIES OF BA CKR OP A TION
W e can rewrite his t as a ummation s of sine terms

X
f x a c sin nx
n n
n
p

with c a b and arctan b This can be seen as a feedorw ard net w with
n n
n n
a single input unit for x a single unit f x and hidden units with an activ ation
function F sin s The factor a corresp o nds with the bias of the output unit the factors c
n
corresp o nd with w eighs from hidden to unit the factor corresp nds o with
n
term of the hidden the factor n corresp onds the w eigh bet w een the
input nd a hidden la y
basic dirence bet w een the F ourier approac h and the kropagation approac h
the the F ourier approac h the ts bet w een the input the hidden units hese
factors n are ed in teger n um b e rs whic h are analytically determined whereas in the
kropagation approac h t hese w eigh ts can e an yv alue and a re t ypically learning using a
learning heuristic
T o illustrate the use of other activ ation functions w e ha v e trained a feedorw ard net w ork
with one utput o unit four hidden units and one input w ith t en patterns dra wn from the function
f x isn x in x The result is depicted i n Figure The same function lb eit with other
learning p oin learned with a net w ork with eigh t sigmoid hidden nits u ee ure
F rom he t ures it is clear that it pa ys o to use s a m uc hkon wledge of the problem t a hand as
p ssible o
+1
-4 -2 2 46 8
-0.5
Figure The p erio dic function f x sni x s in x app ro ximated with ine s activation functions
dapted from Dastani
Deiencies of bac kropagation
Despite the apparen t s uccess of the bac kropagation learning algorithm there are some asp ects
whic hmka e the algorithm not uaran g teed to b e univ ersally useful Most troublesome is the long
training pro cess This can b e a esult r of a nonptim um learning rate and m omen tum A lot of
anced algorithms based on bac kropagation learning ha v e some optimised etho m d to adapt
this learning rate as will b e discussed in he t section Outrigh t training failures generally
arise from t w o sources net w ork paralysis and l o c al minima
w paralysis As the net w ork trains the w ts can b e a djusted to v ery large v alues
The t otal input f o a hidden unit r o output unit an c therefore reac hv ery high ither p ositiv eor
negativ e v alues and b ecause of the sigmoid ctiv a ation f unction the u nit will ha v e an activ ation
v ery lose c to zero or v close t o one As is clear f rom quations e and the w eigh t
ery
eigh ork Net
next
adv
is ts

tak bac
the are
and eigh in that
is bac The
er
ts with and units bias the
phase output the
for output
ork

GA CHAPTER BA CKR OP A
+1
-4 246
-1
Figure The p erio dic function f x ins x in x ximated with sigmoid ctivation a func
tions
dapted from Dastani
p p
adjustmen ts whic h a re prop ortional to y y will b e close to zero a nd the training ro p cess
k k
can ome c t o a virtual standstill
Lo cal minima The error surface of a complex w ork of hills and v alleys Because
of the g radien t descen t w ork can get t rapp d e in a lo cal minim um when there is a m uc h
deep er minim um nearb y Probabilistic metho ds can to a v oid but they tend
b e w Another suggested p ossibilit y is o t i ncrease t he n um b e r f o h idden units Although this
will w ork b ecause of the h igher imensionalit d y o f he t error space and the c hance to get trapp ed
is smaller t i app ears that there s i some upp er limit of the n um b e r of hidden units whic h
exceeded again results i n the system b eing trapp ed in lo cal inima m
anced algorithms
Man y researc hers v e devised impro v ts of and extensions to the basic bac kropagation
algorithm describ ed ab o v e It is to o early for a full ev aluation some of these tec hniques ma y
v e be fundamen tal others ma y simply fade a w a y A few metho ds are discussed in
section
e the most ob vious impro v emen t is to replace the rather primitiv e steep est descen t
metho d with a direction set minimisation metho d e conjugate gradien t minimisation Note
minimisation along a direction u brings the function f at a place where its gradien t
p erp endicular to u therwise minimisation along u is not complete Instead of follo wing the
gradien tat ev ery s tep a et s of n directions i s onstructed c whic h re a all c onjugate to eac h other
h that minimisation along one of these directions u do es not sp il o the minimisation along one
j
of the earlier directions u i the directions are nonn terfering us one minimisation in the
i
direction f o u sues uc s h that n minimisations in a ystem s w ith n degrees of freedom bring this
i
system to a minim um ro vided the system is quadratic This is diren t f rom gradien t descen t
whic h irectly d minimises in the irection d of the steep st e descen t Press Flannery euk olsky
V etterling
Supp ose the function to b e minimised i s ppro a ximated b yits T a ylor series




X X

f



f x f p x x x
i i j



i i j
p
i i
p
T T
x Ax b x c





Th
suc
is that
yb Ma
this to pro
emen ha
Adv
when
slo
to trap this help
net the
full is net
ro app
TION GA AD V ANCED LGORITHMS A
where T denotes transp o se and



f

c f p b f j A
ij
p

i j
p

A is a ymmetric s p ositiv e eite d n n matrix the Hessian of f at p The g radien tof f is
r f A x b
httha a c hange of x results in a c hange of t he gradien tas
r f A x
w s upp ose f w as minimised along a irection d u to a p t here w the gradien t g of f
i
i
p erp endicular to u i
i
T
u g
i i
and a new direction u is sough t In order t o mak e sure that mo ving along u do es not sp oil
i i
minimisation along u w e r equire that the gradien tof f remain p e rp endicular to u i
i i
T
u g
i i
otherwise w e w ould once more ha v e to minimise in a direction whic h a comp onen t of u
i
bining and w eget
T T T
u g g u r f u Au
i
i i i i i
When eq holds for t w o v u and u they are said to b e conjugate
i i
w starting at poin t p the st minimisation direction u is tak en equal to g


f p resulting in a new p oin t p F or i calculate the directions

u g u
i i i
i
T
where is c hosen to mak e u and he t successiv e radien g ts p erp endicular i
i i
i
T
g g
i
i
g f j for all k
i
k
p
T
k
g g
i
i

Next alculate c p p u where is c hosen so as to minimise f p
i i i
i i i
can be sho wn that the u constructed are all m utually conjugate see to er
B ulirsc h The pro c ess escrib d ed ab o v e is kno wn as the Fletc hereev es metho d but
there are man y v arian ts whic h w ork or the same estenes Stiefel P olak
P o w ell
Although only n iterations are for a quadratic system n degrees of freedom
the fact that w e are not minimising quadratic systems as w as a result of round
errors the n directions ha v e be follo w ed sev eral times ee ure P o w ell in tro duced
some impro v ts to correct for b eha viour in nonuadratic systems The resulting cost

O n chi h is s ignian tly b etter than t he linear on c v ergence of steep est descen t

A matrix A is called p ositiv e eite d if y
T
y Ay

This not a trivial problem ee ress et al Ho w ev er line minimisation metho ds exist with
sup erinear con v ergence ee fo o tnote

A metho d i s said to con v erge linearly if E cE with c Metho ds hic w hcon v erge with a h igher p o w
i i
m
i E c E with m are called sup erinear
i i
er
is

is emen
to
ell to due
with needed
less more
us th It


with
Au
some No
ectors


Com
has



is oin No

suc

CHAPTER BA CKR OP A
gradient
u
i
u
i+1
a very slow approximation
Figure Slo w decrease with conjugate gradient in nonuadratic systems The hills the left
a re very steep resulting in a la s vecto r u When the uadratic q p o rtion is e ntered the new
i
sea direction is constructed from p revious direction and gradient resulting n i a
minimisation This p roblem can b e overcome b y etecting d such spiraling m inimisations and resta
the algo rithm with u f

Some i mpro v emen ts on bac kropagation ha v e b een presen ted ased b on an indep enden t a dap
e earning l ate r arameter p for eac hw eigh t
V den Bo omgaard and Smeulders o omgaard Smeulders sho w that a feed
ard w ork without hidden units an incremen pro edure c to d the optimal w eigh t
matrix W needs an a djustmen t of he t w eigh ts with
W t t d t W t x t x t
in whic h is not a constan t but an v ariable N N matrix whic h on the
i i
input v ector By using a priori kno wledge ab out t he input signal the storage requiremen ts
can b e reduced
Silv a nd a Almeida Silv a Almeida a lso sho w het adv an tages o f an indep enden t step
size for eac h w eigh t in the w ork In their algorithm the l earning rate is after ev
learning pattern

t t

u t if and ha v e t he same signs
jk
jk jk
t
jk
t t

d t if and ha v e opp osite s igns
jk
jk jk
where u and d are p ositiv e constan ts v alues sligh tly o v e and belo w unit y resp ectiv ely
The dea i is to ecrease d the learning ate r in ase c of oscillations
w go o d are m ultia y er feedorw ard net w orks
F rom the example wn in ure is clear that the appro ximation of the w ork is
p erfect The esulting r appro ximation error is inenced b y
The learning algorithm and n b e r of iterations This determines ho w g o o d the e rror on
the training set is minimized
um
not net is sho
Ho
ab with





ery adapted net
for
ends dep

tal net forw
for an
tiv
rting
spiraling the the rch
rch ea rge
on
TION GA HO W OOD G ARE M UL TIA YER F EED OR W ARD NETW ORKS
The n b r e of l earning samples This determines ho w go o d he t training samples r epresen t
the a ctual function
The n um b er of hidden units This determines he t xpressiv e po w er of the net w ork F or
mo oth functions only a n um ber of hidden units are needed for wildly
functions more hidden u nits will b e needed
In the p revious sections w e discussed the l earning r ules suc hasbac kropagation and t he other
gradien t based learning algorithms and the problem of ding the minim um error In
section w e p articularly address the ct e of the n um b e r of learning samples and the ect of the
n b er f o idden h units
W e ha v e to dee an adequate error measure All neural net w ork training
to minimize the error of the learning samples whic h are a v ailable the
net w ork a v erage error p er learning sample is deed as the learning error rate error ate r
P
learning
X

p
E E
P
learning
p
p
in whic h E is the dirence bet w een desired output v alue and the actual net w output
for the learning samples
N
o
X
p p p
E d y
o o

o
This is the error whic h i s measurable during the raining t pro c ess
vious that the actual error f o he t net w ork will dir f rom t he error at the lo ations c of
training samples The dirence bet w een the desired utput o v alue and the net w ork
output should b e in tegrated o v er the e n tire input domain to giv e a more realistic error measure
tegral can b e estimated if w eha v e a large set of samples the test set W eno w dee the
test error rate as the a v erage error of the test set
P
test
X

p
E E
test
P
p
follo wing subsections w e will ho w these error measures dep end on learning set size
n um b er of h idden nits u
The ect of the n um ber of learning samples
A imple s problem is used a s e xample a f unction y f x has to b e appro ximated with a feed
ard n eural net w A neural net w ork is created w ith an input hidden units with sigmoid
activ ation function and a l inear output unit Supp ose w eha v e nly o a mall s n um b r e of learning
samples and the w orks is trained ith w t hese samples T raining i s s topp ed when the
error do es not decrease an ymore original esired function is sho in ure as a
dashed line The learning samples and t he appro ximation of the net w ork are sho wn in the same
ure W e see that in this case E is small he net w ork output g o e s p erfectly through the
learning amples s E is large the test error the net w ork is large appro ximation
test
obtained from learning samples is sho wn in ure The E is larger in the
learning
case of l earning samples but the E is smaller
test
This exp e rimen tw as carried out ith w ther o learning set sizes where or f eac h learning set size
e tw as rep eated times The a v erage learning and est t error a function
of the learning set size are giv en in ure the learning error increases with an
increasing learning set size and he t test error decreases w ith increasing learning set size Alo w
that Note
as rates erimen xp the
than
The of but
learning
wn The
net
ork forw
and
see the In
test
in This
actual the
ob is It
ork the
learning
The
training for of set try
algorithms st
um
this
ctuating few
um CHAPTER BA CKR OP A
A B
1 1
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0 0
0 0.5 1 0 0.5 1
x x
Figure Ect of the lea rning set on the generalization The dashed gives
function the lea rning samples a re depicted as circles and the app ro ximation b y he t net w o rk is sho
b y he t dra wn line hidden units a re used a l ea rning s amples b rning samples
learning e rror on the mall learning s et is no guaran tee for agoodnte w ork p erformance With
increasing n um ber of learning samples t w o error rates con v to the same v alue This
v alue dep ends on represen tational po w the net w giv en optimal w eigh ts ho w
go o d is the appro ximation This rror e dep nds e n o t he n um b r e of hidden units nd a the activ ation
function If the learning error rate do es not con v erge to the test error rate he t learning pro cedure
has n ot found a global minim um
error
rate
test set
learning set
number of learning samples
Figure Ect of the lea rning set ize s o n the erro r The average erro r rate and the average
test erro r rate as a function of t he numb e r o f ea l rning samples
The ect of the n um ber of hidden units
The ame s f unction as in the previous subsection is used but n o w he t n um b r e o f hidden units s i
v aried original esired function learning samples and net w ork appro ximation is sho wn
in ure hidden units and in for hidden units The ect visible
in ure is called o v ertraining The net w ork s exactly with the learning samples but
b ecause of large n um ber of hidden units the h is actually represen ted b y the
net w ork is far more wild than the original one P articularly case learning samples whic h
con tain a ertain c amoun t of oise n hic h all real orld data ha v e the net w ork will t the noise
of the earning l samples instead of making a s mo oth appro ximation
y
y
of in
whic function the
ure for
The
rate
the ork of er the
erge the
lea
wn
desired the line size
TION GA APPLICA TIONS
A B
1 1
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0 0
0 0.5 1 0 0.5 1
x x
Figure Ect of the numb er of units the net w o rk p rfoe rmance dashed line
gives the desired function the circles denote the lea samples and the dra wn line gives the
app ro ximation b y t he net w o rk lea rning samples a re used a h idden units b hidden units
This example s ho ws that a large n um b e r f o hidden units leads to a s mall error on the training
set but not necessarily leads to a small error on the test set Adding idden h units will alw a ys
lead t o a reduction f o t he E Ho w er adding hidden units will st lead to a reduction
learning
of the E but then lead to an increase of E This ect is p e aking ect The
test
a v erage learning and t est rror e rates as a unction f of the learning s et size are iv g en in ure
error
rate
test set
learning set
number of hidden units
Figure The average lea rning r rate and the average test erro r a function of the
numb er of hidden units
Applications
kropagation has b een applied to a v ariet y of researc h applications wski and
Rosen b erg ejno wski Rosen b e rg pro duced a s p ectacular success with NETtalk
a system that con v erts prin ted English text in to highly n i telligible sp eec h
A feedorw ard net w ork with one la y er hidden u nits has bene describ e d b y Gorman and
Sejno wski orman ejno S wski as a c lassiation m ac hine for sonar signals
Another application of a m ultia y er feedorw net w ork w ith a bac kropagation training
algorithm is to learn unkno wn function bet w een input and output signals from presen
y
y
the an
ard
of
Sejno wide Bac
as rate erro
test
the called
ev

rning
The on hidden CHAPTER BA CKR OP A
tation of examples It d e that the net w ork is able generalise correctly so that input
v alues whic h are not presen ted as learning patterns will result in correct output v alues An
example is the w ork Josin osin a t w oa y feedorw net w ork with
kropagation learning p erform in v erse kinematic transform h is needed b y a
c troller ee c hapter
on arm ot rob
whic the to bac
ard er used who of
to hop is
TION GA Recurren t Net w
The earning l algorithms discussed in the p revious c hapter w ere applied to feedorw net w orks
all ata d o ws in a etn w ork n i w hic h no c ycles are presen t
But what happ ens when w e in tro d uce a F or instance w e can connect a hidden unit
with itself o v er a w eigh ted c onnection connect hidden units t o input units o r ev connect all
units with eac h other Although as w e kno w the previous c hapter appro ximational
capabilities of suc hnet w orks o d ot n increase w e ma y obtain decreased complexit y net w ork size
e the same problem
imp ortan t question w e v e to consider is the follo what do w e w an t to learn in
a recurren t net w ork After all when is considering a t net w ork it is p o ssible
con tin ue propagating activ ation v ad initum r un til a s table p oin t ttractor is reac hed
w e will see the sequel there recurren t net w ork h are attractor based i the
activ ation v alues in the net w ork are rep atedly e up dated un til a stable p oin t reac hed after
whic h w eigh ts are adapted but there are also recurren t w where the learning rule
is used after eac h propagation an activ ation v alue is transv ersed o v er eac h w t
once while e xternal inputs are included in eac h propagation In suc h net w orks recurren t
connections can b e regarded as extra inputs to the net w ork he v alues of whic hare computed
b y w ork itself
c hapter recurren t extensions to the f eedorw ard net w ork tro duced in the previous
c hapters will be discussed et not to exhaustion The theory of dynamics of recurren t
net w orks extends b ey ond the scop e of a oneemester course on neural net w Y et the basics
of these et n w orks will b e discussed
Subsequen tly some sp ecial recurren t w will be discussed Hopld net w ork in
section whic h an c b e used for the represen tation of binary patterns ubsequen s tly w e touc h
up on Boltzmann mac hines therewith in tro d ucing s to c hasticit y in n eural c omputation
The generalised deltaule in recurren t w
bac kropagation learning rule tro duced in c hapter can be easily used for training
patterns in recurren tnet w orks Before w e will consider this general c ase ho w ev er w e will st
describ e net w orks where some of the hidden nit u activ ation v are fed ac b kto an extra
of input nits u he Elman net w ork or where o utput v alues are fed bac kin to hidden units he
Jordan net w ork
A t ypical application of h a w ork is the follo wing Supp ose w e ha v e construct a
net w rkto tmha ust generate a on c trol command dep nding e on an external input whic h is a time
series x t x t x t With a f eedorw ard et n w ork there are t w o p ossible approac hes
create inputs x x x whic h c onstitute the n v alues of the input v ector Th
n
a ime windo w of the input v is to w ork

create inputs x x x Besides only inputting x t w e also input its st second etc


net the input ector
us last

to net suc
set alues
in The
orks net
the orks net
orks