S C A L A B I L I T Y O F M A C H I N E

achoohomelessAI and Robotics

Oct 14, 2013 (3 years and 8 months ago)

264 views

SCALABILITY OF MA CHINE
LEARNING ALGORITHMS
A thesis submitted to the University of Manchester
f or the degree of Master of Science
in the F a cul ty of Science
No v em b er
By
Georgios P aliouras
Departmen t f o omputer C ScienceCon ten ts
Abstract
The Author
kno wledgemen ts
In tro duction
Deition of Learning
The ob jectiv es of ML
Approac t en so far
Motiv ation for the pro ject
The Structure of the T hesis
Theory of Inductiv e Learning
In tro duction
Induction as a S earc h
The G oal Hyp othesis
The S earc h S pace Hyp thesis o Space
The o p erators
Approac hes
Statistical lassiation C
Similarit y Based Learning SBL
Neural N et w orks












ak hes



Ac

Genetic Algorithms GAs
Tw oP opular SBL Algorithms
The A Q Algorithm
The I D Algorithm
Extensions
Handling Numeric Data
Ov ering and P runing
Incremen tal Learning
Constructiv e I nduction
Computational Learning T heory L T
V alian t robably P Appro ximately orrect C A C Learning
Mo del
Using the P A C L earning Mo del to Measure Inductiv e Bias
Criticisms and Extensions to the P A C Learning Mo del
Summary
The Algorithms
In tro duction
NewID
In tro duction
Description
Analysis
Conclusion
C
In tro duction
Description
Analysis
Conclusion
PLS


























In tro duction
Description
Analysis
Conclusion
CN
In tro duction
Description
Analysis
Conclusion
A Q
In tro duction
Description
Analysis
Conclusion
Summary
Scaling up on Real and rtiial A Data
In tro duction
Description f o Data Sets
Selection Criteria
Recognising T yp ed Letters
Classifying Chromosomes
Learning Ev en Num b ers
T est Organisation
Measuring Scalabilit y
Measuring Classiation Accuracy
Hardw are Sp eciation
Exp erimen tal Results


























Letter Recognition
Chromosome Classiation
Learning Ev en Num b ers
Summary
Real Data Sets
Artiial Data
Conclusions
Summary o f he t Presen ted W ork
Con tribution of the Pro ect j
F urther W
Summary
A Algorithms considered for the pro ject
B Av ailable DataBases
B Classiation ccording a to S ize
B A ttributes and Classes
B Noise
B Sp ecial t yp es of Learning
B Un usual Structure
B Real orld vs T o y cases
B Additional Databases
C Exp erim en tal Results
Bibliograph y









ork







List of T ables
Heuristics u sed n i eac h a lgorithm
A ttribute t yp es that eac h algorithm can handle
Summary o f complexit y estimates
Regression analysis for linear b eha viour n o t he L etter R e c o gnition
Set
Regression analysis for l oginear b e ha viour n o the L etter R e c o gni
tion Set
Regression analysis for inear l b eha viour on he t Chr omosome C las
si ation S et
Regression analysis for l oginear b eha viour on the Chr omosome
Classi ation Set
Regression analysis of logime on sampleize for the Evenumb
L e arning T ask
C Scalabilit y Results using the L etter R e c o gnition data set
C L etter R e c o gnition Set The rate of increase of the PUime C con
sumed at eac h sizetep
C Classiation Accuracy Results using the L etter R e c o gnition data
set
C Scalabilit y Results using the Chr omosome Classi ation data set
C Chr omosome Classi ation Set The rate of increase of the C PU
time consumed at eac h sizetep
C Classiation Accuracy Results using t he Chr omosome Classi a
tion data set
C Scalabilit y Results using the Evenumb ers learning task








er






C Evenumb ers L e arning T ask The rate of increase of the CPU
time consumed at eac h sizetep
C Evenumb ers L e arning T ask Sp ecial examination of the NewID
algorithm


List of Figures
An example of a treetructured attribute
Linear d iscrimination or f t w o classes n i a t w oimensional space
A situation where linear discrimination cannot pro vide a n adequate
solution
Classiation sing u the K earest Neigh b ur o metho d
Orthogonal clustering i n a binarylass problem with t w o ttributes a
Elliptic clustering
At w oa y ered fullyonnected et n w ork
The pro cess p rformed e b y GAs
Design of the basic A Q lgorithm a
The ST AR algorithm
Design of the basic CLS a lgorithm
The st step of the generation of a ecision d tree b yCSL
The second step of the generation of a ecision d tree b y CLS
The decision tree nally g enerated b yCLS
The Basic ID A lgorithm
The Main NewID Pro cedure
The Ev aluation Mo dule
The v aluerouping algorithm
Design of the basic PLS algorithm
The PLS clustering Pro c edure


















The CN algorithm
The Con trol Mo dule rdered
The Con trol Mo dule nordered
Find B est omplex C N
The Con trol Mo dule
The Searc h P ro cedure
Scalabilit y R esults using the L etter R e c o gnition data set orre
sp onds to table C
L etter R e c o gnition Set The rate of increase of the PUime C con
sumed at eac h sizetep corresp o nds to table C
Classiation Accuracy Results using the L etter R e c o gnition data
set orresp onds to table C
Scalabilit y R esults using the Chr omosome Classi ation data set
orresp onds to table C
Chr omosome Classi ation Set The rate of increase of the C PU
time consumed at eac h sizetep orresp onds t o table C
Classiation Accuracy Results using t he Chr omosome Classi a
tion data set orresp onds to table C
Scalabilit y R esults using the Even Numb learning task orre
sp onds to table C
Evenumb ers L e arning T ask The rate of increase of the CPU
time consumed at eac h sizetep orresp onds t o table C
Evenumb ers L e arning T ask Sp ecial examination of the NewID
algorithm corresp onds to table C




ers











Abstract
During the last t w o d ecades there has b een a signian t researc h activit yin
hine Learning whic h has mainly concen trated on the task of empiric al c onc ept
le arning This metho d of learning n i v olv es the acquisition of kno wledge from a
set of examples the training set using generalisation tec hniques
The task of empirical concept learning can b e though t f o a s b eing equiv alen t
to the classiation task previously p erformed b y statistical tec hniques Despite
the e xistence of a large n um b er of roblems p w hic h can b e considered classiation
tasks ML tec hniques ha v e n ot b een widely applied to real orld problems One
of the p ossible reasons for this i s that earning l programs cannot handle largecale
data used in real a pplications
Considering that p ossibilit y he t presen ted thesis e xamined the sc alability
e conceptearning algorithms eing d scalabilt yb y t he ect that an increase
in the s ize of the training set has on the computational p erformance of the al
gorithm he T programs hat t w ere considered are NewID iblett C
uinlan PLS endell CN lark and N iblett and A Q
ic halski et al
The st part of the pro ject in v olv ed the theoretical analysis f o the algorithms
concen trating on their w orstase computational complexit y The obtained results
deviate substan tially rom f those previously p resen ted Oork e and
endell et al pro viding o v eruadratic w orstase estimates
The second art p of the w ork is an exp rimen e tal examination u sing real and
artiial ata d sets Tw o large real data sets ha v e b een selected for that urp p ose
one dealing ith w letter r e c o gnition and the other ith w omosome c lassi ation
The exp e rimen ts that w ere d one using those t w o etss rop vide an indication
of the a v eragease p erformance of the programs w hic h s i signian tly diren t
from the w orstase one The rtiial a data set on the other hand pro vides a
near orst case situation whic h conms the btained o theoretical results
The results of the theoretical and exp erimen tal nalyses a sho w that although
their w orstase omputational c complexit yis o v eruadratic most of the exam
ined algorithms can handle l arge amoun ts of data Those whic h had diulties

chr
of
Macdid n ot do so b ecause of their order of complexit y ut b b ecause f o their standard
computational unit whic h a cts signian tly their p erformance The
size of the training set is only one o f the parameters acting scalabilit y The
examination of other factors the complexit y f o the learning task is equally
in teresting

ostDECLARA TION
No p o rtion of the w ork referred to in this thesis has b een
submitted in supp rt o of an application for another degree
or qualiation of this or an y other univ ersit y r o other
institution of learning
Author
Georgios P aliouras graduated from the Univ ersit y o f Cen tral Lancashire obtain
ing a st class BScons degree in Computing w ith conomics E His w ork
exp erience includes a one ear long Industrial Placemen t in iemens S A G Karl
sruhe erman G y orking on data compression and d ebugging o t o ls Ha ving done
his nal y ear undergraduate pro ject on Mac hine Learning algorithms he joined
the Departmen t of C omputer Science at the Univ ersit yof Mnac hester in Octob er
to w ork as a researc h studen t in the ld o f ac M hine Learning F or this w ork
he has b een a w arded a studen tship from the S cience and ngineering E Researc h
Council


TheT o y m ather f A

kno wledgemen ts
I am reatful g to the f ollo wing p eople for supplying me with or helping me acquire
programs do cumen tation data sets and o ther v aluable information
G Blix E Blo edorn R Bosw ell P E rrington J Graham D Haus
sler Y Makris R ic M halski C Mo o re S Muggleton R Nakhaeizadeh
T Niblett P ork O e T P arsons R Q uinlan L endell R M Ris
sakis D Slate D Sleeman S W rob el X W u
This w ork w as greatly facilitated b y the exc hange of m aterials a v ailable within
Conc erte dA ction of A utomate dCtoy genics Gr oups supp orted b y the Euro
p ean Comm unit y Pro ject No I I and the use of materials f rom the UCI
R ep ository of machine le arning datab ases in Irvine CA Univ ersit y of California
Departmen t of nformation I nd a Computer Science
Iw ould also lik e to thank C Casey R Lee S Nic klin D Rydeheard R
ellariou and G Theo dorop oulos for suggesting corrections and pro ofeading
the t hesis
This w ork w as partly funded b y a studen tship receiv ed from the Science and
Engineering Researc h Council the supp ort o f hic w h s i kindly ac wledged
Most of all I w ould lik e to thank m y sup ervisor P rof D Br ee m y family
and for supp orting m e throughout m yw ork


kno
Sak
the
AcChapter
tro duction
During the last t w o d ecades there has b een a ignian s taomun t o f researc h
activit y in ML whic h has mainly concen trated on the task of empiric al c onc ept
le arning This metho d of learning in v olv es the acquisition of kno wledge from a
set of examples the tr aining set u sing sev eral generalisation tec hniques The pre
ted thesis examines he t scalabilit yofv e conceptearning algorithms here w
scalabilit y is eed d b y the ect that an increase in the ize s of he t training set
has n o the computational p erformance of the algorithm This c hapter in tro duces
brie some imp o rtan t asp ects of Mac hine Learning L and o utlines the aims
of the p ro ject
Deition of Learning
One of t he sources of diult y when trying to set the o b jectiv es of Mac
Learning L is the deition of learning The concept of learning i s rather
abstract and those who ha v e tried to dee t i hilosophers psyc hologists AI
w ork ers etc ha v e usually only managed o t u nco v er one of the man y faces of
the complicated pro cess
Ho w ev er there are some asp ects of learning whic hha v e b een agreed up on
b y most of he t p eople who a h v e ealt d with the problem a nd these pro vide for
y purp oses a go o d description of the ro p cess Some of those asp ects are the
follo wing
There is alw a ys a system that s i able o t mpro i v e i tself manipulating i nfor
mation pro vided b y its en vironmen t
The information pro vided to the system can sually u tak e more than one
form and the system has ore m than o ne w a yof c hanging its curren tsatet

man
hine
sen
InCHAPTER INTR ODUCTION
there is more than one t yp e of learning
The system is usually able to remem b er a nd recall things that it has exp e
rienced
This general description of earning l o h w ev er do es not con tain m uc h nforma i
tion b a out the w a yin cwhi h the pro cess is ac hiev ed and he t elemen ts in to whic h
it can p ossibly b e d ecomp osed I t s i on those asp ects of t he learning p ro cess that
the opinions of diren t researc hers are iv d erge
The ob jectiv es of ML
Bearing in mind the div ersit y of o pinions a s to what learning s i and ho witcan
be ac hiev ed one can understand the diult y i n deing the purp ose of ML and
setting some clearut o b jectiv es for i t Th us although t he main idea is w
deed manade s ystems that are able to learn iren d t roups g of p eople
ha v e approac hed ML diren tly In doing that they ha v eeca h set their o wn
exp ectations ab out the outcome of ML
Th us one can distinguish b et w een the follo wing views of ML
The Philosophical view
The main concerns of philosophers ab out M L are
whether artiial l earning systems can b e pro uced d
what is the purp ose of learning in h uman b eings
and what w ould b e the consequences of dev eloping l earning m ac hines
The psyc hological v iew
Psyc hologists Cognitiv e cien S tists are i n terested i n the men tal pro cesses
in v olv ed in h uman learning They w ould lik e t o b e ble a to m o el d or mimi c
them using ac m hines in order to enhance their understanding f o l earning
The neuroph ysiological iew v
Learning is one of the most complicated functions f o the brain and ph ys
iologists who are dealing with it are v ery in terested in impleme n ting their
ideas sing u mac hines to ro p duce braino dels and observ e their b e ha viour
A successful pro duct of this researc h i s Neural Net w orks Ns whic hare
braino del based learning s ystems
The Artiial I n telligence I view
AI w ork ers re a in terested in dev eloping rtiial a learning systems since
learning s i one of the ost m mp i ortan tin telligen t p ro cesses w a yin
The
ellCHAPTER INTR ODUCTION
whic h hey t are trying to ac hiev e their target is not limited to mo dels of
h uman learning pro cesses although these pro vide go o d guidance for their
researc h
The engineering view
F rom a n engineering ndr b usiness p o in t of v iew learning mac hines
migh tpro v e to b e he t solution for some o f the problems of curren t nforma i
tion systems One migh t exp ect that adaptiv e a nd selfmpro ving systems
w ould increase the eiency and d ecrease the ert that has t o b e made
b yh umans n i sev eral tasks e prediction iagnosis d etc
These views o v erlap in an m yw a ys F or example w ork in AI incorp orates
philosophical cognitiv e and ph ysiological ideas and at the same time its pro d ucts
are sometimes businessngineeringrien ted Another example is the u se of NNs
for practical applications
Approac t en so far
Due to t he v ariet y of ob jectiv es set for M L a n um b er f o d iren tapprcoa
to learning systems h a v e emerged T hese approac hes could b e classid in the
follo wing three t yp es
Braino delli ng
This w as one of t he st approac hes i n L M It w as based on the theory of
cyb ernetics and neurobiological braino dels pro ducing highlyonnected
eural structures that n i teracted with eac h other in a earandom n fash
ion similar t o the w a y t hat the brain w as though tto w ork
Originally this approac hw as not successful but recen tly similar a pproac
ha v e r egained p opularit y ro p ducing systems NNs that are m uc h am
bitious than their predecessors they are s et to solv e sp ci e tasks
rather than ac hieving generalurp o se learning a nd whic hha v e h
p sitiv o e esults r
Learning algorithms
The bulk of the w ork n i hat w i s called symb olic ML ich h is the classical
AI approac h to learning w as done on individual algorithms using man y
diren t etho m ds in order o t a c hiev e l earning Most of those algorithms
fall under one of the follo wing categories
Learning b y deduction
This t yp e of learning algorithm a ssumes a l arge moun a tofbca kground
kno wledge ab out the problem w hic h t i analyses deducing rules and
mo dels that can b e used later to solv e sp ci e problems

some ad
less
hes
hes
ak hesCHAPTER INTR ODUCTION
Learning b y analogy
Algorithms falling in to this category also assume some general bac k
ground kno wledge whic h i s u sually pro vided i n the form of example
situations ollo f w b y corresp onding explanations he T system struc
tures this information i n suc ha w a y s a o t b e able to u se it to explain
new exp eriences In other w ords here t is a mapping of new nforma i
tion on to what is already a v ailable in the ystem s ausing c a c on tin uous
extension of he t system kno wledge
Learning b y i nduction
This is the t yp e o f learning that has b een paid the most atten tion
Algorithms falling under this category learn b y generalising on sp eci
examples he training set using a n um ber fo enre di t generalisation
tec hniques No initial kno wledge is assumed s a t he system extracts
information from the training set in a eartatistical n w a y Empiric al
c onc ept le arning whic h will b e further discussed in c hapter b elongs
to this category
Applicationrien ted learning systems
Some of the practical p roblems that h a v e b een attac k ed b y M L are the
follo wing
Kno wledge Acquisition A for Exp e rt Systems ESs
KA is one of the most diult tasks in uilding b an ES d ue to the fact
that exp erts ha v e arge l amoun ts of kno wledge w hic h they nd diult
to transfer On the ther o hand in an m y c ases large p o ls o of past
data are a v ailable whic h c an b e used for the extraction of information
ab out the problem One w a y to acquire information in this case is to
use inductiv e learning systems
Adaptiv e nd a Selfmpro ving Systems
There is a n um b er o f situations where a system is required to adapt
its kno wledge according to new data that b e come a v ailable This
means either an impro v em e n t i n the system p erformance self
impro ving ESs r o adaptabilit y f o the system to c hanging c ircum
stances adaptiv ecno trol systems The initial kno wledge in
this case is either directly ro p vided to the system or induced from
examples b y the learning p rogram itself
F orecasting Systems
F orecasting systems that ha v e b een pro duced are mainly at an exp eri
men tal stage they mak e use of learning systems hat t i nduce f orecasting
rules based on past data xamples E of situations where suc h systems
could b e used are w eather economic and usiness b forecasting
P attern Recognition Systems
P attern Recognition systems are also at an exp erimen tal stage and are

edCHAPTER INTR ODUCTION
mainly used in Mac hine Vision systems where patterns and ob jects
need to b e recognised in diren t situations N Ns are the main metho d
used for this purp o se replacing traditional statistical classiation ap
proac hes
Exp erience w ith ML applications as h sho wn that i n an m y ases c a single learn
ing algorithm cannot ro p vide an adequate solution to the roblem p As a result
some of the recen t esearc r h i n L M h as b een orien to w ards systems that com
bine more than one learning metho d This is what is called multitr ate gy le arning
and has b ecome v ery p opular lately
Motiv ation for the p ro ject
The task of empirical concept l earning an c b e though tofasbeing equiv alen t
to the classiation task ee section previously p erformed b y statistical
hniques Since there is a l arge n b er o f roblems p whic h can b e considered
to b e classiation tasks one w ould exp ect ML tec hniques to b e widely applied
to real orld problems Ho w er this is not the case There are o nly a few real
w orld applications of ML and most of them are not largecale ones here T are a
n um b er f o p ossible r easons wh y this app h e ns
ML tec hniques do not p ro vide adequate solutions to real orld problems
Statistical classirs ac hiev e a b etter p erformance than the classirs en g
erated b y M L rograms p
ML algorithms mak e ssumptions a ab out the structure of the problem and
the pro vided d ata that o d n ot hold in real problems
ML algorithms cannot b e a pplied to largecale data
The existing ML programs ha v e ot n b een designed to handle largecale
data
In the st few y ears of ML researc h most of the b a o v e claims w ere true Ho w ev er
subsequen t researc h h as led to the impro v en t of learning systems o v ercoming
most of these problems ee section One p roblem to whic h little atten
tion has b e en paid is the b eha viour o f l earning algorithms on argecale l data
There ha v e b een analyses and comparisons of ML algorithms in the past
ork e ams a nd La vrac endell et al n h
ed in detail at the scalabilit y of the algorithms
The aim of the pro ject w as to address this neglected issue examining the truth
of the last t w o of t o v e list of claims ab out ML algorithms and rograms p
ab he
ok lo
as one but
em
ev
um tec
tedCHAPTER INTR ODUCTION
F or this purp ose e ML programs w ere selected whic h p erform similar t yp es of
learning empirical conceptearning a nd their b eha viour w as analysed b o th
theoretically and exp erimen tally The theoretical nalysis a n i v olv ed a thorough
analysis of the computational c omplexit y of the programs pro viding a w orstase
estimate of their p erformance on diren t scales of ata d he T exp rimen e tal in
v estigation examined the b eha viour o f the programs when applied to ata d sets of
v arying scale Three data sets w ere sed u for this purp ose t w o real nd a one arti
ial By com bining the results of the theoretical and the exp e rimen tal nalyses a
a complete picture of the scalabilit y o f the algorithms w as formed
The Structure of the Thesis
F ollo wing this brief in tro duction to L M a nd the o b ectiv j es of the pro ject c hap
tak es a closer lo ok at inductive le arning concen trating on empiric al c onc ept
le arning t I rst presen ts the supp orting theory for this t yp e of learning linking
it to the problem of classiation t hen g iv es a b rief ccoun a t f o the diren t ap
proac hes t o classiation ranging from statistical metho ds to genetic lgorithms a
F ollo wing this accoun t t w osym b o lic earning l lgorithms a a re examined whic h
ha v e b een the cen tre of most of the r esearc h activit y n i the ld Finally a brief
review f o the w ork in Computational Learning T heory a rapidly gro wing researc h
area in empirical conceptearning is giv en
Chapter presen ts the theoretical analysis of the v e a lgorithms F or eac h
of the a lgorithms the ollo f wing information is pro vided
A description of the algorithm fo cusing o n ts i p eculiarities
The design of the algorithm
A etailed d worst ase c omputational c omplexity analysis
Chapter describ es he t exp erimen ts and resen p ts their results The results
are also statistically analysed to allo w the comparison of the a lgorithms r elativ e
p erformance and the v alidation of the theoretical estimates presen ted in the
previous c hapter
Finally c hapter summarises the r esults p resen ted i n t he thesis and dra
conclusions b a out their imp rtance o in the con text of the scalabilit y roblem p
ws
terChapter
Theory of Inductiv e Learning
In tro duction
One of the main researc h areas i n M L nd a the ne o that this thesis concen trates
on is inductive e l arning Inductiv e l earning in v olv es the u se of inductiv e i
ence for t he acquisition of kno wledge from exp erience It has attracted most of
the r esearc h d one i n ML see inston uinlan itc hell
resulting in the dev elopmen tof man yin teresting tec hniques S ome of those tec h
niques will b e d escrib ed later in this c hapter
The task that i s usually set in inductiv e l earning s i the acquisition of concepts
from examples empiric al c epte arning In empirical conceptearning the
system is pro vided with a et s of p sitiv o e nd a negativ e examples of a oncept c as
describ ed b y a set of features whic h can tak e a range f o v alues F or example one
y describ e t he concept of a bir d b y the numb er of wings the numb er of le
size the ing ability etc In this case some of the p ositiv e examples of the
concept will b e the f ollo
no of wings no of legs size
small y es
big y es
big
while some negativ e ones could b e
no of wings no of legs size
small y es
big y es
small

no
no
wing
the
gs ma
onc

nferCHAPTER THEOR Y F O NDUCTIVE I L EARNING
The outcome of conceptearning is a c onc eptescription eptunction
whic h is able to iscriminate d b et w een ob jects that a re instances of the oncept c
and those whic h are not based on their feature alues This task is also called
obje ctlassi ation and it has b een a n activ e a rea o f researc h n i statistics Its
lds of application nclude i diagnosis medical forecasting decisionaking
etc
Real orld applications of learning systems are more demanding than the to y
problem describ ed ab o v e As a result a n um b er f o extensions ha v e b een made to
the basic mo del in order to mak e it m ore w idely pplicable a The most common
extension is the se u of multiple c epts r oninary n classes In this case more
than one concepts need to b e distinguished rom f the s ame ata d while i n some
cases a on c tin uously alued class is used a n inite n um b r e f o concepts
Examples of suc h learning problems is the d iagnosis f o diren tt yp es of diseases

and the forecasting of the closing price o f a currency
Another ma jor extension to the basic conceptearning m o del is unsup ervise d
le arning In this ase c the ob jects that are pro vided for training are not p reclassi
d a nd the learning system is required to cluster them n i to groups hic w hsareh
common features This is a more diult t yp e of l earning lso called disc overy
b ecause the selection of imp ortan t lassiation c features s i not aided b y t
classiation o f the training instances An application area where suc h roblems p
are common is obje ct identi ation n Machine ision V
This thesis examines sup ervise dle arning tec hniques in multi ept le arning
Induction as a earc S h
vided a set E of p ositiv e a nd negativ e examples f o a concept c n a i e
learning system is required to form a h yp othesis H ased on E that will con tain
the m ain features of the oncept c and w ill correctly distinguish b e t w een nstances i
and nonnstances o f it This pro cess can b e though t of a s a searc h through a
state space where the states re a all the p ossible h yp otheses that c orresp ond to the
giv en attribute feature set and the goal is the h yp othesis that b est d escrib es the
concept The op erators that lead the searc h through this space are the inference
rules incorp orated in the learning algorithm

Notice that the term classiation suits b e tter to those t yp es of learning tasks since there
is really one class that can tak e more than one v alues rather than man y diren t concepts
nductiv Pro
onc

pre he

onc
oncCHAPTER THEOR Y F O NDUCTIVE I L EARNING
The Goal Hyp othesis
The learning system is ask ed to induce a h yp othesis that w ill b e c and
c onsistent with the example set This means that it has to co v er all the p osi
tiv e and exclude all the egativ n e examples Using Mic halski ic halski
form ulation the follo wing conditions ha v etohlod
i I E D the c ompleteness c ondition
i i
i j I D E f j i the c onsistency c ondition
i j
where I corresp onds to the n b e r o f class alues D is the i nduced h yp othesis
i
th
for i class and E is a escription d satisd only b y he t p ositiv eev en ts of the
i
class
In some cases there will b e clashes b e t w een instances in the training set caused
usually b y noisy data In that case either the completeness or the consistency
condition is relaxed in order to resolv e het con tradiction F or most nonoisy
training ets s ho w ev er more than one h yp otheses is exp e cted to satisfy the t w o
conditions he T c hoice of the al h yp othesis d ep ends n o the learning algorithm
and the searc h op erators that it uses In this resp ct e there are t w o main t
learning algorithms whic h corresp ond to the t w o extreme cases
char acteristic learning algorithms
determinant learning algorithms
Algorithms falling under the former category searc hfor a t ypical description of the
concept that will con tain as m uc h i nformation as p ossible This is called maximal
char acteristic descriptor These of the atter l t yp e ima at a h yp othesis that will
correctly discriminate b et w een p sitiv o e and negativ e instances and ill w b e f o
minimal information con ten t This is called minimal discriminant descriptor
Bet w een these t w o extremes there are a n um b er o f concept learning algorithms
whic h lso a mak e use of other criteria i n d eciding for the al h yp othesis Suc h
criteria migh t b e the simplicit yof teh h yp otheses r o the preference o f certain
attributes against o thers
The Searc h Space Hyp othesis Space
The nature size and omplexit c y f o the searc h space is determined b y the
follo wing t w o actors f
The A ttribute Set
The Description Language
of es yp
um



ompleteCHAPTER THEOR Y F O NDUCTIVE I L EARNING
A ttribute Set
Firstly the typ e and the domain of the ttributes a that d escrib e the concept a ct
the ize s of the earc s h space One usually istinguishes d b et w een three t
attributes
Nominal attributes
These are the ones that tak e ominal n v alues
size big or normal or smal l
Numeric attributes
These tak en umeric v alues whic h ill w sually u b e either in teger o r real
length
T r e etructur e d attributes
These are the ones that can b e organised h ierarc hically
ure
Shape
Polygon Ellipsoid
Triangle Square Pentagon Ellipse Circle
Figure An example of a treetructured attribute In this case shap e can tak e
v alues p olygon triangle squar e etc
the

of es yp
TheCHAPTER THEOR Y F O NDUCTIVE I L EARNING
On the o ther hand t he domain of the attribute iv g es some information ab out
its eaning and an c b e used to restrict the searc h space F or example a range
can b e sp ecid for a n umeric attribute ndr a t he in terv b e t w een the v alues
that it can tak e
Moreo v er the r elevanc e of the attributes to the problem cts a the complexit y
of the learning task The existence of man y i rrelev an t attributes is a t yp e of noise
whic h will cause ignian s t eterioration d to the p erformance of most inductiv e
learning algorithms In most cases o h w ev er the l earning algorithm is able to dis
card irrelev an t attributes b y examining t he training data This t yp e of induction
is kno wn as sele ctive c onc ept le arning xamples can b e f ound in ic halski
and uinlan a and is the one m ost ommonly c met n i the ML literature A
diren tt yp e of learning hic w h requires a ore m complicated inductiv e ro p cedure
is c onstructive c onc ept le arning ee endell Hong et al In this
t yp e of learning the ttributes a are assumed to con tain less enco ded kno wledge
and b tter e p erformance an c b e ac hiev ed b y com bining them in sev eral w a ys A
simple example f o this w ould b e the concept rightngle d riangle t for hicw h the
length of its sides is giv en lthough A o ne cannot classify a shap e as a rightngle d
triangle b y considering the l ength of eac h side i ndividually one can alculate c
the squares of the iv g en lengths and compare the sum of t he t w o shorter ones
with the third to decide whether the riangle t i s righ tngled
The Description anguage L
The language that is used for b uilding h yp otheses b y com bining attributes also
acts the size and the complexit y o f the space The more expressiv e the language
is the larger the n ber of h yp otheses that can b e induced and the larger the
searc h space There are v arious diren t represen tation sc hemes that ha v ebeen
examined in conceptearning and classiation researc h resulting from dir
en t approac hes to t he problem Statistical Classiation Neural Net w orks
Sym b olic Learning etc see next section The ld of Computational L e arning
The ory examines the complexit y of iren d tt yp es of concepts and the xtend e to
whic h eac h of these is l earnable ection
The op erators
Inductiv e learning is mainly based on generalisation n I o ther w ords het h
esis that will b e induced is a g eneralisation of the p ositiv e examples o f a concept
hic hm ust at the same time b e sp eci enough a s to exclude the n egativ e
examples The i nductiv e pro cess starts w ith a n i nitial h yp othesis whic his be
ing mo did b y sp ecialisation and generalisation op erators in order to ac hiev ea
b etter to the training data This searc h f or a g o o d is guided b y one or

oth yp
um

alCHAPTER THEOR Y F O NDUCTIVE I L EARNING
more heuristic functions whic h are incorp orated in the learning lgorithm a The
construction of the initial h yp othesis the searc h p o rators e and the heuristics that
are used dir b et w een approac hes Some of those ill w b e examined n i sections
and
Approac
The problem of conceptearninglassiation has b een approac hed from dir
en t p ersp ectiv es This section giv es a brief o v erview f o the most common f o those
approac hes
Statistical Classiation
Similarit y Based Learning
Neural Net w orks
Genetic Algorithms
F or more detailed d escription of the metho ds that are discussed here the reader
is referred to eiss and Kulik o wski and Nakhaeizadeh et al
Statistical Classiation
This ld of researc h in statistics is the redecessor p of conceptearning and has
tributed a n um ber of in teresting classiation metho d s Some of the common
features of these metho ds are the follo wing
Only numeric attributes are used
Eac h instance of a class m a h v ea v alue asso ciated with eac hofthe
attributes no missing values are allo w ed
The metho ds are group ed in to p ar ametric nonp ar ametric T he former
assume a sp eci t yp e f o discrimination function whic h they try to to the
data b y adjusting its parameters The atter l do not ak m e this assumption
The follo wing are some of the m etho d s that are commonly used f or statistical
classiation
and
ust
con

hesCHAPTER THEOR Y F O NDUCTIVE I L EARNING
Linear Discriminan t
This metho d attempts to generate h yp erplanes hic w h will ac hiev e good
discrimination b et w een the classes The n um ber of h yp erplanes that are
used dep ends o n the n um b er o f classes eac h separating one class from
the rest The calculation of the p arameters sp ecifying the l o cation of eac h
h yp erplane is usually done b y regression analysis Figure i llustrates the
w a yinwchi ha h yp erplane traigh t line n i t his c ase is used to discriminate
bet w een t w o classes in a p roblem with t w o attributes This is an ideal case
where the classes can b e discriminated p erfectly using a straigh t l ine
Logistic Discriminan t
This is an impro v ed v ersion of the linear iscriminan d t etho m d whic h uses
a iren d t criterion for regression maximi sing c onditional likeliho o d
instead of optimising a q uadratic cost f unction
Quadratic Discriminan t
Real classiation problems are ot n alw a able b y inear l d iscriminan ts
Figure presen ts a ituation s w hic hflals ni to that category B
that another parametric classiation etho m d w as dev elop ed whic h gener
ates quadratic curv es for iscriminating d b et w een classes
K earest Neigh bour
This is a simple nonarametric classiation metho d whic hmak
the preclassid instances in order t o classify new nes o It examines the K
closest neigh b ours o f the new i nstance and classis it according to the most
common class amongst them T he most imp o rtan t eature f of this metho d
is the orm f ula hic w h is sed u for calculating the distance b e t w een instances
Some alternativ v e b een used are t he follo wing
Absolute distance The sum of the absolute dirences b t e w een the
attribute v of t t w o i nstances
Euclidean distance
Normalised distances F or example the n b er of standard devi
ations from the mean of eac h feature
Figure illustrates the se u of this metho d
Conceptual Clustering
This is a metho d w hic his v ery close to the sym b lic o M L a pproac h to the
problem According to that i nstances whic h share common features and
common class are group ed together forming clusters w hic h c an b e used
for classifying new instances Clustering can a lso b e u sed for unsup ervised
learning in whic h case instances are roup g ed according to their feature
v alues and a classab el is attac hed to the enerated g clusters One of the
algorithms examined in this thesis LS is a onceptual c clusterer it will
b e describ ed i n detail in the f ollo c hapter
wing
um
he alues
ha that es
of use es
of ecause
solv ysFigure A situation where linear d iscrimination cannot pro vide an adequate
solutionFigure Classiation sing u the K earest Neigh b our metho d In this case
k and the new nstance i is a ssigned to the negativ e class b ecause ut o of the
examined neigh b ours b elong to that class
Similarit y Based Learning BL
This is a ym s b lic o approac h to i nductiv e earning l whic h i s also the approac h
en in this thesis Most of the m etho d s d ev elop ed nder u this paradigm pro duce
classirs whic h can b e in terpreted b y a set of clusters imilar s to the c onceptual
clustering a pproac h There are h o w ev a n um b er of dirences b et w een the SBL
and conceptual clustering etho m ds
SBL metho ds can handle nominal and structured attributes arly E v ersions
of SBL algorithms could not handle n umeric attributes at all while more re
cen t ones usually discretise them and treat them in a similar w a y o t n ominal
ones T his approac h imp oses a ubstan s tial o v erhead on the omputational c
requiremen ts of the algorithms see c hapter and do es n ot mak e e ctiv e
use of the attributes ee e M erc kt
The clusters that are g enerated n i SBL are u sually ortho gonal hyp err e ct
angles formed b y the in tro d uction of dic hotomising h yp erplanes whic hare
parallel to the eature f axes ure The conceptual clusterer that is
examined in this pro ject s i o f the same t yp e but others use iren d tshapse
er
takFigure Orthogonal clustering in a b inarylass problem w ith t w o attributes
ne nominal and the other n umeric
This restriction in the shap es of the clusters mp i osed b y S BL metho ds has
recen tly b een realised as an imp rtan o t problem and s ome w ork h as b e en
done in order to o v ercome i t urth y T he restriction is
imp sed o b y the description languages that re a used whic h limit the w a ys
in whic h attributes can b e om c bined to simple conjunctions and isjunc d

tions b et w een ttributeests a This mak es sense w hen ominal n attributes
are used since their v alues cannot a lw a ys b e ordered The situation ho w
ev er is diren twthi n umeric attributes b et w een whic h t here ma ybe a
relationship deed b y a linear r o other n umerical function
An um b er f o diren tcno v en tions ha v e b een used in SBL researc h for de
scribing t he induced concepts the ost m p pular o of hic w hare de cision tr e es and
de cision lists T hese represen tation sc hemes together with the h euristics that are
used in some SBL algorithms are describ e d i n section

An attributeest is the asso ciation of a v alue or a range of v alues to an a ttribute
al etFigure Elliptic clustering
Neural Net w
Neural et N w hiev e a similar classiation result to the statistical classi
rs using a d iren t r epresen tation sc heme sc heme con tains a n um ber
of neur arranged in to sev eral la y ers Eac h of the neurons in o ne la y er is
usually connected to all the neurons in the revious p la y er from whic h t i receiv es
input and all the neurons in the next la y er to whic h t i eeds f its output igure
The rst la y er of neurons s i used for i nput to the system Neural Net w orks
accept nly o n umeric input and for this reason nominal attributes are translated
in to the set of all p ossible attributeests eac hof whic hisbiaryn v alued F ollo w
ing t his rero p cessing stage eac h i nput no de accepts v alues for one attribute or
attributeest These v alues a re fed to the next la y er of no des whic h p erform a
w eigh ted summation and generate an output v alue ccording a to some function
The output v alues are passed to the next la y er of neurons f t i e xists whic h
pro cess them in the same w a y nd a con ue the forwar de e ding pro cess un til the
output layer is reac hed A t that stage the enerated g output is compared to the

desired output ro vided b y t he preclassid i nstance and their dirence is fed
k o t he t previous la y ers causing the a djustmen t of the connection w eigh ts and
other arameters p that are used in the calculations taking lace p in eac h o n de

This is not the case for unsup ervised earning l
bac

tin
ons
The
ac orks
orksFigure A t w oa y ered fullyonnected net w ork a stands for the input euron n
i
th
whic h corresp onds to the i attributettributeest and c for the output neuron
i
th
whic h corresp onds to the i class
The calculations in v olv ed in the forw ardeeding a nd the w eigh tdjustmen t
stage dir b et w een diren tt yp es of net w orks The follo wing is a v ery brief
accoun t f o some commonly sed u n et w orks
P erceptron
This is a v ery simple t e net w ork in v ted b y Rosen blatt
osen blatt It consists of only t w ola y ers of neurons n input and
an output one nd a its b eha viour i s den i tical to the inear l iscriminan d t with
the dirence that t i is a nonarametric metho d i it do es not ak m ean y
assumptions ab out the shap e o f the class probabilit y d istributions The
generated v alue at eac h n euron n i he t output la y er is giv en b y the follo wing
form ula

P
if w I
i ji i j
v
j
otherwise
th
where I is the i input v alue w is the w eigh t asso c iated to the connection
i ji
th th
bet w een the i input and the j output neuron and is a threshold v alue
j
th
asso ciated with the j output neuron There i s a diren t t hreshold v alue
for eac h o utput no de whic h gets up dated at the w tdjustmen t stage
eigh
en of ypCHAPTER THEOR Y F O NDUCTIVE I L EARNING
Another idea that is similar b et w een the p erceptron algorithm a nd the linear
discriminan t s i the use of the square dirence for calculating the distance
bet w een the generated and he t exp ected output v alue his T calculated v alue
is used for u p ating d the w eigh ts and the thresholds With resp ct e to the
w eigh tdjustmen t pro cess there are mainly t w o a pproac hes
Batc h L earning All t he examples in the set are e xamined b efore
an y adjustmen ttka es place In that case the me an of squar eerr ors is
calculated and sed u in the adjustmen t
Incremen tal Learning The adjustmen ttak es place after eac h ex
ample has b een considered The examples are e ither pro c essed sequen
tially or randomly In this m etho d the absolute dirence is used in
the adjustmen t
The former metho d is exp ected to giv e m ore reliable results since an
o v erview of the w hole training set is main tained but i t is c omputation
ally more exp ensiv e han t the latter
Multia y er P erceptron LP
As men tioned in the statistical a pproac hes there is a n um b r e of roblems p
whic hcanont besovl ed with linear discrimination n I o rder to o v ercome
this problem in neural net w orks sev eral p e rceptrons are com bined together
The result is a net w ork with a n ber fo hidden layers bet w een the input
and the output ones whic h can appro ximate noninear functions
In parallel to he t in tro duction of m ore than one la y er the c alculation of the
feedorw ard v alues at eac hla y er and the w eigh tdjustmen t metho d ha v e
b en e impro v ed The output f o eac hnode sni o w alculated c b yteh lo gistic
or sigmoidal function

v
j
n
j
e
where e is the base o f the natural logarithm and n is the w eigh ted sum
j
ncluding the threshold i n v olv ed in equation also
The output of this function is within the range and can b e used in
the calculation of the dirence b et w een actual and e xp ected output v alue
without b eing ranslated t in a b inary form as in equation The result of
this is a smo other adjustmen t of the w eigh ts and the other parameters
The adjustmen tof eth w ts is no w done with the use of the Back Pr op a
gation algorithm umelhart et al hicw h i s similar to the one used
in the simple p rceptron e b ut incorp orates more parameters The aim is
still to minim ise the sum of least square errors for the training set but the
errors are no w propagated more than one la y b k in order to adjust all
the w eigh ts and the t hresholds in the net w ork T he new u p ate d function is
giv en b y the follo wing equation

w w w
j j t
j

ac ers

eigh

umCHAPTER THEOR Y F O NDUCTIVE I L EARNING
where is the le arning r ate or stepize whic hcno trols the sp e ed of learn
ing e is the prop ortion of error propagated to no de j is the last
j t
c hange o f the w eigh tadn momentum parameter whic hcno trols the
ect of the previous w t hange
The Bac k Propagation f o errors and subsequen t djustmen a tof eth w eigh ts
con tin un til o n substan tial dirence b et w een the a moun tb y whic h
w eigh c hange at step t and tep s t exists This iterativ e e rror min
imisation metho d is kno gr adient desc ent f I the stepize s too
large this m etho d ill w l ead to an oscillation around the desired minim um
error alue This problem is solv ed b y the inclusion of the most recen t
w eigh t hange in the calculation
Radial Basis F unction Net w orks BFN
This is a ew n t yp e of et n w ork the m ain c haracteristic of whic h s i i t
do es not use an iterativ e earning l etho m d as in Bac k ropagation P The
RBFN is pro vided with a n um ber of p ion ts in the feature space eac hof
whic h s i sed u as the cen tre of an in terp olating function he T set of ll a those
functions can b e used as a classir A ma jor problem n i RBFNs is the
determination of the function cen tres An um ber of odmeth s ah v ebeen
dev elop ed for solving that roblem p whic hv ary from arbitrary and random
approac hes to nsup u ervised learning ones F or a etailed d description of
RBFNs the eader r is referred to akhaeizadeh et al
Kohonen Net w orks
Kohonen net w ork ohonen p ro vides an nsup u ervised learning
metho d using Neural Net w orks Usually in unsup ervised l earning the net
w ork is pro vided with the n um b r e of desired clusters and asso ciates a single
cluster ith w eac h o utput neuron b y a djusting only this neuron w eigh ts
when it ac hiev es the h ighest output v alue winnerakesl l net w ork Ko
honen net w ork ho w ev er up dates also the w eigh ts of the output n eurons
whic h are arc hitecturally close to the winner ac hieving an n i terp olation
ect whic h rranges a the clusters according to their arrangemen t n i the
feature space T his ect i s imilar s to that of a traditional tatistical s algo
rithm called the k eans clustering algorithm whic h enerates g a artition p
of the feature space in to p atches called the V or onoi tesselation
Genetic Algorithms GAs
GAs are searc h m etho ds whic h a re inspired b seyDvrnawi olution theory ab out
the surviv al of the ttest Figure ummarises s the main elemen ts of a genetic
algorithm
Sp ecial in terest has b een sho wn in the follo wing v eeemlen ts of a GA



that
as wn
ts
ues
eigh
the
Figure The pro cess p erformed b y GAsCHAPTER THEOR Y F O NDUCTIVE I L EARNING
Represen tation sc heme The ain m r epresen tation sc heme hat t h as b een used
rop osed b y Holland ho w is one o f the main con tributors in the ld is
bittrings whic h a re called hromosomes Eac hc hromosome represen ts
a rule or an example whic h is enco ded in a bitorm E ac h a ttribute of an
example r condition of a ule r is assigned a bit r a group o f its b in the
c hromosome whic h will h old the v alues that the attribute ondition ak t es
in sp eci examples ules
Initialisation The st stage of t he ev olutionary pro cess n i v olv es the genera
tion r acquisition of an initial set of rules F or researc h purp oses random
generation of those rules from the lgorithm a itself b y random assign
men tof v alues in the bit trings s is fa v oured The reason is that it is a go o d
test for the algorithm to start its ev olution f rom a random p opulation that
ma yha v e nothing to do with the desired optimal ne o I n this c ase if the
algorithm manages to d i ts w a y to the optimal rule it is considered to
ha v e p erformed w ell in searc hing through the space
In real orld applications ho w ev er w here safet y nd a pro cessing time are
imp rtan o t the initial p o pulation is usually supplied to the algorithm b y
the user In this case the nitial i p o pulation ma y b e d eriv ed from the user
p rsonal e exp erience or from a diren t l earning a lgorithm
Ev aluation function The ev aluation function is one o f the most imp ortan t
elemen ts of a GA Small c hanges to it can i mpro v eor w orsen t he algorithm
substan tially The ob jectiv eof teh ev aluation function is to assign a w orth to
eac h ule r according to its success in classifying examples from the training
set In order t o c a hiev e that there is a n b r e f o actors f that can b e tak en
in to accoun t F or example
The classiation correctness of the rule
The complexit y of the rule
The p erformance of the rule in the ast p
Genetic Op erators There are t hree main repro duction op erators prop osed b y
Holland t hat are used in GAs
Crosso v er
This o p erator s i nspired i b y the sexual repro uction d of sp ecies in the
real w orld It in v es the recom bination of the enetic information
of t w o ules r in order to pro uce d n ew rules The newly generated rules
do not c on tain an y ew n enetic material but they could b e more
r less successful t han their paren ts b ecause they con tain diren t
com binations of their genetic i nformation The follo wing example
illustrates the w a yinwich h the op erator w orks for b inary strings
olv
umCHAPTER THEOR Y F O NDUCTIVE I L EARNING
Rule

Rule

th
Applying crosso v bit
Rule

Rule

Mutation
This o p erator is used for the in tro duction of ew ideas to the searc h
a v oiding th us the concen tration o f t he searc h t a lo cal maxima This is
done b ymo ving n a r andom w a y the searc htreoden i t areas of the
searc h pace s The w a y i t o p erates is b y ltering a the enetic g information
of a rule at some random p osition ith w a randomly generated v alue
In v ersion
This i s a complemen tary op erator to crosso v er but i s a pplied rarely
What t i o d es s i o t rearrange the genetic information ithin w a c hromo
some b yrve ersing a substring so that pieces of genetic nformation i
that w ere far rom f eac h ther o are brough t together and c an b e used in
a ubstring s selected b y crosso v er F or example
Rule

th th
Rev ersing substring b et w een het and the bit
Rule

P arameters There is a n um b er of arameters p used in v arious comp onen ts of a
GA the s ize of the p pulation o the probabilities of selecting eac hof
the op erators the condition that w ill a h v e to b e atisd s n i o rder for the
searc h to stop etc Most of these parameters are highly dep e nden ton
the application and are lik ely to ct a substan tially the p erformance of the
algorithm
Tw oP opular SBL lgorithms A
This section describ es t w o lgorithms a whic hha v e b een the asis b of m ost SBL
learning systems Most of the systems that are u sed in the pro ject re a also
descendan ts of these algorithms
The A Q lgorithm A
A Qw as one o f the st successful lgorithms a n i sym b lic o ML It is based on
T S AR metho d Mic halski whic hw b yMic halski in the
late neoO fthe in teresting features of A Q i s the comprehensibilit y of its
ed elop dev as the


the at erCHAPTER THEOR Y F O NDUCTIVE I L EARNING

conceptescription language A nnotate dPr e dic ate Calculus APC n rmtes of
the APC formalism examples and h yp otheses are expressed using the f ollo wing
building b lo c ks
Selectors These are the elemen tary blo c ks of the language They corresp ond
to attributeests where an attribute is asso ciated to one or a range of
its v alues b y m eans of some r elational pr e dic ates g e etc
Examples of selectors a re the ollo f wing
shap e phere
weight
c olour green
Complexes A conjunction o f selectors forms a complex Instances are repre
sen b y complexes The follo wing are examples of complexes
shap e phere weight c olour green
weight weight c olour de
Co v ers v ers a re disjunctions o f complexes and they re a used for represen ting
the induced h yp otheses The follo wing s i a p o ssible co v er
f shap e phere weight c olour enre
weight weight c olour ed
shap e yramid c olour g
The reason h w y these highev el structures are called co v ers has to do with
the inductiv e metho d adopted in the algorithm A Q s i considering eac h p ositiv e
example in turn attempting to g eneralise it as m uc h s a p ossible excluding at the
same time all the negativ e examples During this ro p cess it builds a co v er whic h
is a description that includes all p ositiv e xamples e and excludes an y negativ e ones
h t ime the algorithm is examining a p ositiv e example w hic h i s ot n o v ered
b y he t co v er constructed so far a new complex is dded a to the co v er This
complex is pro uced d b y the ST AR algorithm and is p o ten tially selected out of a
n um b er of candidate ones according to a n b er o f criteria e o v erage
n um b er f o p ositiv e examples co v ered implicit s y sually measured b y ts i size
i n b er of selectors In the end a isjunction d of those b est omplexes a
co v er is induced whic hco v ers ll a p o sitiv e nd a excludes all n egativ e examples of

the concept This is tak en as the b est appro ximation to the concept Figures
and describ e the asic b A Q nd a the ST AR algorithm

APC s i an impro v emen tof het V ariable L o gic system VL hic h m ade use of prop osi

tional calculus e xpressions

Some v ersions of the algorithm allo w for a m isclassiation error in order o t handle noise
in the training set

um
um
Eac
green


Co


ted


CHAPTER THEOR Y F O NDUCTIVE I L EARNING
Input A set of examples E a set of attributes A a set of class v alues
C u sereed criteria F for t he selection of the b est complexes
usereed star size m
Output A et s of co v ers one or f eac h c lass in C
F or eac h class c in C do
i
Split E in to p sitiv o e POS and egativ n e NE G examples of c
i
Set COV E R to empt y set
While POS is not e mpt y do
Randomly select a seed example E from POS

Use the star generating algorithm ST RA E EG

to generate a set of complexes of size m alled the ST AR
a
whic hco v er E and exclude all examples in NE G

Use LE F to select the b est complex BESTOMP in
ST AR
App end BESTOMP to COV E R s a a new isjunct d
Subtract f rom POS all t he examples co v ered b y BEST
COMP
Return COV E R
a
Note that the algorithm assumes t hat no con tradictions exist in the t raining
set
Figure Design of the basic A Q a lgorithm
The main idea underlying t he algorithm s i the generation of stars and co v ers
h o f the p sitiv o e examples acts p ten o tially as a seed expanded against the
negativ e examples Th us there is an initial g eneralisation stage whic h pro duces
a star rom f the seed example F ollo wing the selection of he b st e elemen tof iths
star the a lgorithm p erforms a sp ecialisation o f this b est complex making use
of the subset of p o sitiv e e xamples that s i co v ered b y i t In terms of space searc h
this could b e describ ed b y a rief b scillation within the space existing b et w een

the seed and the negativ e xamples e The resulting star is ptimal in the sense
that it co v ers completely and in a m inim w a y all the p ositiv e examples whic h
are co v ered b y t he initial v ery general star This t yp e of searc h whic hcom bines
a generalisation and a sp ecialisation stage is called b e am se ar ch

Note that the xistence e of noise m a y render the generation of suc h a star imp ossible
al
Eac

LECHAPTER THEOR Y F O NDUCTIVE I L EARNING
ST AR E EG

Set ST RA to con tain the empt y complex
Randomly select a egativ n e example E from NE G o v ered b y

a
a complex in ST AR
Find the set EX T E N S I ON of selectors w hic hco v E

exclude E

Sp ecialise ST AR y eplacing r t i with i ts cartesian pro duct with
EX T E N S I ON
T rim ST AR b y remo ving
Nil c omplexes complexes con taining con tradictions
Complexes subsumed b y thers o i n ST RA
All but the m b est complexes
If ST AR excludes a ll examples in NE G return ST AR Other
wise con ue at step
a
A t the b eginning the empt y c omplex co v ers all t he examples
Figure The ST AR algorithm
Finally n ni teresting asp ect of the algorithm i s the w a y i n whic h i t handles
completeness nd a consistency It attempts to ac hiev e completeness main
taining c onsistency at all imes t Consistency is the primary goal and at least
in this standard v ersion of the algorithm there is no tolerance f o inconsisten t
descriptions m aking the algorithm incapable of coping ith w oise n he T second
goal is completeness whic hm ust also b e ac hiev ed in order for a v alid concept
description to b e generated This s i n a a bsolute requiremen ttoo dan m ust b e
relaxed if noise in the training set is to b e handled
The ID Algorithm
ID mak es use of a simple learning metho d called the Conc ept L e arning
System LS un t that w as initially designed to p rform e single
concept learning This metho d u ses a decision tree to represen t the acquired
kno wledge Eac h no de of the tree represen ts an attribute of the concept and
al et

tin

and er

CHAPTER THEOR Y F O NDUCTIVE I L EARNING
Input A set of examples E a set f o attributes A a set
v alues C
Output A decision ree t T
Initially S E
Dic hotomise S
all the mem b ers of S b elong to the same c lass
mak e curren tnode a le af no de and stop dic hotomi
sation
Else select attribute a that do es b est in discrim
i
inating b e t w een p ositiv e and negativ e examples in
S
Create new n o de in the tree T
F or eac h v a v do
i ij
Dic hotomise S here w S is the subset of
ij ij
examples corresp onding to v
ij
Return T
Figure Design of the basic CLS algorithm

h b ranc h one v alue of the attribute Eac h ttribute a alue is examined as to
ho ww ell it do es in discriminating b et w een p sitiv o e a nd negativ e xamples e of the
concept Figure describ es brie the a lgorithm
As an example f o the pro cess ssume a the follo wing training set describing
the concept o f a b al l
P ositiv e Examples
Shap e Size Bouncy
round small Y
round big Y
round small N
Negativ e xamples E
Shap e Size Bouncy
square small N
round big N
triangular small N

The o riginal v ersion of ID allo ws only nominal attributes
eac

of alue

If
class ofFigure T he second step of the generation of a ecision d ree t b y C LS
The only case in whic h iscrimination d is incomplete is when shap e r ound
If b ouncy is selected as the next attribute then the tree ill w tak e t
ure
Finally f size is used p erfect discrimination will b e c a hiev ed gure
It is p ssible o to ail f to c a hiev e p erfect discrimination after a h ving used all
the a ttributes In this case CLS cannot pro vide an accurate description of the
concept This ineiency has b een o v ercome in some v ersions o f I D b yallo wing

of form heFigure The decision tree ally generated b y LSC
probabilistic answ ers I F shap e r ound THEN b al l y
Another s ource of ineiency whic h can b e detected f rom the previous ex
ample is that a lthough a fairly go o d discrimination has b een c a hiev b y the
st attribute that w as selected the quest for p erfect discrimination has led the
algorithm to use ll a attributes This ma y ot n b e a substan tial problem in suc h
a s imple situation but it is b o und to cause ineiency in large learning prob
lems Moreo v er i t f orces the a lgorithm to mak e use of attributes that are not
c haracteristic of the concept the ize making the a lgorithm v ery prone to
noise in the training set
In order t o c ho ose t he b est discriminating attributes an information theo
retic criterion is used namely Shannon en trop y chi his giv en b y the follo wing
form ula
n
X
v v
i i
en trop y v log v log
i i
v v v v
i i i i
i

where n is the n um b er of p ossible v alues that an attribute can tak eand v and
i

v are t he n um b er of p ositiv e nd a negativ e examples for ac e h attribute alue
i
Chapter resen p ts sev eral v ariations of this m easure
th
Usually the otation n p is used for the prop ortion of instances of c lass j ha ving the i
ij
v alue of the attribute Note lso a that log is u sed for log in the en trop y calculations



ed
CHAPTER THEOR Y F O NDUCTIVE I L EARNING
T o illustrate the use of en trop y in calculating the discriminatory p o w
diren t attributes assume that one w an ts to d the most discriminatory out
of the three attributes in the b al l example used in the previous section t the
b eginning o f the learning pro cess The en trop y for eac h ne o of them will b e
calculated as follo
en trop y shap e log log
log
en trop y size log log log

en trop y b ouncy log log log

The attribute hat t will b e selected is the ne o with the l o w er en trop n y iths
case shap end b ouncy a v e the same en trop y nd a the algorithm will ha v eto
c ho ose b et w een the t w o based on other criteria e n um b er of ttribute a v alues
ID is a greedy dividendonquer algorithm p rforming e a illlim h ng searc h
whic h uses the en trop y criterion as a searc h h euristic T his has pro v ed an e
cien t metho d or f inducing ectiv e iscrimination d functions the ecision d
trees uinlan a Since ts i birth the algorithm as h attracted m uc h re
searc h in L M and has b een used in n umerous earning l systems The original
v ersion of the lgorithm a has a few problems o v ersp ecialisation only nom
inal attributes and binary classes are used etc Some of these problems h a v e
bene in v estigated esulting r i n sev eral impro v ed v ersions o f the algorithm A few
of those extensions a re discussed in the follo wing section
Extensions
The st SBL a lgorithms lik e the ones describ ed in the previous section could
only deal with simple situations and did ot n p rform e v ery w ell on real data
The researc h hic w h follo w ed that initial stage a ddressed some o f those problems
impro ving the standard v ersions f o the algorithms ome S o f the most imp ortan t
extensions to the b asic algorithms re a the ollo f wing
Handling n umeric data b y d iscretisation
Prunning reducing the induced conceptescription in order to min
imise the ect o f n oise
Incremen tal learning in the p resence of new nformation i
Constructiv e induction ased b on primitiv e eatures f
bi



log

log

log
ws
of erCHAPTER THEOR Y F O NDUCTIVE I L EARNING
This section discusses eac h f o hose t extensions escribing d the w ork done so far
Handling Numeric Data
One f o the basic problems of ID nd a A Qw as the fact that they could only deal
with nominal nd treetructured in the case of A Q attributes Since most
of the real orld classiation problems in v olv en umeric ata d one of the early
c hanges to the lgorithms a w as the handling of n umeric a ttributes The w a yin
whic h his t is done i s b y iscretisation d of the attributes in to v alueanges whic h
es it p o ssible to pro cess them as if they w ere nominal ones Usually lyn
t w o ranges are used leading to binary attributeests Some of the d iscretisation
metho ds whic h are used in the a lgorithms that a re examined in he t pro ject are
describ ed in c hapter
A similar problem exists with the handling o f n umeric classes Most of the
learning algorithms cannot handle n umeric classes whic h can tak e an inite
n um ber of v alues An exception to this re a r e gr ession tr e es reiman et al
and the NewID algorithm describ ed in c hapter
ering and P runing
One of the main deiencies with ID is that i t cannot h andle n oise in he t training
set It alw a ys tries o t ac hiev e p erfect discrimination resulting in v ery complicated
trees in the case w here he t examples do not ro p vide clearut d iscrimination
poin ts for the diren t classes In order to a v oid this sev eral metho d s h a v ebeen
prop osed whic hpro vide mec hanisms f or stopping the gro wth o f the tree or for
pruning some of its lready a gro wn branc hes when these do not s eem to pro
an y substan tial increase in discrimination
One o f the early attempts to enhance decision trees b y stopping the gro wth
ather than b y pruning w as presen ted b y Quinlan uinlan a nd uses the

signiance test at eac h no e d of the tree The w a yinwich hheaprpo ximates
this is b y means of the follo wing form ula
X

n e
ij ij ij
ij
where n is the n um b er f o xamples e with class c and attribute v alue v for
ij i j
the ttribute a handled at the o n d e and e is the exp ected n b r e of examples
ij
under he t n ull h yp othesis that the sub opulations of examples cr e ate d when we
split E ar edr awn fr om the same p opulation as E where E denotes the subet of
P P
examples at the n o e d Under this h yp othesis e n n
ij ij ij
j i
The problem with this etho m d s i t hat the prop osed statistic is not a go o d ap

ximation of for small training sets iblett and Bratk o p rop ose an
pro

um

vide
Ov
makCHAPTER THEOR Y F O NDUCTIVE I L EARNING
impro v en t to this measure b y xamining e the c ontingency tables of the exam
ples p er ttribute a alue and class at eac h o n d e o f the tree The idea is to try
and calculate the sum of probabilities of p o ssible distributions of examples whic h

leads to the calculation of
Quinlan uinlan b as h also presen ted some w ork in pruning and pro
p sed o an optimisation criterion that oets the c omplexity of the tr e e against
its observe d c lassi ational ac acy on the tr aining e xamples This is a simi
lar approac h to the one tak en in iblett and ratk B o whic hisbedsa on
a m easure of the classiation rror e e N t cea hnode N of the t ree Pruning
es place when
k
X
e N p v e E
i i
i
where k is the n um b er of p ossible classes p v is the probabilit yof the i th v alue
i
of the attribute and e E the classiation error at the no de c orresp onding o t v
i i
Note that he t error estimates are sso a ciated to the n um b er of examples falling
in to a certain class
F or the calculation of e N the f ollo wing form ula is evised d under the as

sumption that the attribute can tak eonly t w ov alues and the prior distribution

of the k p ossible c lasses is uniform
e N n n k n k
c
whic htak es in to accoun tteh n um b er of classes k and the prop o rtion of examples
n corresp onding to a class c
c
The devised form ula is u sed in a pruning algorithm whic h c an b e split in to
t w o parts The st is a recursiv e calculation of the classiation error at eac h
no de starting from the ro ots and ending at the lea v es The second reconstructs
the decision tree starting from t he lea v es and w orking i ts w a y to i ts ro ot A teac h
stage the algorithm uses the pruning criterion to decide whether a subree should
b e pruned or retained E xp erimen ts with pruning m etho ds ha v esho wn that a sig
nian t increase in classiation accuracy is p ossible Niblett a nd Bratk o
A similar o v ersp ecialisation problem exists with A Q whic h striv es for c om
pleteness and c onsistency One of the d escendan ts of A Q N whic h is used
in this pro ject and s i describ ed i n c hapter o v ercomes this problem b yc hanging
the asic b algorithm i n order to ac hiev e preature searc htopping when no
signian t impro v em en tis obrvse ed

iblett and Bratk o laim c that the xtension e t o m ultiple v alues is straigh tforw

The ssumption a of nitial i uniform class distribution is not true for i n ternal no des of the
tree and an extension o t the basic calculation is prop osed in iblett and ratk B o

ard

tak

cur
emCHAPTER THEOR Y F O NDUCTIVE I L EARNING
Incremen tal L earning
Another m a j or problem with ID is the fact that the decision tree c annot b e
easily up dated i n the ligh t f o new data his T i s n ecessary in incremen tal learning
situations con trol learning where data is pro vided o t the system on c tin u
ously n ucs h a situation ID w ould need to recalculate a ew n tree eac h
new examples w ere pro vided reconsidering all he t previously examined examples
This pro cess is computationally exp ensiv e esp ecially as the training set increases
The st attempt o t ev d elop an incremen v ersion of ID w b y
Sc hlimm er and Fisher c hlimm er and Fisher who dev elop ed an algorithm
named ID The new algorithm practically rebuilds the decision tree eac h time
but it has the adv an tage that t i stores t he class distributions t a eac h n o de whic h
can b e u sed to calculate the new en trop yeasures eien tly without needing to
rexamine the training i nstances The w a y n i whic hIDw orks is b y u p dating
the p robabilit y distributions at eac h o n d e starting from the r o ot nd a discarding
the s ubtree of a no d e when a b etter discriminating attribute e xists
The main problem with ID is that it do es not a lw a ys generate the same tree
as the p ure D I algorithm w ould generate giv en the new and the old nformation i
This can happ en when the scores of the a ttributes at some no de are not substan
tially diren t and the nstances i are presen ted in suc haw a yastoc hange the
ordering of the attributes eac h time This ma y result in a rep etition of deletion
and regeneration of subtrees whic hdoes ont onc v erge to a stable situation
In order to solv e t hat roblem p Utgo has ev d elop ed another incremen tal
v ersion of the I D algorithm named ID Utgo The st imp ortan t
feature of this algorithm is that i t will lw a a ys pro duce the same d ecision tree
as the original algorithm w ould generate ID main tains the same statistical
information as ID at eac h n o e d and it also lo oks or f c hanges in the orderings
of the ttributes a Ho w ev er instead o f replacing the old subtrees with new ones
it restructures them mo ving the new b est attribute a t he t ro ot of the subtree
The main adv an tage though f o ID i s that i t ain m tains information ab out the
instances that it examines at the leafo des This is done b y storing those
attributeests that do not app ear i n the path to the o n de
Recen tv ersions f o the A Q lgorithm a also p erform incremen tal induction F or
example the lgorithm a A Q that is used in the pro ject tores s the class prob
abilit y distributions asso ciated with eac h complex in the o c v er and accepts the
co v ers as input ogether t with the new instances Starting with those c o v ers s a
the initial h yp otheses and using the stored statistical nformation i the algorithm
generates the new set of co v ers

made as tal
time CHAPTER THEOR Y F O NDUCTIVE I L EARNING
Constructiv e Induction
As men tioned ab o v e constructiv e nduction i is the p ro cess b y whic h n ew features
are generated based on the initially sp ecid eature f set There are t w omina
diren t a pproac hes to the problem
Kno wledgeriv en feature construction
Datariv en feature construction
In kno wledgeriv en feature construction bac kground kno wledge is pro vided to
the learning system etermining d the w a yin hcwi h eatures f c an b e com bined
to form new ones This metho d is used in A Q nd a is describ d e in section
Datariv en feature construction is more diult nd a it mak es use of sta
tistical clustering tec hniques for the disco v ery of new attributes An example
of this approac h is R endell conceptual clustering system PLS describ ed in
endell
In PLS the inductiv e pro cess is sub divided in to a n um b er of stages whic h
lead radually g to the inductiv egalo A teca h stage the system o m v es to a higher
lev el of abstraction imp osing c onstr aints r e ducing c omplexity xtr e acting me an
ing and incr e asing r e gularity All o f these transformations a re v ery imp ortan t
for the inductiv e task since they enable it to disco v er idden information in
the raining t s et and arrange t i n i a meaningful w a y here T are th us three lev els
of abstraction in v olv ed in the feature construction that PLS p rforms e
Sub ob ject Relationships

In this st stage the primitiv e space is arbitrary partitioned in

spaces for whic h the utilit y is calculated
P attern Classes
In this lev el similarities are iden tid b et w een the c alculated subspace r
sub ob ject utilities resulting i n the construction of more general classes
This is ac hiev ed b y he t extraction of common patterns u sing the lustering c
metho d
P attern roups G
A t this al stage some transformation op erators are applied rotation
of the attern p in attern p recognition on primitiv emme b e rs of the classes
in order for similar lasses c to b e group d e together

The initial features are assumed to con tain v ery little enco ded kno wledge ab out he t c har
acteristics of the c oncept

The eader r is referred to the d escription f o LS P in ection s f or the eition d of this
measure
sub to
CHAPTER THEOR Y F O NDUCTIVE I L EARNING
Constructiv e induction has recen tly b een realised a s an i mp ortan t researc h
area whic h can pro vide a solution o t the fe atur ec quisition b ottlene ck of ML
Computational L earning Theory L T
V alian t Probably Appro ximately Correct A C
Learning o M del
In con trast with the large amoun tofw ork that h as b een done in formalising and
eloping concept learning systems v ery little h as b een done to measure the
complexit y of the learning task One of the st who a ttempted to do so w as
V alian t who in tro duced a deductiv e m etho d for automatic program acquisi
tion V alian t b V alian t a he T most imp ortan t elemen tof htis w ork
is the analysis f o the learning task n i o rder to assess the p erformance of the
metho d and explore the p oten tial for i mpro v em en t This resulted in a measure
of the complexit y o f t he learning task n i terms of the n um b er of examples that
an algorithm needs in rder o to induce a correct complete and consisten t
h yp othesis This is called the sample c omplexity of the problem and s a V alian t
has pro v ed n i the w orstase analysis it is exp nen o tial in nature Ho w ev er if one
assumes that there is a xed hough unkno wn relativ e f requency h p

itiv e e xamples of the c oncept to b e learned o ccur one can calculate the n ber
of examples from whic h t he algorithm can probably generate a h yp othesis that
is a good appro ximation to the concept
On this basis a deition is prop osed for a le arnable concept
ac ept is le arnable if and only if an algorithm c an b e develop e d
which wil l b e able in p olynomial time to gener ate a hyp othesis t hat
wil l c orr e ctly classify most he err or al lowe d is also sp e ci d of the
instanc es of the c onc ept
What mak es V alian t o m d el esp ecially in teresting is the calculation of an up
p er b ound for the n b er of examples needed to b e considered b y n a algorithm
in order to arriv e to a correct n the ab o v e probabilistic sense h yp othesis of a
learnable concept This upp er b ound dep ends inearly l on he t size of the h
esis space nd a an indep enden t parameter whic h sp ecis the acceptable error
limits

The assumption of a xed p robabilit y d istribution f o examples i s lik ely t o hold for most
natural learning problems and since it is not required to b e k no wn it is not exp ected to b e a
real limitation

V alian t onsiders c concept learning a s a subset of is h more general deition of learning as
an automatic program acquisition p ro cess
oth yp
um
onc
um
os whic at

Les
dev
CHAPTER THEOR Y F O NDUCTIVE I L EARNING
alian t a uses this mo del to analyse the p erformance of the learning
metho d on the follo wing standard t yp es of concepts

disjunctive c onc epts NF con taining o nly disjunctions of atoms
c onjunctive c onc epts NF con taining o nly conjunctions of atoms
NF k made of disjunctiv ely connected blo c ks hic w h c tain up to k
conjunctiv ely connected atoms
NF k quiv alen t to kNF a nd
internal l y disjunctive c epts conjunctiv ely connected atoms eac h f o whic h
corresp ond to one ttribute a but allo ws the assignmen t of m ore than one
v alue to it
Assuming b o olean attributes and a ite h yp othesis space V alian tpro v es that
the metho d is able to learn the concepts eien tly i in p lynomial o time This
assumption ho w ev er imits l substan tially the applicabilit y f o his t analysis and it
has b een relaxed in later researc h
Using the P A C Learning o M del to Measure Induc
e ias B
aussler presen ts a metho d of sing u V alian t probabilistic f ramew ork to
w that simple inductiv e earning l algorithms can p erform nearptimally t
the same ime t h e relaxes V alian t ssumptions a restricting the structure of the
h yp othesis space using inductive bias to o v ercome the p roblems caused b ytihs
relaxation Inductiv e b ias in this con text stands for the restrictions to the
h yp othesis space mp i osed b y the nature of the earning l algorithm h euristics

initial starting state etc In order to quan tify the strength of this bias and
measure t he p erformance of the algorithms e h mak es use o f the owth function
in tro uced d b Vy apnik and Cherv onenkis The outcome is
ame asur e that r elates the str ength of a bias to the p erformanc eof
le arning algorithms that use it so that it wil l b e u seful i n analysing
and c omp aring le arning algorithms aussler
In con trast ith w V alian t analysis Haussler llo a ws for a ll three t yp es of attributes
nominal n umeric and treetructured h e a lso assumes that n ominal at
tributes can b e represen b y orresp c onding treetructured ones As a result
inite h yp othesis spaces are also allo w ed

A tom is a term used in CL T for attributeests

F or a m ore detailed examination of inductiv e bias the reader is eferred r o t tgo
ted
gr
sho
tiv
onc
con anCHAPTER THEOR Y F O NDUCTIVE I L EARNING
In order t o expand the mo del to nite i h yp othesis spaces t he gr owth func
tion f the h yp othesis space is used sym b lised o b y m where m X j
H
Histhe h yp othesis space j X j the cardinalit y o f the example set and m is the
cardinalit y of the subset of j X j that s i b eing examined the minim um size
of whic h is used or f the determination of the sample complexit y This function
measures the axim m um n ber of icd hotomies of a subset of the example set
using all h yp otheses from the h yp othesis s pace The n otion o f dichotomy here is
similar t o that used in decision trees Examples r instances are ab l elled as
or according to whether their attribute alues dhere a to a sp eci h yp othesis
F or example assume a n umeric attribute size and t hree instances that tak e
v alues ccordingly a eh h yp othesis zie w l
instances as n decisionree terms the three i nstances w ould elong to the
same no de The h yp othesis siz e ho w ev er lab els the st instance as a
and the o ther t w o as n a decision tree the st instance w ould b elong to
a diren t no d e than the other three
Ha ving three or more distinct attribute alues n i the training subset there
arean um ber of dci hotomies whic h cannot b e a c hiev ed b yan yh yp othesis of the
h yp othesis space In the previous example there is no w a y to get the second
instance lab elled as and the other t w o a s n ecisionree d terms one
cannot ha v e instances and i n the same n o de without a h ving the second one
to o Th us the gro wth function m m for m This leads o t another
H
deition that of the V apnikherv onenkis imension d V Cdim H ofah yp othesis
space H whic h is t he cardinalit y o f the largest subset of the example et s that
nbca e shatter e d i all p o ssible dic hotomies can b e ac hiev ed b yHTh us for
singlettribute conjunctiv e concepts V Cdim H irresp ectiv e of the t yp e
of attribute allo wing for an inite h yp othesis space imilarly S Haussler pro v es
that if the instance space is deed b y n ttributes a a nd the oncept c to b e
induced is conjunctiv etenh
n V Cdim H n
and

n

m em for a ll m n
H

where e is the base of the atural n l ogarithm
Note S imilar b ounds hold for the other f our t yp es of concepts that are b ing e
considered
Based on this result Haussler claims the follo wing
e c ause it r e cts limitations on the p ower of discrimination and
expr ession inher ent n i the hyp othesis sp ac e H the gr owth function
m is a natur al way to quantify he t bias of le arning algorithms
H
that use H It is also a u seful m e asur e of he t bias aussler



three all el ab ill the

um
CHAPTER THEOR Y F O NDUCTIVE I L EARNING
The al outcome of this w ork i s a new measure of the m axim um probabilit y
of learning a h yp othesis with an appro ximation error greater than here


L m
H m
C
where C is a class of target concepts
L is the learning algorithm
m is the n um b er of examples in the andom r e xample set m
L
and H m i s the e ctive hyp othesis sp ac eof L for tar get c onc epts
C
in C and sample size m denoting the set of h yp otheses generated
b y L for C and m
The imp ortance of this measure as w ell as the ones deriv ed from it is that i t
holds or f an y l earning algorithm and it incorp orates a measure of its inductiv e
bias Th us it limits the complexit y u pp er b o unds to more realistic nd a near
optimal lev els
Criticisms and Extensions to the P A C Learning Mo del
There a re a few problems with the b asic P A C o m del w hic hmak eit un usable for
real problems Some of these are the follo wing
The estimate of the sample complexit ypro vided b y the mo del is a worst
c one
Noiser e e data are assumed
The mo del cannot b e eadily r applied to
Incr emental le arning systems
Learning roblems p with multialue d functions
Systems that use b ackgr ound know le dge
The realisation of these p roblems has led to a n um b er of extensions to the
basic algorithm and the dev elopmen t f o ew n theoretical learning mo dels The
follo wing are a few examples f o suc h m o dels
Distribution sp e ci m o del The distribution on the instance space is tak en
in to accoun t for the calculation of the sample complexit y
r ob ability to mistake mo del This o m d el sp ecially r efers to i ncremen
tal algorithms using as a p erformance measure the probabilit y that the
th
algorithm will misclassify the n randomly selected training xample e
ase

CHAPTER THEOR Y F O NDUCTIVE I L EARNING
otal mistake b ound mo del This is another incremen tal earning l mo del
whic h u ses as a p erformance measure he t total n um b er of misclassiations
that he t algorithm will mak ein the w orstase worst c ase mistake b ound
aussler iv g es a ore m etailed d accoun t f o ecen r t researc h i n the ld
including a n ber of in teresting new m o dels
Summary
This c hapter has ro p vided an o v erview of he t w ork that h as b een done in the eld
of empirical conceptearning hic w histhe t yp e of l earning used in this pro ject
Conceptearning has b een an attractiv eand v ery activ e researc h area d uring the
last decade m ainly due to ts i applicabilit y to real orld problems In this t yp e
of learning the task f o acquiring the description f o a concept i s seen as a searc h
problem where the searc h space is the set of all p ossible d escriptions and the goal
is the iden tiation of the one w hic h b est s the training data In hat t c on text
the learning algorithm pro vides the op erators for o m ving b et w een states in the
searc h s pace
The problems that are b eing a ttac k ed with conceptearning metho ds
prediction diagnosis etc ha v e b een approac hed with statistical metho ds b efore
The ld of statistics whic h deals with this t yp e of tasks is called classiation
and its goal s i the deriv ation o f a function whic h s a set of giv en data This
function pro vides a apping m b et w een ob jects nd a the c lasses to whic h they b elong
Ob jects in statistical classiation are represen ted in terms of a set o f eatures f
whic h a re also used in the eition d of the classiation function
Empirical conceptearning b ears a n um b r e o f similaritie s to statistical classi
ation examples of whic h re a the ollo f wing
The use of featuresttributes to describ e ob jects and ee d concepts
The use f o v arious statistical tec hniques he heuristics used in the learning
algorithms in o rder to iden tify similarities b et w een o b jects and eneralise g
them in to conceptescriptions
The corresp ondence b et w een the searc h pace s used in learning a nd the fea
ture space used for the d eition of the classiation f unction
Th us he t t w o approac hes can b e seen as equiv alen t with the exception of the
t yp es of attributes that they use Statistical a pproac hes a h v e f o cused on n umeric
attributes while nominal ones are fa v oured i n earning l T his d irence is a result
of the w a yinwich h classiation functions a nd conceptescriptions are deed
um
CHAPTER THEOR Y F O NDUCTIVE I L EARNING
The former are mathematical e xpressions while the latter tend to b e represen ted
as rules using a logicik e notation
Most of the researc h done n i empiricaloncept earning l has concen trated on
few algorithms t w oofwchi hha v e b een describ ed n i this c hapter These t w oagol
rithms ha v e b een v ery p o pular in ML researc h a nd form the basis of the learning
systems that will b e describ ed i n the follo c hapter In terms of classiation
they b th o can b e seen as conceptual clusterers whic h ivide d the feature space
in to orthogonal h yp errectangles Eac h suc h rectangle con tains examples of one
class and together with the rest of the rectangles corresp onding to the lass c i s
used as its deition
Concept earning L Theory is an area whic h eals d with the complexit y of the
conceptcquisition task and has recen tly b ecome v ery p opular t I ttempts a to