C a c h e d S u c i e n t S t a t i s t i c s f o r E c i e n t M a c h i n e L e a r n i n g

achoohomelessAI and Robotics

Oct 14, 2013 (3 years and 8 months ago)

107 views

Journal of Artiial In telligence Researc h Submitted published
Cac hed Suien t Statistics for ien E tMac hine Learning
with Large Datasets
Andrew Mo re o a wmsmudu
Mary So on Lee msleesmudu
Scho ol of Computer Scienc e and R ob otics Institute
Carne gie el M lon University Pittsbur gh P A
Abstract
This pap er in tro duces new algorithms and d ata structures for quic k coun f hine
learning datasets W e fo c us on the coun ting task of constructing con tingency tables but
our approac h is also applicable to coun ting he t n um b er o f records in a d ataset hat t matc h
conjunctiv e queries Sub ject to certain a ssumptions the costs f o these op erations can b e
sho wn to b e indep enden t of the n um b er f o r ecords in the dataset and loglinear in the
n b er of nonero en tries in he t con tingency table
W epro vide a v ery sparse data structure the AD tree to minimi ze memory use W e
pro vide analytical w orstase b ounds for this structure for sev eral mo dels of data distribu
tion W e e mpirically emonstrate d that tractablyized ata d s tructures can b e pro duced for
large real orld datasets b y a using a sparse tree structure that nev er allo cates memory
for coun ts of zero nev er allo cating emory m or f coun ts that can b e d educed from other
coun ts nd a not b thering o to expand the tree fully near its ea l v
W e sho who wthe AD tree can b e used t o accelerate Ba y es net s tructure nding al
gorithms rule learning lgorithms a nd a feature selection algorithms a nd w epor vide a
n b er of empirical results comparing AD tree metho d s gainst a traditional direct coun ting
approac hes W e a lso discuss the p ossible uses of AD trees in ther o mac hine learning meth
o s d and discuss the merits of AD trees in comparison with alternativ e r epresen tations uc s h
as k drees R rees and F requen t Sets
Cac hing Suien t Statistics
Computational e iency is an imp ortan t c oncern for mac hine learning algorithms esp ecially
when applied to large datasets a ad M annila Piatetskyhapiro F a yy ad
Uth urusam y or in realime scenarios In arlier e w ork w esho w ed ho w k drees with
m ultiresolution cac hed regression atrix m statistics can enable v ery fast lo cally w eigh ted
and instance based regression o ore Sc hneider Deng In this pap r e w e ttempt a
to accelerate predictions for sym b olic attributes using a k ind of k dree that splits on all
dimensions at a ll no des
Man ymac hine learning algorithms op erating on datasets of sym b lic o ttributes a need o t
do frequen t c ounting This w ork is also applicable to Online Analytical Pro essing c LAP
applications in data mining w here op erations on large datasets suc hasm ultidimensional
database access DataCub e op e rations arinara y an R a jaraman Ullman and
asso ciation rule learning gra w al Mannila Srik an t T oiv onen V erk amo could b e
accelerated b y fast coun ting
Let us b e gin b y e stablishing some notation W e re a giv en a data set with R records
and M attributes The attributes are called a a T he v alue of attribute a in the
M i
c A I Access F oundation and Morgan Kaufmann P ublishers A ll righ ts reserv ed

yy
um
es
um
mac or tingMoore Lee
k th record is a mall s in teger lying in the range f n g where n is called he t arity of
i i
attribute i Figure giv es an example
Attributes aaaa
12 3
Arity n = 2 n = 4 n = 2
12 3
M = 3
Record 11 1
1
Record 23 1
2 R = 6
Record 24 2
3
Record 13 1
4
Record 23 1
5
Record 13 1
6
Figure A simple dataset used as an example It has R records and
M attributes
Queries
A query is a set of attribute value pairs in whic h the left hand sides of the pairs form
a subset of f a a g arranged in increasing order f o index F our xamples e of queries for
M
our dataset are
a a a

M
Notice that the t otal n um b e r of p ossible queries is n This is b cause e ac e h
i
i
attribute can either app ear i n the query w ith ne o of the n v alues it ma ytak e or it ma ybe
i
omitted hic h s i equiv alen t to iving g it a a on care v alue
i
Coun
The c ount of a query denoted b y C Query i s imply s the n um b e r of records in the dataset
matc hing all the attribute value asir in Query or our e xample dataset w e d
C a

C a

C
C a

Con tingency T ables
Eac h subset of attributes a a has an asso ciated c ontingency table denoted b y
i i n
ct a a This is a table with a r o w f h of t he p ssible o sets of v alues for
i i n
a a The ro w corresp nding o to a v a v records the coun t C a
n
i i n i i n i

v a v Our example dataset has attributes and so tingency tables
i n n
exist depicted i n Figure

con

eac or






ts



Ca ched Sufficient St a tistics f or Efficient Ma chine Learning
ct() ct(a ) ct(a ) ct(a ,a ,a )
1 3 1 2 3
# a # a # aaa #
1 3 123
613 1 5 1111
23 2 1 1120
ct(a )
2
1210
a #
2 1220
11 ct(a ,a ) ct(a ,a ) 1312
1223
20 1320
a a## a a
34 1 2 2 3 1410
41 111 111 1420
120 120 2110
ct(a , a )
1 3
132 210 2120
a a #
1 3 140 220 2210
113 210 314 2220
120 220 320 2312
212 232 410 2320
221 241 421 2410
2421
Figure The eigh t p ossible con tingency tables or f the dataset of Figure
A c onditional c ontingency able t ritten w
ct a a j a u a u
i i n j j p p
is the con tingency table for the subset of records in the dataset hat t atc m h the query to
the righ t o f the j sym b ol F or example
a a


a j a



Con tingency tables are used in a v ariet yof mac hine learning applications including
building the probabilit y tables for Ba y es nets and ev aluating candidate conjunctiv e rules in
rule learning algorithms uinlan C lark iblett N It w ould th us b e desirable
to b e able to p erform suc h oun c ting eien tly
If w e a re prepared t o pa y a oneime ost c or f building a c ac hing data structure then t i is
easy to suggest a mec hanism or f doing oun c ting in constan t time F or eac h p ossible query
w e precompute the con tingency table The total amoun tofn um b ers stored in memory for
M
suc h a data structure w ould b e n wchi hev en for our h um ble dataset of Figure
i
i
is as rev ealed b y Figure F or a real dataset with more than ten ttributes a of medium
arit y or teen binary attributes t his is far o t o large o t t in main m emory
W ew ould lik e to retain the sp eed of precomputed con tingency ables t without incurring
tractable memory demand That is the sub ject of this pap er
Cac he Reduction The Dense tree for Cac hing Suien t Statistics
First w e ill w describ e the AD tree the data structure w e ill w use to represen t t he set of
all p ossible coun ts Our initial s implid description is an ob vious tree represen tation that

do es not yield an y immediate memory sa vings but will later pro vide sev eral pp o ortunities
The SEree ymon is a similar data tructure s

AD
in an

ct
Moore Lee
for cutting o zero c oun ts and redundan tcoun ts his T structure s i ho s wn in Figure An
AD tree no de ho wn as a r ectangle has c hild no des alled c ary no des ho wn as o v als
Eac h no de represen ts a query and stores the n um b er f o records that matc h the query n
the C ld The V ary a c hild o f an AD no de has one c hild for eac h f o the n v
j j
of attribute a The k th suc hc hild represen ts the same query a s V ary a paren t with the
j j
additional constrain t hat t a k
j
a = *
1
.
.
.
a = *
M
c = #
. . .
Vary a Vary a Vary a
1 2 M
a = 1 a = n a = * a = *
. .
1 1 1 1 1
. .
a = * a = * a = 1 a = n . .
2 . . . 2 2 . . . 2 2
. . . .
. . . .
. . . .
a = * a = * a = * a = *
M M M M
c = # c = # c = # c = #
. . . . . . . . . . . .
Vary a Vary a Vary a Vary a
2 M 3 M
Figure The t op AD of an AD tree describ d e in the text
Notes regarding this s tructure
Although dra wn on the diagram the description of the query a

a on the leftmost no de of the second lev el is not explicitly ecorded r in
M
the AD no de The on c ten ts of an AD no de are simply a oun c tandaste fo pion
the V a c hildren
j
The con ten a V a no de are a set of p oin ters to AD no des
j
The c ost of lo o king up a coun t is prop o rtional to the n um b er f o instan tiated v ariables
in the query or example to lo ok up C a ew ould follo w the

follo wing path in the tree V a a V ary a a V ary a

a Then the coun t is btained o rom f the resulting no de

Notice that if a no de ADN has V a as its paren t then c hildren are
i
V ary a V ary a V ary a
i i M
It is not necessary to store V ary no des with indices b elo w i b cause e t hat information
can b e obtained from another path in the t ree


ADN ary
ary


ary of ts
ary
to ters
AD

des no
alues
ADCa ched Sufficient St a tistics f or Efficient Ma chine Learning
Cutting o no des with coun ts of zero
M
As describ ed he t tree is not sparse and ill w on c tain exactly n no des Sparseness
i
i
is easily ac hiev b y storing a NULL instead of a no de f or an y q uery that atc m
records A ll of the p s cializations e of suc h a query will also ha v e a coun t f o zero and they
will not app ear an ywhere in the t ree F or some datasets this can r educe the n um ber of
n um b ers that need t o b e tored s F or example the dataset in Figure whic h previously
needed n um b ers to represen tall con tingency t ables will no woynl dneen um b
Cac he Reduction I I T he Sparse AD tree
It is easy to devise datasets for whic h there is no b ne e in failing to store coun ts of zero
M
Supp ose w eha v e M binary attributes and records in whic hthe k th record is the bits of
M
the binary represen tation of k hen T no query has a oun c t o f z ero and the tree con tains
no des T o reduce t he tree size despite this w e w ill tak eadv an tage of the observ ation that
v ery man y o f he t coun ts stored in the b a o v e tree re a redundan t
Eac hV a no de in the ab o v e AD tree stores n subtreesne subtree for eac hv
j j
of a wnseatd e w ill d the most common of the v alues of a all t i MCV and store a
j j
NULL in place of the MCV th subtree The remaining n s ubtrees will b e represen ted
j
as b efore An example for a simple dataset is giv en in Figure Eac hV a w
j
records whic h of ts i v alues is most common in a ld App ndix e B describ es the
straigh tforw ard algorithm for building suc han AD tree
As w e w ill s ee in Section it is till s p o ssible t o build full exact con tingency ables t r
giv e coun ts for sp eci queries in time that is only sligh tly longer t han or f the full AD tree
of Section But st let us examine the memory consequences of this represen tation
App endix A s ho ws that f or binary attributes giv M attributes and R records the
M
n um b er of no des needed to store t he tree is b unded o a b o v eb y in the w orst case nd
M
m uc h less i f R In con trast t he amoun t of m emory needed b y the dense tree of
M
Section s i in the w orst case
Notice in igure F that the MCV v alue is con text dep enden t Dep e nding on constrain ts
on paren t no des a MCV is sometimes and sometimes This c on text dep endency

can pro vide dramatic sa vings i f as is frequen tly the case there are orrelations c among the
attributes This is discussed further in App endix A
Computing Con tingency T ables from the Sparse AD
Giv en an AD tree w e wish to b e able to quic kly c onstruct con tingency tables for a n y arbi
trary set f o a ttributes f a a g
i i n
Notice that a conditional con tingency t able ct a a j Query an c b e built recur
i i n
siv ely e st build
ct a a j a Query
i i n i
ct a a j a Query
i i n i



ct a a j a n Query
i i n i i








tree
en
MCV
no de no ary

alue ary
ers
zero hes ed
Moore Lee
a=
*
1
a =
*
2
c = 8
Vary Vary a a Vary a
1 2
mcv = 3 mcv = 2
a= 1 a= 2 NULL a= NULL
*
1 1 1
a = a = (mcv) a = 1 (mcv)
* *
2 2 2
c = 1 c = 3 c = 3
a a
Vary Vary a a Vary Vary a a 1 2
2 2
mcv = 2 mcv = 1 12
21
21
22
NULL NULL NULL a= 2
1 31
Count 0 (mcv) (mcv) a = 2 32
2
32
c = 1
32
Figure A sparse tree built for the dataset sho wn in the b ottom
righ t The most c ommon v alue for a is and o s the a subtree

of the V a c hild of the ro ot no de is NULL A teca h o f the V a

no des the most common c hild is also set to NULL hic hc hild is most
common dep e nds on the con text
F or example o t build ct a sing u the dataset in Figure w e can build a j a

and ct a j a and com bine them as in Figure

ct(a | a = 1)
3 1
ct(a , a )
13
a#
3
a a #
1 3
13
1 13
20
1 20
ct(a | a = 2) 2 1 2
3 1
2 2 1
a#
3
12
21
Figure An example sing n um b rs e from Figure of ho wcno tingency
tables can b e com bined recursiv ely to f orm larger c on tingency tables
When building a conditional c on tingency table f rom n a AD tree w e ill w not need o t
explicitly sp ecify the query condition Instead w e w ill supply an no de of the AD tree
whic h mplicitly i is equiv alen t information he T algorithm is

AD
ct
ary ary
ADCa ched Sufficient St a tistics f or Efficient Ma chine Learning
Mak eCon tab f a a g ADN
i i n
Let VN The V ary a subno de of ADN
i
Let VN
F or k
i
If k MCV
ADN The a k subno de of VN
k i
CT Mak eCon tab f a a g ADN
k i i n k
CT alculated as explained b e lo w
Return the concatenation of CT CT
n
i
The base case o f t his recursion o ccurs w hen the st argumen tis empt y n i whic h c ase w e
return a o nelemen tcon tingency table con taining the oun c t asso iated c ith w t he curren t
AD no de ADN
There is an omission n i the algorithm In the iteration o v er k n g w eaer
i
unable to compute the conditional con tingency table or f CT b cause e the a MCV
MCV i
subtree is delib rately e issing m as p r e Section What an c w e do instead
W e can tak e adv tage of the follo wing prop rt e yofcon tingency tables
n
i
X
ct a a j Query ct a a j a k Query
i i n i i n i
k
The v alue a a j Query can b e computed from within ur o algorithm b y alling c
i i n
Mak eCon tab f a a g ADN
i i n
and so the issing m conditional con tingency table in the a lgorithm can b e computed b y the
follo wing ro wise subtraction
X
CT Mak eCon tab f a a g ADN CT
MCV k
i i n
k MC V
F requen t Sets gra w al et al whic h re a traditionally used or f learning asso ciation
rules can also b e used for computing coun ts A recen t pap er annila T oiv onen
whic h also emplo ys a similar subtraction ric t k calculates oun c ts from F requen t In
Section w e will discuss the strengths and w eaknesses f o F requen t Sets in comparison ith w
AD trees
Complexit y o f b uilding a con tingency table
What is the cost of computing a con tingency t able Let us consider he t theoretical w orst
case cost of computing a on c tingency table for n attributes eac h o f arit y k ote t hat this
cost is unrealistically p ssimistic e xcept w hen k b e cause most con tingency tables are
sparse as discussed later The assumption that a ll attributes ha v e t he same arit y ks
made to simplify the calculation o f the w orstase cost but is not needed b y the co de


Sets


ct


an



MCV


Let

MCV MCV

Moore Lee
n
A con tingency table f or n attributes has k en tries W rite C n the cost of computing
suc hacon tingency table n I he t topev Mak eCon there a re k calls to build
con tingency tables from n attributes k f o these calls re a to build CT a a j
i i n
a j Query ro ve ery j in f k g exc ept the MCV and the al call is to build
i
CT a a j Query Then there will b e k ubtractions s of con tingency tables
i i n
n
whic hwill eca h r equire k n umeric subtractions So w eha v e
C
n
C n kC n k k if n
n
The solution to t his r ecurrence relation is C n n k k t his cost is loglinear
in the size of the con tingency table By comparison if w e used no cac hed data structure
but simply coun ted through the dataset i n rder o to build a con tingency table w ew ould
n
need O nR k p o e rations where R is the n um b r e f o records in the dataset W e re a th us
n
c heap er than the standard coun ting metho d if k R e a re in terested in large datasets
in whic h R ma y b e more than In suc h a case our metho d ill w presen t a sev eral
order of magnitude sp eedup for sa y con tingency table o f e igh t binary attributes Notice
that this cost is indep nden e tof M the total n um b e r f o a ttributes n i the dataset and
only dep ends up on the almost alw a m uc h maller s n um b e r o f ttributes a n requested for
the con tingency table
Sparse represen tation o f con tingency ables t
In practice w e do not represen tcon tingency tables as m ultidimensional a rra ys but rather
as tree structures This g iv es b th o the slo w c ting approac h nd a the AD tree approac h
a substan tial computational adv an tage in cases here w the con tingency able t is sparse i
has man y zero en tries Figure ho s ws suc h a sparse con tingency table represen tation This
can mean a v eragease b eha vior is m uc h f aster than w orst case for on c tingency tables with
large n b ers of attributes or highrit y ttributes a
n
Indeed our exp erimen ts in Section sho w c osts r ising m uc h ore m slo wly han t O
as n increases Note to o hat t when using a sparse represen tation the w orstase for
n
Mak eCon tab is no w O in nR nk b ecause R is the maxim um p ossible n um ber of
nonero con tingency table e n tries
Cac he Reduction I I I Leafists
W eno win tro duce a sc heme for further reducing memory use It is not w orth building
the AD tree data structure for a s mall n um berofrecords F or example supp ose w eha v e
records and binary attributes Then the analysis in App endix A sho ws us that in
the w orst case the AD tree migh t r equire no des But computing con tingency tables
using the resulting AD tree w ould with so few records b e no faster than the con v tional
coun ting approac h hw ould merely require us t o etain r the dataset n i memory
Aside from concluding that trees are not useful f or v ery small datasets this also
leads to a al metho d for s a ving memory i n l arge AD trees An y AD tree no de with few er
than R records do es not expand its subtree Instead it main tains a list of p oin
min
the original dataset explicitly listing those records that matc h t he curren t AD no de Suc h
a ist l of p oin ters is called a le afist igure F g iv es an example

to in ters
AD
whic
en

nk
um
oun
ys






tab of call elCa ched Sufficient St a tistics f or Efficient Ma chine Learning
v=1
1
ct(a ,a ,a )
1 2 3 v=1
a
3
aaa # v=2
123 NULL
1111 v=2
NULL
1120
v=1
1210 v=1 2
a
v=3
2
a
1220
3 v=2
NULL
1312
1320 v=4
NULL
1410
a
1
1420 v=1
NULL
2110
2120
v=2
2210 v=2 NULL
a
2
v=1
2220
2
2312 v=3
a
3 v=2
2320
NULL
2410 v=1
v=4 NULL
a
2421
3
v=2
1
Figure The righ t hand gure is the sparse represen tation of the con
tingency t able on the eft l
The use of leafists has one inor m and t w o ma jor consequences The minor conse
quence s i t he need to include a straigh tforw ard c hange in the on c tingency able t generating
algorithm to handle leafist no des This minor a lteration is not describ ed here The st
ma jor consequence is that no w the dataset itself m ust b e retained in main memory so that
algorithms that insp ect leafists can access the r o ws of data p in o ted to in hose t leafists
The second ma jor consequence is that the AD tree ma y require m uc hlsesmmoery This is
do cumen ted in Section a nd w orstase b o unds are pro vided n i pp A endix A
Using AD trees for Mac hine Learning
As w e will ee s in Section the AD tree structure can s ubstan tially sp eed p u the computation
of con tingency tables f or large real datasets Ho w c an mac hine learning and tatistical s
algorithms tak e adv tage of this Here w epor vide three examples F eature Selection
Ba y es net coring s and rule learning But it seems lik ely that man y o ther algorithms can
also b ene for xample e step wise logistic regression GMDH Madala Iv akhnenk o
and text classiation Ev en decision t ree uinlan Breiman F riedman Olshen

Stone learning ma y b ene In future w ork w e will also examine w a ys to sp eed up
nearest neigh b or and other memoryased queries using AD trees
This dep ends n o whether the c ost of initiall y building the AD tree can b e a mortized o v er man y runs of
the decision tree algorithm Rep e ated runs of decision tree building can o cur c if one i s u sing the wrapp er
mo del f o feature selection ohn Koha vi Pger or if one is using a m ore in tensiv e earc s ho v er
tree structures than the t raditional greedy searc h Quinlan Breiman e t al

anMoore Lee
a = *
1
a = *
2
a = *
3
c = 9
Row a a a
1 2 3
Vary a Vary a Vary a 1 1 1 1
1 2 3
mcv = 2 mcv = 1 mcv = 1 2 1 1 2
3 1 2 1
a = 1 NULL a = 3 NULL a = * NULL a = *
1 1 1 1 4 2 2 2
a = * (mcv) a = * (mcv) a = 2 (mcv) a = *
2 2 2 2
5 2 1 1
a = * a = * a = * a = 2
3 3 3 3
6 2 1 2
c = 3 c = 2 c = 4 c = 4
See rows 1,2,3 See rows 8,9
7 2 2 1
8 3 2 2
Vary a
3
9 3 1 1
mcv = 1
NULL a = *
1
(mcv) a = 2
2
a = 2
3
c = 2
See rows 4,8
Figure An AD tree built using leafists ith w R An ynode
min
matc hing r o few er records is not expanded but simply records a set of
p oin ters in to the dataset ho wn on the righ t
Datasets
The exp erimen ts used the datasets in T able Eac h dataset w as supplied to us with all
con tin uous attributes already discretized in to ranges
Using AD trees for F eature Selection
Giv en M attributes of whic h o ne is an output that w e w ish o t predict it is often in teresting
to ask hic h s ubset of n attributes n M is the b st e predictor of the output on the
same distribution f o datap oin ts that re a rected in this dataset oha vi There
are man yw a ys of scoring a set of features but a particularly simple one is information
gain o v er Thomas
Let a b e the attribute w e wish to predict nd a let a a b e the set of attributes
i i n
used as inputs Let X b e the set f o p ossible assignmen ts of v alues to a a w
i i n
Assign X as the k th suc h ssignmen a t Then
k

j X j
n n
out
X X X
C a v C Assign C a v Assign
out out
k k
InfoGain f f
R R C Assign
k
v v
k
where R is the n um b e r of records in the en tire dataset a nd
f x x log x

The coun ts needed in the a b o v e computation can b e read directly rom f a a
out
i i n
Searc hing for the b est subset of attributes is simply a uestion q of searc h a mong all
attributeets of size n n sp ecid b y the user T his is a s imple example designed to test

ct


out
rite and
out

Ca ched Sufficient St a tistics f or Efficient Ma chine Learning
Name R Num M Num
Records A ttributes
ADUL T The mall s dult Income dataset placed in the CI U
rep ository b y Ron Koha vi oha vi C on
census data related to job w ealth and nationalit y t
tribute arities range from to In the U CI rep os
itory this is called the T est Set Ro ws with missing
v alues w ere remo v ed
ADUL T The same kinds of r ecords as ab o v e ut b with diren t
data The T raining Set
ADUL T ADUL T and ADUL T concatenated
CENSUS A l arger dataset based on a d iren t c ensus also pro
vided b y Ron Koha vi
CENSUS The same data as CENSUS b ut with the a ddition of
t w o extra highrit y attributes
BIR TH Records oncerning c a v ery wide n um b e r of r eadings
and factors recorded at v arious stages during preg
nancy Most attributes are binary and of the at
tributes are v ery sparse with o v er of the v
begin F ALSE
SYNTH KK Syn thetic datasets of en tirely binary attributes gener
ated using the Ba y es net in Figure
T able Datasets used i n e xp erimen
our coun ting metho ds n a y practical f eature selector w ould need t o p enalize the n um ber of
ro ws in the on c tingency table lse high arit y a ttributes w ould tend to win
Using AD trees for a B y es Net S tructure Disco v
There are man y p ossible Ba y es net learning tasks all of whic hen tail coun ting and hence
migh t b e p s e eded up b y trees In this pap r e w e presen t e xp erimen tal results for the
particular example f o coring s the structure f o a Ba y es net to decide ho ww ell it matc hes the
data
W e will use maxim um lik eliho o d scoring with a p enalt y orf the n um b er f o parameters
W e st compute the probabilit y able t asso ciated with eac hnodeW rite Par ents for the
paren t ttributes a of no de j and write X as the set of p ossible assignmen ts of v o t
j
Par ents The maxim um lik eliho o d e stimate for
P a v j X
j j
is estimated as
C a v X
j j

C X
j
and all suc h estimates for no de j probabilit y ables t can b e r ead from ct a ents
j
The next step in scoring a structure s i o t decide the lik eliho o d of the data giv en the
probabilit ytasebl w e computed and to p nalize e the n um b er of parameters in our net w
ithout the p enalt y the lik eliho o d w ould increase ev ery time a link w as added to the

ork
Par
alues

AD
ery
ts
alues

tainsMoore Lee
Figure A Ba y es net t hat generated our SYNTH datasets There re a
three k inds of no des The no des mark ed with triangles are enerated g with
P a a The square no des are deterministic
i i
A s quare no de tak es v alue i f he t sum f o its four paren ts is ev en else it
tak es v alue The circle no des a re probabilistic functions of their single
paren t deed b y P a j Par ent nda P a j ent
i i
This pro vides a dataset w ith f airly parse s v alues and ith w man y
in terdep endencies
net w ork The p enalized logik eliho o d score riedman Y akhini i s
n
j
M
X X X
N log R R P a v Asgn og P a v j Asgn
params j j
j v
Asgn X
j
where N is the otal t n um b er o f probabilit y t tries in the net w ork
params
W e searc h among structures to d the b e st score n I these exp e rimen ts w e use random
restart sto c hastic hill clim bing in whic h the op erations are random addition or remo v al of
a net w ork link or randomly sw apping a pair o f o n des he T latter op eration is necessary o t
allo w the searc h algorithm to c ho ose the b est rdering o of no des in the Ba y es net Sto c hastic
searc hes suc h as this are a p opular metho d f or ding Ba y es net tructures s riedman
Y akhini Only the probabilit y tables f o the acted no des are recomputed on eac h
step
Figure sho ws the Ba y es net structure eturned r b your aB y es net structure der after
iterations of hill clim bing
Using AD trees for ule R Finding
Giv en an output attribute a and a distinguished v alue v rule ders searc h mong a
out out
conjunctiv e queries of he t form
Assign a v a v
i i n n


en able

Par
Ca ched Sufficient St a tistics f or Efficient Ma chine Learning
_a_t_t_r_i_ b_u_t_e _ s_c_o_r_e _ n_p
relationship 2.13834 2 pars = <no parents>
class 0.643388 36 pars = relationship
sex 0.511666 24 pars = relationship class
capital-gain 0.0357936 30 pars = class
hours-per-week 0.851964 48 pars = relationship class sex
marital-status 0.762479 72 pars = relationship sex
education-num 1.0941 26 pars = class
capital-loss 0.22767 10 pars = class
age 0.788001 28 pars = marital-status
race 0.740212 18 pars = relationship education-num
education 1.71784 36 pars = relationship education-num
workclass 1.33278 108 pars = relationship hours-per-week education-num
native-country 0.647258 30 pars = education-num race
fnlwgt 0.0410872 40 pars = <no parents>
occupation 2.66097 448 pars = class sex education workclass
Score is 435219
The search took 226 seconds.
Figure Output from the Ba y es structure der unning r on the
ADUL T dataset Sc e is the onc tribution t o the sum n i Equation due
to the sp ecid attribute np is the n um berofen tries i n he t probabilit y
table for the s p ecid attribute
to d the uery q that maximizes the estimated v alue
C a v Assign
out out
P a v j Assign
out
C Assign
T oa v oid rules without signian t upp s ort w e a lso insist t hat C Assign he n um ber of
records matc hing the query m ust b e ab o v e some threshold S
min
In these e xp erimen w e mplemen i t a brute force searc h that lo oks through all p ossible
queries that in v olv e a userp ecid n um b r e o f a ttributes n e build eac h ct a a
out
i i n
in turn here are M cho ose n suc h ables t and then lo ok through he t ro ws of eac h ablet
for all queries using the a a that ha v e greater than minim um supp ort S e
min
i i n
return a priorit y ueue q of the highest scoring rules F or instance n o t he ADUL T dataset
the b est rule for predicting lass from a ttributes w as
score w class Priv ate ducation e um b a o v e marital
status Marriedivp ouse capitaloss ab o v e class k
Exp erime n tal Results
Let us st examine the emory m required b yan AD tree on our datasets T able sho ws us
for example that the A DUL T dataset pro duced an AD tree with no des The tree
required almost megab ytes of memory A mong the three ADUL T datasets the size of
the tree v aried appro ximately linearly with the n um b er o f r ecords
Unless otherwise sp ecid in all he t exp e rimen ts in this section the ADUL T datasets
used no leafists The BIR TH and SYNTHETIC datasets used leafists of size R b y
min
default The BIR TH dataset ith w its large n um b r e f o s parse ttributes a required a o m dest
m egab ytes to store the treean y m agnitudes b elo w he t w orstase b ounds mong A the
syn thetic datasets the ree t size increased sublinearly w ith the dataset s ize This indicates


ork


ts
out

orMoore Lee
Dataset M R No des Megab ytes Build Time
CENSUS
CENSUS
ADUL T
ADUL T
ADUL T

SYNK
SYNK
SYN
SYN
SYN
T able The size of AD trees or f v arious datasets M is the n um ber fo
attributes R is the n um broe frrdscoe No des is the n um ber fo nodes
in the tree Me gabytes is the amoun t of memory needed to store the
tree Build Time is the n um b r e f o s econds needed to build the t ree o
the nearest econd s
that as the dataset gets larger no v el records hic hma y ause c new no des to app ear in the
tree b come e less frequen t
T able ho s ws the costs of p e rforming iterations of Ba y es net structure searc hing
All exp erimen ts w ere p erformed on a hz P tium Pro mac hine with egab m ytes
of main memory Recall that e ac hBa y es net iteration n i v olv es one random c hange to the
net w ork and so requires r ecomputation of one on c tingency table the exception is the st
iteration in whic h ll a no des m ust b e c omputed This means that the t ime to run
iterations is essen tially the t ime to ompute c on c tingency tables Among the ADUL T
datasets the adv tage of the AD o v er con v en tional oun c ting ranges b et w een a factor
of to Unsurprisingly t he computational costs f or ADUL T increase ublinearly s with
dataset size for he t AD tree but linearly for the con v en tional oun c ting The computational
adv an tages and the sublinear b eha vior are m uc h more pronounced for the syn thetic data
Next T able xamines e t he ect f o leafists o n the ADUL T and IR B TH datasets F or
the ADUL T dataset he t b yte ize s of the ree t decreases b y a factor of when eafists l are
increased from t o But the computational c ost of running the Ba y es searc h i ncreases
b y only i ndicating a w orthhile t radeo if memory is scarce
The Ba y es net scoring results in v ed the a v erage cost of computing con tingency tables
of man y diren t s izes The f ollo wing results in T ables and ak m ethe as vings for ed
size attribute sets easier to discern These tables iv g e r esults for the feature selection and
rule ding algorithms resp ectiv T he biggest sa vings come from small a ttribute sets
Computational sa vings f or sets of size one o r t w o a re ho w ev er not particularly in teresting
since all suc h coun ts could b e cac b y traigh s tforw ard metho ds without needing an ytirc ks
In all cases ho w ev er w e do ee s large sa vings esp cially e f or the BIR TH data Datasets with
larger n um b e rs of ro ws w ould of course rev eal larger sa vings

hed
ely
olv
tree an
en
AD

TH BIRCa ched Sufficient St a tistics f or Efficient Ma chine Learning
Dataset M R tree Time Regular Time Sp eedup F actor
CENSUS
CENSUS
ADUL T
ADUL T
ADUL T
BIR TH
SYNK
SYNK
SYN
SYN
SYN
T able The time n seconds to p erform hillli m bin g terations i
searc hing for the b est B a y es net structure ADtr e eTime is the time
when using the AD tree and R e gular Time is the t ime tak en when using
the con v tional probabilit y table scoring metho d of coun ting through
the dataset e e dup F actor is the n um ber oftimesb y w hthe AD tree
metho d s i aster f than the con v tional m etho d The AD tree times d o
not include the time or f building the AD tree in the st place iv en in
T able A t ypical use of AD trees ill w build the tree only once a nd then
b e able to use it for man y data analysis o p rations e and so its building
cost can b e amortized y c ase e v en including tree building cost
w ould ha v e o nly a minor impact o n t he results
ADUL T BIR TH
R b o des Build Searc h b o des Build Searc h
min
Secs Secs Secs Secs













T able In v estigating the ect of the R parameter on the ADUL T
min
dataset and he t BIR TH dataset b is the memory used b ythe AD tree
o is the n um b er f o no des in the AD tree Build Se cs is the t ime to
build the AD tree ar ch Se cs is the time needed to p e rform
iterations of the a B y es net structure searc h

Se
des




an In
en
hic Sp
en
ADMoore Lee
ADUL T BIR TH
Num ber Num ber AD tree Regular Sp eedup Num ber tree Regular Sp eedup
A ttributes A ttribute Time Time F actor A ttribute Time Time F actor
Sets Sets





T able The time tak en to searc h a mong all attribute sets f o a giv en size
um ber A ttributes for the set that giv es the b est nformation i gain in
predicting the utput o attribute The t imes in seconds are he t a v erage
ev aluation times p er attributeet
ADUL T BIR TH
Num ber Num ber AD tree Regular Sp eedup Num ber AD tree Regular Sp eedup
A ttributes Rules Time Time F actor Rules Time Time F actor





T able The time tak en to searc h a mong all ules r o f a giv en size um
ber A ttributes for he t highest scoring rules or f predicting he t output
attribute The times in seconds a re the a v erage ev aluation time p er
rule






ADCa ched Sufficient St a tistics f or Efficient Ma chine Learning
Alternativ e ata D Structures
Wh y not use a k dree
k drees can b e used for accelerating learning algorithms moh undro M o ore et al
The primary dirence is that a k dree no de splits on only one a ttribute nstead i
of all attributes This results in m uc h less memory linear in the n um b er f o records But
coun ting can b e exp ensiv e Supp ose for e xample hat t lev el one of the tree splits n o a ev el

t w o splits on a e tc T hen in the c ase of binary v ariables if w eha v ea qrueyin v olving only

attributes a and higher w eha v e t o explore a ll paths in t he tree do wn to lev el ith W

datasets of few er than records this ma ybe noc heap er than p erforming a l inear searc h
through the records Another p ossibilit y R rees uttman Roussop oulos Leifk er
store databases of M imensional geometric ob jects Ho w ev er i n t his con text they
or no adv an tages o v er k drees
Wh y not use a F requen t et S der
F requen t Set ders gra w al et al are t ypically used with v ery large databases
of millions f o records on c taining v ery sparse binary attributes E ien t algorithms exist
for ding all s ubsets of attributes that c o ccur w ith v alue TR UE in more than a ed
n um ber c hosen b y he t user and called the supp ort f o records Recen t researc hannila
T onen suggests that suc hF requen t Sets an c b e used o t p erform eien t coun ting
In the case where supp all suc hF requen t Sets re a gathered and if coun ts of eac h
F requen t Set are retained this is equiv alen t to p AD tree in whic h i nstead of
p erforming a no de c uto for the most common v alue t he cuto alw a ys o ccurs for v alue
F ALSE
The use of F requen t Sets in this w a yw ould th us b e v ery similar to the use of AD trees
with one adv an tage and ne o disadv an tage T he adv an tage i s that eien t algorithms ha v e
b een dev elop ed for building F requen t Sets from a small n um b r e o f s equen tial passes through
data The AD tree requires random access to the dataset while it is b e ing built a nd for its
leafists This is impractical if the dataset is to o large o t r eside n i main memory and is
accessed through database queries
The disadv an tage of F requen t Sets in comparison with AD trees i s that under some
circumstances the former m a y require m uc h ore m m emory A ssume the v alue is rarer
than throughout all attributes in the dataset a nd assume reasonably that w eth us c ho ose
to d all F requen t Sets o f Unnecessarily man y sets w ill b e pro duced if there are
correlations In the extreme case imagine a dataset in w hic h of the v alues re a
are and attributes a re p rfectly e correlatedll v alues in eac h r ecord are iden tical Then
M
with M attributes there w ould b e F requen t Sets of In con trast the AD tree w ould
only con tain M no des This is an extreme xample e but datasets with m uc hw eak er
in terttribute correlations can similarly b ene rom f sing u an tree
Leafists are nother a tec hnique to reduce t he size of AD trees urther f They could also
b e used for the F requen t Set represen tation

AD

an ducing ro
ort
oiv


Moore Lee
Wh y not use hash tables
If w e k new t hat only a small set of con tingency tables w er b e requested instead of
all p ossible con tingency tables t hen a n tree w ould b e unnecessary t w ould b e b e tter to
remem b er this small s et of con tingency ables t xplicitly e T hen some kind of tree structure
could b e used to index the con tingency tables But a hash table w ould b e equally time
eien t a nd require less space A hash table co ding of individual oun c ts in the con tingency
tables w ould similarly a llo w us o t use space prop o rtional only to the n um b e r o f nonero
en tries in the tored s tables ut B for represen ting suien t statistics to p ermit fast solution
to an y con tingency table request the AD tree structure remains more memory eien t than
the hashable approac h r an y metho d that stores a ll nonero coun ts b ecause of the
memory reductions when w e xploit e the i gnoring f o m ost common v alues
Discussion
What ab out n umeric attributes
The AD tree represen tation is designed e n tirely for sym b olic attributes When faced ith w
n umeric attributes the simplest solution is to discretize them in to a xed ite set o f v alues
whic h are then reated t as sym b ls o but this is of little help if he t user requests c oun ts for
queries in v olving inequalities on n umeric ttributes a In future w w e w ev aluate the use
of structures com bining elemen ts from m ultiresolution k drees of real attributes o ore
et al with AD trees
Algorithmp eci coun ting tric ks
Man y lgorithms a that coun t using the con v tional inear metho d ha v e a lgorithmp eci
w a ys of accelerating heir t p erformance F or example a B a y es net structure der ma ytyr
to remem ber all teh onc tingency tables t i has tried previously in case it needs to rev aluate
them When it deletes a l ink i t c an deduce the new con tingency table from the old one
without needing a inear l coun t
In suc h cases the most appropriate use of the AD tree ma ybe asa lazy cac hing mec ha
nism A t birth the AD tree consists o nly of the ro ot no de Whenev er the structure der
needs a con tingency table that cannot b e deduced from the curren t AD tree structure the
appropriate no des of the AD tree are expanded The AD tree then tak es on the role of the
algorithmp eci cac hing metho ds while n g eneral using up m uc h ess l m emory than if
all con tingency tables w ere r emem berde
Hard to up date incremen tally
Although the tree can b e built c heaply ee the exp e rimen tal results in Section and
although it can b e built lazily the AD tree cannot b e up dated c heaply with a new record
M
This is b cause e one new record ma ymatc hupto no des in he t tree in the w orst case
Scaling up
The AD tree represen tation can b e useful f or datasets of the rough size and hap s e used n i his t
pap er On the st datasets w eha v eloko ed athe ones describ ed in this pap er eha v e


en
ill ork
AD
ev ouldCa ched Sufficient St a tistics f or Efficient Ma chine Learning
sho wn empirically that the sizes of the AD trees re a tractable g iv en real noisy data This
included one dataset with a ttributes It is the exten t to w hic h the attributes are sk ew ed
in their v alues and correlated with ac e h other that nables e the AD tree to a v oid approac
its w orsease b ounds The main tec hnical con tribution o f this pap er is the tric k that allo ws
us to prune o mostommon alues Without it k s ew edness and correlation w ould hardly

help at all The e mpirical con tribution o f this pap er has b een to sho w hat t the actual
sizes of the AD trees pro duced from real data are v astly smaller than the izes s w ew ould get
from the w orstase b unds o in App endix A
But despite these a s vings AD trees c annot y et represen t ll a the suien t s tatistics for
h uge datasets with man yh undreds of nonparse and p o orly correlated attributes What
should w e do if ur o dataset or our tree cannot t in to main memory In the latter case
w e could simply increase t he size of leafists t rading o decreased memory gainst a increased
time to build con tingency ables t B ut if that is inadequate at least three p ossibiliti es remain
First w e could build appro ximate AD trees that do not store an y information for no des
that matc hfwe er than a t hreshold n um berofrerscod Then appro ximate con tingency
tables omplete with error b ounds an c b e pro duced annila T oiv onen A
second p ossibilit y is to xploit e secondary storage a nd store deep rarely visited no des of the
AD tree on disk This w ould doubtless b st e b e ac hiev ed b yin tegrating the ac m hine learning
algorithms with curren t database managemen t t o ols topic of considerable i n terest in the
data mining comm unit yF a yy ad et al A third p ossibilit y hicw h restricts the size
of con tingency tables w ema y a sk for is to refuse to store coun ts for queries with more than
some threshold n um b er of ttributes a
What ab out the cost o f building the tree
In practice AD trees could b e used in t w ow a ys
One When a traditional algorithm is required w e build the AD tree run the fast
v ersion of the algorithm iscard d the AD tree nd a return he t results
Amortized When a new dataset b comes e a v ailable a new tree is built for it
The t ree i s then shipp ed and resed b yan y one w ho wishes to do realime coun
queries m ultiv ariate graphs and c harts or an ymac hine learning algorithms on an y
subset of the attributes The cost of the initial tree building is then amortized o v er
all the times it is used In database terminology the pro cess is kno wn as materializ
ing arinara y an et al and has b een s uggested as desirable for datamining b y
sev eral researc hers ohn Len t annila M T oiv onen
The one option is only useful if the cost of building the AD tree plus the cost of running
the AD treeased algorithm is less than the cost of the original coun tingased a lgorithm
F or the in tensiv emac hine learning metho ds studied here this condition i s s afely satisd
But what if w e decided to use a less in tensiv e greedier Ba y es net tructure s der T able
Without pruning on all of our datasets w e r an out of memory n o a egab m yte mac hine b efore w e
had built ev en of the ree t and it is e asy to sho w hat t the BIR TH dataset w ould ha v e n eeded to store

more han t no des

ting
AD


AD
hingMoore Lee
Dataset Sp eedup ignoring Sp eedup allo wing for Sp eedup allo wing for
buildime buildime buildime
iterations iterations iterations
CENSUS
CENSUS
ADUL T
ADUL T
ADUL T
BIR TH
SYNK
SYNK
SYN
SYN
SYN
T able Computational economics of building AD trees and using them
to searc hfro Ba y es net s tructures using the exp erimen ts of Section

sho ws that if w e only run for iterations instead of if w e accoun t f a
AD tree building cost then the relativ e p s eedup of using AD trees declines greatly
T o c onclude if the data nalysis a is in tense then there is b ene to using AD trees ev en
if they are used in a ne o ashion f If the AD tree is used for m ultiple purp oses then its
buildime is amortized and t he resulting relativ e e iency gains o v er traditional coun
are the same or f b oth exhaustiv e s hes and nonxhaustiv e searc hes Algorithms that
use nonxhaustiv e s earc hes include hilllim bi ng Ba y es net l earners reedy g rule learners
suc h s a C N lark Niblett and decision tree learners uinlan Breiman
et al
Ac kno wledgemen ts
This w ork w as sp onsored b y a National Science F oundation Career Aw ard t o A ndrew Mo ore
The authors thank Justin Bo y an Scott Da vies Nir F riedman and Je Sc hneider for their
suggestions and Ron Koha vi for pro viding the census datasets
App endix A Memory Costs
In this app endix w e examine the size of the tree F or simplicit y e restrict atten tion to the
case of binary attributes
The w orstase n um b r e of no d AD tree
Giv en a dataset with M attributes and R records the w orstase for t he AD tree will o ccur if
M
all p ossible records xist e in the dataset T hen for ev ery s ubset o f ttributes a there exists
exactly one no de n i t he tree F or example c onsider he t attribute set f a a g
i i n
where i i n Supp ose there is a no de n i he t tree corresp o nding to the
Unsurprisingl y he t resulting Ba y es nets ha v e a highly inferior structure


AD
an in es

earc
ting
one or andCa ched Sufficient St a tistics f or Efficient Ma chine Learning
query f a v a v g for ome s v alues v v rom t he deition of an tree
n n
i i n
and remem b ring e w e re a only considering the case of binary attributes w e can state
v is the least common v alue of a

i
v is the least common v alue of a among those records that matc h a v

i i



v is the least common v alue of a among hose t records that matc h a
k
i k i
v v
i k k
So there is at most one suc h no de Moreo v er s ince o ur w orstase ssumption a is that
all p ssible o records exist in the database w e see that the AD tree will indeed con tain this
no de Th us the w orstase n um b r e of no des is the same as the n um b e r f o p ossible subsets
M
of attributes
The w orstase n ber of nodse ni an AD tree with a reasonable n um ber of ro ws
M
It is frequen tly the case that a ataset d has R With ew f er records there is a m uc h
lo w er w orstase b ound on the AD tree size A no de at the k th lev el of the tree corresp onds
to a query in v olving k attributes oun ting the ro ot no de as lev el Suc h a no de can matc h
k
at most R records b ecause eac h f o he t no de ancestors up the tree has pruned o at
least half the records b yc ho osing o t xpand e only the least common v alue of the attribute
in tro duced b y that ancestor Th us there can b e no tree no des at lev el b R c of the

log R c

tree b ecause suc h no es d w ould ha v etomatc hfwe er than R r ecords They
w ould th us matc h no records making them NULL
The no des in an AD tree m ust a ll exist a t lev el b log R c or higher T he n um ber of nodse

M
at lev el k is at most b ecause ev ery no de at lev el k in v es an attribute set of size
k
k and b ecause iv en binary attributes or f ev ery attribute s et there is at most one no de
in the AD tree T h us the total n um b er of no des in the tree summing o v er the lev els is less
than

b log R c

X
M
b log R c

b ounded ab o v eb y O M b log R c

k
k
The n b e r o f n o des if w e ssume a sk ew ed indep enden t attribute v alues
Imagine that all v alues of all attributes i n t he dataset are indep nden e t random binary
v ariables aking t v alue ith w probabilit y p and aking t v alue ith w probabilit y p Then
the further p is from the smaller w e an c exp e ct the AD tree to b e This is b ecause
on a v erage t he less common v alue of a V ary no de ill w matc h fraction min p pf tsi
paren t ecords r And on a v erage the n b e r f o r ecords matc hed at the k th lev el of the
k
tree will b e R in p p Th us the maxim um lev el in the tree at whic hw ema y
d a no de matc hing one or m ore records i s a ppro ximately b og R log q c where

q min p p And so the total n um b r e o f no des in the tree is a ppro ximately

b og R log q c

X
M
b og R log q c

bondeud abo v eb y O M b og R log q c

k
k


um

um
olv
log
um

AD Moore Lee
Since the exp onen t is reduced b y a factor of log edness among the attributes

th us brings enormous sa vings in memory
The n b e r o f n o des if w e ssume a correlated attribute v alues
The AD tree b e nes rom f correlations among attributes in m uc h the same w a y that it
b enes from s k ew edness F or example supp ose t hat e ac hredwcor as generated b y the
simple Ba y es net in Figure where the random v ariable B is hidden ot included in the
record Then for i j P a a p p If ADN is an y no de in the resulting tree
i j
then the n um b er of records matc hing an y ther o no de t w olev els b elo w ADN in the tree will
b e fraction p pfthe n um b e r f o records matc hing ADN rom this w e an c see that
the n um b er of no des in the tree is appro ximately

b og R log q c

X
M
b og R log q c

bondeud abo v eb y O M b og R log q c

k
k

p
where q p p Correlation among t he attributes can th us also bring enormous
sa vings in memory ev en if s s i the case in our example the m arginal distribution of
individual attributes is uniform
P(B) = 0.5
B
P(a | B) = 1 - p
i
a a . . . a P(a | ~B) = p
1 2 M i
Figure ABa y es net that generates c orrelated b o olean attributes
a a
M
The n b e r o f n o des for he t dense AD tree of Section
The dense AD trees do not cut o the tree for the most common v alue of a V ary no de The
M
w orst case AD tree will o c cur if ll a p ossible records exist in the dataset Then the dense
M
AD tree will r equire no des b ecause ev ery p ossible query ith eac h attribute taking
v alues or will ha v e a coun t in the tree The n um b er of no d k th lev el of the
M
k
dense AD tree can b e in the w orst case
k
The n b e r o f n o des when using Leafists
Leafists w ere describ d e i n Section If a t ree is b uilt using maxim um leafist size of R
min
then an ynodein the AD tree matc hing few er than R records is a eaf l no de This means
min
that F orm ulae and an c b e resed replacing R with R It is mp i ortan tto
min
remem ber ho w ev er that the l eaf no des m ust no wcno tain ro om for R n um b rs e instead
min
of a single coun t


um
the at es
um


AD
um
ew sk Ca ched Sufficient St a tistics f or Efficient Ma chine Learning
App endix B Building the tree
W e dee the function Mak eADT ree a R e c or dNums where R e c or dNums is a ubset s of
i
f R g R is the total n um b r e o f records in the dataset and where i M This
mak es an AD tree from the ro ws sp ecid in R e c or dNums in whic hall AD no des r epresen t
queries in whic h nly o a ttributes a and higher are used
i
Mak eADT ree a R e c or dNums
i
Mak e a new AD no de called ADN
ADN COUNT j R e c or dNums j
F or j i i
j th V ary no de of Mak eV aryNo e d a R e c or dNums
j
Mak eADT ree uses the f unction Mak eV aryNo de w hic hw eno w dee
Mak eV aryNo de a R e c or dNums
i
Mak e a new V ary no de called VN
F or k n
i
Let Childnums fg
k
F or eac h j R e c or dNums
Let v alue of attribute a in record j
ij i
Add j to the set Childnums
v
Let VN argmax j Childnums j
k
k
F or k n
i
If j Childnums j or if k MCV
k
Set the a k subtree of VN to NULL
i
Else
Set the a k subtree of VN to Mak eADT a Childnums
i i k
T o build the en tire tree w em ust call Mak eADT a f R g Assuming binary at

tributes the cost of building a tree rom f R records and M attributes is b ounded ab o v e
b y

b R c

X
R
M

k
k

k
References
Agra w al R Mannila H Srik an t R T oiv onen H V erk amo A I F ast dis
co v ery of asso ciation rules In F a yy ad U M Piatetskyhapiro G Sm yth P
Uth urusam y R ds A dvanc es in Know le dge Disc overy and Data Mining AAAI
Press


log
ree
ree

MCV
ij


ADN


ADMoore Lee
Breiman L F riedman J H Olshen R A Stone C J Classi ation and
R e gr ession T r e es adsw orth
Clark P N iblett R The CN induction algorithm Machine L e arning

Co v er T M T homas J A Elements of Information The ohn J Wiley
Sons
F a yy ad U Mannila H Piatetskyhapiro G Data Mining and K now le dge
Disc overy Klu w er Academic Publishers A new ournal j
F a yy ad U Uth urusam y R Sp ecial issue n o D ata Mining Communic ations of
the A
F riedman N Y akhini Z On the sample c omplexit y f o learning Ba y esian net
w orks In o c e e dings f o he t th c onfer enc e o n Unc ertainty i n A rtiial Intel l igenc e
Morgan Kaufmann
Guttman A R rees A dynamic index structure for patial s searc hing In Pr o
c e e dings of he t Thir dA CM SIGA CTIGMOD Symp osium n o Principles f o atab D ase
Systems Assn for Computing Mac hinery
Harinara y an V Ra jaraman A Ullman J D mplemen I ting Data Cub es E
cien tly n Pr o c e e dings of the F ifte A CM SIGA CTIGMODIGAR T ympS osium
on Principles of Datab ase Systems PODS pp A ssn f or Computing
Mac hinery
John G H Koha vi R Pger K Irrelev t features and t he Subset Selection
Problem In Cohen W W Hirsh H ds Machine L e arning Pr o c e e dings of
the Eleventh International Confer enc e organ M Kaufmann
John G H Len t B IPping S from the data ehose In Pr o c e e dings f o he t Thir d
International onfer C enc eon owKnle dge Disc overy and D ata Mining A AAI Press
Koha vi R The P o w er of Decision T ables In La vrae N W rob el S ds
Machine L e arning ECML h Eur op e an Confer e o n Machine L e arning
Her aclion Cr ete Gr e e c e Springer V erlag
Koha vi R Scaling up the accuracy of naiv y es classirs a decisionree h y
brid In E Simoudis a nd J Han nd a U F a d Pr o c e e dings of the Se c
International onfer C enc eon owKnle dge Disc overy and D ata Mining A AAI Press
Madala H R Iv akhnenk o A G Inductive L e arning A lgorithms or f Complex
Systems Mo deling R C Press Inc Bo ca Raton
Mannila H T onen H ultiple M uses of frequen t sets and ondensed c represen
tations In E Simoudis and J Han nd a U F a yy ad d Pr o c e e dings f o the Se c
International onfer C enc eon owKnle dge Disc overy and D ata Mining A AAI Press

ond
oiv

ond ad yy
ea
enc
an
enth
Pr
CM
ory
Ca ched Sufficient St a tistics f or Efficient Ma chine Learning
Mo ore A W Sc hneider J Deng K Eien t Lo ally c W eigh ted P olynomial
Regression Predictions In D Fisher d Pr o c e e dings o f t he International
Machine L e arning Confer enc e Morgan Kaufmann
Omoh undro S M Eien t A lgorithms w ith Neural Net w ork Beha viour Journal
of Complex Systems
Quinlan J R Learning Eien t C lassiation Pro c edures and their Application to
Chess nd E Games In Mic halski R S C arb o nell J G Mitc hell T M ds Ma
chine L e arning nA rtiial ntel I ligenc eAppr o ach T ioga Publishing Compan y
P alo Alto
Quinlan J R Learning logical deitions from relations Machine L e arning

Roussop oulos N Leifk er D Direct spatial searc h n o pictorial databases using
pac k ed Rrees In Na v athe S d Pr o c e e dings of A CMIGMOD Interna
tional Confer enc e n o Management of Data Assn or f Computing Mac hinery
Rymon R An SEree based Characterization o f the Induction Problem In P
Utgo d Pr o c e e dings of the h International Confer enc e n o Machine L e arning
Morgan Kaufmann