Inductive Decision Trees

strawberrycokevilleΤεχνίτη Νοημοσύνη και Ρομποτική

7 Νοε 2013 (πριν από 3 χρόνια και 9 μήνες)

60 εμφανίσεις

1

CogNova

Technologies

Inductive Decision Trees


with


Daniel L. Silver

Copyright (c), 1999

All Rights Reserved

2

CogNova

Technologies

Agenda


Inductive Modeling
Overview


Inductive Decision Trees


See handouts and readings


3

CogNova

Technologies

Inductive Modeling Overview



4

CogNova

Technologies

Induction Vs. Deduction

Model or

General Rule

Deduction

Example A

Example B

Example C

Top
-
down verification

Induction

Bottom
-
up construction

5

CogNova

Technologies

Inductive Modeling = Learning

Objective:
Develop a general model or



hypothesis from specific examples


Function approximation




(curve fitting)



Classification
(concept learning, pattern
recognition)

x1

x2

A

B

f(x)

x

6

CogNova

Technologies

Inductive Modeling = Learning

Basic Framework for Inductive Learning

Inductive

Learning System

Environment

Training

Examples

Testing

Examples

Induced

Model of

Classifier

Output Classification

(x, f(x))

(x, h(x))

h(x) = f(x)?

A problem of representation and

search for the best hypothesis, h(x).

~

7

CogNova

Technologies

Inductive Modeling




Machine Learning


Ideally, a hypothesis is


complete



-

transparent


consistent



(provides explanation)


valid


accurate .... able to generalize to
previously unseen examples/cases

8

CogNova

Technologies

Machine Learning Methods


Automated Exploration/Discovery


e.g..
discovering new market segments


distance and probabilistic clustering algorithms


Prediction/Classification


e.g..
forecasting gross sales given current factors


statistics (regression, K
-
nearest neighbour)


artificial neural networks, genetic algorithms


Explanation/Description


e.g..
characterizing customers by demographics




inductive decision trees/rules


rough sets, Bayesian belief nets

x1

x2

f(x)

x

if age > 35


and income < $35k


then ...

A

B

9

CogNova

Technologies

Inductive Learning

Generalization


The objective of learning is to achieve good
generalization

to new cases, otherwise just use
a look
-
up table.


Generalization can be defined as a
mathematical
interpolation

or
regression

over a
set of training points:

f(x)

x

10

CogNova

Technologies

Inductive Learning

Generalization


Generalization accuracy can be guaranteed
for a specified confidence level given
sufficient number of examples


Models can be validated with a previously
unseen test set or approximated by cross
-
validation methods

f(x)

x

11

CogNova

Technologies

PAC Theory of the Learnable


P
robably
A
pproximately
C
orrect
learning theory


Generalization

12

CogNova

Technologies

Inductive Decision Trees



13

CogNova

Technologies

Inductive Decision Trees

Decision Tree


A representational structure


An acyclic, directed graph


Nodes are either a:


Leaf

-

indicates class or value (distribution)


Decision node
-

a test on a single attribute



-

will have one branch and subtree for each



possible outcome of the test


Classification made by traversing from root to
a leaf in accord with tests

A?

B?

C?

D?

Root

Leaf

Yes

14

CogNova

Technologies

Inductive Decision Trees (IDTs)

A Long and Diverse History


Independently developed in the 60’s and 70’s
by researchers in ...

Statistics:
L. Breiman & J. Friedman
-

CART
(Classification and Regression Trees)

Pattern Recognition:
Uof Michigan
-

AID,
G.V. Kass
-

CHAID (Chi
-
squared
Automated Interaction Detection)

AI and Info. Theory:
R. Quinlan
-

ID3, C4.5
(Iterative Dichotomizer)


15

CogNova

Technologies

Inducing a Decision Tree

Given:
Set of examples with Pos. & Neg. classes

Problem:
Generate a Decision Tree model to
classify a separate (validation) set of
examples with minimal error

Approach:
Occam’s Razor
-

produce the
simplest model that is consistent with the
training examples
-
> narrow, short tree.

Every traverse should be as short as
possible

Formally:
Finding the absolute simplest tree is
intractable, but we can at least try our best

16

CogNova

Technologies

Inducing a Decision Tree

How do we produce an optimal tree?

Heuristic (strategy) 1:
Grow the tree from the
top down. Place the
most important variable
test at the root of each successive subtree

The

most important variable
:


the variable (predictor) that gains the most ground
in classifying the set of training examples


the variable that has the most significant
relationship to the response variable


to which the response is most dependent or least
independent

17

CogNova

Technologies

Inducing a Decision Tree

Importance of a predictor variable


CHAID/CART


Chi
-
squared [or F (Fisher)] statistic is used to test
the independence between the catagorical [or
continuous] response variable and each predictor
variable


The lowest probability (p
-
value) from the test
determines the most important predictor (p
-
values
are first corrected by the Bonferroni adjustment)


C4.5


Theoretic Information Gain is computed for each
predictor and one with the highest Gain is chosen

18

CogNova

Technologies

Inducing a Decision Tree

How do we produce an optimal tree?

Heuristic (strategy) 2:
To be fair to predictors
variables that have only 2 values, divide
variables with multiple values into similar
groups or segments which are then treated as
separated variables (CART/CHAID only)



The p
-
values from the Chi
-
squared or F statistic is
used to determine variable/value combinations
which are most similar in terms of their relationship
to the response variable


19

CogNova

Technologies

Inducing a Decision Tree

How do we produce an optimal tree?

Heuristic (strategy) 3:
Prevent overfitting the
tree to the training data so that it
generalizes
well to a validation set by:

Stopping:
Prevent the split on a predictor variable if it
is above a level of statistical significance (CHAID) or
leaves to few examples per leaf (C4.5)

Pruning:
After a complex tree has been grown, replace
a split (subtree) with a leaf if the
predicted

validation
error is no worse than the more complex tree (CART,
C4.5)


20

CogNova

Technologies

Inducing a Decision Tree

Stopping means a choice of level of
significance ....


If the probability (p
-
value) of the statistic is
less than the chosen level of significance then
a split is allowed


Typically the significance level is set to:


0.05 which provides 95% confidence


0.01 which provides 99% confidence

21

CogNova

Technologies

Estimating error rates


Pruning operation is performed if this does not
increase the estimated error


Of course, error on the training data is not a useful
estimator (would result in almost no pruning)


One possibility: using hold
-
out set for pruning
(
reduced
-
error pruning
)


C4.5

s method: using upper limit of 25% confidence
interval derived from the training data


Standard Bernoulli
-
process
-
based method

22

CogNova

Technologies

Post
-
pruning in C4.5


Bottom
-
up pruning
: at each non
-
leaf node v, if merging
the subtree at v into a leaf node improves accuracy,
perform the merging.


Method 1: compute accuracy using examples not seen by the
algorithm.


Method 2: estimate accuracy using the training examples:


Consider classifying E examples incorrectly out of N
examples as observing E events in N trials in the
binomial distribution.


For a given confidence level CF, the upper limit on the error rate
over the whole population is U
CF
(E,N) with CF% confidence.


23

CogNova

Technologies


Usage in Statistics: Sampling error estimation


population: 1,000,000 people


population mean: percentage of the left handed people


sample: 100 people


sample mean: 6 left
-
handed


How to estimate the REAL population mean give 25% confidence?

Pessimistic Estimate

U
0.25
(6,100)

L
0.25
(6,100)

6

Probability of

being left handed

2

10

75% confidence interval

Upper limit of confidence interval

CF = confidence factor = .25 or 25%

24

CogNova

Technologies


Usage in IDTs: Node classification error estimation


population: set of all possible training examples


population mean: percentage of error made by this node


sample: 100 examples from training data set


sample mean: 6 errors for the training data set


How to estimate the REAL average error rate give 25% confidence?

Pessimistic Estimate

U
0.25
(6,100)

L
0.25
(6,100)

6

2

10

75% confidence interval

Probability of

error at node

Upper limit of confidence interval

CF = confidence factor = .25 or 25%

25

CogNova

Technologies


Usage in IDTs: Node classification error estimation


population: set of all possible training examples


population mean: percentage of error made by this node


sample: 16 examples from training data set


sample mean: 1 error for the training data set


How to estimate the REAL average error rate give 25% confidence?

Pessimistic Estimate

U
0.25
(1,16)

L
0.25
(1,16)

1

0.5

2.5

75% confidence interval

Probability of

error at node

Upper limit of confidence interval

CF = confidence factor = .25 or 25%

26

CogNova

Technologies


Usage in IDTs: Node classification error estimation


population: set of all possible training examples


population mean: percentage of error made by this node


sample: 16 examples from training data set


sample mean: 1 error for the training data set


How to estimate the REAL average error rate give 10% confidence?

Pessimistic Estimate

U
0.10
(1,16)

L
0.10
(1,16)

1

-
0.5

3.5

90%

confidence interval

A heuristic

that works

well.

Probability of

error at node

Upper limit of confidence interval

CF = confidence factor =
.10 or 10%

27

CogNova

Technologies

C4.5

s method


Error estimate for subtree is weighted sum of
error estimates for all its leaves


Error estimate for a node:




If
c =
25% then
z

= 0.69 (from normal
distribution)


f

is the error on the training data


N

is the number of instances covered by the
leaf























N
z
N
z
N
f
N
f
z
N
z
f
e
2
2
2
2
2
1
4
2
2
28

CogNova

Technologies

Example for Estimating Error


Consider a subtree with 3 leaf nodes:


Sunny: Play = yes : (0 error, 6 instances)


Overcast: Play= yes: (0 error, 9 instances)


Cloudy: Play = no (0 error, 1 instance)



The estimated error for this subtree is


6*0.206+9*0.143+1*0.750=3.273


If the subtree is replaced with the leaf

yes

, the
estimated error is

750
.
0
)
0
,
1
(
,
143
.
0
)
0
,
9
(
,
206
.
0
)
0
,
6
(
25
.
0
25
.
0
25
.
0



U
U
U
512
.
2
157
.
0
*
16
)
1
,
16
(
*
16
25
.
0


U
29

CogNova

Technologies

IDT Training



30

CogNova

Technologies

IDT Training

How do you ensure that a decision
tree has been well trained?


Objective:
To achieve good generalization





accuracy on new examples/cases


Establish a maximum acceptable error rate


Train the tree using a
tuning/test et
if needed


Validate the trained network against a
separate test set which is usually referred to
as a

validation, hold
-
out or production set

31

CogNova

Technologies

IDT Training

Available Examples

Training

Set

HO

Set

Approach #1:
Large Sample

When the amount of available data is large ...

70%

30%

Used to develop one IDT model

Compute

goodness

of fit

Divide randomly

Generalization

= goodness of fit

(accuracy)

Test

Set

32

CogNova

Technologies

IDT Training

Available Examples

Training

Set

HO

Set

Approach #2:
Cross
-
validation

When the amount of available data is small ...

10%

90%

Repeat 10


times

Used to develop 10 different IDT models

Tabulate

goodness of fit stats

Generalization

= mean and stddev

of goodness of fit

Test

Set

33

CogNova

Technologies

IDT Training

How do you select between two induced
decision trees ?


A statistical test of hypothesis is required to
ensure that a significant difference exists
between the fit of two IDT models


If
Large Sample
method has been used then
apply
McNemar’s test*
or

diff. of proportions


If
Cross
-
validation

then use a
paired
t

test
for
difference of two proportions

*We assume a classification problem, if this is function

approximation then use paired
t
test for difference of means

34

CogNova

Technologies

Data Preparation

Garbage in Garbage out


The quality of results relates directly to
quality of the data


50%
-
70% of IDT development time will be
spent on data preparation


The three steps of data preparation:


Consolidation and Cleaning


Selection and Preprocessing


Transformation and Encoding

35

CogNova

Technologies

Data Preparation

Data Types and IDTs


Three basic data types:


nominal
discrete symbolic (
A, yes, small
)


ordinal
discrete numeric (
-
5, 3, 24)


continuous

numeric (0.23,
-
45.2, 500.43)


IDTs can generally accept all types of data
plus a 4th form called
float =
any of the above
plus a special value for
n/a or unknown


36

CogNova

Technologies

Pros and Cons of IDTs

Cons:


Only one response variable at a time


Different significance tests required for
nominal and continuous responses


Can have difficulties with noisy data


Discriminate functions are often
suboptimal due to orthogonal decision
hyperplanes


37

CogNova

Technologies

Pros and Cons of IDTs

Pros:


Proven modeling method for 20 years


Provides explanation and prediction


Ability to learn arbitrary functions


Handles unknown values well


Rapid training and recognition speed


Has inspired many inductive learning
algorithms using statistical regression

38

CogNova

Technologies

The IDT Application
Development Process

Guidelines for inducting decision trees

1.
IDTs are good method to start with

2.
Get a suitable training set

3.
Use a sensible coding for input variables

4.
Develop the simplest tree by adjusting tuning
parameters (significance level)

5.
Use a test set to prevent over
-
fitting

6.
Determine confidence in generalization
through cross
-
validation

39

CogNova

Technologies

TUTORIAL



Complete the C4.5 IDT Tutorial
Handout


Requires ssh access to euler.acadiau.ca



40

CogNova

Technologies

THE END


daniel.silver@dal.ca