1
CogNova
Technologies
Inductive Decision Trees
with
Daniel L. Silver
Copyright (c), 1999
All Rights Reserved
2
CogNova
Technologies
Agenda
Inductive Modeling
Overview
Inductive Decision Trees
See handouts and readings
3
CogNova
Technologies
Inductive Modeling Overview
4
CogNova
Technologies
Induction Vs. Deduction
Model or
General Rule
Deduction
Example A
Example B
Example C
Top

down verification
Induction
Bottom

up construction
5
CogNova
Technologies
Inductive Modeling = Learning
Objective:
Develop a general model or
hypothesis from specific examples
Function approximation
(curve fitting)
Classification
(concept learning, pattern
recognition)
x1
x2
A
B
f(x)
x
6
CogNova
Technologies
Inductive Modeling = Learning
Basic Framework for Inductive Learning
Inductive
Learning System
Environment
Training
Examples
Testing
Examples
Induced
Model of
Classifier
Output Classification
(x, f(x))
(x, h(x))
h(x) = f(x)?
A problem of representation and
search for the best hypothesis, h(x).
~
7
CogNova
Technologies
Inductive Modeling
Machine Learning
Ideally, a hypothesis is
–
complete

transparent
–
consistent
(provides explanation)
–
valid
–
accurate .... able to generalize to
previously unseen examples/cases
8
CogNova
Technologies
Machine Learning Methods
Automated Exploration/Discovery
–
e.g..
discovering new market segments
–
distance and probabilistic clustering algorithms
Prediction/Classification
–
e.g..
forecasting gross sales given current factors
–
statistics (regression, K

nearest neighbour)
–
artificial neural networks, genetic algorithms
Explanation/Description
–
e.g..
characterizing customers by demographics
–
inductive decision trees/rules
–
rough sets, Bayesian belief nets
x1
x2
f(x)
x
if age > 35
and income < $35k
then ...
A
B
9
CogNova
Technologies
Inductive Learning
Generalization
The objective of learning is to achieve good
generalization
to new cases, otherwise just use
a look

up table.
Generalization can be defined as a
mathematical
interpolation
or
regression
over a
set of training points:
f(x)
x
10
CogNova
Technologies
Inductive Learning
Generalization
Generalization accuracy can be guaranteed
for a specified confidence level given
sufficient number of examples
Models can be validated with a previously
unseen test set or approximated by cross

validation methods
f(x)
x
11
CogNova
Technologies
PAC Theory of the Learnable
P
robably
A
pproximately
C
orrect
learning theory
Generalization
12
CogNova
Technologies
Inductive Decision Trees
13
CogNova
Technologies
Inductive Decision Trees
Decision Tree
A representational structure
An acyclic, directed graph
Nodes are either a:
–
Leaf

indicates class or value (distribution)
–
Decision node

a test on a single attribute

will have one branch and subtree for each
possible outcome of the test
Classification made by traversing from root to
a leaf in accord with tests
A?
B?
C?
D?
Root
Leaf
Yes
14
CogNova
Technologies
Inductive Decision Trees (IDTs)
A Long and Diverse History
Independently developed in the 60’s and 70’s
by researchers in ...
Statistics:
L. Breiman & J. Friedman

CART
(Classification and Regression Trees)
Pattern Recognition:
Uof Michigan

AID,
G.V. Kass

CHAID (Chi

squared
Automated Interaction Detection)
AI and Info. Theory:
R. Quinlan

ID3, C4.5
(Iterative Dichotomizer)
15
CogNova
Technologies
Inducing a Decision Tree
Given:
Set of examples with Pos. & Neg. classes
Problem:
Generate a Decision Tree model to
classify a separate (validation) set of
examples with minimal error
Approach:
Occam’s Razor

produce the
simplest model that is consistent with the
training examples

> narrow, short tree.
Every traverse should be as short as
possible
Formally:
Finding the absolute simplest tree is
intractable, but we can at least try our best
16
CogNova
Technologies
Inducing a Decision Tree
How do we produce an optimal tree?
Heuristic (strategy) 1:
Grow the tree from the
top down. Place the
most important variable
test at the root of each successive subtree
The
most important variable
:
–
the variable (predictor) that gains the most ground
in classifying the set of training examples
–
the variable that has the most significant
relationship to the response variable
–
to which the response is most dependent or least
independent
17
CogNova
Technologies
Inducing a Decision Tree
Importance of a predictor variable
CHAID/CART
–
Chi

squared [or F (Fisher)] statistic is used to test
the independence between the catagorical [or
continuous] response variable and each predictor
variable
–
The lowest probability (p

value) from the test
determines the most important predictor (p

values
are first corrected by the Bonferroni adjustment)
C4.5
–
Theoretic Information Gain is computed for each
predictor and one with the highest Gain is chosen
18
CogNova
Technologies
Inducing a Decision Tree
How do we produce an optimal tree?
Heuristic (strategy) 2:
To be fair to predictors
variables that have only 2 values, divide
variables with multiple values into similar
groups or segments which are then treated as
separated variables (CART/CHAID only)
The p

values from the Chi

squared or F statistic is
used to determine variable/value combinations
which are most similar in terms of their relationship
to the response variable
19
CogNova
Technologies
Inducing a Decision Tree
How do we produce an optimal tree?
Heuristic (strategy) 3:
Prevent overfitting the
tree to the training data so that it
generalizes
well to a validation set by:
Stopping:
Prevent the split on a predictor variable if it
is above a level of statistical significance (CHAID) or
leaves to few examples per leaf (C4.5)
Pruning:
After a complex tree has been grown, replace
a split (subtree) with a leaf if the
predicted
validation
error is no worse than the more complex tree (CART,
C4.5)
20
CogNova
Technologies
Inducing a Decision Tree
Stopping means a choice of level of
significance ....
If the probability (p

value) of the statistic is
less than the chosen level of significance then
a split is allowed
Typically the significance level is set to:
–
0.05 which provides 95% confidence
–
0.01 which provides 99% confidence
21
CogNova
Technologies
Estimating error rates
Pruning operation is performed if this does not
increase the estimated error
Of course, error on the training data is not a useful
estimator (would result in almost no pruning)
One possibility: using hold

out set for pruning
(
reduced

error pruning
)
C4.5
’
s method: using upper limit of 25% confidence
interval derived from the training data
–
Standard Bernoulli

process

based method
22
CogNova
Technologies
Post

pruning in C4.5
Bottom

up pruning
: at each non

leaf node v, if merging
the subtree at v into a leaf node improves accuracy,
perform the merging.
–
Method 1: compute accuracy using examples not seen by the
algorithm.
–
Method 2: estimate accuracy using the training examples:
Consider classifying E examples incorrectly out of N
examples as observing E events in N trials in the
binomial distribution.
–
For a given confidence level CF, the upper limit on the error rate
over the whole population is U
CF
(E,N) with CF% confidence.
23
CogNova
Technologies
Usage in Statistics: Sampling error estimation
–
population: 1,000,000 people
–
population mean: percentage of the left handed people
–
sample: 100 people
–
sample mean: 6 left

handed
–
How to estimate the REAL population mean give 25% confidence?
Pessimistic Estimate
U
0.25
(6,100)
L
0.25
(6,100)
6
Probability of
being left handed
2
10
75% confidence interval
Upper limit of confidence interval
CF = confidence factor = .25 or 25%
24
CogNova
Technologies
Usage in IDTs: Node classification error estimation
–
population: set of all possible training examples
–
population mean: percentage of error made by this node
–
sample: 100 examples from training data set
–
sample mean: 6 errors for the training data set
–
How to estimate the REAL average error rate give 25% confidence?
Pessimistic Estimate
U
0.25
(6,100)
L
0.25
(6,100)
6
2
10
75% confidence interval
Probability of
error at node
Upper limit of confidence interval
CF = confidence factor = .25 or 25%
25
CogNova
Technologies
Usage in IDTs: Node classification error estimation
–
population: set of all possible training examples
–
population mean: percentage of error made by this node
–
sample: 16 examples from training data set
–
sample mean: 1 error for the training data set
–
How to estimate the REAL average error rate give 25% confidence?
Pessimistic Estimate
U
0.25
(1,16)
L
0.25
(1,16)
1
0.5
2.5
75% confidence interval
Probability of
error at node
Upper limit of confidence interval
CF = confidence factor = .25 or 25%
26
CogNova
Technologies
Usage in IDTs: Node classification error estimation
–
population: set of all possible training examples
–
population mean: percentage of error made by this node
–
sample: 16 examples from training data set
–
sample mean: 1 error for the training data set
–
How to estimate the REAL average error rate give 10% confidence?
Pessimistic Estimate
U
0.10
(1,16)
L
0.10
(1,16)
1

0.5
3.5
90%
confidence interval
A heuristic
that works
well.
Probability of
error at node
Upper limit of confidence interval
CF = confidence factor =
.10 or 10%
27
CogNova
Technologies
C4.5
’
s method
Error estimate for subtree is weighted sum of
error estimates for all its leaves
Error estimate for a node:
If
c =
25% then
z
= 0.69 (from normal
distribution)
f
is the error on the training data
N
is the number of instances covered by the
leaf
N
z
N
z
N
f
N
f
z
N
z
f
e
2
2
2
2
2
1
4
2
2
28
CogNova
Technologies
Example for Estimating Error
Consider a subtree with 3 leaf nodes:
–
Sunny: Play = yes : (0 error, 6 instances)
–
Overcast: Play= yes: (0 error, 9 instances)
–
Cloudy: Play = no (0 error, 1 instance)
The estimated error for this subtree is
–
6*0.206+9*0.143+1*0.750=3.273
If the subtree is replaced with the leaf
“
yes
”
, the
estimated error is
750
.
0
)
0
,
1
(
,
143
.
0
)
0
,
9
(
,
206
.
0
)
0
,
6
(
25
.
0
25
.
0
25
.
0
U
U
U
512
.
2
157
.
0
*
16
)
1
,
16
(
*
16
25
.
0
U
29
CogNova
Technologies
IDT Training
30
CogNova
Technologies
IDT Training
How do you ensure that a decision
tree has been well trained?
Objective:
To achieve good generalization
accuracy on new examples/cases
Establish a maximum acceptable error rate
Train the tree using a
tuning/test et
if needed
Validate the trained network against a
separate test set which is usually referred to
as a
validation, hold

out or production set
31
CogNova
Technologies
IDT Training
Available Examples
Training
Set
HO
Set
Approach #1:
Large Sample
When the amount of available data is large ...
70%
30%
Used to develop one IDT model
Compute
goodness
of fit
Divide randomly
Generalization
= goodness of fit
(accuracy)
Test
Set
32
CogNova
Technologies
IDT Training
Available Examples
Training
Set
HO
Set
Approach #2:
Cross

validation
When the amount of available data is small ...
10%
90%
Repeat 10
times
Used to develop 10 different IDT models
Tabulate
goodness of fit stats
Generalization
= mean and stddev
of goodness of fit
Test
Set
33
CogNova
Technologies
IDT Training
How do you select between two induced
decision trees ?
A statistical test of hypothesis is required to
ensure that a significant difference exists
between the fit of two IDT models
If
Large Sample
method has been used then
apply
McNemar’s test*
or
diff. of proportions
If
Cross

validation
then use a
paired
t
test
for
difference of two proportions
*We assume a classification problem, if this is function
approximation then use paired
t
test for difference of means
34
CogNova
Technologies
Data Preparation
Garbage in Garbage out
The quality of results relates directly to
quality of the data
50%

70% of IDT development time will be
spent on data preparation
The three steps of data preparation:
–
Consolidation and Cleaning
–
Selection and Preprocessing
–
Transformation and Encoding
35
CogNova
Technologies
Data Preparation
Data Types and IDTs
Three basic data types:
–
nominal
discrete symbolic (
A, yes, small
)
–
ordinal
discrete numeric (

5, 3, 24)
–
continuous
numeric (0.23,

45.2, 500.43)
IDTs can generally accept all types of data
plus a 4th form called
float =
any of the above
plus a special value for
n/a or unknown
36
CogNova
Technologies
Pros and Cons of IDTs
Cons:
Only one response variable at a time
Different significance tests required for
nominal and continuous responses
Can have difficulties with noisy data
Discriminate functions are often
suboptimal due to orthogonal decision
hyperplanes
37
CogNova
Technologies
Pros and Cons of IDTs
Pros:
Proven modeling method for 20 years
Provides explanation and prediction
Ability to learn arbitrary functions
Handles unknown values well
Rapid training and recognition speed
Has inspired many inductive learning
algorithms using statistical regression
38
CogNova
Technologies
The IDT Application
Development Process
Guidelines for inducting decision trees
1.
IDTs are good method to start with
2.
Get a suitable training set
3.
Use a sensible coding for input variables
4.
Develop the simplest tree by adjusting tuning
parameters (significance level)
5.
Use a test set to prevent over

fitting
6.
Determine confidence in generalization
through cross

validation
39
CogNova
Technologies
TUTORIAL
Complete the C4.5 IDT Tutorial
Handout
–
Requires ssh access to euler.acadiau.ca
40
CogNova
Technologies
THE END
daniel.silver@dal.ca
Comments 0
Log in to post a comment