Data Mining using Decision
Trees
Professor J. F. Baldwin
Decision Trees from Data Base
Ex
Att
Att
Att
Concept
Num
Size
Colour
Shape
Satisfied
1
med
blue
brick
yes
2
small
red
wedge
no
3
small
red
sphere
yes
4
large
red
wedge
no
5
large
green
pillar
yes
6
large
red
pillar
no
7
large
green
sphere
yes
Choose target : Concept satisfied
Use all attributes except Ex Num
CLS

Concept Learning
System

Hunt et al.
Parent node
Attribute
V
v1
v2
v3
Node with mixture
of +ve and

ve
examples
Children nodes
Tree Structure
CLS ALGORITHM
1. Initialise the tree T by setting it to consist of one
node containing all the examples, both +ve and

ve,
in the training set
2. If all the examples in T are +ve, create a YES node and HALT
3. If all the examples in T are

ve, create a NO node and HALT
4. Otherwise, select an attribute F with values v1, ..., vn
Partition T into subsets T1, ..., Tn according to the values on F.
Create branches with F as parent and T1, ..., Tn as child nodes.
5. Apply the procedure recursively to each child node
Data Base Example
Using attribute SIZE
{1, 2, 3, 4, 5, 6, 7}
SIZE
med
small
large
{1}
{2, 3}
{4, 5, 6, 7}
YES
Expand
Expand
Expanding
{1, 2, 3, 4, 5, 6, 7}
SIZE
med
small
large
{1}
{2, 3}
COLOUR
{4, 5, 6, 7}
SHAPE
YES
{2, 3}
SHAPE
wedge
sphere
{3}
{2}
no
yes
wedge
sphere
pillar
{4}
{7}
{5, 6}
COLOUR
No
Yes
red
{6}
No
green
{5}
Yes
Rules from Tree
IF (SIZE = large
AND
((SHAPE = wedge) OR (SHAPE = pillar AND
COLOUR = red) )))
OR (SIZE = small AND SHAPE = wedge)
THEN NO
IF (SIZE = large
AND
((SHAPE = pillar) AND COLOUR = green)
OR SHAPE = sphere) )
OR (SIZE = small AND SHAPE = sphere)
OR (SIZE = medium)
THEN YES
Disjunctive Normal Form

DNF
IF
(SIZE = medium)
OR
(SIZE = small AND SHAPE = sphere)
OR
(SIZE = large AND SHAPE = sphere)
OR
(SIZE = large AND SHAPE = pillar
AND COLOUR = green
THEN CONCEPT = satisfied
ELSE CI?ONCEPT = not satisfied
ID3

Quinlan
ID3 = CLS + efficient ordering of attributes
Entropy is used to order the attributes.
Attributes are chosen in any order for the CLS algorithm.
This can result in large decision trees if the ordering is not
optimal. Optimal ordering would result in smallest decision
Tree.
No method is known to determine optimal ordering.
We use a heuristic to provide efficient ordering which
will result in near optimal ordering
Entropy
For random variable V which can take values {v
1
, v
2
, …, v
n
}
with Pr(v
i
) = p
i
, all i, the entropy of V is given by
Entropy for a fair dice =
Entropy for fair dice with even score =
=
1.7917
=
1.0986
Information gain =
1.7917

1.0986 = 0.6931
Differences between
entropies
Attribute Expansion
A
i
T
Expand attribute A
i

a
i1
a
im
T
T
Pr
Pr
Equally likely unless specified
Pr(A
1
, …A
i
, …A
n
, T)
Attributes
Except
Ai
Pr(A
1
, …A
i

1
, A
i+1
, …A
n
, T  A
i
= a
i1
)
other
attributes
Pass probabilities corresponding to a
i1
from above and re

normalise

equally likely again if previous equally likely
Expected Entropy for an
Attribute
A
i
T
Attribute A
i
and target T

a
i1
a
im
T
T
S(a
i2
)
S(a
i1
)
S(a
im
)
Expected Entropy for Ai =
Pr
Pr
Pr
Pass probabilities corresponding to t
k
from above for a
i1
and re

normalise
Pr(T  A
i
=a
im
)
How to choose attribute and
Information gain
Determine expected entropy for each attribute
i.e. S(A
i
), all i
Choose s such that
Expand attribute A
s
By choosing attribute A
s
the information gain is
S

S(A
s
)
where where
Minimising expected entropy is equivalent to maximising
Information gain
Previous Example
Ex
Att
Att
Att
Concept
Num
Size
Colour
Shape
Satisfied
1
med
blue
brick
yes
1/7
2
small
red
wedge
no
1/7
3
small
red
sphere
yes
1/7
4
large
red
wedge
no
1/7
5
large
green
pillar
yes
1/7
6
large
red
pillar
no
1/7
7
large
green
sphere
yes
1/7
Pr
Concept
satisfied
yes
no
Pr
4/7
3/7
S = (4/7)Log(4/7) + (3/7)Log(3/7) = 0.99
Entropy for attribute Size
Att
Concept
Size
Satisfied
med
yes
1/7
small
no
1/7
small
yes
1/7
large
no
2/7
large
yes
2/7
Pr
Concept
Satisfied
no
1/2
yes
1/2
Pr
small
med
Concept
Satisfied
yes
1
Pr
Concept
Satisfied
no
1/2
yes
1/2
Pr
large
S(small) = 1
S(med) = 0
S(large) = 1
Pr(small) = 2/7
Pr(large) = 4/7
Pr(med) = 1/7
S(Size) = (2/7)1 + (1/7)0
+ (4/7)1 = 6/7 = 0.86
Information Gain
for Size = 0.99

0.86
= 0.13
First Expansion
Attribute
Information Gain
SIZE
0.13
COLOUR
0.52
SHAPE
0.7
choose
max
{1, 2, 3, 4, 5, 6, 7}
SHAPE
wedge
brick
pillar
sphere
{2, 4}
NO
{1}
YES
{5, 6}
{3, 7}
YES
Expand
Complete Decision Tree
{1, 2, 3, 4, 5, 6, 7}
SHAPE
wedge
brick
pillar
sphere
{2, 4}
NO
{1}
YES
{5, 6}
{3, 7}
YES
COLOUR
red
green
{6)
NO
{5}
YES
Rule:
IF
Shape is wedge
OR
Shape is brick
OR
Shape is pillar AND
Colour is red
OR
Shape is sphere
THEN NO
ELSE YES
A new case
Att
Att
Att
Concept
Size
Colour
Shape
Satisfied
med
red
pillar
?
SHAPE
pillar
COLOUR
red
? = NO
Post Pruning
Any Node S
N examples
in node
n cases of C
C is one of
{YES, NO
}
Let C be class
with most
examples
i.e
majority
E(S)
Suppose we terminate this node and make it a leaf with
classification C.
What will be the expected error, E(S), if we use the tree
for new cases and we reach this node.
E(S) = Pr(class of new case is a class ≠ C)
Bayes Updating for Post Pruning
Let p denote probability of class C for new case arriving at S
We do not know p. Let f(p) be a prior probability distribution
for p on [0, 1]. We can update this prior using Bayes’ updating
with the information at node S.
The information at node S is
n C in S
1
0
Pr(n C in S  p) f(p)
Pr(n C in S  p) f(p)dp
f(p  n in S) =
Mathematics of Post Pruning
Assume f(p) to be uniform over [0, 1]
1
0
dp
f(p  n C in S) =
p (1

p)
n
N
–
n
p (1

p)
n
N
–
n
E(S) =
E
(1
–
p)
f(p  n C in S)
E(S) =
1
0
dp
p (1

p)
n
N
–
n + 1
p (1

p)
n
N
–
n
=
N
–
n + 1
N + 2
dp
using
Beta Functions.
The evaluation of the integral
n! (N
–
n + 1)!
(N + 2)!
1
0
dx
x (1

x)
a
b
=
using Beta Functions
Post Pruning for Binary Case
S
S1
S2
Sm
Error(S1)
Error(S2)
Error(Sm)
P1
P2
Pm
E(S)
BackUpError(S)
For any node S which is not a leaf
node we can calculate
BackUpError(S) = Pi Error(Si)
i
Error(S) =
MIN
{
}
P
i
=
Num of examples in Si
Num of examples in S
For leaf nodes S
i
Error(S
i
) = E(S
i
)
E(S)
BackUpError(S
)
Decision: Prune at S if
BackUpError(S) ≥ Error(S)
Example of Post Pruning
Before Pruning
a
[6, 4]
b
[4, 2]
c
[2, 2]
d
[1, 2]
[x, y]
means x YES cases
and y NO cases
We underline Error(Sk)
[3, 2]
0.429
[1, 0]
0.333
[1, 1]
0.5
[0, 1]
0.333
[1, 0]
0.333
0.375
0.413
0.417
0.378
0.5
0.383
0.4
0.444
PRUNE
PRUNE
PRUNE means cut the
sub

tree below this point
Result of Pruning
After Pruning
a
[6, 4]
[4, 2]
c
[2, 2]
[1, 2]
[1, 0]
Generalisation
For the case in which we have k classes the generalisation
for E(S) is
=
N
–
n + k
–
1
N + k
Otherwise, pruning method is the same.
E(S)
Testing
DataBase
Training
Set
Test
Set
Learn rules using Training Set and Prune
Test rules on this set and record % correct
Test rules on Test Set record % correct
% accuracy on test set should be close to that of training set.
This indicates good generalisation
Over

fitting can occur if noisy data is used or too specific attributes
are used. Pruning will overcome noise to some extent but not
completely. Too specific attributes must be dropped.
Comments 0
Log in to post a comment