Data Mining using Decision Trees

levelsordΔιαχείριση Δεδομένων

20 Νοε 2013 (πριν από 3 χρόνια και 4 μήνες)

60 εμφανίσεις

Data Mining using Decision
Trees

Professor J. F. Baldwin

Decision Trees from Data Base

Ex

Att


Att


Att


Concept

Num

Size


Colour


Shape


Satisfied


1

med


blue


brick


yes

2

small


red


wedge


no

3

small


red


sphere


yes

4

large


red


wedge


no

5

large


green


pillar


yes

6

large


red


pillar


no

7

large


green


sphere


yes

Choose target : Concept satisfied

Use all attributes except Ex Num

CLS
-

Concept Learning

System
-

Hunt et al.

Parent node

Attribute


V

v1

v2

v3

Node with mixture

of +ve and
-
ve

examples

Children nodes

Tree Structure

CLS ALGORITHM

1. Initialise the tree T by setting it to consist of one

node containing all the examples, both +ve and
-
ve,

in the training set


2. If all the examples in T are +ve, create a YES node and HALT



3. If all the examples in T are
-
ve, create a NO node and HALT



4. Otherwise, select an attribute F with values v1, ..., vn

Partition T into subsets T1, ..., Tn according to the values on F.

Create branches with F as parent and T1, ..., Tn as child nodes.



5. Apply the procedure recursively to each child node

Data Base Example

Using attribute SIZE

{1, 2, 3, 4, 5, 6, 7}


SIZE

med

small

large

{1}

{2, 3}

{4, 5, 6, 7}

YES

Expand

Expand

Expanding

{1, 2, 3, 4, 5, 6, 7}


SIZE

med

small

large

{1}

{2, 3}

COLOUR

{4, 5, 6, 7}

SHAPE

YES

{2, 3}

SHAPE

wedge

sphere

{3}

{2}

no

yes

wedge

sphere

pillar

{4}

{7}

{5, 6}

COLOUR

No

Yes

red

{6}

No

green

{5}

Yes

Rules from Tree

IF (SIZE = large


AND

((SHAPE = wedge) OR (SHAPE = pillar AND







COLOUR = red) )))

OR (SIZE = small AND SHAPE = wedge)

THEN NO


IF (SIZE = large


AND

((SHAPE = pillar) AND COLOUR = green)


OR SHAPE = sphere) )

OR (SIZE = small AND SHAPE = sphere)

OR (SIZE = medium)

THEN YES

Disjunctive Normal Form
-

DNF

IF

(SIZE = medium)

OR

(SIZE = small AND SHAPE = sphere)

OR

(SIZE = large AND SHAPE = sphere)

OR

(SIZE = large AND SHAPE = pillar


AND COLOUR = green

THEN CONCEPT = satisfied


ELSE CI?ONCEPT = not satisfied

ID3
-

Quinlan

ID3 = CLS + efficient ordering of attributes

Entropy is used to order the attributes.

Attributes are chosen in any order for the CLS algorithm.

This can result in large decision trees if the ordering is not

optimal. Optimal ordering would result in smallest decision

Tree.

No method is known to determine optimal ordering.

We use a heuristic to provide efficient ordering which

will result in near optimal ordering

Entropy

For random variable V which can take values {v
1
, v
2
, …, v
n
}

with Pr(v
i
) = p
i
, all i, the entropy of V is given by

Entropy for a fair dice =

Entropy for fair dice with even score =

=
1.7917

=
1.0986

Information gain =
1.7917
-

1.0986 = 0.6931

Differences between

entropies

Attribute Expansion

A
i

T

Expand attribute A
i

-

a
i1

a
im

T

T

Pr

Pr

Equally likely unless specified

Pr(A
1
, …A
i
, …A
n
, T)

Attributes

Except

Ai

Pr(A
1
, …A
i
-
1
, A
i+1
, …A
n
, T | A
i

= a
i1
)

other

attributes

Pass probabilities corresponding to a
i1

from above and re
-
normalise

-
equally likely again if previous equally likely


Expected Entropy for an
Attribute

A
i

T

Attribute A
i

and target T
-

a
i1

a
im

T

T

S(a
i2
)

S(a
i1
)

S(a
im
)

Expected Entropy for Ai =

Pr

Pr

Pr

Pass probabilities corresponding to t
k


from above for a
i1
and re
-
normalise

Pr(T | A
i
=a
im
)

How to choose attribute and
Information gain

Determine expected entropy for each attribute

i.e. S(A
i
), all i

Choose s such that

Expand attribute A
s

By choosing attribute A
s

the information gain is

S
-

S(A
s
)

where where

Minimising expected entropy is equivalent to maximising

Information gain

Previous Example

Ex

Att


Att


Att


Concept

Num

Size


Colour


Shape


Satisfied


1

med


blue


brick


yes


1/7

2

small


red


wedge


no


1/7

3

small


red


sphere


yes


1/7

4

large


red


wedge


no


1/7

5

large


green


pillar


yes


1/7

6

large


red


pillar


no


1/7

7

large


green


sphere


yes


1/7

Pr

Concept

satisfied

yes

no

Pr

4/7

3/7

S = (4/7)Log(4/7) + (3/7)Log(3/7) = 0.99



Entropy for attribute Size

Att

Concept

Size

Satisfied

med

yes


1/7

small

no


1/7

small

yes


1/7

large

no


2/7

large

yes


2/7

Pr

Concept

Satisfied

no


1/2

yes


1/2

Pr

small

med

Concept

Satisfied

yes


1

Pr

Concept

Satisfied

no


1/2

yes


1/2

Pr

large

S(small) = 1

S(med) = 0

S(large) = 1

Pr(small) = 2/7

Pr(large) = 4/7

Pr(med) = 1/7

S(Size) = (2/7)1 + (1/7)0

+ (4/7)1 = 6/7 = 0.86

Information Gain

for Size = 0.99
-

0.86

= 0.13

First Expansion

Attribute

Information Gain

SIZE


0.13

COLOUR

0.52

SHAPE

0.7

choose

max

{1, 2, 3, 4, 5, 6, 7}

SHAPE

wedge

brick

pillar

sphere

{2, 4}

NO

{1}

YES

{5, 6}

{3, 7}

YES

Expand

Complete Decision Tree

{1, 2, 3, 4, 5, 6, 7}

SHAPE

wedge

brick

pillar

sphere

{2, 4}

NO

{1}

YES

{5, 6}

{3, 7}

YES

COLOUR

red

green

{6)

NO

{5}

YES

Rule:

IF

Shape is wedge

OR

Shape is brick

OR

Shape is pillar AND


Colour is red

OR

Shape is sphere

THEN NO

ELSE YES

A new case


Att


Att


Att


Concept


Size


Colour


Shape


Satisfied



med


red


pillar


?


SHAPE

pillar

COLOUR

red

? = NO

Post Pruning

Any Node S

N examples

in node

n cases of C

C is one of

{YES, NO
}

Let C be class

with most

examples

i.e

majority

E(S)

Suppose we terminate this node and make it a leaf with

classification C.

What will be the expected error, E(S), if we use the tree

for new cases and we reach this node.


E(S) = Pr(class of new case is a class ≠ C)

Bayes Updating for Post Pruning

Let p denote probability of class C for new case arriving at S

We do not know p. Let f(p) be a prior probability distribution

for p on [0, 1]. We can update this prior using Bayes’ updating

with the information at node S.


The information at node S is


n C in S



1

0


Pr(n C in S | p) f(p)

Pr(n C in S | p) f(p)dp

f(p | n in S) =


Mathematics of Post Pruning

Assume f(p) to be uniform over [0, 1]



1

0

dp

f(p | n C in S) =


p (1
-
p)

n

N


n

p (1
-
p)

n

N


n

E(S) =
E

(1


p)

f(p | n C in S)

E(S) =




1

0

dp

p (1
-
p)

n

N


n + 1

p (1
-
p)

n

N


n

=


N


n + 1

N + 2



dp

using

Beta Functions.


The evaluation of the integral

n! (N


n + 1)!

(N + 2)!



1

0



dx

x (1
-
x)

a

b

=

using Beta Functions

Post Pruning for Binary Case

S

S1

S2

Sm

Error(S1)

Error(S2)

Error(Sm)

P1

P2

Pm

E(S)

BackUpError(S)

For any node S which is not a leaf

node we can calculate


BackUpError(S) = Pi Error(Si)



i

Error(S) =


MIN

{

}

P
i

=

Num of examples in Si

Num of examples in S

For leaf nodes S
i

Error(S
i
) = E(S
i
)

E(S)

BackUpError(S
)

Decision: Prune at S if

BackUpError(S) ≥ Error(S)

Example of Post Pruning

Before Pruning

a

[6, 4]

b

[4, 2]

c

[2, 2]

d

[1, 2]

[x, y]

means x YES cases

and y NO cases

We underline Error(Sk)

[3, 2]

0.429

[1, 0]

0.333

[1, 1]

0.5

[0, 1]

0.333

[1, 0]

0.333

0.375

0.413

0.417

0.378

0.5

0.383

0.4

0.444

PRUNE

PRUNE

PRUNE means cut the

sub
-

tree below this point

Result of Pruning

After Pruning

a

[6, 4]

[4, 2]

c

[2, 2]

[1, 2]

[1, 0]

Generalisation

For the case in which we have k classes the generalisation

for E(S) is

=

N


n + k


1

N + k

Otherwise, pruning method is the same.

E(S)

Testing

DataBase

Training

Set

Test

Set

Learn rules using Training Set and Prune


Test rules on this set and record % correct


Test rules on Test Set record % correct

% accuracy on test set should be close to that of training set.

This indicates good generalisation

Over
-
fitting can occur if noisy data is used or too specific attributes

are used. Pruning will overcome noise to some extent but not

completely. Too specific attributes must be dropped.