Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains

boorishadamantAI and Robotics

Oct 29, 2013 (3 years and 10 months ago)

71 views

Copyright © 2004 by Jinyan Li and Limsoon Wong

Rule
-
Based Data
Mining Methods for
Classification
Problems in
Biomedical Domains

Jinyan Li

Limsoon Wong

Copyright © 2004 by Jinyan Li and Limsoon Wong

Rule
-
Based Data
Mining Methods for
Classification
Problems in
Biomedical Domains

Part 2:

Rule
-
Based Approaches

Copyright © 2004 by Jinyan Li and Limsoon Wong

Outline


Overview of Supervised Learning


Decision Trees Ensembles


Bagging


Boosting


Random forest


Randomization trees


CS4

Copyright © 2004 by Jinyan Li and Limsoon Wong

Overview of
Supervised Learning

Copyright © 2004 by Jinyan Li and Limsoon Wong

Computational Supervised Learning


Also called
classification


Learn from past experience, and use the
learned knowledge to classify new data


Knowledge learned by
intelligent algorithms


Examples:


Clinical diagnosis for patients


Cell type classification

Copyright © 2004 by Jinyan Li and Limsoon Wong

Data


Classification application involves > 1 class of
data. E.g.,


Normal vs disease cells for a diagnosis problem


Training data

is a set of instances (samples,
points) with known class labels


Test data

is a set of instances whose class
labels are to be predicted

Copyright © 2004 by Jinyan Li and Limsoon Wong

Notation


Training data



{

x
1
, y
1

,

x
2
, y
2

, …,

x
m
, y
m

}


where
x
j

are n
-
dimensional vectors


and
y
j

are from a discrete space
Y
.


E.g., Y = {normal, disease}.


Test data



{

u
1
, ?

,

u
2
, ?

, …,

u
k
, ?

, }

Training data:
X

Class labels Y

f(X)

A classifier, a mapping, a hypothesis

Test data:
U

Predicted class labels

f(U)

Copyright © 2004 by Jinyan Li and Limsoon Wong

Process


x
11

x
12

x
13

x
14

… x
1n


x
21

x
22

x
23

x
24

… x
2n


x
31

x
32

x
33

x
34

… x
3n


………………………………….


x
m1

x
m2

x
m3

x
m4

… x
mn

n

features (
order of 1000
)

m

samples

class

P

N

P


N

gene
1

gene
2

gene
3

gene
4

… gene
n

Copyright © 2004 by Jinyan Li and Limsoon Wong

Relational Representation

of Gene Expression Data

Copyright © 2004 by Jinyan Li and Limsoon Wong

Features


Also called
attributes


Categorical features


feature color = {red, blue, green}


Continuous or numerical features


gene expression


age


blood pressure


Discretization

An Example

Copyright © 2004 by Jinyan Li and Limsoon Wong

Biomedical

Financial

Government

Scientific

Decision trees

Emerging patterns

SVM

Neural networks

Classifiers (M
-
Doctors)

Copyright © 2004 by Jinyan Li and Limsoon Wong

Overall Picture of

Supervised Learning

Copyright © 2004 by Jinyan Li and Limsoon Wong

Evaluation of a Classifier


Performance on independent blind test data


K
-
fold cross validation: Given a dataset, divide
it into k even parts, k
-
1 of them are used for
training, and the rest one part treated as test
data


LOOCV, a special case of K
-
fold CV


Accuracy, error rate


False positive rate, false negative rate,
sensitivity, specificity, precision

Copyright © 2004 by Jinyan Li and Limsoon Wong

Requirements of

Biomedical Classification


High accuracy


High comprehensibility

Copyright © 2004 by Jinyan Li and Limsoon Wong

Importance of Rule
-
Based Methods


Systematic selection of a small number of
features used for decision making. Increase
the comprehensibility of the knowledge
patterns


C4.5 and CART are two commonly used rule
induction algorithms, or called decision tree
induction algorithms

Leaf nodes

Internal nodes

Root node

A

B

B

A

A

x
1

x
2

x
4

x
3

> a
1

> a
2

Copyright © 2004 by Jinyan Li and Limsoon Wong

Structure of Decision Trees


If x
1

> a
1

& x
2

> a
2
, then it’s A class


C4.5, CART, two of the most widely used


Easy interpretation, but accuracy generally unattractive

Elegance of Decision Trees

A

B

B

A

A

Copyright © 2004 by Jinyan Li and Limsoon Wong

Copyright © 2004 by Jinyan Li and Limsoon Wong

CLS (Hunt etal. 1966)
---

cost driven

ID3 (Quinlan, 1986 MLJ)
---

Information
-
driven

C4.5 (Quinlan, 1993)
---

Gain ratio + Pruning ideas

CART (Breiman et al. 1984)
---

Gini Index

Brief History of Decision Trees

9 Play samples


5 Don’t


A total of 14.

A Simple Dataset

Copyright © 2004 by Jinyan Li and Limsoon Wong

2

outlook

windy

humidity

Play

Play

Play

Don’t

Don’t

sunny

overcast

rain

<= 75

> 75

false

true

2

4

3

3

A Decision Tree


NP
-
complete problem

Copyright © 2004 by Jinyan Li and Limsoon Wong

Copyright © 2004 by Jinyan Li and Limsoon Wong

Construction of a Decision Tree


Determination of the root node of the tree and
the root node of its sub
-
trees

Copyright © 2004 by Jinyan Li and Limsoon Wong

Most Discriminatory Feature


Every feature can be used to partition the
training data


If the partitions contain a pure class of training
instances, then this feature is most
discriminatory

Copyright © 2004 by Jinyan Li and Limsoon Wong

Example of Partitions


Categorical feature


Number of partitions of the training data is equal to
the number of values of this feature


Numerical feature


Two partitions

Outlook

Temp

Humidity

Windy


class

Sunny


75

70


true

Play

Sunny


80

90


true

Don’t

Sunny


85

85


false

Don’t

Sunny


72

95


true

Don’t

Sunny


69

70


false

Play

Overcast

72

90


true

Play

Overcast

83

78


false

Play

Overcast

64

65


true

Play

Overcast

81

75


false

Play

Rain


71

80


true

Don’t

Rain


65

70


true

Don’t

Rain


75

80


false

Play

Rain


68

80


false

Play

Rain


70

96


false

Play

Instance #

1

2

3

4

5

6

7

8

9

10

11

12

13

14

Copyright © 2004 by Jinyan Li and Limsoon Wong

Total 14 training

instances

1,2,3,4,5

P,D,D,D,P

6,7,8,9

P,P,P,P

10,11,12,13,14

D, D, P, P, P

Outlook =

sunny

Outlook =

overcast

Outlook =

rain

Copyright © 2004 by Jinyan Li and Limsoon Wong

Total 14 training

instances

5,8,11,13,14

P,P, D, P, P

1,2,3,4,6,7,9,10,12

P,D,D,D,P,P,P,D,P

Temperature

<= 70

Temperature

> 70

Copyright © 2004 by Jinyan Li and Limsoon Wong

Copyright © 2004 by Jinyan Li and Limsoon Wong

Three Measures


Gini index


Information gain


Gain ratio

Copyright © 2004 by Jinyan Li and Limsoon Wong

Steps of Decision Tree Construction


Select the best feature as the root node of the
whole tree


After partition by this feature, select the best
feature (wrt the subset of training data) as the
root node of this sub
-
tree


Recursively, until the partitions become pure or
almost pure

Copyright © 2004 by Jinyan Li and Limsoon Wong

Missing many globally significant rules; mislead the system

Characteristics of C4.5 Trees


Single coverage of training data (elegance)


Divide
-
and
-
conquer splitting strategy


Fragmentation problem


Locally reliable but globally un
-
significant rules

Copyright © 2004 by Jinyan Li and Limsoon Wong

Decision Tree
Ensembles



Bagging



Boosting



Random forest



Randomization trees



CS4

Copyright © 2004 by Jinyan Li and Limsoon Wong


h
1
, h
2
, h
3

are indep classifiers w/ accuracy = 60%


C
1
, C
2

are the only classes


t is a test instance in C
1


h(t) = argmax
C

{C1,C2}

|{h
j


{h
1
, h
2
, h
3
} | h
j
(t) = C}|


Then prob(h(t) = C
1
)



= prob(h
1
(t)=C
1
& h
2
(t)=C
1

& h
3
(t)=C
1
) +





prob(h
1
(t)=C
1

& h
2
(t)=C
1

& h
3
(t)=C
2
) +





prob(h
1
(t)=C
1
& h
2
(t)=C
2

& h
3
(t)=C
1
) +





prob(h
1
(t)=C
2
& h
2
(t)=C
1

& h
3
(t)=C
1
)



= 60% * 60% * 60% + 60% * 60% * 40% +




60% * 40% * 60% + 40% * 60% * 60%



= 64.8%

Motivating Example

Copyright © 2004 by Jinyan Li and Limsoon Wong

Bagging


Proposed by Breiman (1996)


Also called
B
ootstrap
agg
regat
ing


Make use of randomness injected to training
data

50 p + 50 n

Original training set

48 p + 52 n

49 p + 51 n

53 p + 47 n



A base inducer such as C4.5

A committee
H

of classifiers:


h
1

h
2

…. h
k

Main Ideas

Copyright © 2004 by Jinyan Li and Limsoon Wong

Decision Making by Bagging

Given a new test sample T

Copyright © 2004 by Jinyan Li and Limsoon Wong

Copyright © 2004 by Jinyan Li and Limsoon Wong

Boosting


AdaBoost by Freund & Schapire (1995)


Also called
Ada
ptive
Boost
ing


Make use of weighted instances and weighted
voting

Main Ideas

100 instances

with equal weight

A classifier h1

error

If error is 0

or >0.5 stop

Otherwise re
-
weight: e1/(1
-
e1)

Renormalize to 1

100 instances

with different weights

A classifier h2

error

Copyright © 2004 by Jinyan Li and Limsoon Wong

Given a new test sample T

Decision Making by AdaBoost.M1

Copyright © 2004 by Jinyan Li and Limsoon Wong

Copyright © 2004 by Jinyan Li and Limsoon Wong

Bagging vs Boosting


Bagging


Construction of Bagging classifiers are independent


Equal voting


Boosting


Construction of a new Boosting classifier depends
on the performance of its previous classifier, i.e.
sequential construction (a series of classifiers)


Weighted voting

Copyright © 2004 by Jinyan Li and Limsoon Wong

Random Forest


Proposed by Breiman (2001)


Similar to Bagging, but the base inducer is not
the standard C4.5


Make use twice of randomness

50 p + 50 n

Original training set

48 p + 52 n

49 p + 51 n

53 p + 47 n



A base inducer (not C4.5 but revised)

A committee
H

of classifiers:


h
1

h
2

…. h
k

Main Ideas

Copyright © 2004 by Jinyan Li and Limsoon Wong

Root

node

Original
n

number of

features

Selection is from
m
try

number of randomly

chosen features

A Revised C4.5 as Base Classifier

Copyright © 2004 by Jinyan Li and Limsoon Wong

Decision Making by Random Forest

Given a new test sample T

Copyright © 2004 by Jinyan Li and Limsoon Wong

Copyright © 2004 by Jinyan Li and Limsoon Wong

Randomization Trees


Proposed by Dietterich (2000)


Make use of randomness in the selection of the
best split point

Root

node

Original
n

number of

features

Select one randomly from

{feature 1: choice 1,2,3


feature 2: choise 1, 2,


.


.


.


feature 8: choice 1, 2, 3

}
Total 20 candidates

Equal voting on the committee of such decision trees

Main Ideas

Copyright © 2004 by Jinyan Li and Limsoon Wong

Copyright © 2004 by Jinyan Li and Limsoon Wong

CS4


Proposed by Li et al (2003)


CS4:
C
ascading and
S
haring
for

decision trees


Don’t make use of randomness

Selection of root nodes is in a cascading manner!

1

2

k

tree
-
1

tree
-
2

tree
-
k

total
k

trees

root nodes

Main Ideas

Copyright © 2004 by Jinyan Li and Limsoon Wong

Not equal voting

Decision Making by CS4

Copyright © 2004 by Jinyan Li and Limsoon Wong

Bagging

Random

Forest

AdaBoost.M1

Randomization

Trees

CS4

Rules may
not be correct
when

applied to
training data

Rules correct

Copyright © 2004 by Jinyan Li and Limsoon Wong

Summary of Ensemble Classifiers

Copyright © 2004 by Jinyan Li and Limsoon Wong

Any Question?