Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains

AI and Robotics

Oct 29, 2013 (4 years and 6 months ago)

97 views

Rule
-
Based Data
Mining Methods for
Classification
Problems in
Biomedical Domains

Jinyan Li

Limsoon Wong

Rule
-
Based Data
Mining Methods for
Classification
Problems in
Biomedical Domains

Part 2:

Rule
-
Based Approaches

Outline

Overview of Supervised Learning

Decision Trees Ensembles

Bagging

Boosting

Random forest

Randomization trees

CS4

Overview of
Supervised Learning

Computational Supervised Learning

Also called
classification

Learn from past experience, and use the
learned knowledge to classify new data

Knowledge learned by
intelligent algorithms

Examples:

Clinical diagnosis for patients

Cell type classification

Data

Classification application involves > 1 class of
data. E.g.,

Normal vs disease cells for a diagnosis problem

Training data

is a set of instances (samples,
points) with known class labels

Test data

is a set of instances whose class
labels are to be predicted

Notation

Training data

{

x
1
, y
1

,

x
2
, y
2

, …,

x
m
, y
m

}

where
x
j

are n
-
dimensional vectors

and
y
j

are from a discrete space
Y
.

E.g., Y = {normal, disease}.

Test data

{

u
1
, ?

,

u
2
, ?

, …,

u
k
, ?

, }

Training data:
X

Class labels Y

f(X)

A classifier, a mapping, a hypothesis

Test data:
U

Predicted class labels

f(U)

Process

x
11

x
12

x
13

x
14

… x
1n

x
21

x
22

x
23

x
24

… x
2n

x
31

x
32

x
33

x
34

… x
3n

………………………………….

x
m1

x
m2

x
m3

x
m4

… x
mn

n

features (
order of 1000
)

m

samples

class

P

N

P

N

gene
1

gene
2

gene
3

gene
4

… gene
n

Relational Representation

of Gene Expression Data

Features

Also called
attributes

Categorical features

feature color = {red, blue, green}

Continuous or numerical features

gene expression

age

blood pressure

Discretization

An Example

Biomedical

Financial

Government

Scientific

Decision trees

Emerging patterns

SVM

Neural networks

Classifiers (M
-
Doctors)

Overall Picture of

Supervised Learning

Evaluation of a Classifier

Performance on independent blind test data

K
-
fold cross validation: Given a dataset, divide
it into k even parts, k
-
1 of them are used for
training, and the rest one part treated as test
data

LOOCV, a special case of K
-
fold CV

Accuracy, error rate

False positive rate, false negative rate,
sensitivity, specificity, precision

Requirements of

Biomedical Classification

High accuracy

High comprehensibility

Importance of Rule
-
Based Methods

Systematic selection of a small number of
features used for decision making. Increase
the comprehensibility of the knowledge
patterns

C4.5 and CART are two commonly used rule
induction algorithms, or called decision tree
induction algorithms

Leaf nodes

Internal nodes

Root node

A

B

B

A

A

x
1

x
2

x
4

x
3

> a
1

> a
2

Structure of Decision Trees

If x
1

> a
1

& x
2

> a
2
, then it’s A class

C4.5, CART, two of the most widely used

Easy interpretation, but accuracy generally unattractive

Elegance of Decision Trees

A

B

B

A

A

CLS (Hunt etal. 1966)
---

cost driven

ID3 (Quinlan, 1986 MLJ)
---

Information
-
driven

C4.5 (Quinlan, 1993)
---

Gain ratio + Pruning ideas

CART (Breiman et al. 1984)
---

Gini Index

Brief History of Decision Trees

9 Play samples

5 Don’t

A total of 14.

A Simple Dataset

2

outlook

windy

humidity

Play

Play

Play

Don’t

Don’t

sunny

overcast

rain

<= 75

> 75

false

true

2

4

3

3

A Decision Tree

NP
-
complete problem

Construction of a Decision Tree

Determination of the root node of the tree and
the root node of its sub
-
trees

Most Discriminatory Feature

Every feature can be used to partition the
training data

If the partitions contain a pure class of training
instances, then this feature is most
discriminatory

Example of Partitions

Categorical feature

Number of partitions of the training data is equal to
the number of values of this feature

Numerical feature

Two partitions

Outlook

Temp

Humidity

Windy

class

Sunny

75

70

true

Play

Sunny

80

90

true

Don’t

Sunny

85

85

false

Don’t

Sunny

72

95

true

Don’t

Sunny

69

70

false

Play

Overcast

72

90

true

Play

Overcast

83

78

false

Play

Overcast

64

65

true

Play

Overcast

81

75

false

Play

Rain

71

80

true

Don’t

Rain

65

70

true

Don’t

Rain

75

80

false

Play

Rain

68

80

false

Play

Rain

70

96

false

Play

Instance #

1

2

3

4

5

6

7

8

9

10

11

12

13

14

Total 14 training

instances

1,2,3,4,5

P,D,D,D,P

6,7,8,9

P,P,P,P

10,11,12,13,14

D, D, P, P, P

Outlook =

sunny

Outlook =

overcast

Outlook =

rain

Total 14 training

instances

5,8,11,13,14

P,P, D, P, P

1,2,3,4,6,7,9,10,12

P,D,D,D,P,P,P,D,P

Temperature

<= 70

Temperature

> 70

Three Measures

Gini index

Information gain

Gain ratio

Steps of Decision Tree Construction

Select the best feature as the root node of the
whole tree

After partition by this feature, select the best
feature (wrt the subset of training data) as the
root node of this sub
-
tree

Recursively, until the partitions become pure or
almost pure

Missing many globally significant rules; mislead the system

Characteristics of C4.5 Trees

Single coverage of training data (elegance)

Divide
-
and
-
conquer splitting strategy

Fragmentation problem

Locally reliable but globally un
-
significant rules

Decision Tree
Ensembles

Bagging

Boosting

Random forest

Randomization trees

CS4

h
1
, h
2
, h
3

are indep classifiers w/ accuracy = 60%

C
1
, C
2

are the only classes

t is a test instance in C
1

h(t) = argmax
C

{C1,C2}

|{h
j

{h
1
, h
2
, h
3
} | h
j
(t) = C}|

Then prob(h(t) = C
1
)

= prob(h
1
(t)=C
1
& h
2
(t)=C
1

& h
3
(t)=C
1
) +

prob(h
1
(t)=C
1

& h
2
(t)=C
1

& h
3
(t)=C
2
) +

prob(h
1
(t)=C
1
& h
2
(t)=C
2

& h
3
(t)=C
1
) +

prob(h
1
(t)=C
2
& h
2
(t)=C
1

& h
3
(t)=C
1
)

= 60% * 60% * 60% + 60% * 60% * 40% +

60% * 40% * 60% + 40% * 60% * 60%

= 64.8%

Motivating Example

Bagging

Proposed by Breiman (1996)

Also called
B
ootstrap
agg
regat
ing

Make use of randomness injected to training
data

50 p + 50 n

Original training set

48 p + 52 n

49 p + 51 n

53 p + 47 n

A base inducer such as C4.5

A committee
H

of classifiers:

h
1

h
2

…. h
k

Main Ideas

Decision Making by Bagging

Given a new test sample T

Boosting

AdaBoost by Freund & Schapire (1995)

Also called
ptive
Boost
ing

Make use of weighted instances and weighted
voting

Main Ideas

100 instances

with equal weight

A classifier h1

error

If error is 0

or >0.5 stop

Otherwise re
-
weight: e1/(1
-
e1)

Renormalize to 1

100 instances

with different weights

A classifier h2

error

Given a new test sample T

Bagging vs Boosting

Bagging

Construction of Bagging classifiers are independent

Equal voting

Boosting

Construction of a new Boosting classifier depends
on the performance of its previous classifier, i.e.
sequential construction (a series of classifiers)

Weighted voting

Random Forest

Proposed by Breiman (2001)

Similar to Bagging, but the base inducer is not
the standard C4.5

Make use twice of randomness

50 p + 50 n

Original training set

48 p + 52 n

49 p + 51 n

53 p + 47 n

A base inducer (not C4.5 but revised)

A committee
H

of classifiers:

h
1

h
2

…. h
k

Main Ideas

Root

node

Original
n

number of

features

Selection is from
m
try

number of randomly

chosen features

A Revised C4.5 as Base Classifier

Decision Making by Random Forest

Given a new test sample T

Randomization Trees

Proposed by Dietterich (2000)

Make use of randomness in the selection of the
best split point

Root

node

Original
n

number of

features

Select one randomly from

{feature 1: choice 1,2,3

feature 2: choise 1, 2,

.

.

.

feature 8: choice 1, 2, 3

}
Total 20 candidates

Equal voting on the committee of such decision trees

Main Ideas

CS4

Proposed by Li et al (2003)

CS4:
C
S
haring
for

decision trees

Don’t make use of randomness

Selection of root nodes is in a cascading manner!

1

2

k

tree
-
1

tree
-
2

tree
-
k

total
k

trees

root nodes

Main Ideas

Not equal voting

Decision Making by CS4

Bagging

Random

Forest

Randomization

Trees

CS4

Rules may
not be correct
when

applied to
training data

Rules correct