Applications of Data Mining in

muttchessΤεχνίτη Νοημοσύνη και Ρομποτική

8 Νοε 2013 (πριν από 3 χρόνια και 7 μήνες)

155 εμφανίσεις

Applications of Data Mining in
Microarray Data Analysis

Yen
-
Jen Oyang

Dept. of Computer Science and
Information Engineering

Observations and Challenges in
the Information Age



A huge volume of information has been
and is being digitized and stored in the
computer.


Due to the volume of digitized information,
effectively exploitation of information is
beyond the capability of human being
without the aid of intelligent computer
software.

An Example of Data Mining


Given the data set shown on next slide, can
we figure out a set of rules that predict the
classes of objects?

Data Set

Data

Class

Data

Class

Data

Class


15,33


O


18,28


×


16,31


O


9 ,23


×


15,35


O


9 ,32


×


8 ,15


×


17,34


O


11,38


×


11,31


O


18,39


×


13,34


O


13,37


×


14,32


O


19,36


×


18,32


O


25,18


×


10,34


×


16,38


×


23,33


×


15,30


O


12,33


O


21,28


×


13,22


×

Distribution of the Data Set





10

15

20

30

















×

×

×

×

×

×

×

×

×

×

×

×

×

×

Rule Based on Observation





.
0
30
25
30
15
2
2
X
class
else
class
, then
and y
y
x
If









Rule Generated by a
RBF(Radial Basis Function)
Network Based Learning
Algorithm





Let and


If then prediction=“O”.

Otherwise prediction=“X”.

2
o
2
o
2
10
1
2
o
o

2
1
)
(
i
i
c
v
i
i
e
v
f








.

2
1
)
(
2
2
14
1
2
x
x
2
x
x
j
j
c
v
j
j
e
v
f








),
(
)
(
x
o
v
f
v
f



(15,33)

(11,31)

(18,32)

(12,33)

(15,35)

(17,34)

(14,32)

(16,31)

(13,34)

(15,30)

1.723

2.745

2.327

1.794

1.973

2.045

1.794

1.794

1.794

2.027

i
c
o
i
o

(9,23)

(8,15)

(13,37)

(16,38)

(18,28)

(18,39)

(25,18)

(23,33)

(21,28)

(9,32)

(11,38)

(19,36)

(10,34)

(13,22)

6.458

10.08

2.939

2.745

5.451

3.287

10.86

5.322

5.070

4.562

3.463

3.587

3.232

6.260

j
c
x
j
x

Identifying Boundary of Different
Classes of Objects

Boundary Identified

Data Mining /

Knowledge Discovery



The main theme of data mining is to
discover unknown and implicit knowledge
in a large dataset.


There are three main categories of data
mining algorithms:


Classification;


Clustering;


Mining association rule/correlation analysis.

Data Classification


In a data classification problem, each object is
described by a set of attribute values and each
object belongs to one of the predefined classes.


The goal is to derive a set of rules that predicts
which class a new object should belong to, based
on a given set of training samples. Data
classification is also called
supervised learning
.


Instance
-
Based Learning


In instance
-
based learning, we take
k

nearest training samples of a new instance
(
v
1
,
v
2
, …,
v
m
) and assign the new instance
to the class that has most instances in the
k

nearest training samples.


Classifiers that adopt instance
-
based
learning are commonly called the KNN
classifiers.

Example of the KNN







If an 1NN classifier is employed, then the
prediction of “

” = “X”.


If an 3NN classifier is employed, then prediction
of “

” = “O”.

Applications of Data
Classification in
Bioinformatics


In microarray data analysis, data
classification is employed to predict the
class of a new sample based on the existing
samples with known class.


For example, in the Leukemia data set, there
are 72 samples and 7129 genes.


25 Acute Myeloid Leukemia(AML) samples.


38 B
-
cell Acute Lymphoblastic Leukemia
samples.


9 T
-
cell Acute Lymphoblastic Leukemia
samples.



Model of Microarray Data Sets

Gene
1

Gene
2
‧‧‧‧‧‧

Gene
n


Sample
1
Sample
2






Sample
m

.
)
,
(
R
j
i
M

Alternative Data Classification
Algorithms


Decision tree (Q4.5 and Q5.0);


Instance
-
based learning(KNN);


Naïve Bayesian classifier;


Support vector machine(SVM);



Novel approaches including the RBF
network based classifier that we have
recently proposed.

Accuracy of Different
Classification Algorithms


Data set

classification algorithms

RBF

SVM

1NN

3NN

Satimage

(4335,2000)

92.30

91.30

89.35

90.6

Letter

(15000,5000)

97.12

97.98

95.26

95.46

Shuttle

(43500,14500)

99.94

99.92

99.91

99.92

Average

96.45

96.40

94.84

95.33

Comparison of Execution
Time(in seconds)

RBF without
data reduction

RBF with data
reduction

SVM

Cross
validation

Satimage

670

265

64622

Letter

2825

1724

386814

Shuttle

96795

59.9

467825

Make
classifier

Satimage

5.91

0.85

21.66

Letter

17.05

6.48

282.05

Shuttle

1745

0.69

129.84

Test

Satimage

21.3

7.4

11.53

Letter

128.6

51.74

94.91

Shuttle

996.1

5.85

2.13

More Insights

Satimage

Letter

Shuttle

# of training samples in the
original data set

4435

15000

43500

# of training samples after
data reduction is applied


1815

7794

627

% of training samples
remaining


40.92%

51.96%

1.44%

Classification accuracy after
data reduction is applied

92.15

96.18

99.32

# of support vectors in
identified by LIBSVM


1689

8931

287

Data Clustering


Data clustering concerns how to group a set
of objects based on their similarity of
attributes and/or their proximity in the
vector space. Data clustering is also called
unsupervised learning.

The Agglomerative
Hierarchical Clustering
Algorithms


The agglomerative hierarchical clustering
algorithms operate by maintaining a sorted
list of inter
-
cluster distances.


Initially, each data instance forms a cluster.


The clustering algorithm repetitively
merges the two clusters with the minimum
inter
-
cluster distance.


Upon merging two clusters, the clustering
algorithm computes the distances between
the newly
-
formed cluster and the remaining
clusters and maintains the sorted list of
inter
-
cluster distances accordingly.


There are a number of ways to define the
inter
-
cluster distance:


minimum distance (single
-
link);


maximum distance (complete
-
link);


average distance;


mean distance.

An Example of the
Agglomerative Hierarchical
Clustering Algorithm


For the following data set, we will get
different clustering results with the single
-
link and complete
-
link algorithms.

1

2

3

4

5

6

Result of the Single
-
Link
algorithm

1

2

3

4

5

6

1

3

4

5

2

6

Result of the Complete
-
Link
algorithm

1

2

3

4

5

6

1

3

2

4

5

6

Remarks


The single
-
link and complete
-
link are the
two most commonly used alternatives.


The single
-
link suffers the so
-
called
chaining effect.


On the other hand, the complete
-
link also
fails in some cases.

Example of the Chaining
Effect

Single
-
link (10 clusters)

Complete
-
link (2 clusters)

Effect of Bias towards
Spherical Clusters

Single
-
link (2 clusters)

Complete
-
link (2 clusters)

K
-
Means: A Partitional Data
Clustering Algorithm


The
k
-
means algorithm is probably the most
commonly used partitional clustering
algorithm.


The
k
-
means algorithm begins with
selecting
k

data instances as the means or
centers of k clusters.


The
k
-
means algorithm then executes
the following loop iteratively until the
convergence criterion is met.


repeat {


assign every data instance to the closest cluster
based on the distance between the data instance and
the center of the cluster;


compute the new centers of the k clusters;


} until(the convergence criterion is met);






A commonly
-
used convergence criterion is

.
cluster

of
center

the
is


where
,
2
i
i
C
C
p
i
C
m
m
p
E
i
i





Illustration of the K
-
Means
Algorithm
---
(I)

initial center

initial center

initial center

Illustration of the K
-
Means
Algorithm
---
(II)

x

x

x

new center after
1
st

iteration

new center after
1
st

iteration

new center after
1
st

iteration

Illustration of the K
-
Means
Algorithm
---
(III)

new center after
2
nd

iteration

new center after
2
nd

iteration

new center after
2
nd

iteration

A Case in which the K
-
Means
Algorithm Fails


The K
-
means algorithm may converge to a
local optimal state as the following example
demonstrates:

Initial

Selection

Remarks


As the examples demonstrate, no clustering
algorithm is definitely superior to other
clustering algorithms with respect to
clustering quality.


Applications of Data
Clustering in Microarray Data
Analysis



Data clustering has been employed in
microarray data analysis for


identifying the genes with similar expressions;


identifying the subtypes of samples.

Feature Selection in
Microarray Data Analysis


In microarray data analysis, it is highly
desirable to identify those genes that are
correlated to the classes of samples.


For example, in the Leukemia data set, there
are 7129 genes. We want to identify those
genes that lead to different disease types.




Furthermore, Inclusion of features that are
not correlated to the classification decision
may result in lower classification accuracy
or poor clustering quality.


For example, in the data set shown on the
following page, inclusion of the feature
corresponding to the Y
-
axis causes incorrect
prediction of the test instance marked by


”, if a 3NN classifier is employed.


It is apparent that “o”s and “x” s are separated by
x
=10. If only the attribute corresponding to the
x
-
axis was selected, then the 3NN classifier would
predict the class of “

” correctly.

x
=10

x

y

Univariate Analysis in Feature
Selection


In the univariate analysis, the importance of each
feature is determined by how objects of different
classes are distributed in this particular axis.


Let and denote the feature
values of class
-
1 and class
-
2 objects, respectively.


Assume that the feature values of both classes of
objects follow the normal distribution.


m
v
v
v



,...,
,
2
1
n
v
v
v






,...,
,
2
1

Then,




is a t
-
distribution with degree of freedom =

(
m
+
n
-
2
)
, where






If the t statistic of a feature is lower than a

threshold, then the feature is deleted.










,
1
1
2
1
1
2
2




















n
m
n
m
s
n
s
m
v
v
T










.
1
1

and

1
1
;
1

and

1
1
2
2
1
2
2
1
1
















n
i
i
m
i
i
n
i
i
m
i
i
v
v
n
s
v
v
m
s
v
n
v
v
m
v















Multivariate Analysis


The univariate analysis is not able to
identify crucial features in the following
example.


Therefore, multivariate analysis has been
developed. However, most multivariate
analysis algorithms that have been proposed
suffer high time complexity and may not be
applicable in real
-
world problems.

Summary


Data clustering and data classification have
been widely used in microarray data
analysis.


Feature selection is the most challenging
issue as of today.