Applications of Data Mining in
Microarray Data Analysis
Yen

Jen Oyang
Dept. of Computer Science and
Information Engineering
Observations and Challenges in
the Information Age
•
A huge volume of information has been
and is being digitized and stored in the
computer.
•
Due to the volume of digitized information,
effectively exploitation of information is
beyond the capability of human being
without the aid of intelligent computer
software.
An Example of Data Mining
•
Given the data set shown on next slide, can
we figure out a set of rules that predict the
classes of objects?
Data Set
Data
Class
Data
Class
Data
Class
（
15,33
）
O
（
18,28
）
×
（
16,31
）
O
（
9 ,23
）
×
（
15,35
）
O
（
9 ,32
）
×
（
8 ,15
）
×
（
17,34
）
O
（
11,38
）
×
（
11,31
）
O
（
18,39
）
×
（
13,34
）
O
（
13,37
）
×
（
14,32
）
O
（
19,36
）
×
（
18,32
）
O
（
25,18
）
×
（
10,34
）
×
（
16,38
）
×
（
23,33
）
×
（
15,30
）
O
（
12,33
）
O
（
21,28
）
×
（
13,22
）
×
Distribution of the Data Set
。
。
10
15
20
30
。
。
。
。
。
。
。
。
×
×
×
×
×
×
×
×
×
×
×
×
×
×
Rule Based on Observation
.
0
30
25
30
15
2
2
X
class
else
class
, then
and y
y
x
If
Rule Generated by a
RBF(Radial Basis Function)
Network Based Learning
Algorithm
Let and
If then prediction=“O”.
Otherwise prediction=“X”.
2
o
2
o
2
10
1
2
o
o
2
1
)
(
i
i
c
v
i
i
e
v
f
.
2
1
)
(
2
2
14
1
2
x
x
2
x
x
j
j
c
v
j
j
e
v
f
),
(
)
(
x
o
v
f
v
f
(15,33)
(11,31)
(18,32)
(12,33)
(15,35)
(17,34)
(14,32)
(16,31)
(13,34)
(15,30)
1.723
2.745
2.327
1.794
1.973
2.045
1.794
1.794
1.794
2.027
i
c
o
i
o
(9,23)
(8,15)
(13,37)
(16,38)
(18,28)
(18,39)
(25,18)
(23,33)
(21,28)
(9,32)
(11,38)
(19,36)
(10,34)
(13,22)
6.458
10.08
2.939
2.745
5.451
3.287
10.86
5.322
5.070
4.562
3.463
3.587
3.232
6.260
j
c
x
j
x
Identifying Boundary of Different
Classes of Objects
Boundary Identified
Data Mining /
Knowledge Discovery
•
The main theme of data mining is to
discover unknown and implicit knowledge
in a large dataset.
•
There are three main categories of data
mining algorithms:
•
Classification;
•
Clustering;
•
Mining association rule/correlation analysis.
Data Classification
•
In a data classification problem, each object is
described by a set of attribute values and each
object belongs to one of the predefined classes.
•
The goal is to derive a set of rules that predicts
which class a new object should belong to, based
on a given set of training samples. Data
classification is also called
supervised learning
.
Instance

Based Learning
•
In instance

based learning, we take
k
nearest training samples of a new instance
(
v
1
,
v
2
, …,
v
m
) and assign the new instance
to the class that has most instances in the
k
nearest training samples.
•
Classifiers that adopt instance

based
learning are commonly called the KNN
classifiers.
Example of the KNN
•
If an 1NN classifier is employed, then the
prediction of “
” = “X”.
•
If an 3NN classifier is employed, then prediction
of “
” = “O”.
Applications of Data
Classification in
Bioinformatics
•
In microarray data analysis, data
classification is employed to predict the
class of a new sample based on the existing
samples with known class.
•
For example, in the Leukemia data set, there
are 72 samples and 7129 genes.
•
25 Acute Myeloid Leukemia(AML) samples.
•
38 B

cell Acute Lymphoblastic Leukemia
samples.
•
9 T

cell Acute Lymphoblastic Leukemia
samples.
Model of Microarray Data Sets
Gene
1
Gene
2
‧‧‧‧‧‧
Gene
n
Sample
1
Sample
2
Sample
m
.
)
,
(
R
j
i
M
Alternative Data Classification
Algorithms
•
Decision tree (Q4.5 and Q5.0);
•
Instance

based learning(KNN);
•
Naïve Bayesian classifier;
•
Support vector machine(SVM);
•
Novel approaches including the RBF
network based classifier that we have
recently proposed.
Accuracy of Different
Classification Algorithms
Data set
classification algorithms
RBF
SVM
1NN
3NN
Satimage
(4335,2000)
92.30
91.30
89.35
90.6
Letter
(15000,5000)
97.12
97.98
95.26
95.46
Shuttle
(43500,14500)
99.94
99.92
99.91
99.92
Average
96.45
96.40
94.84
95.33
Comparison of Execution
Time(in seconds)
RBF without
data reduction
RBF with data
reduction
SVM
Cross
validation
Satimage
670
265
64622
Letter
2825
1724
386814
Shuttle
96795
59.9
467825
Make
classifier
Satimage
5.91
0.85
21.66
Letter
17.05
6.48
282.05
Shuttle
1745
0.69
129.84
Test
Satimage
21.3
7.4
11.53
Letter
128.6
51.74
94.91
Shuttle
996.1
5.85
2.13
More Insights
Satimage
Letter
Shuttle
# of training samples in the
original data set
4435
15000
43500
# of training samples after
data reduction is applied
1815
7794
627
% of training samples
remaining
40.92%
51.96%
1.44%
Classification accuracy after
data reduction is applied
92.15
96.18
99.32
# of support vectors in
identified by LIBSVM
1689
8931
287
Data Clustering
•
Data clustering concerns how to group a set
of objects based on their similarity of
attributes and/or their proximity in the
vector space. Data clustering is also called
unsupervised learning.
The Agglomerative
Hierarchical Clustering
Algorithms
•
The agglomerative hierarchical clustering
algorithms operate by maintaining a sorted
list of inter

cluster distances.
•
Initially, each data instance forms a cluster.
•
The clustering algorithm repetitively
merges the two clusters with the minimum
inter

cluster distance.
•
Upon merging two clusters, the clustering
algorithm computes the distances between
the newly

formed cluster and the remaining
clusters and maintains the sorted list of
inter

cluster distances accordingly.
•
There are a number of ways to define the
inter

cluster distance:
•
minimum distance (single

link);
•
maximum distance (complete

link);
•
average distance;
•
mean distance.
An Example of the
Agglomerative Hierarchical
Clustering Algorithm
•
For the following data set, we will get
different clustering results with the single

link and complete

link algorithms.
1
2
3
4
5
6
Result of the Single

Link
algorithm
1
2
3
4
5
6
1
3
4
5
2
6
Result of the Complete

Link
algorithm
1
2
3
4
5
6
1
3
2
4
5
6
Remarks
•
The single

link and complete

link are the
two most commonly used alternatives.
•
The single

link suffers the so

called
chaining effect.
•
On the other hand, the complete

link also
fails in some cases.
Example of the Chaining
Effect
Single

link (10 clusters)
Complete

link (2 clusters)
Effect of Bias towards
Spherical Clusters
Single

link (2 clusters)
Complete

link (2 clusters)
K

Means: A Partitional Data
Clustering Algorithm
•
The
k

means algorithm is probably the most
commonly used partitional clustering
algorithm.
•
The
k

means algorithm begins with
selecting
k
data instances as the means or
centers of k clusters.
•
The
k

means algorithm then executes
the following loop iteratively until the
convergence criterion is met.
•
repeat {
•
assign every data instance to the closest cluster
based on the distance between the data instance and
the center of the cluster;
•
compute the new centers of the k clusters;
•
} until(the convergence criterion is met);
•
A commonly

used convergence criterion is
.
cluster
of
center
the
is
where
,
2
i
i
C
C
p
i
C
m
m
p
E
i
i
Illustration of the K

Means
Algorithm

(I)
initial center
initial center
initial center
Illustration of the K

Means
Algorithm

(II)
x
x
x
new center after
1
st
iteration
new center after
1
st
iteration
new center after
1
st
iteration
Illustration of the K

Means
Algorithm

(III)
new center after
2
nd
iteration
new center after
2
nd
iteration
new center after
2
nd
iteration
A Case in which the K

Means
Algorithm Fails
•
The K

means algorithm may converge to a
local optimal state as the following example
demonstrates:
Initial
Selection
Remarks
•
As the examples demonstrate, no clustering
algorithm is definitely superior to other
clustering algorithms with respect to
clustering quality.
Applications of Data
Clustering in Microarray Data
Analysis
•
Data clustering has been employed in
microarray data analysis for
•
identifying the genes with similar expressions;
•
identifying the subtypes of samples.
Feature Selection in
Microarray Data Analysis
•
In microarray data analysis, it is highly
desirable to identify those genes that are
correlated to the classes of samples.
•
For example, in the Leukemia data set, there
are 7129 genes. We want to identify those
genes that lead to different disease types.
•
Furthermore, Inclusion of features that are
not correlated to the classification decision
may result in lower classification accuracy
or poor clustering quality.
•
For example, in the data set shown on the
following page, inclusion of the feature
corresponding to the Y

axis causes incorrect
prediction of the test instance marked by
“
”, if a 3NN classifier is employed.
•
It is apparent that “o”s and “x” s are separated by
x
=10. If only the attribute corresponding to the
x

axis was selected, then the 3NN classifier would
predict the class of “
” correctly.
x
=10
x
y
Univariate Analysis in Feature
Selection
•
In the univariate analysis, the importance of each
feature is determined by how objects of different
classes are distributed in this particular axis.
•
Let and denote the feature
values of class

1 and class

2 objects, respectively.
•
Assume that the feature values of both classes of
objects follow the normal distribution.
m
v
v
v
,...,
,
2
1
n
v
v
v
,...,
,
2
1
•
Then,
is a t

distribution with degree of freedom =
(
m
+
n

2
)
, where
If the t statistic of a feature is lower than a
threshold, then the feature is deleted.
,
1
1
2
1
1
2
2
n
m
n
m
s
n
s
m
v
v
T
.
1
1
and
1
1
;
1
and
1
1
2
2
1
2
2
1
1
n
i
i
m
i
i
n
i
i
m
i
i
v
v
n
s
v
v
m
s
v
n
v
v
m
v
Multivariate Analysis
•
The univariate analysis is not able to
identify crucial features in the following
example.
•
Therefore, multivariate analysis has been
developed. However, most multivariate
analysis algorithms that have been proposed
suffer high time complexity and may not be
applicable in real

world problems.
Summary
•
Data clustering and data classification have
been widely used in microarray data
analysis.
•
Feature selection is the most challenging
issue as of today.
Comments 0
Log in to post a comment