A Comparative Evaluation of
Different machine Learning Methods
on Microarray Gene Expression Data
Song Li
CS573 Class Project
What is Microarray
We want to know how genes
are expressed in an individual
(is this gene turned on ? )
Microarray technology allows
researchers to monitor the
expression levels of thousands
of genes simultaneously
Microarray Data
Sample 1
Sample 2
……
Sample n
Gene1
X
11
X
12
X
1n
Gene2
X
21
X
22
X
2n
Gene3
X
31
X
32
X
3n
Gene p
X
p1
X
p2
X
on
Each row represent a specific gene.
Each column represent an experiment.
Xij is the expression intensity of gene
i observed in sample j.
Classification of Microarray
Data
A typical usage of microarray data is
tumor classification in cancer research.
Learning
Labeled microarray data
Data of one microarray
experiment
Is it cancer? if yes,
what type ?
Training
Challenge
“big p, small n” problem
Normally thousands of genes are tested in
a single microarray experiment
The number of samples are relatively small
(still an expensive experiment)
Many learning algorithm cannot handle
it well
Dimension reduction is needed
Methods
Support Vector Machine
Linear regression based methods
Huang et al.
[1]
summarized
several models as linear
regression models
They also proposed PLS (Partial Least Squares) and PPLS
(Penalized Partial Least Squares) based linear regression
model
Clustering + Bayesian classification
Ji et al.
[2]
proposed a two

stage model using clustering and
Bayesian classification
[1] X.
Huang and W.
Pan. Linear regression and two

class classification with gene expression
data. Bioinformatics, 19: 2072

2078, 2003.
[2] X.
Ji, K.
Tsui, and K.
Kim. A novel means of using gene clusters in a two

step empirical
bayes method for predicting classes of samples. Bioinformatics, 21: 1055

1061, 2005.
Evaluation of Existing Models
Very limited number of datasets has been tested on
these newly proposed models
Huang’s paper has results on Leukemia data and colon data.
However, according to the paper the colon data used is not
publicly available
Ji’s paper has results on an anonymous dataset, and
Leukemia data.
Helman et al.
studied the performances of a large number of
microarray data classifiers, but the data is from the original
papers [3]
How about using WEKA as the evaluation platform?
[3] P.
Helman, R.
Veroff, S.
Atlas, and C.
Willman. A bayesian network classification methodology for
gene expression data. J. of Computational Biology, 11, 581

615, 2004.
This Project
Collects microarray datasets and converts
them into WEKA’s .arff files
Implements:
PLS and PPLS based linear regression
model described in Huang’s paper
Two

stage clustering+Bayesian method
described in Ji’s paper
Compare the results that these new methods
and WEKA’s SVM SMO implementation( 2

class classification)
Datasets
Source:
caGEDA
at University of Pittsburgh
(http://bioinformatics.upmc.edu/GE2/GEDA.htm
l)
colon1
colon2
lung1
lung2
Lymphoma1
Lymphoma2
leukemia
(training)
Leukemia
(testing)
# of
samples
61
57
86
69
96
77
38
34
# of
genes
7464
7464
5377
5377
4027
7129
7129
7129
Model #1 Linear regression
PLS
Define:
Cont’d
Cont’d
Predictor
PPLS
Model #2: Clustering + Bayesian
K << p
My implementation use K

mean as clustering
method
Clustering
Original data
(p genes)
k meta

genes
Bayesian model
Bayesian Model
Cont’d
Result
SMO
PLS
PPLS
Bayesian
colon1
95.08%
90.16%
91.80%
88.52%
colon2
92.98%
92.98%
78.94%
87.72%
lung1
76.74%
80.23%
73.26%
N/A
lung2
68.12%
73.91%
55.07%
N/A
lymphoma1
87.5%
89.58%
89.58%
93.75%
lymphoma2
93.51%
77.92%
75.32%
N/A
leukemia
94.12%
N/A
N/A
97.06%
Discussion
These newly proposed methods offer
comparable performance with WEKA’s SVM

SMO
Parameter selection sometimes matters a lot.
e.g. K value in Bayesian method, q value in
PPLS method.
Clustering thousands of genes is
computationally intensive
Question: does clustering algorithm has
impact on the performance ?
Project Web Site
http://www.cs.iastate.edu/~lisong/cs573
Comments 0
Log in to post a comment