Different machine Learning Methods

journeycartΤεχνίτη Νοημοσύνη και Ρομποτική

15 Οκτ 2013 (πριν από 3 χρόνια και 8 μήνες)

90 εμφανίσεις

A Comparative Evaluation of
Different machine Learning Methods
on Microarray Gene Expression Data

Song Li

CS573 Class Project

What is Microarray


We want to know how genes
are expressed in an individual
(is this gene turned on ? )


Microarray technology allows
researchers to monitor the
expression levels of thousands
of genes simultaneously

Microarray Data

Sample 1

Sample 2


……

Sample n

Gene1

X
11

X
12

X
1n

Gene2

X
21

X
22

X
2n

Gene3

X
31

X
32

X
3n

Gene p

X
p1

X
p2

X
on

Each row represent a specific gene.


Each column represent an experiment.


Xij is the expression intensity of gene

i observed in sample j.

Classification of Microarray
Data


A typical usage of microarray data is
tumor classification in cancer research.

Learning

Labeled microarray data

Data of one microarray

experiment

Is it cancer? if yes,

what type ?

Training

Challenge


“big p, small n” problem


Normally thousands of genes are tested in
a single microarray experiment


The number of samples are relatively small
(still an expensive experiment)


Many learning algorithm cannot handle
it well


Dimension reduction is needed

Methods


Support Vector Machine


Linear regression based methods


Huang et al.
[1]

summarized

several models as linear
regression models


They also proposed PLS (Partial Least Squares) and PPLS
(Penalized Partial Least Squares) based linear regression
model


Clustering + Bayesian classification


Ji et al.
[2]

proposed a two
-
stage model using clustering and
Bayesian classification


[1] X.

Huang and W.

Pan. Linear regression and two
-
class classification with gene expression
data. Bioinformatics, 19: 2072
-

2078, 2003.


[2] X.

Ji, K.

Tsui, and K.

Kim. A novel means of using gene clusters in a two
-
step empirical
bayes method for predicting classes of samples. Bioinformatics, 21: 1055
-

1061, 2005.




Evaluation of Existing Models


Very limited number of datasets has been tested on
these newly proposed models


Huang’s paper has results on Leukemia data and colon data.
However, according to the paper the colon data used is not
publicly available


Ji’s paper has results on an anonymous dataset, and
Leukemia data.


Helman et al.
studied the performances of a large number of
microarray data classifiers, but the data is from the original
papers [3]


How about using WEKA as the evaluation platform?


[3] P.

Helman, R.

Veroff, S.

Atlas, and C.

Willman. A bayesian network classification methodology for
gene expression data. J. of Computational Biology, 11, 581
-
615, 2004.


This Project


Collects microarray datasets and converts
them into WEKA’s .arff files


Implements:



PLS and PPLS based linear regression
model described in Huang’s paper


Two
-
stage clustering+Bayesian method
described in Ji’s paper


Compare the results that these new methods
and WEKA’s SVM SMO implementation( 2
-
class classification)

Datasets


Source:
caGEDA

at University of Pittsburgh
(http://bioinformatics.upmc.edu/GE2/GEDA.htm
l)

colon1

colon2

lung1

lung2

Lymphoma1

Lymphoma2


leukemia

(training)

Leukemia

(testing)

# of
samples

61

57

86

69

96

77

38

34

# of
genes

7464

7464

5377

5377

4027

7129

7129

7129

Model #1 Linear regression


PLS


Define:

Cont’d

Cont’d


Predictor


PPLS


Model #2: Clustering + Bayesian


K << p


My implementation use K
-
mean as clustering
method

Clustering

Original data

(p genes)

k meta
-
genes

Bayesian model

Bayesian Model


Cont’d


Result

SMO

PLS

PPLS

Bayesian

colon1

95.08%

90.16%

91.80%

88.52%

colon2

92.98%

92.98%

78.94%

87.72%

lung1

76.74%

80.23%

73.26%

N/A

lung2

68.12%

73.91%

55.07%

N/A

lymphoma1

87.5%

89.58%

89.58%

93.75%

lymphoma2

93.51%

77.92%

75.32%

N/A

leukemia

94.12%

N/A

N/A

97.06%

Discussion


These newly proposed methods offer
comparable performance with WEKA’s SVM
-
SMO


Parameter selection sometimes matters a lot.
e.g. K value in Bayesian method, q value in
PPLS method.


Clustering thousands of genes is
computationally intensive


Question: does clustering algorithm has
impact on the performance ?

Project Web Site

http://www.cs.iastate.edu/~lisong/cs573