(2005, Kai-Bo Duan) Multiple SVM-RFE for gene selection in cancer ...

grizzlybearcroatianΤεχνίτη Νοημοσύνη και Ρομποτική

16 Οκτ 2013 (πριν από 3 χρόνια και 8 μήνες)

63 εμφανίσεις


★★★(2005, Kai-Bo Duan) Multiple SVM-RFE for gene selection in cancer
classification with expression data

(似乎不適用)

(以統計方式處理multiclass 的weighting vector)

(注意!SVM-RFE 是訓練subsamples 而非全部的樣本)

This paper proposes a new feature selection method that uses a backward
elimination procedure similar to that implemented in support vector machine
recursive feature elimination (SVM-RFE).
At each step, computes the ranking scores from a statistical analysis of weight
vectors of multiple linear SVMs trained on subsamples of the original training data
by resampling. Correspondingly, a new feature ranking criterion is proposed.
This proposed approach computes the feature ranking score from a statistical
analysis of weight vectors of multiple linear SVMs trained on subsamples of the
original training data.
SVM-RFE feature selection method was proposed in (Guyon, Weston et al. 2002)
to conduct gene selection for cancer classification. Nested subsets of features are
selected in a sequential backward elimination manner, which starts with all the feature
variables at a tine. At each step, the coefficients of the weight vector
w
of a linear
SVM are used to compute the feature ranking score
2
( )
i i
c w
is eliminated, where
i
w
represents the corresponding component in the weight vector
w
. Using
2
( )
i i
c w


as the ranking criterion corresponds to removing the feature whose removal changes
to objective function the least. This objective function is chosen to be
2
1
2
i
J
w
in
SVM-RFE. This can be explained by the Optimal Brain Damage (OBD) algorithm
(LeCun, Denker et al. 1990), which approximates the change in objective function
caused by removing a given feature by expanding the objective function in Taylor
series to second order.
2
2
2
( ) ( )
i i
i
i
J J
J
i w w
w
w
 
    


.
At the optimal of
J
, the first-order term can be neglected and with
2
1
2
i
J
w

becomes
2
( ) ( )
i
J
i w  
.

i i
w w
 
corresponds to removing the i-th feature.

Support that we have
t
linear SVMs trained on different subsamples of original
training samples. Let
j
w
be the weight vector of the j-th linear SVMs and
j
i
w
be
the corresponding weight value associated with the i-th feature; Let
2
( )
ji ji
v w
. We
can compute the feature ranking score with the following criterion:
i
i
i
v
v
c



where
i
v
and
i
v

are mean and standard deviation of variable
i
v

1
1
i
i ji
j
v v
t




2
1
( )
1
i
i
ji i
j
v
v v
t







Before computing the ranking score for each feature, it is important to normalize
the weight vectors
j
j
j
w
w
w

.

(2004) Feature selection for multi-class problems using support vector machines

(重要)

(解釋Optimal brain damage 並寫出以SVM解釋的方程式)

(方程式好像寫錯了)

(Guyon, Weston et al. 2002) proposed to use optimal brain damage as the
selection criteria. Furthermore optimal brain damage has been studied by
Rakotomamonjy and proved to be better than the other measures proposed before
(Rakotomamonjy 2003).
Optimal brain damage (OBD) proposed by (LeCun, Denker et al. 1990) uses the
change of the object function as the selection criteria, which is defined as the second
order term in Taylor series of the object function:
2
2
2
1
( )
2 ( )
i
i
i
L
S D






in which
L
is the object function of learning machines, and

 楳⁴桥⁷敩杨琠潦i
晥慴畲敳⸠佂䐠桡猠扥敮⁵獥搠楮⁴f攠晥慴畲攠獥 汥捴楯渠景爠慲瑩晩捩慬l 湥畲慬整睯牫猠慮搠
潢瑡楮敤⁳慴楳晡捴潲礠牥獵汴猠⡃楢慳Ⱐ卯畬楥o 整⁡氮‱㤹㘩⸠䥮⁢楮慲礠捬慳獩晩捡瑩潮⁓噍猬e
OBD has performed well in the gene analysis problems (Guyon, Weston et al. 2002).
For binary classification SVMs, the measure of OBD is defined (Guyon, Weston
et al. 2002) as
1 1
(,) (,)
2 2
T T i i
i k h k h
S K x x K x x

  
 
 
where

 楳⁴桥⁌慧牡湧eu汴楰汩敲s⁩渠卖䵳Ⱐ慮搠
i

in
(,)
i i
k h
K x x
 
means the
component i has been removed. The feature corresponding to the least
i
S will be
removed.
This method is based on binary classification SVMs. If we want to extend them
to multiple classification SVMs, we have to compute the measures of each individual
binary classification SVMs. The simplest way is to compute the sum of the measures
of each individual SVMs for the corresponding features, and remove the features with
the least sum of measures.

(2005) Multicategory classification using extended SVM-RFE and Markov
blanket on SELDI-TOF mass spectrometry data

(沒有講為什麼OVO和DAGSVM比較好)

(敘述寫的有點彆扭,架構還不錯)

Several methods to classify multiclass data sample using SVM have been
proposed including those constructing a single classifier by maximizing margin
between all classes simultaneously, and those based on binary classifications, such as
one-versus-rest (OVR), one versus one (OVO), and directed acyclic graph SVM
(DAGSVM). Since OVO and DAGSVM were shown to be more suitable for practical
use, we use the OVO model in this study. OVO method constructs
( 1)
2
2
k
k k
 


 
 

binary SVMs for k-class problem, where each of the
( 1)
2
k k

SVMs casts one vote.
Finally, an instance is assigned to a class which has the largest number of votes.
SVM-RFE is a sequential backward feature elimination method based on SVM,
which was proposed to select a relevant set of features for classification problem. At
first it starts with all the features. At every iteration feature weights are obtained by
learning the training dataset with the existing features and then a feature with
minimum weight is removed from the data. This procedure continues until all features
are ranked according to the removed order. We want to use SVM-RFE in the task
which classifies the multiclass instances. Here we introduce the formulation to decide
a feature to be eliminated in the nonlinear SVM, which is an extension of the method
used in the binary SVM-RFE.
A feature which is most likely to result in the smallest change after removing is
eliminated. That is, a feature which has the smallest
f
W
value is removed.

(注意這裡的multiclass SVM-RFE)
(看不懂,怎麼會有i, j, k)
We want to use SVM-RFE in the task which classifies the
multi-category
instances
. Here we introduce the formulation to decide a feature to be eliminated in
the nonlinear SVM, which is an extension of the method used in binary SVM-RFE.
1 1
* * *( ) *( )
1 1 1 1 1 1
1
(,) (,)
2
n n k n n k
f
f f f f
il jl i j il jl i j
i j l i j l
W e e K x x e e K x x
 
   
     
 
 

where
* *
e y


and
f
x

represents that a feature
f
is removed from sample
x
.
A feature which is most likely to result in the smallest change after removing is
eliminated. That is, a feature which has the smallest
f
W
value is removed.

★★★(2005)
Multiclass cancer classification by using
fuzzy support vector
machine
and binary decision tree with gene selection

(
沒講到太多
SVM-RFE
的細節
)
(
這篇參考價值很高
)
(

multiclass
的分類
)
In the study of (Guyon, Weston et al. 2002), feature selection using recursively
feature elimination based on SVM (SVM-RFE) is proposed. When used in two-class
circumstances, it is demonstrated that the features selected by these techniques yield
better classification performance than the other methods mentioned in (Guyon,
Weston et al. 2002). But its application in multiclass feature selection
has not been
seen for its expensive calculation burden.
In recent years, some multiclass classification algorithms, such as binary
classification tree and fuzzy-SVM are proposed and reported having excellent
performance, which benefit from their root in binary SVM.
Experiments
 F test
+
RFE


Binary classification tree +
F test


Binary classification tree +
RFE


FSVM +
RFE

Fuzzy-SVM based on RFE can find most important genes affect certain types of
cancer with high recognition accuracy.
But multiclass classification, combined with gene selection, has not been
investigated intensively.
Two different constructed Multiclass classifiers with gene selection are proposed,
which are fuzzy support vector machine (FSVM) with gene selection and binary
classification tree based on SVM with gene selection.
Many variable selection methods based on two-class classification have been
proposed. However, in multiclass gene selection has not been seen for its expensive
calculation burden.
SVM’s remarkable robust performance with respect to sparse and noisy data
makes them first choice in a number of applications.
The binary SVM has been used as a component in many Multiclass classification
algorithms, such as binary classification tree and fuzzy SVM.
SVMs can work in combination with the technique of “kernels” that
automatically do a nonlinear mapping to a feature space so that SVM can settle the
nonlinear separation problems.
Fuzzy-SVM is a new method firstly proposed by Abe and Inoue in (Inoue and
Abe 2001; Abe and Inoue 2002). It was proposed to deal with unclassifiable regions
when using one versus the rest or pairwise classification method based on binary
SVM for
( 2)
n 
-class problems.
Fuzzy-SVM is an improved pairwise classification method with SVM. A fuzzy
membership function is introduced into the decision function based on pairwise
classification.
For data in the classifiable regions, fuzzy-SVM gives out the same classification
results as pairwise classification with SVM method. For the data in the unclassifiable
regions, fuzzy-SVM generates better classification results than the pairwise
classification with SVM method.
Decision functions that are simple weighted sums of the training data points plus
a bias are called linear discriminant functions. Where
w
is the weight vector and
b

is a bias value.
Support vector machine recursive feature elimination (SVM-RFE) method was
proposed in (Guyon, Weston et al. 2002) to do gene selection for cancer classification.
Problem statement
Assume there are
K
classes. Let
1
[,...,]
m
w w w

denotes the class labels of
m
samples, where
i
w k
indicates the sample
i
being class
k
, where
1,...,k K
.
Assume
1
,...,
n
x
x
are
n
features. Let
ij
x
be the
thj
feature of the
thi
sample,
where
1,...,j n
,
,
[ ]
ij m n
X x
denotes all features of the data samples, that is,

11 12 1
12 22 2
1 2
feature-1 feature-2 feature-
n
n
m m mn
n
x x x
X x x x
x x x
 
 
 
 

 
 
 
 



   

(1)

Every sample is partitioned by a series of optimal hyperplanes. The optimal
hyperplane means training data is maximally distant from the hyperplane itself, and
the lowest classification error rate will be achieved when using this hyperplane to
classify current training set. These hyperplanes can be modified as
0
T
st i st
w X b


(2)
and the classification functions are defined as
( )
T T
s
t i st i st
f
X w X b


, when
i
X

denotes the
th
i
row of matrix
X
;
s
and
t
mean two arbitrary classes separated
by an optimal hyperplane in
K
classes.
s
t
w
is an
-
n
dimensional weight vector,
s
t
b

is a bias term.
SVM-RFE is recursive feature elimination based on SVM. It is a circulation
procedure for eliminating features combined with training an SVM classifier and, for
each elimination operation, it consists of three steps:
(1)

Train the SVM classifier
(2)

Compute the ranking criteria for all features
(3)

Remove the feature with the smallest ranking scores, in which all ranking
criteria are relative to the decision function of SVM.
As a linear kernel SVM is used as a classifier between two specific classes
s

and
t
, the square of every element of weight vector
s
t
w
is used as a score to
evaluate the contribution of the corresponding features. The features with the smallest
scores are eliminated.

(2005, Kai-Bo Duan) SVM-RFE peak selection for cancer classification with
mass spectrometry data


(Linear version of SVM-RFE)
SVMs have been very popular in solving classification problems.
It constructs an optimal hyperplane decision function in feature space that is
mapped from the original input space.
The mapping

 楳⁵獵慬汹潮汩湥慲⁡湤⁴桥⁦ 敡瑵牥⁳灡捥⁩e⁵獵慬汹⁡u捨c
桩杨敲⁤ime湳楯湡氠獰慣攠瑨慮⁴桥r楧楮慬⁩湰a琠獰慣攮t
i
x
denotes the
i
-th example
vector in the original input space and
i
z
denotes the corresponding vector in the
feature space,
( )
i i
z x


.
Kernel is one of the core concepts in SVMs and plays an very important role.
Kernel function
(,)
i j
k x x
computes the inner product of two vectors in the feature
space and thus implicitly defines the mapping function:
(,) ( ) ( )
i j i j i j
k x x x x z z



  
.
The following are three types of commonly used kernel functions:
Linear Kernel
(,)
i j i j
k x x x x



Polynomial Kernel
(,) (1 )
p
i j i j
k x x x x  

Gaussian Kernel
2
2
(,) exp(/2 )
i j i j
k x x x x

  

where the order
p
of polynomial kernel and the spread with

 潦⁇慵獳楡渠步牮敬o
慲攠慤橵獴慢汥敲湥氠晵湣瑩潮⁰慲慭整敲献a

周攠 牥捵rs楶攠敬業楮慴楯n ⁰牯捥摵牥⁵獥搠楮
䝵祯測 We獴潮⁥琠慬⸠㈰〲⤠楳⁡猠
fo汬潷s㨠
⠱⤠(t慲琺⁲慮k敤⁦敡瑵牥e
孝R

; selected subset
[1,...,]S d

;
(2) Repeat until all feature are ranked:
(a) Train a linear SVM with all the training data and variables in
S
;
(b) Compute the weight vector using Eq. (5);
(c) Compute the ranking scores for features in
S
:
2
( )
i i
c w
;
(d) Find the feature with the smallest ranking score:
argmin
i i
e c

;
(e) Update
R
:
[,]
R
e R
;
(f) Update
S
:
[ ]S S e 
;
(3) Output: Ranked feature list
R
.

(2001) Multiclass cancer diagnosis using tumor gene expression signatures

This study used an one-versus-all (OVA) approach. Given
m
classes and
m

trained classifiers, a new samples takes the class of the classifier with the largest real
valued output
1,...,
argmax
i m i
class f


, where
i
f
is the real valued output of the
i
-th
classifier. A positive prediction strength corresponds to a test sample being assigned to
a single class rather than to the “all other” class.

(Recursive Feature Elimination) This feature selection method recursively
removes features based on the absolute magnitude of each hyperplane element
(Guyon, Weston et al. 2002). Given data with
n
features per sample, each OVA
SVM classifier outputs a hyperplane,
w
, that can be thought of as a vector with
n

elements each corresponding to the expression of a particular feature. Assuming that
the expression values of each feature have similar ranges, the absolute magnitude of
each element in
w
determines its importance in classifying a sample, because
1
( )
n
i i
i
f
x w x b

 

and the class label is
[ ( )]
s
ign f x
. Each OVA SVM classifier is first
trained with all features, then features corresponding to
i
w
in the bottom 10% are
removed, and each classifier is retrained with the small feature set. This procedure is
repeated iteratively to study prediction accuracy as a function of feature number.
(Results) Training the samples to recognize distinction among
k
classes.
Divide the
k
multiclass problem into a series of
k
one-against-all pairwise
comparisons. Each training sample is presented sequentially to these
k
pairwise
classifier, each of which wither claims or rejects that sample as belonging to a single
class. The method results in
k
separate one-against-all classifications per sample,
each with an associated confidence. Each training sample is assigned to the class with
the highest one-against-all classifier confidence.

(2006) Reducing multiclass cancer classification to binary by output coding and
SVM

The PLS and the PCA have been proven to be effective for classification.