Nonlinear Feature Selection by Relevance
Feature Vector Machine
Haibin Cheng
1
,Haifeng Chen
2
,Guofei Jiang
2
,and Kenji Yoshihira
2
1
CSE Department,Michigan State University
East Lansing,MI 48824
chenghai@msu.edu
2
NEC Laboratories America,Inc.
4 Independence Way,Princeton,NJ 08540
{
haifeng,gfj,kenji
}
@neclabs.com
Abstract.
Support vector machine (SVM) has received much attention
in feature selection recently because of its ability to incorporate kernels to
discover nonlinear dependencies between features.However it is known
that the number of support vectors required in SVM typically grows
linearly with the size of the training data set.Such a limitation of SVM
becomes more critical when we need to select a small subset of relevant
features from a very large number of candidates.To solve this issue,this
paper proposes a novel algorithm,called the ‘relevance feature vector
machine’(RFVM),for nonlinear feature selection.The RFVMalgorithm
utilizes a highly sparse learning algorithm,the relevance vector machine
(RVM),and incorporates kernels to extract important features with both
linear and nonlinear relationships.As a result,our proposed approach
can reduce many false alarms,e.g.including irrelevant features,while
still maintain good selection performance.We compare the performances
between RFVM and other state of the art nonlinear feature selection
algorithms in our experiments.The results conﬁrm our conclusions.
1 Introduction
Feature selection is to identify a small subset of features that are most relevant
to the response variable.It plays an important role in many data mining ap
plications where the number of features is huge such as text processing of web
documents,gene expression array analysis,and so on.First of all,the selection
of a small feature subset will signiﬁcantly reduce the computation cost in model
building,e.g.the redundant independent variables will be ﬁltered by feature
selection to obtain a simple regression
model.Secondly,the selected features
usually characterize the data better and
hence help us to better understand the
data.For instance,in the study of genome in bioinformatics,the best feature
(gene) subset can reveal the mechanisms of
diﬀerent diseases[6].Finally,by elim
inating the irrelevant features,feature selection can avoid the problem of “curse
The work was performed when the ﬁrst author worked as a summer intern at NEC
Laboratories America,Inc.
P.Perner (Ed.):MLDM 2007,LNAI 4571,pp.144–159,2007.
c
SpringerVerlag Berlin Heidelberg 2007
Nonlinear Feature Selection by Relevance Feature Vector Machine 145
of dimensionality” in case when the number of data examples is small in the
highdimensional feature space [2].
The common approach to feature selectio
n uses greedy local heuristic search,
which incrementally adds and/or deletes features to obtain a subset of rele
vant features with respect to the response[21].While those methods search in
the combinatorial space of feature subsets,regularization or shrinkage methods
[20][18] trim the feature space by constraining the magnitude of parameters.
For example,Tibshirani [18] proposed t
he Lasso regression technique which re
lies on the polyhedral structure of
L
1
norm regularization to force a subset of
parameter values to be exactly zero at the optimum.However,both the com
binatorial search based methods and regularization based methods assume the
linear dependencies between features and the response,and can not handle their
nonlinear relationships.
Due to the sparse property of support v
ector machine (SVM),recent work [3][9]
reformulated the feature
selection problem into SVM based framework by
switching the roles of features and data examples.The support vectors after opti
mizationare then regardedas the relevant features.By doing so,we canapply non
linear kernels onfeature vectors to capture
the nonlinear relationships betweenthe
features and the response variable.In this paper we utilize such promising charac
teristic of SVMto accomplish nonlinear fea
ture selection.However,we also notice
that in the past few years the data generated in a variety of applications tend to
have thousands of features.For instance,
in the gene selection problem,the num
ber of features,the gene expression coeﬃcients corresponding to the abundance
of mRNA,in the raw data ranges from 6000 to 60000 [19].This large number of
features presents a signiﬁcant challenge to the SVM based feature selection be
cause it has been shown [7] that the number of support vectors required in SVM
typically grows linearly with the size of the training data set.When the number
of features is large,the standard SVMbased feature selection may produce many
false alarms,e.g.include irrelevant features in the ﬁnal results.
To eﬀectively select relevant features fr
omvast amount of attributes,this paper
proposes to use the “Relevance Vector Mach
ine”(RVM) for feature selection.Rele
vance vector machine is a Bayesiantreatm
ent of SVMwith the same decision func
tion [1].It produces highly sparse solutions by introducing some prior probability
distributionto constrainthe model weight
s governedby a set of hyperparameters.
As a consequence,the selected features by
RVMare muchfewer than those learned
by SVMwhile maintaining comparable selection performance.In this paper we in
corporate a nonlinear featu
re kernel into the relevance vector machine to achieve
nonlinear feature selectio
n from large number of features.Experimental results
show that our proposed algorithm,which we call the “Relevance Feature Vector
Machine”(RFVM),can discover nonlinea
r relevant features with good detection
rate but low rate of false alarms.Furthermore,compared with the SVM based
feature selection methods [3][9],our pro
posed RFVMalgorithmoﬀers other com
pelling beneﬁts.For instance,the parameters in RFVMare automatically learned
by the maximumlikelihood estimation rather than the timeconsuming cross val
idation procedure as does in the SVMbased methods.
146 H.Cheng et al.
The rest of the paper is organized as
follows.In Section 2,we will summarize
the related work of nonlin
ear feature selection usi
ng SVM.In Section 3,we
extend the relevance vector machine for
the task of nonlinear feature selection.
The experimental results and conclusio
ns are presented in Section 4 and Section 5
respectively.
2 Preliminaries
Given a data set
D
=
X
n
×
m
,
y
n
×
1
,where
X
n
×
m
represents the
n
input exam
ples with
m
features and
y
n
×
1
represents the responses
,we ﬁrst describe deﬁni
tions of
feature space
and
example space
with respect to the data.In the feature
space,each dimension is related to one s
peciﬁc feature,the data set is regarded
as a group of data examples
D
= [(
x
1
,y
1
)
,
(
x
2
,y
2
)
,
· · ·
,
(
x
n
,y
n
)]
T
,where
x
i
s are
the rows of
X
,
X
= [
x
T
1
,
x
T
2
,
· · ·
,
x
T
n
]
T
.The sparse methods such as SVM in the
feature space try to learn a s
parse example weight vector
α
= [
α
1
,α
2
,
· · ·
,α
n
]
associated with the
n
data examples.The examples with nonzero values
α
i
are
regarded as support vectors,which are illustrated as solid circles in Figure 1(a).
Alternatively,each dimension in the example space is related to each data sam
ple
x
i
,and the data is denoted as
a collection of features
X
= [
f
1
,
f
2
,
· · ·
,
f
m
]
and response
y
.The sparse solution in the example space is then related to a
weight vector
w
= [
w
1
,w
2
,
· · ·
,w
m
]
T
associated with
m
features.Only those
features with nonzero elements in
w
are regarded as relevant ones or “support
features”.If we use SVM to obtain the sparse solution,those relevant features
Feature Space
Example Space
(a) (b)
Fig.1.
(a) The feature space where each dimension is related to one feature (
f
) in
the data.SVM learns the sparse solution (denoted as black points) of weight vector
α
associated with data examples
x
i
.(b) The example space in which each dimension is
a data example
x
i
.The sparse solution (denoted as black points) of weight vector
w
is associated with related features (
f
).
Nonlinear Feature Selection by Relevance Feature Vector Machine 147
are derived from the support features as shown in Figure 1(b).In this section,
we ﬁrst describe feature selection in the SVM framework.Then we will present
nonlinear feature selection solutions.
2.1 Feature Selection by SVM
Support Vector Machine [13] is a very popular machine learning technique for
the task of classiﬁcation and regression.The standard SVMregression [14] aims
to ﬁnd a predictive function
f
(
x
) =
x
,
w
+
b
that has at most
deviation
from the actual value
y
and is as ﬂat as possible,where
w
is the feature weight
vector as described before and
b
is the oﬀset of function
f
.If the solution can
be further relaxed by allowing certain de
gree of error,the optimization problem
of SVMregression can be formulated as
min
1
2

w

2
+
C
1
T
(
ξ
+
+
ξ
−
) (1)
sub.
⎧
⎨
⎩
y
−
X,
w
−
b
1
≤
1
+
ξ
+
X,
w
+
b
1
−
y
≤
1
+
ξ
−
ξ
+
,
ξ
−
≥
0
where
ξ
+
and
ξ
−
represent the errors,
C
measures the tradeoﬀ between error
relaxation and ﬂatness of function,and
1
denotes the vector whose elements
are all 1s.Instead of solving this optimization problem directly,it is usually
much easier to solve its dual form [14] by SMO algorithm.The dual problem
of the SVMregression can be derived from Lagrange optimization with KKT
conditions and Lagrange multipliers
α
+
,
α
−
:
min
1
2
(
α
+
−
α
−
)
T
X,X
T
(
α
+
−
α
−
)
−
y
T
(
α
+
−
α
−
) +
1
T
(
α
+
+
α
−
) (2)
sub.
1
T
(
α
+
−
α
−
) = 0
,
0
≤
α
+
≤
C
1
,
0
≤
α
−
≤
C
1
The dual form also provides an easy way to model nonlinear dependencies by
incorporating nonlinear kernels.That is,a kernel function
K
(
x
i
,
x
j
) deﬁned over
the examples
x
i
,
x
j
is used to replace the dot product
x
i
,
x
j
in equation (2).
The term
1
T
(
α
+
+
α
−
) in (2) works as the shrinkage factor and leads to the
sparse solution of the example weight vector
α
= (
α
+
−
α
−
),which is associated
with data examples in the feature space.
While the SVM algorithm is frequently used in the
feature space
to achieve
sparse solution
α
for classiﬁcation and regression tasks,the paper [3] employed
SVM in the
example space
to learn a sparse solution of feature weight vector
w
for the purpose of feature selection by switching the roles of features and data
examples.After data normalization such that
X
T
1
= 0 and thus
X
T
b
1
= 0,the
SVM based feature selection described in
[3] can be formulated as the following
optimization problem.
min
1
2

X
w

2
+
C
1
T
(
ξ
+
+
ξ
−
) (3)
148 H.Cheng et al.
sub.
⎧
⎨
⎩
X
T
,
y
−
X
T
,X
w
≤
1
+
ξ
+
X
T
,X
w
−
X
T
,
y
≤
1
+
ξ
−
ξ
+
,
ξ
−
≥
0
The above equation (3) makes it easy to model nonlinear dependencies between
features and response,which has also been explored in the work [9].Similarly,
the dual problem of (3) can also be obtained with Lagrange multiplies
w
+
,
w
−
and KKT conditions
min
1
2
(
w
+
−
w
−
)
T
X
T
,X
(
w
+
−
w
−
)
−
y
T
,X
(
w
+
−
w
−
) +
1
T
(
w
+
+
w
−
) (4)
sub.
0
≤
w
+
≤
C
1
,
0
≤
w
−
≤
C
1
The intuition behind the dual optimization problem (4) is very obvious.It
tries to minimize the mutual feature correlation noted as
X
T
,X
and maxi
mize the response feature correlation
y
T
,X
.The parameter “C” in equation
(4) controls the redundancy of the selected features.Small value of “C” reduces
the importance of mutual feature correlation
X
T
,X
and thus allow more re
dundancy.The term
1
T
(
w
+
+
w
−
) in the above dual form (4) achieves the
sparseness of the feature weight vector
w
= (
w
+
−
w
−
).After optimization,the
nonzero elements in
w
are related to the relevant features in the example space.
For the detailed explanation about the derivation of (3) and (4),please see [3].
2.2 Nonlinear Feature Selection
If we set
=
λ
2
and ignore the error relaxation in the primal problem (3),
the optimization form (3) can be rewritten in the example space using features
X
= [
f
1
,
f
2
,
· · ·
,
f
m
] and the response
y
min
1
2
m
i
=1
m
j
=1
w
i
w
j
f
i
,
f
j
(5)
sub.

m
i
=1
w
i
f
j
,
f
i
−
f
j
,
y
 ≤
λ
2
,
∀
j
The optimization problem in (5) has been proved in [9] to be equivalent to the
Lasso regression (6) [18] which has been wi
dely used for linear feature selection
min

X
w
−
y

2
+
λ

w

1
.
(6)
While the Lasso regression (6) is performed in the
feature space
of data set to
achieve feature selection,the optimizat
ion (5) formulates th
e feature selection
problemin the
example space
.As a consequence,we can d
eﬁne nonlinear kernels
over the feature vectors to model nonlin
ear interactions between features.For
the feature vectors
f
i
and
f
j
with nonlinear dependency,we assume that they
can be projected to a high dimensi
onal space by a mapping function
φ
so that
Nonlinear Feature Selection by Relevance Feature Vector Machine 149
they interact linearly in the mapped spa
ce.Therefore the nonlinear dependency
can be represented by introducing the feature kernel
K
(
f
i
,
f
j
) =
φ
(
f
i
)
T
φ
(
f
j
).
If we replace the dot product
,
in (5) with the feature kernel
K
,we can obtain
its nonlinear version:
min
1
2
m
i
=1
m
j
=1
w
i
w
j
K
(
f
i
,
f
j
) (7)
sub.

m
i
=1
w
i
K
(
f
j
,
f
i
)
−
K
(
f
j
,
y
)
 ≤
λ
2
,
∀
j.
In the same way,we can incorporate nonlinear feature kernels into the general
expression (4) and obtain
min
1
2
m
i
=1
m
j
=1
(
w
+
i
−
w
−
i
)
K
(
f
i
,
f
j
)(
w
+
j
−
w
−
j
)
−
m
i
=1
K
(
y
,
f
i
)(
w
+
i
−
w
−
i
) +
n
i
=1
(
w
+
i
+
w
−
j
) (8)
sub.
0
≤
w
+
i
≤
C,
0
≤
w
−
i
≤
C,
∀
i
Both (7) and (8) can be used for nonlinea
r feature selection.However,they
are both derived from the SVM framework and share the same weakness of
standard SVM algorithm.For instance,the number of support features will
grow linearly with the size of the feature set in the training data.As a result,
the provided solution in the example space is not sparse enough.This will lead
to a serious problem of high false alarm rate,e.g.including many irrelevant
features,when the feature set is large.To solve this issue,this paper proposes a
RVM based solution for nonlinear feature selection,which is called “Relevance
Feature Vector Machine”.RFVM achieves more sparse solution in the example
space by introducing priors over the feature weights.As a result,RFVM is
able to select the most relevant features as well as decrease the number of false
alarms signiﬁcantly.Furthermore,we will also show that RFVM can learn the
hyperparameters automatically and hence avoids the eﬀort of cross validation
to determine the tradeoﬀ parameter “C” in SVM optimization (8).
3 Relevance Feature Vector Machine
In this section,we will investigate the problemof using Relevance Vector Machine
for nonlinear feature selection.We will ﬁrst introduce the Bayesian framework
of standard Relevance Vector Machine algorithm [1].Then we present our Rel
evance Feature Vector Machine algorit
hm which utilizes RVM in the example
space and exploits the mutual information kernel for nonlinear feature selection.
3.1 Relevance Vector Machine
The standard RVM [1] is to learn the vector
˜
α
(
n
+1)
×
1
= [
α
0
,
α
] with
α
0
=
b
denoting the “oﬀset” and
α
= [
α
1
,α
2
,
· · ·
,α
n
] as the “relevance feature weight
150 H.Cheng et al.
vector” associated with data examples in
the feature space.It assumes that the
response
y
i
is sampled fromthe model
f
(
x
i
) with noise
,and the model function
is expressed as
f
(
x
) =
n
j
=1
α
j
x
,
x
j
+
α
0
+
(9)
where
is assumed to be sampled independently from a Gaussian distribution
noise with mean zero and variance
σ
2
.If we use kernel to model the dependencies
between the examples in the fe
ature space,we can get the
n
×
(
n
+1) ’design’
matrix
Φ
:
Φ
=
⎡
⎢
⎢
⎢
⎣
1
K
(
x
1
,
x
1
)
K
(
x
1
,
x
2
)
· · ·
K
(
x
1
,
x
n
)
1
K
(
x
2
,
x
1
)
K
(
x
2
,
x
2
)
· · ·
K
(
x
2
,
x
n
)
.
.
.
1
K
(
x
n
,
x
1
)
K
(
x
n
,
x
2
)
· · ·
K
(
x
n
,
x
n
)
⎤
⎥
⎥
⎥
⎦
In order to estimate the coeﬃcients
α
0
,
· · ·
,
α
n
in equation (9) from a set of
training data,the likelihood of the given data set is written as
p
(
y

˜
α
,σ
2
) = (2
πσ
2
)
−
n
2
exp
−
1
σ
2

y
−
Φ
˜
α

2
(10)
In addition,RVM deﬁnes prior probability distributions on parameters
˜
α
in
order to obtain sparse solutions.Such prior distribution is expressed with
n
+1
hyperparameters
˜
β
(
n
+1)
×
1
= [
β
0
,β
1
,
· · ·
,β
n
]:
p
(
˜
α

˜
β
) =
n
i
=0
N
(
α
i

0
,β
−
1
i
) (11)
The unknowns
˜
α
,
˜
β
and
σ
2
can be estimated by maximi
zing the posterior dis
tribution
p
(
˜
α
,
˜
β
,σ
2

y
),which can be decomposed as:
p
(
˜
α
,
˜
β
,σ
2

y
) =
p
(
˜
α

y
,
˜
β
,σ
2
)
p
(
˜
β
,σ
2

y
)
.
(12)
Such decomposition allows us to use two steps to ﬁnd the solution
˜
α
together
with hyperparameters
˜
β
and
σ
2
.For details of the optimization procedure,
please see [1].Compared with SVM,RVMproduces a more sparse solution
˜
α
as
well as determines the hyperp
arameters simultaneously.
To the best of our knowledge,current RVM algorithm is always performed in
the feature space in which th
e relevance weight vector
α
in RVM is associated
with data examples.This paper is the ﬁrst to utilize the promising characteristics
of RVM for feature selection.In the next section,we reformulate the Relevance
Vector Machine in the example space and i
ncorporate nonlinear feature kernels
to learn nonlinear “relevant features”.
3.2 Nonlinear Feature Selection with Relevance Feature Vector
Machine
This section presents the relevance feature vector machine (RFVM) algorithm,
which utilizes RVM in the example space to select relevant features.We will
Nonlinear Feature Selection by Relevance Feature Vector Machine 151
also show how the kernel trick can be applied to accomplish nonlinear feature
selection.Again,we assume the data (
X,
y
) is standardized.We start by rewrit
ing the function (9) into an equivalent form by incorporating the feature weight
vector
w
y
=
m
j
=1
w
j
f
j
+
(13)
The above formula assumes the linear dependency between features and the
response.When such relationship is nonlinear,we project the features and re
sponses into high dimensional space by a function
φ
so that the dependency in
the mapped space becomes linear
φ
(
y
) =
m
j
=1
w
j
φ
(
f
j
) +
.
(14)
Accordingly the likelihood function given the training data can be expressed as
p
(
φ
(
y
)

w
,σ
2
) = (2
πσ
2
)
−
n
2
exp
−

φ
(
y
)
−
φ
(
X
)
w

2
σ
2
(15)
where
φ
(
X
) = [
φ
(
f
1
)
,φ
(
f
2
)
,
· · ·
,φ
(
f
m
)].We expand the squared error term in
the above likelihood function and replace the dot product with certain feature
kernel
K
to model the nonlinear interacti
on between the feature vectors and
response,which results in

φ
(
y
)
−
φ
(
X
)
w

2
= (
φ
(
y
)
−
φ
(
X
)
w
)
T
(
φ
(
y
)
−
φ
(
X
)
w
)
=
φ
(
y
)
T
φ
(
y
)
−
2
w
T
φ
(
X
)
T
φ
(
y
) +
w
T
φ
(
X
)
T
φ
(
X
)
w
=
K
(
y
T
,
y
)
−
2
w
T
K
(
X
T
,
y
) +
w
T
K
(
X
T
,X
)
w
where:
K
(
X
T
,
y
) =
⎡
⎢
⎢
⎢
⎣
K
(
y
,
f
1
)
K
(
y
,
f
2
)
.
.
.
K
(
y
,
f
m
)
⎤
⎥
⎥
⎥
⎦
and
K
(
X
T
,X
) =
⎡
⎢
⎢
⎢
⎣
K
(
f
1
,
f
1
)
K
(
f
1
,
f
2
)
· · ·
K
(
f
1
,
f
m
)
K
(
f
2
,
f
1
)
K
(
f
2
,
f
2
)
· · ·
K
(
f
2
,
f
m
)
.
.
.
K
(
f
m
,
f
1
)
K
(
f
m
,
f
2
)
· · ·
K
(
f
m
,
f
m
)
⎤
⎥
⎥
⎥
⎦
After some manipulations,the likelihood function (15) can be reformulated as
p
(
φ
(
y
)

w
,σ
2
) = (2
πσ
2
)
−
n
2
exp
−
K
(
y
T
,
y
)+
2
w
T
K
(
X
T
,
y
)
−
w
T
K
(
X
T
,X
)
w
/σ
2
(16)
152 H.Cheng et al.
Note that RFVM diﬀers from traditional RVM in that the prior
β
=
[
β
1
,β
2
,
· · ·
,β
m
] is deﬁned over the relevance feature vector weight
w
.
p
(
w

β
) =
m
i
=1
N
(
w
i

0
,β
−
1
i
) (17)
The sparse solution
w
corresponding to relevant features can be obtained by
maximizing
p
(
w
,
β
,σ
2

φ
(
y
)) =
p
(
w

φ
(
y
)
,
β
,σ
2
)
p
(
β
,σ
2

φ
(
y
)) (18)
Similar to RVM,we use two steps to ﬁnd the maximized solution.The ﬁrst step
is now to maximize
p
(
w

φ
(
y
)
,
β
,σ
2
) =
p
(
φ
(
y
)

w
,σ
2
)
p
(
w

β
)
p
(
φ
(
y
)

β
,σ
2
)
= (2
π
)
−
n
+1
2

Σ

−
1
2
exp
−
1
2
(
w
−
μ
)
T

Σ

−
1
(
w
−
μ
)
(19)
Given the current estimation of
β
and
σ
2
,the covariance
Σ
and mean
μ
of the
feature weight vector
w
are
Σ
= (
σ
−
2
K
(
X
T
,X
) +
B
)
−
1
(20)
μ
=
σ
−
2
ΣK
(
X
T
,
y
) (21)
and
B
=
diag
(
β
1
,
· · ·
,β
n
).
Once we get the current estimation of
w
,the second step is to learn the hyper
parameters
β
and
σ
2
by maximizing
p
(
β
,σ
2

φ
(
y
))
∝
p
(
φ
(
y
)

β
,σ
2
)
p
(
β
)
p
(
σ
2
).
Since we assume the hyperparameters are uniformly distributed,e.g.
p
(
β
) and
p
(
σ
2
) are constant,it is equivalent to maximize the marginal likelihood
p
(
φ
(
y
)

β
,σ
2
),which is computed by:
p
(
φ
(
y
)

β
,σ
2
) =
p
(
φ
(
y
)

w
,σ
2
)
p
(
w

β
)
d
w
= (2
π
)
−
n
2

σ
2
I
+
φ
(
X
)
B
−
1
φ
(
X
)
T

−
1
2
∗
exp
−
1
2
y
T
(
σ
2
I
+
φ
(
X
)
B
−
1
φ
(
X
)
T
)
−
1
y
(22)
By diﬀerentiation of equation (22),we can update the hyperparameters
β
and
σ
2
by:
β
new
i
=
1
−
β
i
N
ii
μ
i
2
(23)
σ
2
new
=

φ
(
y
)
−
φ
(
X
)
μ

2
n
−
i
(1
−
β
i
N
ii
)
(24)
where
N
ii
is
i
th
diagonal element of the covariance from equation (20) and
μ
is
computed from equation (21) with current
β
and
σ
2
values.The ﬁnal optimal
Nonlinear Feature Selection by Relevance Feature Vector Machine 153
set of
w
,
β
and
σ
2
are then learned by repeating the ﬁrst step to update the
covariance
Σ
(20) and mean
μ
(21) of the feature weight vector
w
and the second
step to update the h
yperparameters
β
(23) and
σ
2
(24) iteratively.
RFVM learns a sparse feature weight vector
w
in which most of the elements
are zeros.Those zero elements in
w
indicate that the corresponding features
are irrelevant and should be ﬁltered out.On the other hand,large values of el
ements in
w
indicate high importance of the related features.In this paper we
use mutual information as the kernel function
K
(
·
,
·
),which will be introduced
in the following section.In that case,
K
(
X
T
,
y
) actually measures the relevance
between the response
y
and features in the data matrix
X
and
K
(
X
T
,X
) in
dicates the redundancy between features in the data matrix
X
.The likelihood
maximization procedure of RFVMtends
to maximize the relevance between the
features and response and minimize the mutual redundancy within the features.
3.3 Mutual Information Feature Kernel
While kernels are usually deﬁned over data examples in the feature space,the
RFVM algorithm places the nonlinear kernel over the feature and response vec
tors for the purpose of feature selection.As we know,the mutual information
[16] of two variables measures how much uncertainty can be reduced about one
variable given the knowledge of the other variable.Such property can be used
as the metric to measure th
e relevance between features.Given two discrete
variables
U
and
V
with their observations denoted as
u
and
v
respectively,the
mutual information
I
between them is formulated as
I
(
U,V
) =
u
∈
U
y
∈
V
p
(
u,v
)
log
2
p
(
u,v
)
p
(
u
)
p
(
v
)
(25)
where
p
(
u,v
) is the joint probability density function of
U
and
V
,and
p
(
u
) and
p
(
v
) are the marginal probability density functions of
U
and
V
respectively.
Now given two feature vectors
f
u
and
f
v
,we use the following way to calcu
late the value of their mutual information kernel
K
(
f
u
,
f
v
).We regard all the
elements in the vector
f
u
(or
f
v
) as multiple observations of a variable
f
u
(or
f
v
),and discretize those observations into bins for each variable.That is,we
sort the values in the feature vectors
f
u
and
f
v
separately and discretize each
vector into
N
bins,with the same interval for each bin.For example,if the the
maximal value of
f
u
is
u
max
and the minimal value is
u
min
,the interval for each
bin of feature vector
f
u
is (
u
max
−
u
min
)
/N
.Now for each value
u
in feature
vector
f
u
and
v
in feature vector
f
v
,we assign
u
=
i
and
v
=
j
if
u
falls into
the
i
th
bin and
v
falls into the
j
th
bin of their discretized regions respectively.
The probability density functions
p
(
f
u
,
f
v
),
p
(
f
u
) and
p
(
f
v
) are calculated as
the ratio of the number of elements within corresponding bin to the length of
vector
n
.As a result,we have
p
(
u
=
i
) =
counts
(
u
=
i
)
/n
p
(
v
=
j
) =
counts
(
v
=
j
)
/n
p
(
u
=
i,v
=
j
) =
counts
(
u
=
i and v
=
j
)
/n
154 H.Cheng et al.
and
K
(
f
u
,
f
v
) =
N
i
=1
N
j
=1
p
(
u
=
i,v
=
j
)
log
2
p
(
u
=
i,v
=
j
)
p
(
u
=
i
)
p
(
v
=
j
)
(26)
The mutual information kernel is symmetric and nonnegative with
K
(
f
u
,
f
v
)
=
K
(
f
v
,
f
u
) and
K
(
f
u
,
f
v
)
≥
0.It also satisﬁes the m
ercer’s condition [13],
which guarantees the convergence of th
e proposed RFVM algorithm.In this
paper we set the number of bins for discretization as log
2
(
m
),where
m
is the
number of features.
4 Experimental Results
Experimental results are presented in t
his section to demonstrate the eﬀective
ness of our proposed RFVMalgorithm.We compare RFVMwith other two state
of the art nonlinear feature selection algorithms in [3]and [9].To be convenient,
we call the algorithm proposed in [3] as PSVM and that in [9] as FVM algo
rithm.All the experiments are conducted on a Pentium 4 machine with 3GHZ
CPU and 1GB of RAM.
4.1 Nonlinear Feature Selection by RFVM
In order to verify that the proposed RFVM is able to catch the nonlinear de
pendency between the response and feature vectors,we simulated 1000 data
examples with 99 features and one response.The response
y
is generated by the
summation of three base functions of
f
1
,f
2
,f
3
respectively,together with the
Gaussian noise
distributed as
N
(0
,
0
.
005).
y
=
f
(
f
1
,f
2
,f
3
,
· · ·
,f
99
)
= 9
f
1
+20(1
−
f
2
)
3
+17 sin(80
∗
f
3
−
7) +
The three base functions are shown in Figure 2(b)(c)(d),in which the ﬁrst
is a linear function and the other two are nonlinear.Figure 2(a) also plots the
distribution of
y
with respect to the two nonlinear features
f
2
and
f
3
.The values
of features
f
1
,f
2
,f
3
are generated by a uniform distribution in [0
,
1].The other
96 features
f
4
,f
5
,
· · ·
,f
99
are generated uniformly in [0
,
20] and are independent
with the response
y
.
We modiﬁed the MATLAB code provided by Mike Tipping [17] to implement
RFVM for the task of feature selection.The RFVM is solved by updating the
posterior covariance
Σ
in equation (20) and the mean
μ
in equation (21) along
with the hyperparameters
β
in equation (23) and
σ
2
in equation (24) iteratively
using the two step procedure.The nonlinear dependencies between response and
features by using mutual information kernel in RFVM.That is,we replace the
dot product of the features and response,
< X
T
,y >
and
< X
T
,X >
,by the
precomputed mutual in
formation kernel
K
(
X
T
,y
) and
K
(
X
T
,X
).The optimal
Nonlinear Feature Selection by Relevance Feature Vector Machine 155
0
0.2
0.4
0.6
0.8
1
0
0.2
0.4
0.6
0.8
1
−20
−10
0
10
20
30
40
50
f
2
f
3
y
(a)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
−1
0
1
2
3
4
5
6
7
8
9
f
1
9*f
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
5
10
15
20
25
f
2
20(1−f
2
)
3
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
−20
−15
−10
−5
0
5
10
15
20
f
3
17sin(80f
3
−7)
(b) (c) (d)
Fig.2.
(a) The distribution of response
y
with respect to two nonlinear features
f
2
and
f
3
.The bottom three ﬁgures illustrate the three components of the simulated function
f
:(b) linear,(c) cubic and (d) sin.
set of feature vector weight
w
along with the hyperparameters
β
and
σ
2
in
RFVM are automatically learned by a two step updating procedure.The initial
values of the hyperp
arameters are set as
β
= 10 and
σ
2
=
std
(
y
)
/
10.
Figure 3(a) and (b) plot the values of feature weights computed by linear
RVM and nonlinear RFVM over the simulated data.From Figure 3(a),we see
that the linear RVM can detect th
e linear dependent feature
f
1
,as well as the
feature
f
2
which has cubical relationship.The reason that
f
2
is also detected by
linear RVM is that the cubical curve can be approximated by a linear line in
certain degree,which is shown in Figure
2(b).However,RVMmissed the feature
f
3
completely,which is a highly nonlinear feature with periodical sin wave.On
the other hand,the nonlinear RFVM detect
s all the three features successfully,
which is shown in Figure 3(b).Furthermore,the detected feature set is pretty
sparse compared with th
e results of linear RVM.
4.2 Performance Comparison
This section compares the performance of RFVMalgorithmwith other nonlinear
feature selection algorithms such as FVM in [9] and PSVM in [3].To demon
strate that RFVM is able to select most relevant features with much lower false
156 H.Cheng et al.
0
10
20
30
40
50
60
70
80
90
0
1
2
3
4
5
6
7
feature ID
feature weight
linear RVM
1 2
(a)
1
2
3
13
72
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
feature ID
feature weight
nonlinear RFVM
(b)
Fig.3.
(a) The histogram of feature weights from linear RVM.It detects
f
1
and
f
2
but
misses the highly nonlinear relevant feature
f
3
.(b) The histogram of feature weights
from nonlinear RVM.It detects all the three features.
alarm rate,we simulate another data set with 2000 data examples and 100 fea
tures.The ﬁrst 20 features are simulated uniformly in [
−
0
.
5
,
0
.
5] and the rest
are generated uniformly in [0
,
20] with Gaussian noise.The response
y
is the
summation of functions
F
i
(
·
) on the ﬁrst 20 features
y
=
20
i
=1
F
i
(
f
i
)
.
(27)
The basis function
F
i
(
·
) is randomly chosen from the pool of eight candidate
functions
F
i
(
f
i
)
∈ {
F
1
(
f
i
)
,F
2
(
f
i
)
,
· · ·
,F
8
(
f
i
)
}
(28)
where the expressions of those candidate functions are described in Table 1.
As you can see our simulation covers almost all kinds of common nonlinear
relationships.
Nonlinear Feature Selection by Relevance Feature Vector Machine 157
Table 1.
The 8 basis function
j
=
1
2
3
4
F
j
(
f
i
) =
40
f
i
20(1
−
f
2
i
)
23
f
3
i
20 sin(40
f
i
−
5)
j
=
5
6
7
8
F
j
(
f
i
) =
20
e
f
i
−
log
2
(

f
i

)
20
√
1
−
f
i
20 cos(20
f
i
−
7)
We divide the data into two parts,the ﬁrst 1000 examples are used as training
data to determine the parameter
λ
in FVM and
,
C
in PSVM by 10 fold cross
validation.The rest 1000 data examples are used for test.The performances
of those algorithms are compared in terms of detection rate and false alarm
rate.We run 20 rounds of such simulations and present the results in Figure 4.
Figure 4(a) plots the detection rate of RFVM together with those of FVM and
PSVM.It shows that RFVM maintains comparable detection rate as the other
0
2
4
6
8
10
12
14
16
18
20
−0.1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Simulation ID
Detection Rate
FVM
P−SVM
RFVM
(a)
0
2
4
6
8
10
12
14
16
18
20
−0.1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Simulation ID
False Alarm
FVM
P−SVM
RFVM
(b)
Fig.4.
(a) The detection rates of FVM,PSVMand RFVM.(b) The false alarm rates
of FVM,PSVM and RFVM.
158 H.Cheng et al.
two algorithms.Note since the nonlinear relationship (27) generated in our sim
ulation is very strong,the range of detection rates for those algorithms is rea
sonable.Figure 4(b) plots the false alarm rates of FVM,PSVM and RFVM
algorithms.It demonstrates that RFVM has lower false alarm rate generally,
which is due to the sparseness of RFV
M in the example space compared with
FVM and PSVM.Also note in the exper
iment we don’t need to predetermine
any parameters in RFVMsince the parameters are automatically learned by two
step maximum likelihood method,while FVM and PSVMare both sensitive to
parameters and need extra eﬀorts of cross
validation to determine those values.
5 Conclusions
This paper has proposed a new method,the “Relevance Feature Vector Ma
chine”(RFVM),to detect features with
nonlinear dependency.Compared with
other state of the art nonlinear feature selection algorithms,RFVM has two
unique advantages based on our theoretical analysis and experimental results.
First,by utilizing the highly sparseness nature of RVM,the RFVM algorithm
reduces the false alarms in feature selection signiﬁcantly while still maintains de
sirable detection rate.Fu
rthermore,unlike other SV
M based nonlinear feature
selection algorithms whose performances
are sensitive to the selection of parame
ter values,RFVMlearns the hyperparameters automatically by maximizing the
“marginal likelihood” in the second step of the twostep updating procedure.In
the future,we will apply RFVMto some real applications to further demonstrate
the advantages of our algorithm.
References
1.Tipping,M.E.:Sparse Bayesian learning and the relevance vector machine.Journal
of Machine Learning Research 1,211–244 (2001)
2.Bellman,R.E.:Adaptive Control Processes.Princeton University Press,Princeton,
NJ (1961)
3.Hochreiter,S.,Obermayer,K.:Nonlinear feature selection with the potential sup
port vector machine.In:Guyon,I.,Gunn,S.,Nikravesh,M.,Zadeh,L.(eds.)
Feature extraction,founda
tions and applications
,Springer,
Berlin (2005)
4.Figueiredo,M.,Jain,A.K.:Bayesian Learning of Sparse Classiﬁers.In:Proc.IEEE
Computer Soc.Conf.Computer Vision and Pattern Recognition,vol.1,35–41
(2001)
5.Figueiredo,M.A.T.:Adaptive sparseness for supervised learning.IEEE Transac
tions on Pattern Analysis and Machine Intelligence 25(9),1150–1159 (2003)
6.B
Φ
,T.H.,Jonassen,I.:New feature subset selection procedures for classiﬁcation
of expression proﬁles,Genome Biology,3 research 0017.10017.11 (2000)
7.Burges,C.:Simpliﬁed support vector decision rules.In:Proc.of the Thirteenth
International Conf.on Machine Learning,pp.71–77.Morgan Kaufmann,Seattle
(1996)
8.Aizerman,M.E.,Braverman,Rozonoer,L.:Theoretical foundations of the poten
tial function method in pattern recognition learning.Automation and Remote Con
trol 25,821–837 (1964)
Nonlinear Feature Selection by Relevance Feature Vector Machine 159
9.Li,F.,Yang,Y.,Xing,E.P.:From Lasso regression to Feature Vector Machine,
Advances in Neural Information Processing Systems,18 (2005)
10.Faul,A.,Tipping,M.E.:Analysis of sparse bayesian learning.In:Dietterich,T.,
Becker,S.,Ghahramani,Z.(eds.) Advances in Neural Information Processing Sys
tems 14,pp.383–389.MIT Press,Cambridge,MA (2002)
11.Faul,A.,Tipping,M.:A variational approach to robust regression,in Artiﬁcial
Neural Networks.In:Dorﬀner,G.,Bischof,H.,Hornik,K.(eds.),pp.95–202 (2001)
12.Roth,V.:The Generalized LASSO,V.IEEE Transactions on Neural Networks,
Dorﬀner,G.vol.15(1).(2004)
13.Vapnik,V.N.:The Nature of Statistical Learning Theory.Springer,Heidelberg
(1995)
14.Smola,A.J.,Scholkopf,B.:A tutorial on support vector regression,NEUROCOLT
Technical Report NCTR98030,Royal Holloway College,London (1998)
15.Long,F.,Ding,C.:Feature Selection Based on Mutual Information:Criteria of
MaxDependency,MaxRelevance,and MinRedundancy.IEEE Transactions on
Pattern Analysis and Machine Intelligence 27(8),1226–1238 (2005)
16.Guiasu,Silviu.:Information Theory with Applications.McGrawHill,New York
(1977)
17.Tipping,M.E.:Microsoft Corporation,
http://research.microsoft.com/MLP/
RVM/
18.Tibshirani,R.:Regression Shrinkage and Selection via the Lasso.Journal of the
Royal Statistical Society.Series B (Methodological) 58(1),267–288 (1999)
19.Guyon,I.,Elisseeﬀ,A.:An Introduction to Variable and Feature Selection.Journal
of Machine Learning Research 3,1157–1182 (2003)
20.Bishop,C.M.:Neural Networks for Pattern Recognition.Oxford University Press,
Oxford (1995)
21.Reeves,S.J.,Zhao,Z.:Sequential algorithms for observation selection.IEEE Trans
actions on Signal Processing 47,123–132 (1999)
Comments 0
Log in to post a comment