Nonlinear Feature Selection by Relevance Feature Vector Machine

grizzlybearcroatianAI and Robotics

Oct 16, 2013 (3 years and 8 months ago)

110 views

Nonlinear Feature Selection by Relevance
Feature Vector Machine

Haibin Cheng
1
,Haifeng Chen
2
,Guofei Jiang
2
,and Kenji Yoshihira
2
1
CSE Department,Michigan State University
East Lansing,MI 48824
chenghai@msu.edu
2
NEC Laboratories America,Inc.
4 Independence Way,Princeton,NJ 08540
{
haifeng,gfj,kenji
}
@nec-labs.com
Abstract.
Support vector machine (SVM) has received much attention
in feature selection recently because of its ability to incorporate kernels to
discover nonlinear dependencies between features.However it is known
that the number of support vectors required in SVM typically grows
linearly with the size of the training data set.Such a limitation of SVM
becomes more critical when we need to select a small subset of relevant
features from a very large number of candidates.To solve this issue,this
paper proposes a novel algorithm,called the ‘relevance feature vector
machine’(RFVM),for nonlinear feature selection.The RFVMalgorithm
utilizes a highly sparse learning algorithm,the relevance vector machine
(RVM),and incorporates kernels to extract important features with both
linear and nonlinear relationships.As a result,our proposed approach
can reduce many false alarms,e.g.including irrelevant features,while
still maintain good selection performance.We compare the performances
between RFVM and other state of the art nonlinear feature selection
algorithms in our experiments.The results confirm our conclusions.
1 Introduction
Feature selection is to identify a small subset of features that are most relevant
to the response variable.It plays an important role in many data mining ap-
plications where the number of features is huge such as text processing of web
documents,gene expression array analysis,and so on.First of all,the selection
of a small feature subset will significantly reduce the computation cost in model
building,e.g.the redundant independent variables will be filtered by feature
selection to obtain a simple regression
model.Secondly,the selected features
usually characterize the data better and
hence help us to better understand the
data.For instance,in the study of genome in bioinformatics,the best feature
(gene) subset can reveal the mechanisms of
different diseases[6].Finally,by elim-
inating the irrelevant features,feature selection can avoid the problem of “curse

The work was performed when the first author worked as a summer intern at NEC
Laboratories America,Inc.
P.Perner (Ed.):MLDM 2007,LNAI 4571,pp.144–159,2007.
c

Springer-Verlag Berlin Heidelberg 2007
Nonlinear Feature Selection by Relevance Feature Vector Machine 145
of dimensionality” in case when the number of data examples is small in the
high-dimensional feature space [2].
The common approach to feature selectio
n uses greedy local heuristic search,
which incrementally adds and/or deletes features to obtain a subset of rele-
vant features with respect to the response[21].While those methods search in
the combinatorial space of feature subsets,regularization or shrinkage methods
[20][18] trim the feature space by constraining the magnitude of parameters.
For example,Tibshirani [18] proposed t
he Lasso regression technique which re-
lies on the polyhedral structure of
L
1
norm regularization to force a subset of
parameter values to be exactly zero at the optimum.However,both the com-
binatorial search based methods and regularization based methods assume the
linear dependencies between features and the response,and can not handle their
nonlinear relationships.
Due to the sparse property of support v
ector machine (SVM),recent work [3][9]
reformulated the feature
selection problem into SVM based framework by
switching the roles of features and data examples.The support vectors after opti-
mizationare then regardedas the relevant features.By doing so,we canapply non-
linear kernels onfeature vectors to capture
the nonlinear relationships betweenthe
features and the response variable.In this paper we utilize such promising charac-
teristic of SVMto accomplish nonlinear fea
ture selection.However,we also notice
that in the past few years the data generated in a variety of applications tend to
have thousands of features.For instance,
in the gene selection problem,the num-
ber of features,the gene expression coefficients corresponding to the abundance
of mRNA,in the raw data ranges from 6000 to 60000 [19].This large number of
features presents a significant challenge to the SVM based feature selection be-
cause it has been shown [7] that the number of support vectors required in SVM
typically grows linearly with the size of the training data set.When the number
of features is large,the standard SVMbased feature selection may produce many
false alarms,e.g.include irrelevant features in the final results.
To effectively select relevant features fr
omvast amount of attributes,this paper
proposes to use the “Relevance Vector Mach
ine”(RVM) for feature selection.Rele-
vance vector machine is a Bayesiantreatm
ent of SVMwith the same decision func-
tion [1].It produces highly sparse solutions by introducing some prior probability
distributionto constrainthe model weight
s governedby a set of hyper-parameters.
As a consequence,the selected features by
RVMare muchfewer than those learned
by SVMwhile maintaining comparable selection performance.In this paper we in-
corporate a nonlinear featu
re kernel into the relevance vector machine to achieve
nonlinear feature selectio
n from large number of features.Experimental results
show that our proposed algorithm,which we call the “Relevance Feature Vector
Machine”(RFVM),can discover nonlinea
r relevant features with good detection
rate but low rate of false alarms.Furthermore,compared with the SVM based
feature selection methods [3][9],our pro
posed RFVMalgorithmoffers other com-
pelling benefits.For instance,the parameters in RFVMare automatically learned
by the maximumlikelihood estimation rather than the time-consuming cross val-
idation procedure as does in the SVMbased methods.
146 H.Cheng et al.
The rest of the paper is organized as
follows.In Section 2,we will summarize
the related work of nonlin
ear feature selection usi
ng SVM.In Section 3,we
extend the relevance vector machine for
the task of nonlinear feature selection.
The experimental results and conclusio
ns are presented in Section 4 and Section 5
respectively.
2 Preliminaries
Given a data set
D
=

X
n
×
m
,
y
n
×
1

,where
X
n
×
m
represents the
n
input exam-
ples with
m
features and
y
n
×
1
represents the responses
,we first describe defini-
tions of
feature space
and
example space
with respect to the data.In the feature
space,each dimension is related to one s
pecific feature,the data set is regarded
as a group of data examples
D
= [(
x
1
,y
1
)
,
(
x
2
,y
2
)
,
· · ·
,
(
x
n
,y
n
)]
T
,where
x
i
s are
the rows of
X
,
X
= [
x
T
1
,
x
T
2
,
· · ·
,
x
T
n
]
T
.The sparse methods such as SVM in the
feature space try to learn a s
parse example weight vector
α
= [
α
1

2
,
· · ·

n
]
associated with the
n
data examples.The examples with nonzero values
α
i
are
regarded as support vectors,which are illustrated as solid circles in Figure 1(a).
Alternatively,each dimension in the example space is related to each data sam-
ple
x
i
,and the data is denoted as
a collection of features
X
= [
f
1
,
f
2
,
· · ·
,
f
m
]
and response
y
.The sparse solution in the example space is then related to a
weight vector
w
= [
w
1
,w
2
,
· · ·
,w
m
]
T
associated with
m
features.Only those
features with nonzero elements in
w
are regarded as relevant ones or “support
features”.If we use SVM to obtain the sparse solution,those relevant features
Feature Space
Example Space
(a) (b)
Fig.1.
(a) The feature space where each dimension is related to one feature (
f
) in
the data.SVM learns the sparse solution (denoted as black points) of weight vector
α
associated with data examples
x
i
.(b) The example space in which each dimension is
a data example
x
i
.The sparse solution (denoted as black points) of weight vector
w
is associated with related features (
f
).
Nonlinear Feature Selection by Relevance Feature Vector Machine 147
are derived from the support features as shown in Figure 1(b).In this section,
we first describe feature selection in the SVM framework.Then we will present
nonlinear feature selection solutions.
2.1 Feature Selection by SVM
Support Vector Machine [13] is a very popular machine learning technique for
the task of classification and regression.The standard SVM-regression [14] aims
to find a predictive function
f
(
x
) =

x
,
w

+
b
that has at most

deviation
from the actual value
y
and is as flat as possible,where
w
is the feature weight
vector as described before and
b
is the offset of function
f
.If the solution can
be further relaxed by allowing certain de
gree of error,the optimization problem
of SVM-regression can be formulated as
min
1
2
||
w
||
2
+
C
1
T
(
ξ
+
+
ξ

) (1)
sub.



y
−
X,
w
 −
b
1


1
+
ξ
+

X,
w

+
b
1

y


1
+
ξ

ξ
+
,
ξ


0
where
ξ
+
and
ξ

represent the errors,
C
measures the trade-off between error
relaxation and flatness of function,and
1
denotes the vector whose elements
are all 1s.Instead of solving this optimization problem directly,it is usually
much easier to solve its dual form [14] by SMO algorithm.The dual problem
of the SVM-regression can be derived from Lagrange optimization with KKT
conditions and Lagrange multipliers
α
+
,
α

:
min
1
2
(
α
+

α

)
T

X,X
T

(
α
+

α

)

y
T
(
α
+

α

) +

1
T
(
α
+
+
α

) (2)
sub.
1
T
(
α
+

α

) = 0
,
0

α
+

C
1
,
0

α


C
1
The dual form also provides an easy way to model nonlinear dependencies by
incorporating nonlinear kernels.That is,a kernel function
K
(
x
i
,
x
j
) defined over
the examples
x
i
,
x
j
is used to replace the dot product

x
i
,
x
j

in equation (2).
The term

1
T
(
α
+
+
α

) in (2) works as the shrinkage factor and leads to the
sparse solution of the example weight vector
α
= (
α
+

α

),which is associated
with data examples in the feature space.
While the SVM algorithm is frequently used in the
feature space
to achieve
sparse solution
α
for classification and regression tasks,the paper [3] employed
SVM in the
example space
to learn a sparse solution of feature weight vector
w
for the purpose of feature selection by switching the roles of features and data
examples.After data normalization such that
X
T
1
= 0 and thus
X
T
b
1
= 0,the
SVM based feature selection described in
[3] can be formulated as the following
optimization problem.
min
1
2
||
X
w
||
2
+
C
1
T
(
ξ
+
+
ξ

) (3)
148 H.Cheng et al.
sub.




X
T
,
y
 −
X
T
,X

w


1
+
ξ
+

X
T
,X

w
−
X
T
,
y
 ≤

1
+
ξ

ξ
+
,
ξ


0
The above equation (3) makes it easy to model nonlinear dependencies between
features and response,which has also been explored in the work [9].Similarly,
the dual problem of (3) can also be obtained with Lagrange multiplies
w
+
,
w

and KKT conditions
min
1
2
(
w
+

w

)
T

X
T
,X

(
w
+

w

)
−
y
T
,X

(
w
+

w

) +

1
T
(
w
+
+
w

) (4)
sub.
0

w
+

C
1
,
0

w


C
1
The intuition behind the dual optimization problem (4) is very obvious.It
tries to minimize the mutual feature correlation noted as

X
T
,X

and maxi-
mize the response feature correlation

y
T
,X

.The parameter “C” in equation
(4) controls the redundancy of the selected features.Small value of “C” reduces
the importance of mutual feature correlation

X
T
,X

and thus allow more re-
dundancy.The term

1
T
(
w
+
+
w

) in the above dual form (4) achieves the
sparseness of the feature weight vector
w
= (
w
+

w

).After optimization,the
nonzero elements in
w
are related to the relevant features in the example space.
For the detailed explanation about the derivation of (3) and (4),please see [3].
2.2 Nonlinear Feature Selection
If we set

=
λ
2
and ignore the error relaxation in the primal problem (3),
the optimization form (3) can be rewritten in the example space using features
X
= [
f
1
,
f
2
,
· · ·
,
f
m
] and the response
y
min
1
2
m

i
=1
m

j
=1
w
i
w
j

f
i
,
f
j

(5)
sub.
|
m

i
=1
w
i

f
j
,
f
i
 −
f
j
,
y
| ≤
λ
2
,

j
The optimization problem in (5) has been proved in [9] to be equivalent to the
Lasso regression (6) [18] which has been wi
dely used for linear feature selection
min
||
X
w

y
||
2
+
λ
||
w
||
1
.
(6)
While the Lasso regression (6) is performed in the
feature space
of data set to
achieve feature selection,the optimizat
ion (5) formulates th
e feature selection
problemin the
example space
.As a consequence,we can d
efine nonlinear kernels
over the feature vectors to model nonlin
ear interactions between features.For
the feature vectors
f
i
and
f
j
with nonlinear dependency,we assume that they
can be projected to a high dimensi
onal space by a mapping function
φ
so that
Nonlinear Feature Selection by Relevance Feature Vector Machine 149
they interact linearly in the mapped spa
ce.Therefore the nonlinear dependency
can be represented by introducing the feature kernel
K
(
f
i
,
f
j
) =
φ
(
f
i
)
T
φ
(
f
j
).
If we replace the dot product

,

in (5) with the feature kernel
K
,we can obtain
its nonlinear version:
min
1
2
m

i
=1
m

j
=1
w
i
w
j
K
(
f
i
,
f
j
) (7)
sub.
|
m

i
=1
w
i
K
(
f
j
,
f
i
)

K
(
f
j
,
y
)
| ≤
λ
2
,

j.
In the same way,we can incorporate nonlinear feature kernels into the general
expression (4) and obtain
min
1
2
m

i
=1
m

j
=1
(
w
+
i

w

i
)
K
(
f
i
,
f
j
)(
w
+
j

w

j
)

m

i
=1
K
(
y
,
f
i
)(
w
+
i

w

i
) +

n

i
=1
(
w
+
i
+
w

j
) (8)
sub.
0

w
+
i

C,
0

w

i

C,

i
Both (7) and (8) can be used for nonlinea
r feature selection.However,they
are both derived from the SVM framework and share the same weakness of
standard SVM algorithm.For instance,the number of support features will
grow linearly with the size of the feature set in the training data.As a result,
the provided solution in the example space is not sparse enough.This will lead
to a serious problem of high false alarm rate,e.g.including many irrelevant
features,when the feature set is large.To solve this issue,this paper proposes a
RVM based solution for nonlinear feature selection,which is called “Relevance
Feature Vector Machine”.RFVM achieves more sparse solution in the example
space by introducing priors over the feature weights.As a result,RFVM is
able to select the most relevant features as well as decrease the number of false
alarms significantly.Furthermore,we will also show that RFVM can learn the
hyper-parameters automatically and hence avoids the effort of cross validation
to determine the trade-off parameter “C” in SVM optimization (8).
3 Relevance Feature Vector Machine
In this section,we will investigate the problemof using Relevance Vector Machine
for nonlinear feature selection.We will first introduce the Bayesian framework
of standard Relevance Vector Machine algorithm [1].Then we present our Rel-
evance Feature Vector Machine algorit
hm which utilizes RVM in the example
space and exploits the mutual information kernel for nonlinear feature selection.
3.1 Relevance Vector Machine
The standard RVM [1] is to learn the vector
˜
α
(
n
+1)
×
1
= [
α
0
,
α
] with
α
0
=
b
denoting the “offset” and
α
= [
α
1

2
,
· · ·

n
] as the “relevance feature weight
150 H.Cheng et al.
vector” associated with data examples in
the feature space.It assumes that the
response
y
i
is sampled fromthe model
f
(
x
i
) with noise

,and the model function
is expressed as
f
(
x
) =
n

j
=1
α
j

x
,
x
j

+
α
0
+

(9)
where

is assumed to be sampled independently from a Gaussian distribution
noise with mean zero and variance
σ
2
.If we use kernel to model the dependencies
between the examples in the fe
ature space,we can get the
n
×
(
n
+1) ’design’
matrix
Φ
:
Φ
=





1
K
(
x
1
,
x
1
)
K
(
x
1
,
x
2
)
· · ·
K
(
x
1
,
x
n
)
1
K
(
x
2
,
x
1
)
K
(
x
2
,
x
2
)
· · ·
K
(
x
2
,
x
n
)
.
.
.
1
K
(
x
n
,
x
1
)
K
(
x
n
,
x
2
)
· · ·
K
(
x
n
,
x
n
)





In order to estimate the coefficients
α
0
,
· · ·
,
α
n
in equation (9) from a set of
training data,the likelihood of the given data set is written as
p
(
y
|
˜
α

2
) = (2
πσ
2
)

n
2
exp


1
σ
2
||
y

Φ
˜
α
||
2

(10)
In addition,RVM defines prior probability distributions on parameters
˜
α
in
order to obtain sparse solutions.Such prior distribution is expressed with
n
+1
hyper-parameters
˜
β
(
n
+1)
×
1
= [
β
0

1
,
· · ·

n
]:
p
(
˜
α
|
˜
β
) =
n

i
=0
N
(
α
i
|
0


1
i
) (11)
The unknowns
˜
α
,
˜
β
and
σ
2
can be estimated by maximi
zing the posterior dis-
tribution
p
(
˜
α
,
˜
β

2
|
y
),which can be decomposed as:
p
(
˜
α
,
˜
β

2
|
y
) =
p
(
˜
α
|
y
,
˜
β

2
)
p
(
˜
β

2
|
y
)
.
(12)
Such decomposition allows us to use two steps to find the solution
˜
α
together
with hyper-parameters
˜
β
and
σ
2
.For details of the optimization procedure,
please see [1].Compared with SVM,RVMproduces a more sparse solution
˜
α
as
well as determines the hyper-p
arameters simultaneously.
To the best of our knowledge,current RVM algorithm is always performed in
the feature space in which th
e relevance weight vector
α
in RVM is associated
with data examples.This paper is the first to utilize the promising characteristics
of RVM for feature selection.In the next section,we reformulate the Relevance
Vector Machine in the example space and i
ncorporate nonlinear feature kernels
to learn nonlinear “relevant features”.
3.2 Nonlinear Feature Selection with Relevance Feature Vector
Machine
This section presents the relevance feature vector machine (RFVM) algorithm,
which utilizes RVM in the example space to select relevant features.We will
Nonlinear Feature Selection by Relevance Feature Vector Machine 151
also show how the kernel trick can be applied to accomplish nonlinear feature
selection.Again,we assume the data (
X,
y
) is standardized.We start by rewrit-
ing the function (9) into an equivalent form by incorporating the feature weight
vector
w
y
=
m

j
=1
w
j
f
j
+

(13)
The above formula assumes the linear dependency between features and the
response.When such relationship is nonlinear,we project the features and re-
sponses into high dimensional space by a function
φ
so that the dependency in
the mapped space becomes linear
φ
(
y
) =
m

j
=1
w
j
φ
(
f
j
) +

.
(14)
Accordingly the likelihood function given the training data can be expressed as
p
(
φ
(
y
)
|
w

2
) = (2
πσ
2
)

n
2
exp


||
φ
(
y
)

φ
(
X
)
w
||
2
σ
2

(15)
where
φ
(
X
) = [
φ
(
f
1
)

(
f
2
)
,
· · ·

(
f
m
)].We expand the squared error term in
the above likelihood function and replace the dot product with certain feature
kernel
K
to model the nonlinear interacti
on between the feature vectors and
response,which results in
||
φ
(
y
)

φ
(
X
)
w
||
2
= (
φ
(
y
)

φ
(
X
)
w
)
T
(
φ
(
y
)

φ
(
X
)
w
)
=
φ
(
y
)
T
φ
(
y
)

2
w
T
φ
(
X
)
T
φ
(
y
) +
w
T
φ
(
X
)
T
φ
(
X
)
w
=
K
(
y
T
,
y
)

2
w
T
K
(
X
T
,
y
) +
w
T
K
(
X
T
,X
)
w
where:
K
(
X
T
,
y
) =





K
(
y
,
f
1
)
K
(
y
,
f
2
)
.
.
.
K
(
y
,
f
m
)





and
K
(
X
T
,X
) =





K
(
f
1
,
f
1
)
K
(
f
1
,
f
2
)
· · ·
K
(
f
1
,
f
m
)
K
(
f
2
,
f
1
)
K
(
f
2
,
f
2
)
· · ·
K
(
f
2
,
f
m
)
.
.
.
K
(
f
m
,
f
1
)
K
(
f
m
,
f
2
)
· · ·
K
(
f
m
,
f
m
)





After some manipulations,the likelihood function (15) can be reformulated as
p
(
φ
(
y
)
|
w

2
) = (2
πσ
2
)

n
2
exp


K
(
y
T
,
y
)+
2
w
T
K
(
X
T
,
y
)

w
T
K
(
X
T
,X
)
w


2

(16)
152 H.Cheng et al.
Note that RFVM differs from traditional RVM in that the prior
β
=
[
β
1

2
,
· · ·

m
] is defined over the relevance feature vector weight
w
.
p
(
w
|
β
) =
m

i
=1
N
(
w
i
|
0


1
i
) (17)
The sparse solution
w
corresponding to relevant features can be obtained by
maximizing
p
(
w
,
β

2
|
φ
(
y
)) =
p
(
w
|
φ
(
y
)
,
β

2
)
p
(
β

2
|
φ
(
y
)) (18)
Similar to RVM,we use two steps to find the maximized solution.The first step
is now to maximize
p
(
w
|
φ
(
y
)
,
β

2
) =
p
(
φ
(
y
)
|
w

2
)
p
(
w
|
β
)
p
(
φ
(
y
)
|
β

2
)
= (2
π
)

n
+1
2
|
Σ
|

1
2
exp


1
2
(
w

μ
)
T
|
Σ
|

1
(
w

μ
)

(19)
Given the current estimation of
β
and
σ
2
,the covariance
Σ
and mean
μ
of the
feature weight vector
w
are
Σ
= (
σ

2
K
(
X
T
,X
) +
B
)

1
(20)
μ
=
σ

2
ΣK
(
X
T
,
y
) (21)
and
B
=
diag
(
β
1
,
· · ·

n
).
Once we get the current estimation of
w
,the second step is to learn the hyper-
parameters
β
and
σ
2
by maximizing
p
(
β

2
|
φ
(
y
))

p
(
φ
(
y
)
|
β

2
)
p
(
β
)
p
(
σ
2
).
Since we assume the hyper-parameters are uniformly distributed,e.g.
p
(
β
) and
p
(
σ
2
) are constant,it is equivalent to maximize the marginal likelihood
p
(
φ
(
y
)
|
β

2
),which is computed by:
p
(
φ
(
y
)
|
β

2
) =

p
(
φ
(
y
)
|
w

2
)
p
(
w
|
β
)
d
w
= (2
π
)

n
2
|
σ
2
I
+
φ
(
X
)
B

1
φ
(
X
)
T
|

1
2

exp


1
2
y
T
(
σ
2
I
+
φ
(
X
)
B

1
φ
(
X
)
T
)

1
y

(22)
By differentiation of equation (22),we can update the hyper-parameters
β
and
σ
2
by:
β
new
i
=
1

β
i
N
ii
μ
i
2
(23)
σ
2
new
=
||
φ
(
y
)

φ
(
X
)
μ
||
2
n


i
(1

β
i
N
ii
)
(24)
where
N
ii
is
i
th
diagonal element of the covariance from equation (20) and
μ
is
computed from equation (21) with current
β
and
σ
2
values.The final optimal
Nonlinear Feature Selection by Relevance Feature Vector Machine 153
set of
w
,
β
and
σ
2
are then learned by repeating the first step to update the
covariance
Σ
(20) and mean
μ
(21) of the feature weight vector
w
and the second
step to update the h
yper-parameters
β
(23) and
σ
2
(24) iteratively.
RFVM learns a sparse feature weight vector
w
in which most of the elements
are zeros.Those zero elements in
w
indicate that the corresponding features
are irrelevant and should be filtered out.On the other hand,large values of el-
ements in
w
indicate high importance of the related features.In this paper we
use mutual information as the kernel function
K
(
·
,
·
),which will be introduced
in the following section.In that case,
K
(
X
T
,
y
) actually measures the relevance
between the response
y
and features in the data matrix
X
and
K
(
X
T
,X
) in-
dicates the redundancy between features in the data matrix
X
.The likelihood
maximization procedure of RFVMtends
to maximize the relevance between the
features and response and minimize the mutual redundancy within the features.
3.3 Mutual Information Feature Kernel
While kernels are usually defined over data examples in the feature space,the
RFVM algorithm places the nonlinear kernel over the feature and response vec-
tors for the purpose of feature selection.As we know,the mutual information
[16] of two variables measures how much uncertainty can be reduced about one
variable given the knowledge of the other variable.Such property can be used
as the metric to measure th
e relevance between features.Given two discrete
variables
U
and
V
with their observations denoted as
u
and
v
respectively,the
mutual information
I
between them is formulated as
I
(
U,V
) =

u

U

y

V
p
(
u,v
)
log
2
p
(
u,v
)
p
(
u
)
p
(
v
)
(25)
where
p
(
u,v
) is the joint probability density function of
U
and
V
,and
p
(
u
) and
p
(
v
) are the marginal probability density functions of
U
and
V
respectively.
Now given two feature vectors
f
u
and
f
v
,we use the following way to calcu-
late the value of their mutual information kernel
K
(
f
u
,
f
v
).We regard all the
elements in the vector
f
u
(or
f
v
) as multiple observations of a variable
f
u
(or
f
v
),and discretize those observations into bins for each variable.That is,we
sort the values in the feature vectors
f
u
and
f
v
separately and discretize each
vector into
N
bins,with the same interval for each bin.For example,if the the
maximal value of
f
u
is
u
max
and the minimal value is
u
min
,the interval for each
bin of feature vector
f
u
is (
u
max

u
min
)
/N
.Now for each value
u
in feature
vector
f
u
and
v
in feature vector
f
v
,we assign
u
=
i
and
v
=
j
if
u
falls into
the
i
th
bin and
v
falls into the
j
th
bin of their discretized regions respectively.
The probability density functions
p
(
f
u
,
f
v
),
p
(
f
u
) and
p
(
f
v
) are calculated as
the ratio of the number of elements within corresponding bin to the length of
vector
n
.As a result,we have
p
(
u
=
i
) =
counts
(
u
=
i
)
/n
p
(
v
=
j
) =
counts
(
v
=
j
)
/n
p
(
u
=
i,v
=
j
) =
counts
(
u
=
i and v
=
j
)
/n
154 H.Cheng et al.
and
K
(
f
u
,
f
v
) =
N

i
=1
N

j
=1
p
(
u
=
i,v
=
j
)
log
2
p
(
u
=
i,v
=
j
)
p
(
u
=
i
)
p
(
v
=
j
)
(26)
The mutual information kernel is symmetric and non-negative with
K
(
f
u
,
f
v
)
=
K
(
f
v
,
f
u
) and
K
(
f
u
,
f
v
)

0.It also satisfies the m
ercer’s condition [13],
which guarantees the convergence of th
e proposed RFVM algorithm.In this
paper we set the number of bins for discretization as log
2
(
m
),where
m
is the
number of features.
4 Experimental Results
Experimental results are presented in t
his section to demonstrate the effective-
ness of our proposed RFVMalgorithm.We compare RFVMwith other two state
of the art nonlinear feature selection algorithms in [3]and [9].To be convenient,
we call the algorithm proposed in [3] as P-SVM and that in [9] as FVM algo-
rithm.All the experiments are conducted on a Pentium 4 machine with 3GHZ
CPU and 1GB of RAM.
4.1 Nonlinear Feature Selection by RFVM
In order to verify that the proposed RFVM is able to catch the nonlinear de-
pendency between the response and feature vectors,we simulated 1000 data
examples with 99 features and one response.The response
y
is generated by the
summation of three base functions of
f
1
,f
2
,f
3
respectively,together with the
Gaussian noise

distributed as
N
(0
,
0
.
005).
y
=
f
(
f
1
,f
2
,f
3
,
· · ·
,f
99
)
= 9
f
1
+20(1

f
2
)
3
+17 sin(80

f
3

7) +

The three base functions are shown in Figure 2(b)(c)(d),in which the first
is a linear function and the other two are nonlinear.Figure 2(a) also plots the
distribution of
y
with respect to the two nonlinear features
f
2
and
f
3
.The values
of features
f
1
,f
2
,f
3
are generated by a uniform distribution in [0
,
1].The other
96 features
f
4
,f
5
,
· · ·
,f
99
are generated uniformly in [0
,
20] and are independent
with the response
y
.
We modified the MATLAB code provided by Mike Tipping [17] to implement
RFVM for the task of feature selection.The RFVM is solved by updating the
posterior covariance
Σ
in equation (20) and the mean
μ
in equation (21) along
with the hyper-parameters
β
in equation (23) and
σ
2
in equation (24) iteratively
using the two step procedure.The nonlinear dependencies between response and
features by using mutual information kernel in RFVM.That is,we replace the
dot product of the features and response,
< X
T
,y >
and
< X
T
,X >
,by the
precomputed mutual in
formation kernel
K
(
X
T
,y
) and
K
(
X
T
,X
).The optimal
Nonlinear Feature Selection by Relevance Feature Vector Machine 155
0
0.2
0.4
0.6
0.8
1
0
0.2
0.4
0.6
0.8
1
−20
−10
0
10
20
30
40
50
f
2
f
3
y
(a)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
−1
0
1
2
3
4
5
6
7
8
9
f
1
9*f
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
5
10
15
20
25
f
2
20(1−f
2
)
3
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
−20
−15
−10
−5
0
5
10
15
20
f
3
17sin(80f
3
−7)
(b) (c) (d)
Fig.2.
(a) The distribution of response
y
with respect to two nonlinear features
f
2
and
f
3
.The bottom three figures illustrate the three components of the simulated function
f
:(b) linear,(c) cubic and (d) sin.
set of feature vector weight
w
along with the hyper-parameters
β
and
σ
2
in
RFVM are automatically learned by a two step updating procedure.The initial
values of the hyper-p
arameters are set as
β
= 10 and
σ
2
=
std
(
y
)
/
10.
Figure 3(a) and (b) plot the values of feature weights computed by linear
RVM and nonlinear RFVM over the simulated data.From Figure 3(a),we see
that the linear RVM can detect th
e linear dependent feature
f
1
,as well as the
feature
f
2
which has cubical relationship.The reason that
f
2
is also detected by
linear RVM is that the cubical curve can be approximated by a linear line in
certain degree,which is shown in Figure
2(b).However,RVMmissed the feature
f
3
completely,which is a highly nonlinear feature with periodical sin wave.On
the other hand,the nonlinear RFVM detect
s all the three features successfully,
which is shown in Figure 3(b).Furthermore,the detected feature set is pretty
sparse compared with th
e results of linear RVM.
4.2 Performance Comparison
This section compares the performance of RFVMalgorithmwith other nonlinear
feature selection algorithms such as FVM in [9] and P-SVM in [3].To demon-
strate that RFVM is able to select most relevant features with much lower false
156 H.Cheng et al.
0
10
20
30
40
50
60
70
80
90
0
1
2
3
4
5
6
7
feature ID
feature weight
linear RVM
1 2
(a)
1
2
3
13
72
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
feature ID
feature weight
nonlinear RFVM
(b)
Fig.3.
(a) The histogram of feature weights from linear RVM.It detects
f
1
and
f
2
but
misses the highly nonlinear relevant feature
f
3
.(b) The histogram of feature weights
from nonlinear RVM.It detects all the three features.
alarm rate,we simulate another data set with 2000 data examples and 100 fea-
tures.The first 20 features are simulated uniformly in [

0
.
5
,
0
.
5] and the rest
are generated uniformly in [0
,
20] with Gaussian noise.The response
y
is the
summation of functions
F
i
(
·
) on the first 20 features
y
=
20

i
=1
F
i
(
f
i
)
.
(27)
The basis function
F
i
(
·
) is randomly chosen from the pool of eight candidate
functions
F
i
(
f
i
)
∈ {
F
1
(
f
i
)
,F
2
(
f
i
)
,
· · ·
,F
8
(
f
i
)
}
(28)
where the expressions of those candidate functions are described in Table 1.
As you can see our simulation covers almost all kinds of common nonlinear
relationships.
Nonlinear Feature Selection by Relevance Feature Vector Machine 157
Table 1.
The 8 basis function
j
=
1
2
3
4
F
j
(
f
i
) =
40
f
i
20(1

f
2
i
)
23
f
3
i
20 sin(40
f
i

5)
j
=
5
6
7
8
F
j
(
f
i
) =
20
e
f
i

log
2
(
|
f
i
|
)
20

1

f
i
20 cos(20
f
i

7)
We divide the data into two parts,the first 1000 examples are used as training
data to determine the parameter
λ
in FVM and

,
C
in P-SVM by 10 fold cross
validation.The rest 1000 data examples are used for test.The performances
of those algorithms are compared in terms of detection rate and false alarm
rate.We run 20 rounds of such simulations and present the results in Figure 4.
Figure 4(a) plots the detection rate of RFVM together with those of FVM and
P-SVM.It shows that RFVM maintains comparable detection rate as the other
0
2
4
6
8
10
12
14
16
18
20
−0.1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Simulation ID
Detection Rate
FVM
P−SVM
RFVM
(a)
0
2
4
6
8
10
12
14
16
18
20
−0.1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Simulation ID
False Alarm
FVM
P−SVM
RFVM
(b)
Fig.4.
(a) The detection rates of FVM,P-SVMand RFVM.(b) The false alarm rates
of FVM,P-SVM and RFVM.
158 H.Cheng et al.
two algorithms.Note since the nonlinear relationship (27) generated in our sim-
ulation is very strong,the range of detection rates for those algorithms is rea-
sonable.Figure 4(b) plots the false alarm rates of FVM,P-SVM and RFVM
algorithms.It demonstrates that RFVM has lower false alarm rate generally,
which is due to the sparseness of RFV
M in the example space compared with
FVM and P-SVM.Also note in the exper
iment we don’t need to predetermine
any parameters in RFVMsince the parameters are automatically learned by two
step maximum likelihood method,while FVM and P-SVMare both sensitive to
parameters and need extra efforts of cross
validation to determine those values.
5 Conclusions
This paper has proposed a new method,the “Relevance Feature Vector Ma-
chine”(RFVM),to detect features with
nonlinear dependency.Compared with
other state of the art nonlinear feature selection algorithms,RFVM has two
unique advantages based on our theoretical analysis and experimental results.
First,by utilizing the highly sparseness nature of RVM,the RFVM algorithm
reduces the false alarms in feature selection significantly while still maintains de-
sirable detection rate.Fu
rthermore,unlike other SV
M based nonlinear feature
selection algorithms whose performances
are sensitive to the selection of parame-
ter values,RFVMlearns the hyper-parameters automatically by maximizing the
“marginal likelihood” in the second step of the two-step updating procedure.In
the future,we will apply RFVMto some real applications to further demonstrate
the advantages of our algorithm.
References
1.Tipping,M.E.:Sparse Bayesian learning and the relevance vector machine.Journal
of Machine Learning Research 1,211–244 (2001)
2.Bellman,R.E.:Adaptive Control Processes.Princeton University Press,Princeton,
NJ (1961)
3.Hochreiter,S.,Obermayer,K.:Nonlinear feature selection with the potential sup-
port vector machine.In:Guyon,I.,Gunn,S.,Nikravesh,M.,Zadeh,L.(eds.)
Feature extraction,founda
tions and applications
,Springer,
Berlin (2005)
4.Figueiredo,M.,Jain,A.K.:Bayesian Learning of Sparse Classifiers.In:Proc.IEEE
Computer Soc.Conf.Computer Vision and Pattern Recognition,vol.1,35–41
(2001)
5.Figueiredo,M.A.T.:Adaptive sparseness for supervised learning.IEEE Transac-
tions on Pattern Analysis and Machine Intelligence 25(9),1150–1159 (2003)
6.B
Φ
,T.H.,Jonassen,I.:New feature subset selection procedures for classification
of expression profiles,Genome Biology,3 research 0017.1-0017.11 (2000)
7.Burges,C.:Simplified support vector decision rules.In:Proc.of the Thirteenth
International Conf.on Machine Learning,pp.71–77.Morgan Kaufmann,Seattle
(1996)
8.Aizerman,M.E.,Braverman,Rozonoer,L.:Theoretical foundations of the poten-
tial function method in pattern recognition learning.Automation and Remote Con-
trol 25,821–837 (1964)
Nonlinear Feature Selection by Relevance Feature Vector Machine 159
9.Li,F.,Yang,Y.,Xing,E.P.:From Lasso regression to Feature Vector Machine,
Advances in Neural Information Processing Systems,18 (2005)
10.Faul,A.,Tipping,M.E.:Analysis of sparse bayesian learning.In:Dietterich,T.,
Becker,S.,Ghahramani,Z.(eds.) Advances in Neural Information Processing Sys-
tems 14,pp.383–389.MIT Press,Cambridge,MA (2002)
11.Faul,A.,Tipping,M.:A variational approach to robust regression,in Artificial
Neural Networks.In:Dorffner,G.,Bischof,H.,Hornik,K.(eds.),pp.95–202 (2001)
12.Roth,V.:The Generalized LASSO,V.IEEE Transactions on Neural Networks,
Dorffner,G.vol.15(1).(2004)
13.Vapnik,V.N.:The Nature of Statistical Learning Theory.Springer,Heidelberg
(1995)
14.Smola,A.J.,Scholkopf,B.:A tutorial on support vector regression,NEUROCOLT
Technical Report NC-TR-98-030,Royal Holloway College,London (1998)
15.Long,F.,Ding,C.:Feature Selection Based on Mutual Information:Criteria of
Max-Dependency,Max-Relevance,and Min-Redundancy.IEEE Transactions on
Pattern Analysis and Machine Intelligence 27(8),1226–1238 (2005)
16.Guiasu,Silviu.:Information Theory with Applications.McGraw-Hill,New York
(1977)
17.Tipping,M.E.:Microsoft Corporation,
http://research.microsoft.com/MLP/
RVM/
18.Tibshirani,R.:Regression Shrinkage and Selection via the Lasso.Journal of the
Royal Statistical Society.Series B (Methodological) 58(1),267–288 (1999)
19.Guyon,I.,Elisseeff,A.:An Introduction to Variable and Feature Selection.Journal
of Machine Learning Research 3,1157–1182 (2003)
20.Bishop,C.M.:Neural Networks for Pattern Recognition.Oxford University Press,
Oxford (1995)
21.Reeves,S.J.,Zhao,Z.:Sequential algorithms for observation selection.IEEE Trans-
actions on Signal Processing 47,123–132 (1999)