Feature Selection using Sparse Priors:
A Regularization Approach
Martin Brown and Nick Costen
m.brown@mmu.ac.uk
n.costen@mmu.ac.uk
Feature selection is a
fundamental pro
c
ess within many classification algorithms.
A large
dictionary of pote
n
tial, flexible features often exists, from which it is necessary to select a relevant
subset. The aims of feature selection include:
improved, more robust parameter estimation
improved insight into the decision making process.
Feature selection is generally an empirical process that is performed prior to, or jointly with, the
parameter estimation process. However, it should also be recognized that feature selection often
introduces problems because:
optimal feature selection i
s an NP

complete problem,
sensitivity analysis of the trained classifier is often neglected
parametric prediction uncertainty with respect to the unselected features is often not
represented in the final classifier.
This talk is concerned with a classific
ation

regularization approach of the form:
where
慮搠
b
are the classification parameters and bias term respectively,
is the regularization
parameter and the empirical data set is given by {
Ⱐ
t
}, where the target class labels li
e in {
–
1,+1}. This is similar to the regularization functions proposed by Vapnik and used to develop
Support Vector Machines [
1
]. The aim is to jointly minimize the loss function that measures how
close the predictions are to
the class labels and the model complexity term that orders potential
classifiers according to their complexity. For classifiers that are linear in their parameters, this is
a piecewise Quadratic Programming problem.
An interesting observation is that th
e aim of classification is to minimize the number of
classification errors, which is, in general an NP

complete problem. To circumvent this, “soft”
measures, such as a 1

norm or 2

norm on the prediction errors are used (the loss function), and
this is min
imized instead. This is the basis of the sparse regularization approach to classification.
Instead of performing a discrete search through the space of potential classifiers, which is an NP

complete problem, the 1

norm of the parameter vector can be used
to provide a “soft” measure.
Any convex optimization criterion with convex constraints will have a global minimum and it can
be shown that the 1

norm complexity measure is convex. The talk will concentrate on the
properties of the 1

norm prior, showing wh
y it produces sparse classification models, and
compare it with other complexity measures.
The 1

norm’s derivative discontinuity at 0 is fundamental to producing sparse models, and it will
be shown that other
p

norms that do not possess this property wi
ll not produce truly sparse
models. In addition, because the 1

norm is constant in each region where the parameters’ signs
do not change, an efficient algorithm can be generated that calculates every globally optimal
sparse classifier as a function of the
regularization parameter. This is because the local Hessian
matrix is a constant in that region. The complete algorithm for generating every sparse classifier,
and thus every sparse feature set will be briefly described [
2
], a
nd a discussion will be performed
about visualizing the information provided by the parameter trajectories. This is an important
aspect of providing a qualitative understanding of the classification problem’s sensitivity and gives
the designer some indica
tion of about the conditional independence of the different features as
the classifier’s complexity changes.
The algorithm has been tested with a variety of datasets, including Australian Credit, a real

world
set consisting of 690 observations about wheth
er or not credit should be granted, using 14
potential attributes. The error curve and p
a
rameter trajectory for a li
n
ear model are shown in
Figures 1and 2 respe
c
tively, using a transformed regularization parameter log(
+1) to emphasize
small

behaviour.
As can be seen, for a large range of sparse models, the error rate is approximately constant and
there is a single dominant p
a
rameter. When
is small, a number of parameters started to reverse
their sign as the regularization param
e
ter is further reduc
ed, demonstrating the presence of
Simpon’s paradox. The advantage of this approach is that the pro
p
erties for all the sparse models
can be directly visualized and the behaviour of the parameters di
s
cussed. In addition, the
approach is efficient, compared
to re

training a single classifier, even though every sparse
class
i
fier is generated
Other aspects, such as the generalization of the problem to non

linear classifier structures, the
development of parametric uncertainty/error bars for the inactive featur
es, the inclusion of prior
weighting parameters on the features [
3
] and the development of globally optimal, sparse,
adaptive classifiers will also be briefly discussed.
References
1.
Vapnik, V. N.,
Statistical Learning Theory
. Wil
ey, New York, 1998.
2.
Brown, M.,
Exploring the set of sparse, optimal classifiers
. Proceedings of the International
W
orkshop
on Artificial Neural Networks in Pattern Recognition, pages **

**, 2003
.
3.
Costen, N. P. and Brown, M.
Exploratory sparse models for fa
ce classification
Proceedings of the
British Machine Vision Conference, vol 1, pages 13

22, 2003
.
Figure
1
. Parameter trajectories verses the
regulari
zation parameter, log(
+1
Figure 1. Class
i
fic
a
tion errors verses the
regular
i
zation param
e
ter, log(
+1).
Comments 0
Log in to post a comment