Feature Selection using Sparse Priors: A Regularization Approach

spraytownspeakerAI and Robotics

Oct 16, 2013 (4 years and 8 months ago)


Feature Selection using Sparse Priors:

A Regularization Approach

Martin Brown and Nick Costen



Feature selection is a

fundamental pro
ess within many classification algorithms.

A large
dictionary of pote
tial, flexible features often exists, from which it is necessary to select a relevant
subset. The aims of feature selection include:

improved, more robust parameter estimation

improved insight into the decision making process.

Feature selection is generally an empirical process that is performed prior to, or jointly with, the
parameter estimation process. However, it should also be recognized that feature selection often
introduces problems because:

optimal feature selection i
s an NP
complete problem,

sensitivity analysis of the trained classifier is often neglected

parametric prediction uncertainty with respect to the unselected features is often not
represented in the final classifier.

This talk is concerned with a classific
regularization approach of the form:



are the classification parameters and bias term respectively,

is the regularization
parameter and the empirical data set is given by {

}, where the target class labels li
e in {

1,+1}. This is similar to the regularization functions proposed by Vapnik and used to develop
Support Vector Machines [
]. The aim is to jointly minimize the loss function that measures how
close the predictions are to

the class labels and the model complexity term that orders potential
classifiers according to their complexity. For classifiers that are linear in their parameters, this is
a piecewise Quadratic Programming problem.

An interesting observation is that th
e aim of classification is to minimize the number of
classification errors, which is, in general an NP
complete problem. To circumvent this, “soft”
measures, such as a 1
norm or 2
norm on the prediction errors are used (the loss function), and
this is min
imized instead. This is the basis of the sparse regularization approach to classification.
Instead of performing a discrete search through the space of potential classifiers, which is an NP
complete problem, the 1
norm of the parameter vector can be used
to provide a “soft” measure.
Any convex optimization criterion with convex constraints will have a global minimum and it can
be shown that the 1
norm complexity measure is convex. The talk will concentrate on the
properties of the 1
norm prior, showing wh
y it produces sparse classification models, and
compare it with other complexity measures.

The 1
norm’s derivative discontinuity at 0 is fundamental to producing sparse models, and it will
be shown that other
norms that do not possess this property wi
ll not produce truly sparse
models. In addition, because the 1
norm is constant in each region where the parameters’ signs
do not change, an efficient algorithm can be generated that calculates every globally optimal
sparse classifier as a function of the

regularization parameter. This is because the local Hessian
matrix is a constant in that region. The complete algorithm for generating every sparse classifier,
and thus every sparse feature set will be briefly described [
], a
nd a discussion will be performed
about visualizing the information provided by the parameter trajectories. This is an important
aspect of providing a qualitative understanding of the classification problem’s sensitivity and gives
the designer some indica
tion of about the conditional independence of the different features as
the classifier’s complexity changes.

The algorithm has been tested with a variety of datasets, including Australian Credit, a real
set consisting of 690 observations about wheth
er or not credit should be granted, using 14
potential attributes. The error curve and p
rameter trajectory for a li
ear model are shown in
Figures 1and 2 respe
tively, using a transformed regularization parameter log(
+1) to emphasize

As can be seen, for a large range of sparse models, the error rate is approximately constant and
there is a single dominant p
rameter. When

is small, a number of parameters started to reverse
their sign as the regularization param
ter is further reduc
ed, demonstrating the presence of
Simpon’s paradox. The advantage of this approach is that the pro
erties for all the sparse models
can be directly visualized and the behaviour of the parameters di
cussed. In addition, the
approach is efficient, compared
to re
training a single classifier, even though every sparse
fier is generated

Other aspects, such as the generalization of the problem to non
linear classifier structures, the
development of parametric uncertainty/error bars for the inactive featur
es, the inclusion of prior
weighting parameters on the features [
] and the development of globally optimal, sparse,
adaptive classifiers will also be briefly discussed.



Vapnik, V. N.,
Statistical Learning Theory
. Wil
ey, New York, 1998.


Brown, M.,
Exploring the set of sparse, optimal classifiers
. Proceedings of the International

on Artificial Neural Networks in Pattern Recognition, pages **
**, 2003


Costen, N. P. and Brown, M.
Exploratory sparse models for fa
ce classification
Proceedings of the
British Machine Vision Conference, vol 1, pages 13
22, 2003

. Parameter trajectories verses the
zation parameter, log(


Figure 1. Class
tion errors verses the
zation param
ter, log(