Selection of Relevant Features and
Examples in Machine Learning
Paper By: Avrim L. Blum
Pat Langley
Presented
By:
Arindam
Bhattacharya
(
10305002)
Akshat
Malu
(10305012)
Yogesh
Kakde
(10305039)
Tanmay
Haldankar
(
10305911)
Overview
Introduction
Selecting Relevant
Features
Embedded Approaches
Filter Approaches
Wrapper Approaches
Feature Weighting Approaches
Selecting Relevant
Examples
Selecting Labeled Data
Selecting Unlabeled Data
Challenges
and Future Work
Introduction
Machine learning
are addressing
larger and
complex tasks.
Internet has a huge volume of low quality
information.
We focus on:
Selecting the most relevant features
Selecting the most relevant examples
[1] Cover and Hart, 1967 [2] Langley and
Iba, 1993
Problems of Irrelevant Features
Not helpful in classification
Slow the learning process
[1]
Number of training examples required grows
exponentially with number of irrelevant
features
[2]
Blum et al., 1997
Definitions of Relevance
Definition 1: Relevant to Target:

A feature
x
i
is relevant to a target concept
C
if there exists a pair of examples
A
and
B
such that
A
and
B
differ only in feature
x
i
and
c(A)
≠
c(B)
.
John, Kohavi and Pfelger (1994)
Definitions of Relevance
Definition 2: Strongly Relevant to sample:

A feature is said to be strongly relevant to
sample
S
if there exist examples
A
and
B
in
S
that differ only in feature
x
i
and have
different labels.
John, Kohavi and Pfelger (1994)
Definitions of Relevance
Definition 3: Weakly Relevant to the
sample:

A feature
x
i
is said to be weakly relevant to
the sample
S
if it possible to remove a
subset of the features so that
x
i
becomes
strongly relevant
Blum et al, 1997
Definitions of Relevance
Definition 4: Relevance as complexity
measure:

Given a sample
S
and a set of concepts
C
,
let
r(S,C)
be the number of features
relevant (using definition 1) to a concept
C
that, out of all those whose error over
S
is
least, has the fewest relevant features.
Caruana and Frietag, 1994
Definitions of Relevance
Definition 5: Incremental Usefulness:

Given a sample
S
, a learning algorithm
L
,
and a feature set
A
, feature
x
i
is
incrementally useful to
L
if the accuracy of
the hypothesis that
L
produces using the
feature
set {x
i
} U A
is better than the
accuracy achieved using just the feature set
A
.
Example
Consider concepts can be expressed as
disjunctions and the algorithm sees the
following examples:
x
1
x
2
x
3
x
4
x
5
1
0
0
0
0
+
1
1
1
0
0
+
0
0
0
1
0
+
0
0
0
0
1
+
0
0
0
0
0

Example
Using Definition 2 and 3, we can say that
x
1
is
strongly relevant while
x
2
is weakly relevant.
Using Definition 4, we can say that there are
three relevant features (
r(S,C)=3
).
Using Definition 5, given the feature set
{1,2}, the third feature may not be useful but
features 4,5 would be
useful.
Feature Selection as Heuristic Search
Heuristic Search is an ideal paradigm for Feature
selection algorithms.
Feature Selection as Heuristic Search
Search Space Partial Order
Four Basic Issues
Where to start?
Forward Selection
Bckward Elimination
Four Basic Issues
How to organize the
search?
Exhaustive search: 2
n
possibilities for n attributes
Greedy search:
Hill climbing
First is best
Four Basic Issues
Which is better?

Strategy to evaluate
alternatives
Accuracy on
training or
separate evaluation
set
Feature selection

basic
induction
interaction
Four Basic Issues
When to stop?
Stop when nothing improves
Go on until things worsen
Reach the end and select best
Each combination of selected
features map to single class
Order by relevance and
determine a break point
An
Example
–
Set Cover Algorithm
Disjunction of 0
features
From safe features, select
one that maximize correctly
classified +ive example
Output the selected features
Any safe feature
that improves is left?
Begins at the left of the figure
Incrementally move right
Evaluate based on performance
on training set with
∞
penalty
for misclassifying

ve
example
Halts when no further step
improves performance
`
Feature Selection Method
•
Feature selection methods are grouped into
three classes :
•
Those that
embed
the selection into induction
algorithm
•
Those that use feature selection algorithm to
filter
the attributes passed to induction algorithm
•
Those that treat feature selection as a
wrapper
around the induction process
Embedded Approaches to Feature
Selection
•
For these class of algorithm, feature selection
is embedded within basic induction
algorithm.
•
Most algorithms for inducing logical concepts
(e.g. the set

cover algorithm) adds or remove
features from concept description based on
prediction errors
•
For these algorithms, the feature space is also
the concept space.
Embedded Approaches in Binary
Feature Space
•
Gives attractive results for systems learning
pure conjunctive (or pure disjunctive) rules.
•
At most logarithmic factor more than smallest
possible hypothesis!
•
Also applies in settings where target
hypothesis is characterized by conjunction (or
disjunction) of functions produced by
induction algorithms
•
e.g.: algorithms for learning DNF in O(
n
log
n
) time
[1]
[1] (Verbeurgt, 1990)
Embedded Approaches for Complex
Logical Concepts
•
In this approach, the core method
adds/removes features to induce complex
logical concepts.
•
e.g. ID3
[1]
and C4.5
[2]
•
Greedy search through space of decision tree
•
Each stage select attribute that discriminate
among classes using evaluation function
(usually based on information theory)
[1] (Quinlan, 1983)
[2] (Quinlan, 1993)
Embedded Approaches: Scalability
Issues
•
Experimental studies
[1]
suggest decision list
learners scale linearly with increase in
irrelevant features
•
For other target concepts, exhibit exponential
growth.
•
(
Kira
and Rendell, 1992) shows that there is
substantial decrease in accuracy on inserting
irrelevant features into Boolean target
concept.
[1](Langley and Sage, 1997)
Embedded Approaches: Remedies
•
Problems are caused due to reliance on
greedy selection of attributes to discriminate
among classes.
•
Some researchers
[1]
attempted to replace
greedy approach with look

ahead techniques.
•
Letting Greedy take larger steps
[2]
.
•
None has been able to handle scaling
effectively.
[1] Norton, 1989
[2] (
Methes
and Rendell, 1989;
Pagallo
and Haussler, 1990)
Filter Approaches
•
Feature selection is done based on some
general characteristics
of the training set.
•
Independent of the
induction algorithm
used,
and thus, can be combined with any such
method.
John et al, 1994.
A Simple Filtering Scheme
•
Evaluate each feature individually based on its
correlation with the target function.
•
Select the ‘
k
’ features with the highest value.
•
The best choice of ‘
k
’ can be determined by
testing on a holdout set.
Blum et al, 1997.
FOCUS Algorithm
•
Looks for the minimal combinations of
attributes that perfectly discriminate among
the classes
•
Halt only when a pure partition of the training
set is generated
•
Performance
: Under similar conditions, FOCUS was almost
unaffected by the introduction of irrelevant attributes,
whereas decision

tree accuracy degraded significantly.
{ f
1
, f
2
, f
3
,…, f
n
}
{
f
1
, f
2
, f
3
,…, f
n
}
{ f
1
,
f
2
, f
3
,…, f
n
}
{ f
1
, f
2
,
f
3
,…, f
n
}
{
f
1
,
f
2
, f
3
,…, f
n
}
{ f
1
,
f
2
,
f
3
,…, f
n
}
{
f
1
,
f
2
,
f
3
,…, f
n
}
Almuallim et al,1991.
Comparing Various Filter Approaches
AUTHORS
(SYSTEM)
STARTING
POINT
SEARCH
CONTROL
HALTING
CRITERION
INDUCTION
ALGORITHM
ALMUALLIM
(FOCUS)
NONE
BREADTH
FIRST
CONSISTENCY
DECISION TREE
CARDIE
NONE
GREEDY
CONSISTENCY
NEAR. NEIGH.
KOLLER/SAHA
MI
ALL
GREEDY
THRESHOLD
TREE/BAYES
KUBAT et al.
NONE
GREEDY
CONSISTENCY
NAÏVE BAYES
SINGH/
PROVAN
NONE
GREEDY
NO INFO. GAIN
BAYES NET
Blum et al, 1997.
Wrapper Approaches (1/2)
•
Motivation
: The features selected should
depend not only on the relevance of the data,
but also on the learning algorithm.
John et al, 1994.
Wrapper Approaches (2/2)
•
Advantage
: The inductive method that uses the
feature subset provides a better
estimate of
accuracy than a separate measure that may have
an entirely different inductive bias.
•
Disadvantage
: Computational
cost, which results
from calling the induction algorithm for each
feature set considered
.
•
Modifications
:
–
Caching decision trees
–
Reducing percentage of training cases
OBLIVION Algorithm
•
It
carries out a backward elimination search
through the space of feature
sets.
Start with
all features and iteratively
remove
the one that leads to
a tree that has the
greatest improvement in the estimated
accuracy
.
Continue
this process
till there is a constant
improvement in accuracy.
Langley et al, 1994.
Comparing Various Wrapper Approaches
AUTHORS
(SYSTEM)
STARTING
POINT
SEARCH
CONTROL
HALTING
CRITERION
INDUCTION
ALGORITHM
CARUANA/
FREITAG (CAP)
COMPARISON
GREEDY
ALL USED
DEC. TREE
JOHN/ KOHAVI/
PFLEGER
COMPARISON
GREEDY
NO BETTER
DEC.
TREE
LANGLEY/SAGE
(OBLIVION)
ALL
GREEDY
WORSE
NEAR. NEIGH.
LANGLEY/ SAGE
(SEL. BAYES)
NONE
GREEDY
WORSE
NAÏVE BAYES
MOORE/ LEE
(RACE)
COMPARISON
GREEDY
NO BETTER
NEAR. NEIGH.
SINGH/PROVA
N (K2

AS)
NONE
GREEDY
WORSE
BAYES NET
SKALAK
RANDOM
MUTATION
ENOUGH
TIMES
NEAR. NEIGH.
Blum et al, 1997.
Feature Selection v/s Feature Weighting
FEATURE SELECTION
FEATURE WEIGHTING
Explicitly
attempts to select a ‘most
relevant’ subset of features.
Assigns
degrees
of perceived
relevance to features
via a weighting
function.
Most natural when the result
is to be
understood by humans, or fed into
another algorithm.
Easier to implement in
on

line
incremental settings.
Most commonly
characterized in
terms of heuristic search.
Most common techniques involve
some form of gradient descent,
updating weights in successive passes
through the training instances.
Winnow Algorithm
Initialize the weights w
1
, w
2
,…, w
n
of the features to 1.
Given an example (x
1
,…,
x
n
), output
1
if
w
1
x
1
+…+w
n
x
n
≥
n, and output
0
otherwise.
For each
x
i
equal to
1
, double the value
of
w
i
.
For each
x
i
equal to
1
, cut the value of
w
i
to half.
If algorithm predicts negative
on a positive sample.
If algorithm predicts positive on a
negative sample.
Littlestone, 1988.
References
•
Avrim
L. Blum, Pat Langley, Selection of relevant features and
examples in machine learning,
Artificial
Intelligence
Volume 97, Issues 1

2
, Pages 245

271
,
(1997)
.
•
D. Aha, A study of instance

based algorithms for supervised
learning tasks: mathematical, empirical and psychological
evaluations.
University of California, Irvine
, CA, (1990).
•
K.
Verbeurgt
, Learning DNF under the uniform distribution in
polynomial time. In:
Proceedings 3rd Annual Workshop on
Computational Learning
Theory
San
Francisco, CA, , Morgan
Kaufmann, San Mateo, CA, pp. 314
–
325, (1990).
•
T.M. Cover and P.E. Hart, Nearest neighbor pattern classification.
IEEE Trans. Inform. Theory
13
, pp. 21
–
27, (1967).
•
P. Langley and W.
Iba
, Average

case analysis of a nearest neighbor
algorithm. In:
Proceedings IJCAI

93
, pp. 889
–
894, (1993).
References (contd…)
•
G.H. John, R.
Kohavi
and K.
Pfleger
, Irrelevant features and the
subset selection problem. In:
Proceedings 11th International
Conference on Machine
Learning
New
Brunswick, NJ, , Morgan
Kaufmann, San Mateo, CA, pp. 121
–
129, (1994).
•
J.R. Quinlan, Learning efficient classification procedures and their
application to chess end games. In: R.S.
Michalski
, J.G.
Carbonell
and T.M. Mitchell, Editors,
Machine Learning: An Artificial
Intelligence Approach
, Morgan Kaufmann, San Mateo, CA (1983).
•
J.R. Quinlan. In:
C4.5: Programs for Machine Learning
, Morgan
Kaufmann, San Mateo, CA (1993).
•
C.J.
Matheus
and L.A. Rendell, Constructive induction on decision
trees. In:
Proceedings IJCAI

89
Detroit, MI, , Morgan Kaufmann, San
Mateo, CA, pp. 645
–
650, (1989).
•
N.
Littlestone
, Learning quickly when irrelevant attributes abound:
a new linear threshold algorithm.
Machine Learning
2
, pp. 285
–
318,
(1988).
Comments 0
Log in to post a comment