Selection of Relevant Features and

unknownlippsAI and Robotics

Oct 16, 2013 (4 years and 23 days ago)

59 views

Selection of Relevant Features and
Examples in Machine Learning

Paper By: Avrim L. Blum

Pat Langley




Presented
By:




Arindam

Bhattacharya

(
10305002)




Akshat

Malu




(10305012)




Yogesh

Kakde


(10305039)




Tanmay

Haldankar



(
10305911)

Overview


Introduction


Selecting Relevant
Features


Embedded Approaches


Filter Approaches


Wrapper Approaches


Feature Weighting Approaches


Selecting Relevant
Examples


Selecting Labeled Data


Selecting Unlabeled Data


Challenges
and Future Work

Introduction


Machine learning
are addressing
larger and
complex tasks.


Internet has a huge volume of low quality
information.


We focus on:


Selecting the most relevant features


Selecting the most relevant examples

[1] Cover and Hart, 1967 [2] Langley and
Iba, 1993

Problems of Irrelevant Features


Not helpful in classification


Slow the learning process
[1]


Number of training examples required grows
exponentially with number of irrelevant
features
[2]

Blum et al., 1997

Definitions of Relevance


Definition 1: Relevant to Target:
-


A feature
x
i

is relevant to a target concept
C

if there exists a pair of examples
A
and
B

such that
A

and
B

differ only in feature
x
i

and
c(A)

c(B)
.

John, Kohavi and Pfelger (1994)

Definitions of Relevance


Definition 2: Strongly Relevant to sample:
-


A feature is said to be strongly relevant to
sample
S

if there exist examples
A

and
B

in
S

that differ only in feature
x
i

and have
different labels.

John, Kohavi and Pfelger (1994)

Definitions of Relevance


Definition 3: Weakly Relevant to the
sample:
-


A feature
x
i

is said to be weakly relevant to
the sample
S

if it possible to remove a
subset of the features so that
x
i

becomes
strongly relevant

Blum et al, 1997

Definitions of Relevance


Definition 4: Relevance as complexity
measure:
-


Given a sample
S

and a set of concepts
C
,
let
r(S,C)

be the number of features
relevant (using definition 1) to a concept
C

that, out of all those whose error over
S

is
least, has the fewest relevant features.

Caruana and Frietag, 1994

Definitions of Relevance


Definition 5: Incremental Usefulness:
-


Given a sample
S
, a learning algorithm
L
,
and a feature set
A
, feature
x
i

is
incrementally useful to
L

if the accuracy of
the hypothesis that
L

produces using the
feature
set {x
i
} U A

is better than the
accuracy achieved using just the feature set
A
.

Example


Consider concepts can be expressed as
disjunctions and the algorithm sees the
following examples:

x
1

x
2

x
3

x
4

x
5

1

0

0

0

0

+

1

1

1

0

0

+

0

0

0

1

0

+

0

0

0

0

1

+

0

0

0

0

0

-

Example


Using Definition 2 and 3, we can say that
x
1

is
strongly relevant while
x
2

is weakly relevant.


Using Definition 4, we can say that there are
three relevant features (
r(S,C)=3
).


Using Definition 5, given the feature set
{1,2}, the third feature may not be useful but
features 4,5 would be
useful.

Feature Selection as Heuristic Search

Heuristic Search is an ideal paradigm for Feature
selection algorithms.

Feature Selection as Heuristic Search

Search Space Partial Order

Four Basic Issues


Where to start?


Forward Selection


Bckward Elimination


Four Basic Issues


How to organize the
search?


Exhaustive search: 2
n
possibilities for n attributes



Greedy search:


Hill climbing


First is best


Four Basic Issues


Which is better?
-

Strategy to evaluate
alternatives


Accuracy on
training or
separate evaluation
set


Feature selection
-
basic
induction
interaction

Four Basic Issues


When to stop?


Stop when nothing improves


Go on until things worsen


Reach the end and select best


Each combination of selected
features map to single class


Order by relevance and
determine a break point

An
Example


Set Cover Algorithm

Disjunction of 0
features

From safe features, select
one that maximize correctly
classified +ive example

Output the selected features

Any safe feature

that improves is left?



Begins at the left of the figure



Incrementally move right



Evaluate based on performance

on training set with


penalty

for misclassifying
-
ve

example



Halts when no further step

improves performance

`

Feature Selection Method


Feature selection methods are grouped into
three classes :


Those that
embed

the selection into induction
algorithm


Those that use feature selection algorithm to
filter
the attributes passed to induction algorithm


Those that treat feature selection as a
wrapper

around the induction process

Embedded Approaches to Feature
Selection


For these class of algorithm, feature selection
is embedded within basic induction
algorithm.


Most algorithms for inducing logical concepts
(e.g. the set
-
cover algorithm) adds or remove
features from concept description based on
prediction errors


For these algorithms, the feature space is also
the concept space.

Embedded Approaches in Binary
Feature Space


Gives attractive results for systems learning
pure conjunctive (or pure disjunctive) rules.


At most logarithmic factor more than smallest
possible hypothesis!


Also applies in settings where target
hypothesis is characterized by conjunction (or
disjunction) of functions produced by
induction algorithms


e.g.: algorithms for learning DNF in O(
n
log

n
) time
[1]

[1] (Verbeurgt, 1990)

Embedded Approaches for Complex
Logical Concepts


In this approach, the core method
adds/removes features to induce complex
logical concepts.


e.g. ID3
[1]

and C4.5
[2]


Greedy search through space of decision tree


Each stage select attribute that discriminate
among classes using evaluation function
(usually based on information theory)

[1] (Quinlan, 1983)

[2] (Quinlan, 1993)


Embedded Approaches: Scalability
Issues


Experimental studies
[1]

suggest decision list
learners scale linearly with increase in
irrelevant features


For other target concepts, exhibit exponential
growth.


(
Kira

and Rendell, 1992) shows that there is
substantial decrease in accuracy on inserting
irrelevant features into Boolean target
concept.

[1](Langley and Sage, 1997)

Embedded Approaches: Remedies


Problems are caused due to reliance on
greedy selection of attributes to discriminate
among classes.


Some researchers
[1]

attempted to replace
greedy approach with look
-
ahead techniques.


Letting Greedy take larger steps
[2]
.


None has been able to handle scaling
effectively.

[1] Norton, 1989

[2] (
Methes

and Rendell, 1989;
Pagallo

and Haussler, 1990)

Filter Approaches





Feature selection is done based on some
general characteristics

of the training set.


Independent of the
induction algorithm

used,
and thus, can be combined with any such
method.

John et al, 1994.

A Simple Filtering Scheme


Evaluate each feature individually based on its
correlation with the target function.


Select the ‘
k
’ features with the highest value.


The best choice of ‘
k
’ can be determined by
testing on a holdout set.

Blum et al, 1997.

FOCUS Algorithm


Looks for the minimal combinations of
attributes that perfectly discriminate among
the classes





Halt only when a pure partition of the training
set is generated


Performance
: Under similar conditions, FOCUS was almost
unaffected by the introduction of irrelevant attributes,
whereas decision
-
tree accuracy degraded significantly.

{ f
1
, f
2
, f
3
,…, f
n
}

{
f
1
, f
2
, f
3
,…, f
n
}

{ f
1
,
f
2
, f
3
,…, f
n
}

{ f
1
, f
2
,
f
3
,…, f
n
}

{
f
1
,

f
2
, f
3
,…, f
n
}

{ f
1
,
f
2
,

f
3
,…, f
n
}

{
f
1
,

f
2
,

f
3
,…, f
n
}

Almuallim et al,1991.

Comparing Various Filter Approaches

AUTHORS
(SYSTEM)

STARTING
POINT

SEARCH
CONTROL

HALTING
CRITERION

INDUCTION
ALGORITHM

ALMUALLIM
(FOCUS)

NONE

BREADTH
FIRST

CONSISTENCY

DECISION TREE

CARDIE

NONE

GREEDY

CONSISTENCY

NEAR. NEIGH.

KOLLER/SAHA
MI

ALL

GREEDY

THRESHOLD

TREE/BAYES

KUBAT et al.

NONE

GREEDY

CONSISTENCY

NAÏVE BAYES

SINGH/
PROVAN

NONE

GREEDY

NO INFO. GAIN

BAYES NET

Blum et al, 1997.

Wrapper Approaches (1/2)








Motivation
: The features selected should
depend not only on the relevance of the data,
but also on the learning algorithm.

John et al, 1994.

Wrapper Approaches (2/2)



Advantage
: The inductive method that uses the
feature subset provides a better
estimate of
accuracy than a separate measure that may have
an entirely different inductive bias.


Disadvantage
: Computational
cost, which results
from calling the induction algorithm for each
feature set considered
.


Modifications
:


Caching decision trees


Reducing percentage of training cases


OBLIVION Algorithm


It
carries out a backward elimination search
through the space of feature
sets.


Start with
all features and iteratively
remove
the one that leads to
a tree that has the
greatest improvement in the estimated
accuracy
.


Continue
this process
till there is a constant
improvement in accuracy.

Langley et al, 1994.

Comparing Various Wrapper Approaches

AUTHORS

(SYSTEM)

STARTING
POINT

SEARCH
CONTROL

HALTING
CRITERION

INDUCTION
ALGORITHM

CARUANA/

FREITAG (CAP)

COMPARISON

GREEDY

ALL USED

DEC. TREE

JOHN/ KOHAVI/
PFLEGER

COMPARISON

GREEDY

NO BETTER

DEC.

TREE

LANGLEY/SAGE
(OBLIVION)

ALL

GREEDY

WORSE

NEAR. NEIGH.

LANGLEY/ SAGE
(SEL. BAYES)

NONE

GREEDY

WORSE

NAÏVE BAYES

MOORE/ LEE
(RACE)

COMPARISON

GREEDY

NO BETTER

NEAR. NEIGH.

SINGH/PROVA
N (K2
-
AS)

NONE

GREEDY

WORSE

BAYES NET

SKALAK

RANDOM

MUTATION

ENOUGH
TIMES

NEAR. NEIGH.

Blum et al, 1997.

Feature Selection v/s Feature Weighting

FEATURE SELECTION

FEATURE WEIGHTING

Explicitly

attempts to select a ‘most
relevant’ subset of features.

Assigns
degrees

of perceived
relevance to features

via a weighting
function.

Most natural when the result

is to be
understood by humans, or fed into
another algorithm.

Easier to implement in

on
-
line
incremental settings.

Most commonly

characterized in
terms of heuristic search.

Most common techniques involve
some form of gradient descent,
updating weights in successive passes
through the training instances.

Winnow Algorithm

Initialize the weights w
1
, w
2
,…, w
n

of the features to 1.

Given an example (x
1
,…,
x
n
), output
1

if
w
1
x
1
+…+w
n
x
n


n, and output
0

otherwise.

For each
x
i

equal to
1
, double the value
of
w
i
.

For each
x
i

equal to
1
, cut the value of
w
i

to half.

If algorithm predicts negative
on a positive sample.

If algorithm predicts positive on a

negative sample.

Littlestone, 1988.

References


Avrim

L. Blum, Pat Langley, Selection of relevant features and
examples in machine learning,

Artificial

Intelligence

Volume 97, Issues 1
-
2
, Pages 245
-
271
,
(1997)
.


D. Aha, A study of instance
-
based algorithms for supervised
learning tasks: mathematical, empirical and psychological
evaluations.
University of California, Irvine
, CA, (1990).


K.
Verbeurgt
, Learning DNF under the uniform distribution in
polynomial time. In:
Proceedings 3rd Annual Workshop on
Computational Learning
Theory
San

Francisco, CA, , Morgan
Kaufmann, San Mateo, CA, pp. 314

325, (1990).


T.M. Cover and P.E. Hart, Nearest neighbor pattern classification.
IEEE Trans. Inform. Theory

13
, pp. 21

27, (1967).


P. Langley and W.
Iba
, Average
-
case analysis of a nearest neighbor
algorithm. In:
Proceedings IJCAI
-
93
, pp. 889

894, (1993).

References (contd…)


G.H. John, R.
Kohavi

and K.
Pfleger
, Irrelevant features and the
subset selection problem. In:
Proceedings 11th International
Conference on Machine
Learning
New

Brunswick, NJ, , Morgan
Kaufmann, San Mateo, CA, pp. 121

129, (1994).


J.R. Quinlan, Learning efficient classification procedures and their
application to chess end games. In: R.S.
Michalski
, J.G.
Carbonell

and T.M. Mitchell, Editors,
Machine Learning: An Artificial
Intelligence Approach
, Morgan Kaufmann, San Mateo, CA (1983).


J.R. Quinlan. In:
C4.5: Programs for Machine Learning
, Morgan
Kaufmann, San Mateo, CA (1993).


C.J.
Matheus

and L.A. Rendell, Constructive induction on decision
trees. In:
Proceedings IJCAI
-
89
Detroit, MI, , Morgan Kaufmann, San
Mateo, CA, pp. 645

650, (1989).



N.
Littlestone
, Learning quickly when irrelevant attributes abound:
a new linear threshold algorithm.
Machine Learning

2
, pp. 285

318,
(1988).