10 Challenging Problems in Data Mining Research - University of ...

sentencehuddleData Management

Nov 20, 2013 (3 years and 9 months ago)

90 views

1

Feature Selection:

Algorithms and Challenges

Joint Work with Yanglan Gang, Hao Wang & Xuegang Hu


Xindong Wu


University of Vermont, USA;

Hefei University of Technology, China

合肥工业大学计算机应用长江学者讲座教授

2

Deduction Induction: My Research Background

1988

Expert Systems

1995

1990

Expert Systems

2004





3

Outlines

1.
Why feature selection

2.
What is feature selection

3.
Components of feature selection

4.
Some research efforts by myself

5.
Challenges in feature selection

4

1. Why Feature Selection?


High
-
dimensional data often contain irrelevant or
redundant features



reduce the accuracy of data mining algorithms



slow down the mining process



be a problem in storage and retrieval



hard to interpret

5

2. What Is Feature Selection?



Select the most “relevant” subset of attributes
according to some selection criteria.



6

Outlines

1.
Why feature selection

2.
What is feature selection

3.
Components of feature selection

4.
Some research efforts by myself

5.
Challenges in feature selection

7

Traditional Taxonomy


Wrapper approach


Features are selected as part of the mining algorithm


Filter approach


Features selected before a mining algorithm,
using
heuristics based on general characteristics of the data,
rather than a learning algorithm to evaluate the merit of
feature subsets


Wrapper approach is generally more accurate but
also more computationally expensive.

8

Components of Feature Selection


Feature selection is actually a search problem,
including four basic components:

1.
an initial subset

2.
one or more selection criteria (

)

3.
a search strategy (

)

4.
some given stopping conditions

9

Feature Selection Criteria


Selection criteria generally use “relevance” to
estimate the goodness of a selected feature subset in
one way or another
:


Distance Measure


Information Measure



Inconsistency Measure


Relevance Estimation


Selection Criteria related to Learning Algorithms (wrapper
approach)


Some unified framework

for relevance has been
proposed recently.

10

Search Strategy


Exhaustive Search



Every possible subset is evaluated and the best one is
chosen


Guarantee the optimal solution


Low efficiency


A modified approach: B&B

11

Search Strategy (2)


Heuristic search


Sequential search, including SFS,SFFS,SBS and SBFS


SFS: Start with empty attribute set



Add “best” of attributes



Add “best” of remaining attributes



Repeat until the maximum performance is reached


SBS: Start with the entire attribute set



Remove “worst” of attributes



Repeat until the maximum performance has been reached.

12

Search Strategy (3)


Random search


It proceeds in two different ways


Inject randomness into classical sequential approaches
(simulated annealing, beam search, the genetic algorithm ,
and random
-
start hill
-
climbing)


Generate the next subset randomly


The use of randomness can help to escape local optima in
the search space, and the optimality of the selected subset
would depend on the available resources.

13

Outlines

1.
Why feature selection

2.
What is feature selection

3.
Components of feature selection

4.
Some research efforts by myself

5.
Challenges in feature selection

14

RITIO: Rule Induction Two In One


Feature selection using the information gain
in a reverse order


Delete features that are lest informative


Results are significant compared to forward
selection


[Wu et al 1999, TKDE].

15

Induction as Pre
-
processing


Use one induction algorithm to select attributes for
another induction algorithm


Can be a decision
-
tree method for rule induction, or vice
versa


Accuracy results are not as good as expected


Reason: feature selection normally causes
information loss


Details: [Wu 1999, PAKDD].

16

Subspacing with Asysmetric Bagging



When the number of examples is less than the
number of attributes


When the number of positive examples is smaller
than the number of negative examples


An example: content
-
based information retrieval


Details: [Tao et al., 2006, TPAMI].

17

Outlines

1.
Why feature selection

2.
What is feature selection

3.
Components of feature selection

4.
Some research efforts by myself

5.
Challenges in feature selection

18

Challenges in Feature Selection (1)


Dealing with ultra
-
high dimensional data and feature
interactions


Traditional feature selection encounter two major problems when the
dimensionality runs into tens or hundreds of thousands:

1.
curse of dimensionality

2.
the relative shortage of instances.




19

Challenges in Feature Selection (2)


Dealing with active instances (Liu et al., 2005)


When the dataset is huge, feature selection performed on the
whole dataset is inefficient,


so instance selection is necessary:


Random sampling (pure random sampling without
exploiting any data characteristics)


Active feature selection (selective sampling using data
characteristics achieves better or equally good results with
a significantly smaller number of instances).

20

Challenges in Feature Selection (3)


Dealing with new data types (Liu et al., 2005)


traditional data type: an N*M data matrix


Due to the growth of computer and Internet/Web techniques, new
data types are emerging:


text
-
based data (e.g., e
-
mails, online news, newsgroups)


semistructure data (e.g., HTML, XML)


data streams.



21

Challenges in Feature Selection (4)


Unsupervised feature selection


Feature selection vs classification: almost every
classification algorithm


Subspace method with the curse of
dimensionality in classification


Subspace clustering.

22

Challenges in Feature Selection (5)


Dealing with predictive
-
but
-
unpredictable attributes
in noisy data


Attribute noise is difficult to process, and removing noisy
instances is dangerous


Predictive attributes: essential to classification


Unpredictable attributes: cannot be predicted by the class
and other attributes


Noise identification, cleansing, and measurement
need special attention [Yang et al., 2004]

23

Challenges in Feature Selection (6)


Deal with inconsistent and redundant features


Redundancy can indicate reliability


Inconsistency can also indicate a problem for handling


Researchers in Rough Set Theory: What is the purpose of
feature selection?


Can you really demonstrate the usefulness of reduction, in data
mining accuracy, or what?


Removing attributes can well result in information loss


When the data is very noisy, removals can cause a very different data
distribution


Discretization can possibly bring new issues.

24

Concluding Remarks


Feature selection is and will remain an important
issue in data mining, machine learning, and related
disciplines


Feature selection has a price in accuracy for
efficiency


Researchers need to have the bigger picture in mind,
not just doing selection for the purpose of feature
selection.