Presenter : YU-TING LU

hesitantdoubtfulAI and Robotics

Oct 29, 2013 (3 years and 7 months ago)

65 views

Intelligent Database Systems Lab

Presenter
: YU
-
TING LU

Authors :
Harun

Ug˘uz

2011.KBS


A two
-
stage feature selection method for text categorization by using

information gain, principal component analysis and genetic algorithm

Intelligent Database Systems Lab

Outlines


Motivation


Objectives


Methodology


Experiments


Conclusions


Comments

Intelligent Database Systems Lab

Motivation



A major problem of text categorization is its
large
number of features
.


Most of those are
irrelevant noise

that can
mislead
the
classifier
.




Intelligent Database Systems Lab

Objectives


Two
-
stage feature selection and feature extraction is
used to
improve the performance of text
categorization.


Intelligent Database Systems Lab

Methodology

Intelligent Database Systems Lab

Methodology


pre
-
processing


removing of stop
-
words



Stemming



term weighting



pruning of the words


a, an,
and, because, can, do, every
, the



computer,

computing,

computation, computes


comput


prune the words that appear
less than

two times
in the documents.

Terms
of
the document collection

documents

Intelligent Database Systems Lab

Methodology


feature ranking
with information gain


each term within the text is
ranked depending on
their importance

for the
classification

in decreasing
order using the IG method.

Intelligent Database Systems Lab

Methodology


dimension
reduction methods


principal component analysis



Genetic algorithm for feature selection

Individual’s
encoding

Fitness
function

Mutation

Crossover

11011

00110

01110

11110

Selection

p


m

Intelligent Database Systems Lab

Methodology


text categorization
methods


KNN classifier




C4.5 decision tree
classifier


Intelligent Database Systems Lab


precision

recall

F
-
measure

Methodology


evaluation of the
performance

Intelligent Database Systems Lab

Experiments


datasets


Reuters dataset
-
21578







Classic3 dataset


Category name

Number

of document

Earn

3743

Acquisition

2179

Money
-
fx

633

Crude

561

Grain

542

Trade

500

Category name

Number

of document

CRANFIELD

1398

MEDLINE

1033

CISI

1460

Intelligent Database Systems Lab

Experiments


Reuters
-
21578

A document
-
term matrix is acquired with a dimension of
8158
×

7542
at the end of pre
-
processing.

Intelligent Database Systems Lab

Experiments


Reuters
-
21578

Intelligent Database Systems Lab

Experiments


Classic3

A document
-
term matrix is acquired in the dimension of
3891
×

6679
at the end of pre
-
processing.

Intelligent Database Systems Lab

Experiments


Classic3

Intelligent Database Systems Lab

Conclusions


The success of text categorization performed through
the C4.5 decision tree and KNN algorithms using
fewer features selected via IG
-
PCA and IG
-

GA is
higher than
the success acquired using features
selected via IG.


Two
-
stage feature selection methods can
improve
the performance

of text categorization.

Intelligent Database Systems Lab

Comments


Advantages

-

understand
the
basic methods


Applications

-

text categorization