document?

blabbingunequaledAI and Robotics

Oct 24, 2013 (3 years and 10 months ago)

84 views

Text Classification

References
:


Manning and Schutze. Statistical Natural Language Processing. MIT Press: 1999.


Russell, and Norvig. AI a Modern approach. Prentice Hall, 2003.


Yang, and , “A Re
-
examination of Text Categorization Methods”. at
http://www.inf.ufes.br/~csbgoncalves/cursos/ct08/artigos/yang99reexamination.pdf


Yiming Yang. “Decision Tree in Text Categorization”. at:
www.cs.cmu.edu/afs/cs/academic/class/15381
-
f00/public/www/handouts/lec1024.ps


McCallum, A. and Nigam K. "A Comparison of Event Models for Naive Bayes Text Classification".
In AAAI/ICML
-
98 Workshop on Learning for Text Categorization, pp. 41
-
48. Technical Report
WS
-
98
-
05. AAAI Press. 1998.
http://www.kamalnigam.com/papers/multinomial
-
aaaiws98.pdf




So far


Looked at AI problems as problems of


knowledge representation


search


Discussed several different search methods:


Uniformed search


Informed search


Adversary search (game playing)


Studied examples of strong method problem solving :


Expert Systems


Planning


Explored probability theory as a model for dealing with
uncertainty.

In the remaining Lectures


Natural Language Processing is a challenging area in
AI. We will explore text categorization as an NLP field
which deals with uncertainty.

Natural Language Processing in AI


Goal: To understand, and generate, spoken, and
written, human language.

Natural Language Processing in AI

Applications


Search and retrieval
: utilizing large databases of documents
that would otherwise be impossible to search through.


Categorization
: organizing large repositories such as the
WWW.


Translation
: translating documents between languages.


User Interface
: communicating with users in an intelligent
manner.


Text summarization
: summarizing large document, or several
documents.


Document filtering
: filtering groups of documents to retain
desired ones and/or discard others (
e.g

junk mail).


Feature extraction
: automatically extracting important
information from news articles.


Automatic Classification


“The task of assigning objects from a universe to two or
more
classes

or
categories

(Manning and Schutz. Statistical Natural
Language Understanding. MIT Press, 1999)


Automatic Classification

Challenges


Language does not usually have clear strict rules.


We use language to refer to objects and events in our
world which may be inaccessible to a computer.


Language is sometimes ambiguous.

Applications of Automatic Classification


Document classification:


Text categorization


Author identification


Language identification


“Word” classification:


Word sense disambiguation.


Part of Speech tagging.

Text Categorization


Goal: to classify documents into one or more of a
predetermined set of categories.


Applications:


Webpage categorization.


News filtering.


Spam filtering


Text categorization is a
supervised learning
method.


Supervised Learning


Goal is to learn a
function

from provided examples of
input and the corresponding correct output:



Each example is represented as a pair (
x
,
f(x)
), where
x

is
the input, and
f(x)
is the correct output.



Therefore, supervised learning methods attempt to
generate a function
h

which approximates
f.

Supervised Learning in Text
Categorization

this is a
document
blah blah blah

this is a
document
blah blah blah

this is a
document
blah blah blah

this is a
document
blah blah blah

this is a
document
blah blah blah

this is a
document
blah blah blah

this is a
document
blah blah blah

this is a
document
blah blah blah

Training
Procedure

Training Set

<x1, C(x1)>

<x2, C(x2)>

:

:

Model
Class

Data representation
Model

Appropriate
Categorization
model selected
from the model
class

Data Representation Model in Text
Categorization


The
training data
of a categorization system consists of
documents and their categories.


A
data representation model
(
R(d), c
)

is the
representation
R(d)
of a document
d

and the
document category
c.



The choice of
R(d)

depends on the
language model
and

the classification model
used.

Modeling Language


Language modeling
is an abstract view that captures
the important aspects of language and specifies
underlying assumptions.


Many different language models exist.


The choice of model is influenced mainly by the
intended application.


In text categorization our main concern is with
documents and their categories.

Document Models


Documents as structured objects


linear structure


title, keywords, sequence of sections


hierarchical structure


book, chapter, sections


grammatical structure


hyperlinked


Documents as unstructured objects


ordered sequence of strings


unordered/partially ordered sequence of strings



Document Models

Example


documents as structured objects


Linear structure


title, keywords, sequence of sections


hierarchical structure


book, chapter, sections


grammatical structure


Hyperlinked


Documents as unstructured objects


ordered sequence of strings


unordered/partially ordered sequence of strings



Document Views

Documents as Bags of Words


Words are completely independent of each other
(
unigram model
).


Word pairs tend to co
-
occur (
bigram model
)


example:



swimming pool

vs

swimming shoe



running, fitness

vs

running, window


many words tend to co
-
occur (
n
-
gram model
)

Document Models

The unigram model


Assume that the language consists of the words
W={w
1
,w
2
,....,w
n
}


Assume that word occurrences are independent.


Associate each word with a weight r(w
i
) reflecting its
relevance to the content of the document.


A document is then represented as a vector of (word,
weight) pairs.


Many systems use
feature selection

which reduces the
size of the document vector by keeping only the
“best” features(words) in the vector space model

Document Unigram Model

Example


documents as structured objects


Linear structure


title, keywords, sequence of sections


hierarchical structure


book, chapter, sections


grammatical structure


Hyperlinked


Documents as unstructured objects


ordered sequence of strings


unordered/partially ordered sequence of strings



Document Views

Document Unigram Model

Example

as

Document

documents

Views

structured

objects

Linear

structure

title

keywords

sequence

of

sections

hierarchical

book

chapter

sections

grammatical

Hyperlinked

unstructured

ordered

sequence

strings

unordered

partially



Document Models

The Bigram Model


Assume that the language consists of the words
W={w
1
,w
2
,....,w
n
}


Assume that words tend to occur in pairs of
W={(w
11
,w
12
),(w
21
,w
22
),...}


Associate each word sequence s
i
with a weight r(s
i
)
reflecting its weight.


A document is then a vector of (word sequence,
weight) pairs.

The Bigram Model

Example

document Views

documents structured

structured objects

Linear structure

hierarchical structure

grammatical structure

documents unstructured

unstructured objects

ordered sequence

sequence strings

unordered partially

partially ordered

ordered sequence



Document Models

The N
-
gram Model


Assume that the language consists of the words
W={w
1
,w
2
,....,w
n
}


Assume that words tend to co
-
occur in sets of n
words of W={(w
11
,w
12
,...,w
1n
),(w
21
,w
22
,...,w
1n
),...}


Supervised Learning in Text
Categorization

this is a
document
blah blah blah

this is a
document
blah blah blah

this is a
document
blah blah blah

this is a
document
blah blah blah

this is a
document
blah blah blah

this is a
document
blah blah blah

this is a
document
blah blah blah

this is a
document
blah blah blah

Training
Procedure

Training Set

<x1, C(x1)>

<x2, C(x2)>

:

:

Model
Class

Data representation
Model

Appropriate
Categorization
model selected
from the model
class

Text Categorization Models


Naive Bayes’ Classifier


K Nearest
-
Neighbour.


Decision Trees

Naive
Bayes
’ Classifier

Naive
Bayes
’ Learning Model


A common Bayesian Network machine learning model.


Makes the assumption of word independence.


In some cases, uses the
Multivariate

document model
(multivariate probability distribution).

Category

W1

W2

W3

W4

The Multivariate Document Model


Assume independence of words.


A document is a sequence of word occurrences.


The occurrence of a word
w

in a document is an event
with probability p
w
.


Document words are drawn from the vocabulary list V.


Therefore


d
i

= <w
1
, w
2
, w
3
, ...,w
|d|
>


p(c|d
i
)= p(c|w
1
, w
2
, w
3
, ...,w
|d|
)


p(w
1
, w
2
, w
3
, ...,w
|d|
|c) =
P
p(w
i
|c)




Naive
Bayes
’ Classifier

Training Procedure


Model must be trained to determine, for each
class/category C and word
w
i


P(C=
c
i
)


p(
w
i
|C
=
c
i
)


Given a set of categorized documents, we may use the total
number of words in documents under and the total number of
words in the training set to calculate P(C=
c
i
).


Similarly, we can use the total frequency of word
w
i

in all
documents under
c
i

and the total number of words in
c
i

to
calculate P(
w
i

|C=
c
i
).


Naive Bayes’ Classifier

Given P(C=c
i
) and (w
i
|C=c
i
)


Classification can be performed using Naive Bayes’ rule:


P(C|w
1
,w
2
,..., w
n
) = (P(C)
P
P(w
i
|C))/P(w
1
,w
2
,...,w
n
)


The proper category is the one that maximizes
P(C|w
1
,w
2
,...,w
n
).


Since P(w
1
,w
2
,...,w
n
) is the same for all classes, we use:


P(C|w
1
,w
2
,...,w
n
) =
a
P(C)
P
P(w
i
|C)






@
P(C)
P
P(w
i
|C)



Naive Bayes’ Classifier


Therefore, to classify a document into one of two
categories c1 and c2:


Calculate the approximate value of P(C=c1|w
1
,w
2
,...,w
n
) (do
not calculate the value of the
a
factor)


Calculate the approximate value of P(C=c2|w
1
,w
2
,...,w
n
) (do
not calculate the value of the
a
factor)


classify the document under the category c1 if

P(C=c1|w
1
,w
2
,...,w
n
) > P(C=c2|w
1
,w
2
,...,w
n
)

Otherwise, choose c2



Naive
Bayes

Classifier



Advantages


Very effective in comparison to other classification
methods


Simple and efficient



Disadvantages


The assumption of word independence is not always
appropriate.


K Nearest Neighbour


Another machine learning technique.


Rationale: To classify a new document, find the
training document most similar to it, then assign it the
category of that document.


Image Source:
https://engineering.purdue.edu/people/mireille.boutin.1/ECE301kiwi/KNN.jpg

KNN

General Approach (for K=1)


Input:


document represented by vector y.


training set X of pre
-
categorized documents.


Goal: Categorize y based on the training set X.


Determine the largest similarity with any element in the training
set:

sim_max
(y) =
max
x

in X
(
sim
(
x,y
))


Let A be a set consisting of the items that are most similar to X.


let n1 and n2 be the number of items in A which belong to
categories c1 and c2, respectively:


p(c1|y) = n1/(n1+n2)


p(c2|y) = n2/(n1+n2)


Categorize y under c1 if p(c1|y) > p(c2|y), otherwise categorize it
under c2.

KNN


What is the best value for k?


How can we choose the best value for k?


What makes a good similarity metric?


How can we weigh probabilities p(c|y) using the k
nearest documents when k>1?


KNN

Advantages


Robust


Conceptually simple.


Often performs well when using an appropriate
similarity measure.

KNN

Disadvantages


Sensitive to similarity measure used to compare
documents.


Can be inefficient in calculating all required
similarities.


Decision Trees

node 1

7681 articles

P(C|n1) = 0.3

split: cts

value: 2

node 2

2977 articles

P(C|n2) = 0.116

split: net

value: 1

node 5

1704 articles

P(C|n5) = 0.943

split: vs

value: 2

node 3

5436 articles

P(C|n3) = 0.05

node 4

541 articles

P(C|n4) = 0.649

node 6

301 articles

P(C|n6) = 0.694

node 7

1403 articles

P(C|n7) = 0.996

cts<2

cts>=2

net<1

vs<2

net>= 1

vs<=2

Given a document with
weight 1 for cts, 3 for net
Is it an “earnings”
document?

node 1

7681 articles

P(C|n1) = 0.3

split: cts

value: 2

node 2

2977 articles

P(C|n2) = 0.116

split: net

value: 1

node 4

541 articles

P(C|n4) = 0.649

0.05

0.649

Decision Trees

cts

net

1

2

0

1

2

Creating Decision Tree for a
Category

This is the
training procedure

for this model:

1.
Grow(Tree)

2.
Prune the tree.

Creating Decision Tree for a Category

This is the
training procedure

for this model:

1.
Grow(Tree)

1.
Create the vocabulary V from the training data

2.
if Tree is empty

1.
Set the initial node n with the full training set

2.
Grow(n)

3.
else if (the training set in the current node is too small) or (all
documents have the same category) then

1.
make the current node a leaf node.

2.
return;

4.
else

1.
find the word w with the highest information gain (this is our
splitting
criterion
).

2.
divide the training set into: subset S1 containing documents where w
has weight of w < the splitting value ;


and subset S2 containing documents where w has weight >= splitting
value.

1.
Creat

a node n1 containing S1.

2.
Create a node n2 containing S2.

3.
Grow(n1)

4.
Grow(n2)

Creating Decision Tree for a
Category

This is the
training procedure

for this model:

1.
Grow(Tree)

2.
Prune the tree:
repeatedly remove leaf nodes that
are found to be least useful.



Pruning helps us avoid
overfitting

the data.


Pruning also reduces the tree size thus improving
efficiency.

Decision Trees


Advantages:


Easy to interpret and understand.


Disadvantages:


Does not perform very well compared to other models
such as KNN.


High training cost.


Costly when re
-
training is needed.


Summary


Discussed NLP and its applications


Examined the issue of modelling language.


Considered how to model documents.


Explored the components of a text categorization
system.


Looked at three different models of text categorization
all of which are machine learning methods:


Naive
Bayes
’ Model


K Nearest Neighbour.


Decision Trees.