lecture_16

unknownlippsΤεχνίτη Νοημοσύνη και Ρομποτική

16 Οκτ 2013 (πριν από 3 χρόνια και 10 μήνες)

56 εμφανίσεις

Text Classification and Naïve Bayes


An example of text classification

Definition of a machine learning problem

A refresher on probability

The Naive Bayes classifier

1

Google News

2

Different ways for classification

Human labor (people assign categories to every incoming
article)


Hand
-
crafted rules for automatic classification


If article contains: stock, Dow, share, Nasdaq, etc.


Business


If article contains: set, breakpoint, player, Federer, etc.


Tennis


Machine learning algorithms

3

What is Machine Learning?

4

Definition
: A computer program is said to learn from
experience
E

when its performance
P

at a task
T

improves with experience
E
.





Tom Mitchell, Machine Learning, 1997

Examples:

-

Learning to recognize spoken words

-

Learning to drive a vehicle

-

Learning to play backgammon

Components of a ML System (1)

Experience
(a set of examples that combines together
input and output for a task)


Text categorization: document + category


Speech recognition: spoken text + written text

Experience is referred to as Training Data. When training
data is available, we talk of
Supervised Learning
.

Performance metrics


Error or accuracy in the Test Data


Test Data are not present in the Training Data


When there are few training data, methods like ‘leave
-
one
-
out’ or
‘ten
-
fold cross validation’ are used to measure error.

5

Components of a ML System (2)

Type of knowledge to be learned (known as the target
function, that will map between input and output)

Representation of the target function


Decision trees


Neural networks


Linear functions

The learning algorithm


C4.5 (learns decision trees)


Gradient descent (learns a neural network)


Linear programming (learns linear functions)



6

Task

Defining Text Classification

7

the document in the multi
-
dimensional space

a set of classes (categories, or labels)

the training set of labeled documents

Target function:

Learning algorithm:

“Beijing joins the World Trade Organization”, China

China

Naïve Bayes Learning

8

Learning Algorithm
: Naïve Bayes

Target Function:

The generative process:

a priori probability, of choosing a category

the cond. prob. of generating
d
, given the fixed
c

a posteriori probability that
c

generated
d

A Refresher on Probability

9

Visualizing probability

A is a random variable that denotes an uncertain event


Example: A = “I’ll get an A+ in the final exam”

P(A) is “the fraction of possible worlds where A is true”

10

Worlds in
which A

is true

Slide: Andrew W. Moore

Worlds in which A is false

Event space of all possible
worlds. Its area is 1.

P(A) = Area of the blue
circle.

Axioms and Theorems of Probability

Axioms:


0 <= P(A) <= 1


P(True) = 1


P(False) = 0


P(A or B) = P(A) + P(B)


P(A and B)


Theorems:


P(not A) = P(~A) = 1


P(A)


P(A) = P(A ^ B) + P(A ^ ~B)



11

Conditional Probability

P(A|B) = the probability of A being true, given that we
know that B is true

12

F

H

H = “I have a headache”

F = “Coming down with flu”


P(H) = 1/10

P(F) = 1/40

P(H/F) = 1/2

Slide: Andrew W. Moore

Headaches are rare and flu even
rarer, but if you got that flu, there
is a 50
-
50 chance you’ll have a
headache.

Deriving the Bayes Rule

13

Conditional Probability:

Chain rule:

Bayes Rule:

Back to the Naïve Bayes Classifier

14

Deriving the Naïve Bayes

15

(Bayes Rule)

Given two classes

and the document

We are looking for a that maximizes the a
-
posteriori

(the denominator) is the same in both cases

Thus:

Estimating parameters for the
target function

We are looking for the estimates and

16

P(c) is the fraction of possible worlds where c is true.

N


number of all documents

N
c



number of documents in class c

is a vector in the space

where each dimension is a term:

By using the chain rule:

we have:

Naïve assumptions of independence

1.
All attribute values are independent of each other given
the class. (conditional independence assumption)

2.
The conditional probabilities for a term are the same
independent of position in the document.

We assume the document is a “bag
-
of
-
words”.


17

Finally, we get the target function of Slide 8:

Again about estimation

18

For each term, t, we need to estimate P(t|c)

Because an estimate will be 0 if a term does not appear with a class
in the training data, we need smoothing:

Laplace
Smoothing

|V| is the number of terms in the vocabulary

T
ct

is the count of term t in all documents of class c

An Example of classification with
Naïve Bayes

19

Example 13.1 (Part 1)

20



Training

set

docID

c

= China?

1

Chinese Beijing Chinese

Yes

2

Chinese
Chinese

Shangai

Yes

3

Chinese Macao

Yes

4

Tokyo Japan

Chinese

No

Test set

5

Chinese
Chinese

Chinese

Tokyo Japan

?

Two classes: “China”, “not China”

N = 4

V = {Beijing, Chinese, Japan, Macao, Tokyo}

Example 13.1 (Part 1)

21



Training

set

docID

c

= China?

1

Chinese Beijing Chinese

Yes

2

Chinese
Chinese

Shangai

Yes

3

Chinese Macao

Yes

4

Tokyo Japan

Chinese

No

Test set

5

Chinese
Chinese

Chinese

Tokyo Japan

?

Estimation

Classification

Summary: Miscellanious

Naïve Bayes is linear in the time is takes to scan the data


When we have many terms, the product of probabilities
with cause a floating point underflow, therefore:



For a large training set, the vocabulary is large. It is better
to select only a subset of terms. For that is used “feature
selection” (Section 13.5).


22