CS 479, section 1: Natural Language Processing

blabbingunequaledAI and Robotics

Oct 24, 2013 (3 years and 5 months ago)

72 views

CS 479, section 1:

Natural Language Processing

Lecture #32: Text Clustering with
Expectation Maximization


Thanks to Dan Klein of UC Berkeley for
some of
the materials used in this lecture.

This work is licensed under a
Creative Commons Attribution
-
Share Alike 3.0
Unported

License
.

Announcements


Homework 0.3


Key has been posted


Requires authentication


If you haven’t submitted this assignment, but you’re doing project #4,
perhaps it’s time to submit what you have and move on.



Project #4


Note updates to the requirements involving horizontal
markovization



Reading Report #13: M&S
ch.

13 on alignment and MT


Due Monday


online again


See directions sent to the announcements list



Project #
5


Help
session: Tuesday



Final Project


Proposal
pre
-
requisites:


Discuss your idea with me before submitting
proposal


Identify and possess data


Proposal Due:
Today



Project Report:


Early: Wednesday after Thanksgiving


Due: Friday after Thanksgiving

Where are we going?


Earlier in the semester: Classification


“Supervised” learning paradigm


Joint / Generative model: Naïve Bayes



Moving into “Unsupervised” territory …

Objectives


Understand the text clustering problem



Understand the Expectation Maximization
(EM) algorithm for one model



See unsupervised learning at work on a real
application


Clustering Task


Given a set of documents, put them into groups
so that:


Documents in the same group are alike


Documents in different groups are different



“Unsupervised” learning: no class labels


Goal: discover hidden patterns in the data



Useful in a range of NLP tasks:


IR


Smoothing


Data mining and Text mining


Exploratory data analysis

Clustering Task: Example

Dog
dog

dog

cat

Canine
dog

woof

Dog
collar

Feline
cat

Cat
cat

cat

cat

dog

Kitten
cat
litter

Kennel
dog
pound

Dog
dog

dog

dog

Dog
dog

dog

Raining
cats
and
dogs

Year of
the
dog

Feline
cat
meow
cat

Dog
dog

Cat
dog
cat

Dog
cat
mouse
kitten

Cat
kitten
cub

Puppy
dog
litter

Dog
dog

cat
cat

Dog
dog

cat

Dog
cat
cat

cat

cat

k
-
Means Clustering


The simplest model
-
based technique


Procedure:






Failure Cases:

Count(cat)

Count(dog)

Demo

http://home.dei.polimi.it/matteucc/Clustering/t
utorial_html/AppletKM.html


High
-
Level:

Expectation
-
Maximization (EM)


Iterative procedure:

1.
Guess some initial parameters for the model

2.
Use model to estimate partial label counts for all
docs (E
-
step)

3.
Use the new complete data to learn better
model (M
-
step)

4.
Repeat steps 2 and 3 until convergence


k
-
Means is “hard” EM


Or EM is “soft” k
-
Means


EM

When to stop?


: Posterior
distribution

over clusters

(

|

)

for
each document


: Total partial label
counts

for each document, based
on


𝜃
:
Model
parameter

estimates, computed from


Model

c
i

t
i,
1

t
i
,2

t
i,M
i



Word
-
Token
Perspective:

This is what we called “Naïve Bayes” when we discussed
classification.

Multinomial Mixture Model

c
i

x
i

c
i

t
i,
1

t
i
,2

t
i,M
i



c
i

x
i,
1

x
i
,2

x
i,V



Word
-
Type
Perspective:

Word
-
Token
Perspective:

Multinomial Mixture Model

c
i

x
i

c
i

t
i,
1

t
i
,2

t
i,M
i



c
i

x
i,
1

x
i
,2

x
i,V



Word
-
Type
Perspective:

Word
-
Token
Perspective:

𝑃


=
𝜆


𝑃


|


=
𝛽


,


Computing Data Likelihood

Marginalization

Factorization

Conditional

Independence

Make word
type explicit

+ Conditional
Indep
.

Categorical

parameters

Log Likelihood of the Data


We want the probability of the unlabeled data according to a model with
parameters
𝜃
:










Independence of data

Logarithm

Log of product

Log of sum

Log of product

logsum


For the computation involved in the
logsum

and its justification:


https://
facwiki.cs.byu.edu/nlp/index.php/Log_D
omain_Computations


Model Parameterizations

Data Likelihood

Likelihood Surface

Model Parameterizations

Data Likelihood

EM Properties


Each step of EM is guaranteed to increase data likelihood
-

a hill climbing
procedure


Not guaranteed to find global maximum of data likelihood


Data likelihood typically has many local maxima for a general model class and
rich feature set


Many “patterns” in the data that we can fit our model to…

Idea: Random restarts!

EM for Naïve Bayes: E Step

Dog
dog

dog

cat

Canine
dog

woof

Dog
collar

Feline
cat

Cat
cat

cat

cat

dog

Kitten
cat
litter

Kennel
dog
pound

Dog
dog

dog

dog

Dog
dog

dog

Raining
cats
and
dogs

Year of
the
dog

Feline
cat
meow
cat

Dog
dog

Cat
dog
cat

Dog
cat
mouse
kitten

Cat
kitten
cub

Puppy
dog
litter

Dog
dog

cat
cat

Dog
dog

cat

Dog
cat
cat

cat

cat

EM for Naïve Bayes: E Step

Dog

cat
cat

cat

cat

Document
j

1. Compute posteriors for each



:


𝜃


=

1




)

=

0
.
2


𝜃


=

2




)

=

0
.
8


2.
Use as partial counts,

total them up:


(

1
)

+
=

0
.
2



(

2
)

+
=

0
.
8



(

1
,
 𝑔
)

+
=

1



0
.
2



(

2
,
 𝑔
)

+
=

1



0
.
8



(

1
,
𝑎𝑡
)

+
=

4



0
.
2



(

2
,
𝑎𝑡
)

+
=

4



0
.
8

EM for Naïve Bayes: M Step


Once you have your partial
counts, re
-
estimate
parameters like you would
with regular counts


𝜆
1
=


1
=


1
𝑁


𝜆
1
=


2
=


2
𝑁



𝛽
1
,
𝑔
=

 𝑔

1
=


1
,
 𝑔
+
1



1
,

𝑉
𝑣
=
1

+

𝑉


𝛽
1
,
𝑎𝑡
=

𝑎𝑡

1
=


1
,
𝑎𝑡
+
1



1
,

𝑉
𝑣
=
1

+

𝑉



𝛽
2
,
𝑔
=


 𝑔

2
=


2
,
 𝑔
+
1



2
,

𝑉
𝑣
=
1

+

𝑉



𝛽
2
,
𝑎𝑡
=

𝑎𝑡

2
=


2
,
𝑎𝑡
+
1



2
,

𝑉
𝑣
=
1

+

𝑉

Etc.


In the next E step, the
partial counts will be
different,
because
the parameters have
changed:


E.g.,






=

1







1






1
𝑥



=
1


And the likelihood will increase!

EM for Multinomial Mixture Model


Initialize

model parameters. E.g., randomly




E
-
step
:

1.
Calculate posterior distributions over clusters for each document
x
i
:







2.

Use posteriors as fractional labels; Total up your new counts!





M
-
Step

1.
Re
-
estimate


:




2.
Re
-
estimate


晲omhe晲ac楯na汬y污be汥
data
with smoothing
:

EM in General


EM is a technique for learning anytime
we have incomplete data
(

,

)


Induction Scenario (ours)
: we actually want
to know


(e.g., clustering)


Convenience Scenario
: we want the
marginal,
𝑃
(

)
, and including


just makes
the model simpler (e.g., mixing weights)


General Approach


Learn


and
𝜃


E
-
step
: make a guess at posteriors
𝑃
(


|


,
𝜃
)


This means scoring all completions with the
current parameters


Treat the completions as (fractional) complete
data, and count.


M
-
step
: re
-
estimate
𝜃

to maximize log
𝑃
(

,


|

𝜃
)


Then compute (smoothed) ML estimates of
model parameters

Expectation Maximization (EM)

for Mixing Parameters


How to estimate mixing parameters
𝜆
1
and
𝜆
2
?




Sometimes you can just do line search, as we have discussed.


… or the “try a few orders of magnitude” approach



Alternative: Use EM


Think of mixing as a hidden choice between histories:





Given a guess at
P
H
, we can calculate expectations of which
generation route a given token took (over held
-
out data)





Use these expectations to update
P
H
, rinse and repeat

Advertisement


Learn more about text mining, including text
classification, text clustering, topic models,
text summarization, and visualization for text
mining:


Register for CS 679


Fall 2012

Next


Next Topics:


Machine Translation


If time permits, Co
-
reference Resolution



We can use the EM algorithm to attack both
of these problems!