CS 479, section 1:
Natural Language Processing
Lecture #32: Text Clustering with
Expectation Maximization
Thanks to Dan Klein of UC Berkeley for
some of
the materials used in this lecture.
This work is licensed under a
Creative Commons Attribution

Share Alike 3.0
Unported
License
.
Announcements
Homework 0.3
Key has been posted
Requires authentication
If you haven’t submitted this assignment, but you’re doing project #4,
perhaps it’s time to submit what you have and move on.
Project #4
Note updates to the requirements involving horizontal
markovization
Reading Report #13: M&S
ch.
13 on alignment and MT
Due Monday
–
online again
See directions sent to the announcements list
Project #
5
Help
session: Tuesday
Final Project
Proposal
pre

requisites:
Discuss your idea with me before submitting
proposal
Identify and possess data
Proposal Due:
Today
Project Report:
Early: Wednesday after Thanksgiving
Due: Friday after Thanksgiving
Where are we going?
Earlier in the semester: Classification
“Supervised” learning paradigm
Joint / Generative model: Naïve Bayes
Moving into “Unsupervised” territory …
Objectives
Understand the text clustering problem
Understand the Expectation Maximization
(EM) algorithm for one model
See unsupervised learning at work on a real
application
Clustering Task
Given a set of documents, put them into groups
so that:
Documents in the same group are alike
Documents in different groups are different
“Unsupervised” learning: no class labels
Goal: discover hidden patterns in the data
Useful in a range of NLP tasks:
IR
Smoothing
Data mining and Text mining
Exploratory data analysis
Clustering Task: Example
Dog
dog
dog
cat
Canine
dog
woof
Dog
collar
Feline
cat
Cat
cat
cat
cat
dog
Kitten
cat
litter
Kennel
dog
pound
Dog
dog
dog
dog
Dog
dog
dog
Raining
cats
and
dogs
Year of
the
dog
Feline
cat
meow
cat
Dog
dog
Cat
dog
cat
Dog
cat
mouse
kitten
Cat
kitten
cub
Puppy
dog
litter
Dog
dog
cat
cat
Dog
dog
cat
Dog
cat
cat
cat
cat
k

Means Clustering
The simplest model

based technique
Procedure:
Failure Cases:
Count(cat)
Count(dog)
Demo
http://home.dei.polimi.it/matteucc/Clustering/t
utorial_html/AppletKM.html
High

Level:
Expectation

Maximization (EM)
Iterative procedure:
1.
Guess some initial parameters for the model
2.
Use model to estimate partial label counts for all
docs (E

step)
3.
Use the new complete data to learn better
model (M

step)
4.
Repeat steps 2 and 3 until convergence
k

Means is “hard” EM
Or EM is “soft” k

Means
EM
When to stop?
: Posterior
distribution
over clusters
(

)
for
each document
: Total partial label
counts
for each document, based
on
𝜃
:
Model
parameter
estimates, computed from
Model
c
i
t
i,
1
t
i
,2
t
i,M
i
…
Word

Token
Perspective:
This is what we called “Naïve Bayes” when we discussed
classification.
Multinomial Mixture Model
c
i
x
i
c
i
t
i,
1
t
i
,2
t
i,M
i
…
c
i
x
i,
1
x
i
,2
x
i,V
…
Word

Type
Perspective:
Word

Token
Perspective:
Multinomial Mixture Model
c
i
x
i
c
i
t
i,
1
t
i
,2
t
i,M
i
…
c
i
x
i,
1
x
i
,2
x
i,V
…
Word

Type
Perspective:
Word

Token
Perspective:
𝑃
=
𝜆
𝑃

=
𝛽
,
Computing Data Likelihood
Marginalization
Factorization
Conditional
Independence
Make word
type explicit
+ Conditional
Indep
.
Categorical
parameters
Log Likelihood of the Data
We want the probability of the unlabeled data according to a model with
parameters
𝜃
:
Independence of data
Logarithm
Log of product
Log of sum
Log of product
logsum
For the computation involved in the
logsum
and its justification:
https://
facwiki.cs.byu.edu/nlp/index.php/Log_D
omain_Computations
Model Parameterizations
Data Likelihood
Likelihood Surface
Model Parameterizations
Data Likelihood
EM Properties
Each step of EM is guaranteed to increase data likelihood

a hill climbing
procedure
Not guaranteed to find global maximum of data likelihood
Data likelihood typically has many local maxima for a general model class and
rich feature set
Many “patterns” in the data that we can fit our model to…
Idea: Random restarts!
EM for Naïve Bayes: E Step
Dog
dog
dog
cat
Canine
dog
woof
Dog
collar
Feline
cat
Cat
cat
cat
cat
dog
Kitten
cat
litter
Kennel
dog
pound
Dog
dog
dog
dog
Dog
dog
dog
Raining
cats
and
dogs
Year of
the
dog
Feline
cat
meow
cat
Dog
dog
Cat
dog
cat
Dog
cat
mouse
kitten
Cat
kitten
cub
Puppy
dog
litter
Dog
dog
cat
cat
Dog
dog
cat
Dog
cat
cat
cat
cat
EM for Naïve Bayes: E Step
Dog
cat
cat
cat
cat
Document
j
1. Compute posteriors for each
:
𝜃
=
1
)
=
0
.
2
𝜃
=
2
)
=
0
.
8
2.
Use as partial counts,
total them up:
(
1
)
+
=
0
.
2
(
2
)
+
=
0
.
8
(
1
,
𝑔
)
+
=
1
∗
0
.
2
(
2
,
𝑔
)
+
=
1
∗
0
.
8
(
1
,
𝑎𝑡
)
+
=
4
∗
0
.
2
(
2
,
𝑎𝑡
)
+
=
4
∗
0
.
8
EM for Naïve Bayes: M Step
Once you have your partial
counts, re

estimate
parameters like you would
with regular counts
𝜆
1
=
1
=
1
𝑁
𝜆
1
=
2
=
2
𝑁
𝛽
1
,
𝑔
=
𝑔
1
=
1
,
𝑔
+
1
1
,
𝑉
𝑣
=
1
+
𝑉
𝛽
1
,
𝑎𝑡
=
𝑎𝑡
1
=
1
,
𝑎𝑡
+
1
1
,
𝑉
𝑣
=
1
+
𝑉
𝛽
2
,
𝑔
=
𝑔
2
=
2
,
𝑔
+
1
2
,
𝑉
𝑣
=
1
+
𝑉
𝛽
2
,
𝑎𝑡
=
𝑎𝑡
2
=
2
,
𝑎𝑡
+
1
2
,
𝑉
𝑣
=
1
+
𝑉
Etc.
In the next E step, the
partial counts will be
different,
because
the parameters have
changed:
E.g.,
=
1
∝
1
⋅
1
𝑥
=
1
And the likelihood will increase!
EM for Multinomial Mixture Model
Initialize
model parameters. E.g., randomly
E

step
:
1.
Calculate posterior distributions over clusters for each document
x
i
:
2.
Use posteriors as fractional labels; Total up your new counts!
M

Step
1.
Re

estimate
:
2.
Re

estimate
†
晲omhe晲ac楯na汬y污be汥
data
with smoothing
:
EM in General
EM is a technique for learning anytime
we have incomplete data
(
,
)
Induction Scenario (ours)
: we actually want
to know
(e.g., clustering)
Convenience Scenario
: we want the
marginal,
𝑃
(
)
, and including
just makes
the model simpler (e.g., mixing weights)
General Approach
Learn
and
𝜃
E

step
: make a guess at posteriors
𝑃
(

,
𝜃
)
This means scoring all completions with the
current parameters
Treat the completions as (fractional) complete
data, and count.
M

step
: re

estimate
𝜃
to maximize log
𝑃
(
,

𝜃
)
Then compute (smoothed) ML estimates of
model parameters
Expectation Maximization (EM)
for Mixing Parameters
How to estimate mixing parameters
𝜆
1
and
𝜆
2
?
Sometimes you can just do line search, as we have discussed.
… or the “try a few orders of magnitude” approach
Alternative: Use EM
Think of mixing as a hidden choice between histories:
Given a guess at
P
H
, we can calculate expectations of which
generation route a given token took (over held

out data)
Use these expectations to update
P
H
, rinse and repeat
Advertisement
Learn more about text mining, including text
classification, text clustering, topic models,
text summarization, and visualization for text
mining:
Register for CS 679
–
Fall 2012
Next
Next Topics:
Machine Translation
If time permits, Co

reference Resolution
We can use the EM algorithm to attack both
of these problems!
Comments 0
Log in to post a comment