Expectation

Maximization (EM) Algorithm
Md. Rezaul Karim
Professor
Department of Statistics
University of
Rajshahi
Bangladesh
September 21, 2012
2
Basic Concept (1)
Dr. M. R. Karim, Stats, R.U.
EM
algorithm
stands
for
“
Expectation

Maximization
”
algorithm
A
parameter
estimation
method
:
it
falls
into
the
general
framework
of
maximum

likelihood
estimation
(MLE)
The
general
form
was
given
in
Dempster,
Laird,
and
Rubin
(
1977
)
,
although
essence
of
the
algorithm
appeared
previously
in
various
forms
.
3
Basic Concept (2)
Dr. M. R. Karim, Stats, R.U.
The
EM
algorithm
is
a
broadly
applicable
iterative
procedure
for
computing
maximum
likelihood
estimates
in
problems
with
incomplete
data
.
The
EM
algorithm
consists
of
two
conceptually
distinct
steps
at
each
iteration
:
o
the
Expectation
or
E

step
and
o
the
Maximization
or
M

step
Details
can
be
found
:
Hartley
(
1958
),
Dempster
et
al
.
(
1977
),
Little
and
Rubin
(
1987
)
and
McLachlan
and
Krishnan
(
1997
)
4
Formulation of the EM Algorithm (1)
Dr. M. R. Karim, Stats, R.U.
Suppose we have a model for
a set of
comple
t
e
da
t
a
Y
, with
associated dens
i
ty
(  )
f Y
, where
1 2
= (,,,)
d
is a
vector of unknown parameters with parameter space
.
We write
(,)
obs mis
Y Y Y
where
obs
Y
represent the observed part of
Y
and
mis
Y
denotes
the missing values
Y
= (
Y
obs
,
Y
mis
)
•
Complete data
Y
(
e.g., what we’d like to have!
)
•
Observed data
Y
obs
(
e.g.,
what we have
)
•
Missing data
Y
mis
(
e.g.,
incomplete/unobserved
)
5
Formulation of the EM Algorithm (2)
Dr. M. R. Karim, Stats, R.U.
The EM algorithm is designed to find the value of
,
denoted
, that maximizes the incomplete data log

likelihood
log ( ) log (  )
obs
L f Y
,
that is, the MLE of
θ
based on the observed data
obs
Y
The EM algorithm starts with an initial value
(0)
.
Suppose that
( )
k
denotes the estimate of
at the
k
th
iteration; then the (
k
+1)st iteration can be described in
two
steps
as follows:
6
Formulation of the EM Algorithm (3)
Dr. M. R. Karim, Stats, R.U.
E

step:
Find the conditional expected complete

data log
likelihood given observed data and
( )
k
:
( ) ( )
( )
 log (,,
log (  ) ( ,)
k k
obs
k
mis obs mis
Q E L Y Y
L Y f Y Y dY
which, in the case of linear exponential family, amounts to
estimating the sufficient statistics for the complete data.
M

step:
Determine
( 1)
k
to be a value of
that
maximizes
( )

k
Q
The MLE of
is found by i
terating between the E and M
steps until
a
conve
r
gence criterion
is met
.
7
Formulation of the EM Algorithm (4)
Dr. M. R. Karim, Stats, R.U.
Guess of
unknown
parameters
initial
guess
M step
Observed data
structure
Guess of unknown/
hidden data structure
and Q function
E step
8
Formulation of the EM Algorithm (5)
Dr. M. R. Karim, Stats, R.U.
In some cases,
it may not be numerically feasibl
e to find the
value of
that globally maximizes the function
( )

k
Q
in
the M

step.
9
Formulation of the EM Algorithm (6)
Dr. M. R. Karim, Stats, R.U.
In such situations, a
Generalized EM
(GEM)
algorithm
(Dempster et al. 1977) is used to choose
( 1)
k
in the M

step such that the condition
( 1) ( ) ( ) ( )
 
k k k k
Q Q
holds. For any EM or GEM algorithm, the change from
( )
k
to
( 1)
k
increases the likelihood; that is,
( 1) ( )
log log
k k
L L
which follows from the definition of GEM and Jensen's
inequality (See ,
Rao 1972, p.47).
10
Formulation of the EM Algorithm (7)
Dr. M. R. Karim, Stats, R.U.
This fact i
m
plies that the log

likelihood,
log ( )
L
, increases
monotonically on any iteration sequence generated by the
EM algorithm, which is the fundamental property for the
con
vergence of the algorithm
.
Detailed properties of the algorithm, including the
convergence properties, are given in
o
Dem
p
ster et al.
(
1977),
o
Wu
(
1983),
o
Redner and Walker
(
1984), and
o
McLachlan and Krishnan
(
1997)
11
Formulation of the EM Algorithm (8)
Dr. M. R. Karim, Stats, R.U.
M
ethod
s
for
obtain
ing
the
asymptotic variance

covariance
matrix of the EM

computed estimator
are derived by
o
Meng
and
Rubin
(
1991)
,
o
Louis
(
1982)
and
o
Oakes
(
1999)
12
Multinomial Example (1)
Dr. M. R. Karim, Stats, R.U.
The data relate to a problem of estimation of linkage in
genetics discussed by Rao (1 973,
pp. 368

369).
One considers data in which 197 animals are
distributed
multinomially into
four
categories
with cell

probabilities
{
1/2+
θ
/4,
(1−
θ
)/4,
(1−
θ
)/4,
θ
/
4
}
for some unknown
[0,1]
The observed number in each cell
was
Y
obs
= (125, 18, 20, 34)
n=197
y
1
=125
y
2
=18
y
3
=20
(1

θ
)/4
(1

θ
)/4
½+
θ
/4
y
4
=34
θ
/4
Observed data
Probability
13
Multinomial Example (2)
Dr. M. R. Karim, Stats, R.U.
The density of the observed data i
s
1 2 3
4
1 2 3 4
!1 1 1
(  )
!!!!2 4 4 4 4
y y y y
obs
n
f y
y y y y
The log

likelihood function
for θ is therefore, apart from an
additive term not involving θ,
1 2 3 4
(  ) log(2 ) ( ) log(1 ) log( )
obs
l y y y y y
Differentiating w.r.t.
θ
,
we
have
that
2 3
1 4
(  )
2 1
obs
l y y y
y y
14
Multinomial Example (3)
Dr. M. R. Karim, Stats, R.U.
Although the log

likelihood can be maximized explicitly we
use the example to illustrate
the EM algorithm.
To view the problem as an
unobserved data problem
we would
think of it as a multinomial experiment with
five categories
with observations
Y = (y
11
,
y
12
, y
2
, y
3
, y
4
),
each with cell probability (1/2,
θ
/4, (1−
θ
)/4, (1−
θ
)/4,
θ
/4).
That is, we
split the first category into two
, and we can only
observe the
sum y
1
= y
11
+y
12
. Then
y
11
and y
12
are considered
as the unobservable variables
.
15
Multinomial Example (4)
Dr. M. R. Karim, Stats, R.U.
n=197
y
1
=125
y
11
1/2
y
12
θ
/4
y
2
=18
(1

θ
)/4
y
3
=20
(1

θ
)/4
y
4
=34
θ
/4
Observed data
Probability
Missing data
16
Multinomial Example (5)
Dr. M. R. Karim, Stats, R.U.
The
density of the
complete data is then
11 12 2 3
4
11 12 2 3 4
!1 1 1
(  )
!!!!!2 4 4 4 4
y y y y y
n
f y
y y y y y
and the log

likelihood is (apart from a term not involving
θ
)
12 4 2 3
(  ) ( ) log( ) ( ) log(1 )
l y y y y y
Since
y
12
is unobservable
we cannot maximize this directly.
This obstacle is overcome
by the E

step
(
as it handles the
problem of filling
in for unobservable data by averaging the
complete

data log likelihood over its conditional
distribution
given the obs
erved data
).
17
Multinomial Example (6)
Dr. M. R. Karim, Stats, R.U.
Let
(0)
be an initial guess for
θ
.
T
he E

step requires computation of
(0)
(0)
(0)
(0)
11 12 2 3 4 1 2 3 4
12 1 4 2 3
 (,) 
= (,,,,,) ,,,
=  log( ) ( )log(1 )
c obs
c
Q E l y y
E l y y y y y y y y y
E y y y y y
Thus, we need to compute the conditional expectation of
y
12
given
y
1
given
(0)
. But this is a Binomial distribution with
sample size
y
1
and
probability
parameter
(0)
(0)
/4
1/2/4
p
y
1
=125
y
11
1/2
y
12
θ
/4
18
Multinomial Example (7)
Dr. M. R. Karim, Stats, R.U.
Hence, the expected value is
(0)
(0)
(0)
12 1 1 12
(0)
/4
. (say)
1/2/4
E y y y y
and the expression for
(0)

Q
is
(0) (0)
12 4 2 3
 log( ) ( )log(1 )
Q y y y y
In the M

step we maximize this with respec
t
to
θ
to get
(0)
(1)
12 4
(0)
12 2 3 4
y y
y y y y
19
Multinomial Example (8)
Dr. M. R. Karim, Stats, R.U.
Then, iterating this gives us finally the estimate for
θ
.
Summarizing, we get the
iterations
( )
( 1)
12 4
( )
12 2 3 4
i
i
i
y y
y y y y
where
( )
( )
12 1
( )
/4
.
1/2/4
i
i
i
y y
20
Flowchart for EM
Algorithm
Dr. M. R. Karim, Stats, R.U.
( )
Estep: Compute 
k
Q
( ) *
Mstep: Maximize 
k
Q
( 1) *
Set
k
( 1) ( )
?
k k
( 1)
ˆ
k
( )
Set 0;Initialize
k
k
1
k k
Yes
No
21
R function for the Example: (1)
(y1, y2, y3, y4 are the observed frequencies)
Dr. M. R. Karim, Stats, R.U.
EM
.
Algo
=
function(y
1
,
y
2
,
y
3
,
y
4
,
tol
,
start
0
)
{
n
=
y
1
+y
2
+y
3
+y
4
;
theta
.
current
=
start
0
;
theta
.
last
=
0
;
theta
=
theta
.
current
;
while
(abs(
theta
.
last

theta)
>
tol
){
y
12
=
E
.
step
(
theta
.
current
,
y
1
)
theta
=
M
.
step
(y
12
,
y
2
,
y
3
,
y
4
,
n)
theta
.
last
=
theta
.
current
theta
.
current
=
theta
log
.
lik
=
y
1
*log(
2
+theta
.
current)
+(y
2
+y
3
)*log(
1

theta
.
current)+
y
4
*log(
theta
.
current
)
cat(c(
theta
.
current
,
log
.
lik),
'
\
n')
}
}
22
R function for the Example (2)
Dr. M. R. Karim, Stats, R.U.
M
.
step
=
function(y
12
,
y
2
,
y
3
,
y
4
,
n){
return((y
12
+y
4
)/(y
12
+y
2
+y
3
+y
4
))
}
E
.
step
=
function(
theta
.
current
,
y
1
){
y
12
=
y
1
*(
theta
.
current
/
4
)/(
0
.
5
+theta
.
current/
4
)
;
return(c(y
12
))
}
#
Results
:
EM
.
Algo
(
125
,
18
,
20
,
34
,
10
^(

7
),
0
.
50
)
23
R function for the Example (3)
Dr. M. R. Karim, Stats, R.U.
Iteration (
k
)
0
0.5000000
64.62974
1
0.6082474
67.32017
2
0.6243210
67.38292
3
0.6264889
67.38408
4
0.6267773
67.38410
5
0.6268156
67.38410
6
0.6268207
67.38410
7
0.6268214
67.38410
8
0.6268215
67.38410
( )
log
k
L
( )
k
ˆ
0.6268215
24
Dr. M. R. Karim, Stats, R.U.
Monte Carlo EM (1)
•
In
an
EM
algorithm,
the
E

step
may
be
difficult
to
implement
because
of
difficulty
in
computing
the
expectation
of
log
likelihood
.
•
Wei
and
Tanner
(
1990
a,
1990
b)
suggest
a
Monte
Carlo
approach
by
simulating
the
missing
data
Z
from
the
conditional
distribution
k
(
z

y
,
θ
(
k
)
)
on
the
E

step
of
the
(
k
+
1
)
th
iteration
25
Dr. M. R. Karim, Stats, R.U.
Monte Carlo EM (2)
•
Then
maximizing
the
approximate
conditional
expectation
of
the
complete

data
log
likelihood
•
The
limiting
form
of
this
as
m
tends
to
∞
is
the
actual
Q
(
θ
;
θ
(k)
)
m
j
j
mis
obs
c
k
y
y
L
m
Q
1
)
(
)
(
)
,

(
log
1
)

(
26
Dr. M. R. Karim, Stats, R.U.
Monte Carlo EM (3)
Application
of
MCEM
in
the
previous
example
:
•
A
Monte
Carlo
EM
solution
would
replace
the
expectation
with
the
empirical
average
where
z
j
are
simulated
from
a
binomial
distribution
with
size
y
1
and
probability
m
j
j
m
i
z
m
z
y
1
)
(
12
1
( )
( )
12 1
( )
/4
.
1/2/4
i
i
i
y y
( )
( )
/4
1/2/4
i
i
27
Dr. M. R. Karim, Stats, R.U.
Monte Carlo EM (4)
Application
of
MCEM
in
the
previous
example
:
•
The
R
code
for
the
E

step
becomes
E
.
step
=
function(
theta
.
current
,
y
1
){
bprob
=
(
theta
.
current
/
4
)/(
0
.
5
+theta
.
current/
4
)
zm
=
rbinom
(
10000
,
y
1
,
bprob
)
y
12
=
sum(
zm
)/
10000
return(c(y
12
))
}
28
Dr. M. R. Karim, Stats, R.U.
Applications of EM algorithm (1)
EM
algorithm
is
frequently
used
for
–
Data
clustering
(the
assignment
of
a
set
of
observations
into
subsets,
called
clusters,
so
that
observations
in
the
same
cluster
are
similar
in
some
sense)
used
in
many
fields,
including
machine
learning
,
computer
vision
,
data
mining
,
pattern
recognition
,
image
analysis
,
information
retrieval
,
and
bioinformatics
Natural
language
processing
(NLP
is
a
field
of
computer
science
and
linguistics
concerned
with
the
interactions
between
computers
and
human
(natural)
languages)
29
Dr. M. R. Karim, Stats, R.U.
Applications of EM algorithm (2)
Psychometrics
(the
field
of
study
concerned
with
the
theory
and
technique
of
educational
and
psychological
measurement,
which
includes
the
measurement
of
knowledge,
abilities,
attitudes,
and
personality
traits
.
)
Medical
image
reconstruction
,
especially
in
positron
emission
tomography
(
PET
)
and
single
photon
emission
computed
tomography
(
SPECT
)
30
Dr. M. R. Karim, Stats, R.U.
Applications of EM algorithm (3)
More
applications
regarding
data
analysis
examples
are
–
Multivariate
Data
with
Missing
Values
o
Example:
Bivariate
Normal Data with Missing Values
Least Squares with Missing Data
o
Example: Linear Regression with Missing Dependent Values
o
Example: Missing Values in a Latin Square Design
Example: Multinomial with Complex Cell Structure
Example: Analysis of PET and SPECT Data
Example: Mixture distributions
Example: Grouped, Censored and Truncated Data
o
Example: Grouped Log Normal Data
o
Example: Lifetime distributions for censored data
31
Dr. M. R. Karim, Stats, R.U.
Advantages of EM algorithm (1)
The
EM
algorithm
is
numerically
stable,
with
each
EM
iteration
increasing
the
likelihood
Under
fairly
general
conditions,
the
EM
algorithm
has
reliable
global
convergence
(depends
on
initial
value
and
likelihood!)
.
Convergence
is
nearly
always
to
a
local
maximizer
.
The
EM
algorithm
is
typically
easily
implemented,
because
it
relies
on
complete
data
computations
The
EM
algorithm
is
generally
easy
to
program,
since
no
evaluation
of
the
likelihood
nor
its
derivatives
is
involved
32
Dr. M. R. Karim, Stats, R.U.
Advantages of EM algorithm (2)
The
EM
algorithm
requires
small
storage
space
and
can
generally
be
carried
out
on
a
small
computer
(it
does
not
have
to
store
the
information
matrix
nor
its
inverse
at
any
iteration)
.
The
M

step
can
often
be
carried
out
using
standard
statistical
packages
in
situations
where
the
complete

data
MLE’s
do
not
exist
in
closed
form
.
By
watching
the
monotone
increase
in
likelihood
over
iterations,
it
is
easy
to
monitor
convergence
and
programming
errors
.
The
EM
algorithm
can
be
used
to
provide
estimated
values
of
the
“missing”
data
.
33
Dr. M. R. Karim, Stats, R.U.
Criticisms of EM algorithm
Unlike
the
Fisher’s
scoring
method,
it
does
not
have
an
inbuilt
procedure
for
producing
an
estimate
of
the
covariance
matrix
of
the
parameter
estimates
.
The
EM
algorithm
may
converge
slowly
even
in
some
seemingly
innocuous
problems
and
in
problems
where
there
is
too
much
‘incomplete
information’
.
The
EM
algorithm
like
the
Newton

type
methods
does
not
guarantee
convergence
to
the
global
maximum
when
there
are
multiple
maxima
(in
this
case,
the
estimate
obtained
depends
upon
the
initial
value)
.
In
some
problems,
the
E

step
may
be
analytically
intractable,
although
in
such
situations
there
is
the
possibility
of
effecting
it
via
a
Monte
Carlo
approach
.
34
Dr. M. R. Karim, Stats, R.U.
References (1)
1.
Dempster
AP,
Laird
NM,
Rubin,
DB
(
1977
)
Maximum
likelihood
from
incomplete
data
via
the
EM
algorithm
(with
discussion)
.
J
Royal
Statist
Soc

B
39
:
1
–
38
2.
Hartley
HQ
(
1958
)
Maximum
likelihood
estimation
from
incomplete
data
.
Biometrics
14
:
174

194
3.
Little
RJA,
Rubin
DB
(
1987
)
Statistical
Analysis
with
Missing
Data
.
John
Wiley
&
Sons,
Inc
.
,
New
York
4.
Louis
TA
(
1982
)
Finding
the
observed
information
matrix
when
using
the
EM
algorithm
.
J
Royal
Statist
Soc

B
44
:
226
–
233
5.
McLachlan
GJ,
Krishnan
T
(
1997
)
The
EM
Algorithm
and
Extensions
.
John
Wiley
&
Sons,
Inc
.
,
New
York
35
Dr. M. R. Karim, Stats, R.U.
References (2)
6.
Meng
XL,
Rubin
DB
(
1991
)
Using
EM
to
obtain
asymptotic
variance

covariance
matrices
:
the
SEM
algorithm
.
J
Am
Statist
Assoc
86
:
899

909
7.
Oakes
D
(
1999
)
Direct
calculation
of
the
information
matrix
via
the
EM
algorithm
.
J
Royal
Statist
Soc

B
61
:
479

482
8.
Rao
CR
(
1972
)
Linear
Statistical
Inference
and
its
Applications
.
John
Wiley
&
Sons,
Inc
.
,
New
York
9.
Redner
RA,
Walker
HF
(
1984
)
Mixture
densities,
maximum
likelihood
and
the
EM
algorithm
.
SIAM
Rev
26
:
195

239
36
Dr. M. R. Karim, Stats, R.U.
References (3)
10
.
Wei,
G
.
C
.
G
.
and
Tanner,
M
.
A
.
(
1990
a)
.
A
Monte
Carlo
implementation
of
the
EM
algorithm
and
the
poor
man’s
data
augmentation
algorithms
.
Journal
of
the
American
Statistical
Association
85
,
699

704
.
11
.
Wei,
G
.
C
.
G
.
and
Tanner,
M
.
A
.
(
1990
b)
.
Posterior
computations
for
censored
regression
data
.
Journal
of
the
American
Statistical
Association
85
,
829

839
.
12
.
Wu
CFJ
(
1983
)
On
the
convergence
properties
of
the
EM
algorithm
.
Ann
Statist
11
:
95

103
37
Dr. M. R. Karim, Stats, R.U.
Thank You
Comments 0
Log in to post a comment