Expectation-Maximization (EM) Algorithm

naivenorthAI and Robotics

Nov 8, 2013 (3 years and 10 months ago)

64 views

Expectation
-
Maximization (EM) Algorithm

Md. Rezaul Karim


Professor

Department of Statistics

University of
Rajshahi

Bangladesh




September 21, 2012

2

Basic Concept (1)

Dr. M. R. Karim, Stats, R.U.


EM

algorithm

stands

for


Expectation
-

Maximization


algorithm


A

parameter

estimation

method
:

it

falls

into

the

general

framework

of

maximum

-

likelihood

estimation

(MLE)


The

general

form

was

given

in

Dempster,

Laird,

and

Rubin

(
1977
)
,

although

essence

of

the

algorithm

appeared

previously

in

various

forms
.

3

Basic Concept (2)

Dr. M. R. Karim, Stats, R.U.


The

EM

algorithm

is

a

broadly

applicable

iterative

procedure

for

computing

maximum

likelihood

estimates

in

problems

with

incomplete

data
.



The

EM

algorithm

consists

of

two

conceptually

distinct

steps

at

each

iteration
:


o
the

Expectation

or

E
-
step

and


o
the

Maximization

or

M
-
step



Details

can

be

found
:

Hartley

(
1958
),

Dempster

et

al
.

(
1977
),

Little

and

Rubin

(
1987
)

and

McLachlan

and

Krishnan

(
1997
)


4

Formulation of the EM Algorithm (1)

Dr. M. R. Karim, Stats, R.U.



Suppose we have a model for

a set of

comple
t
e

da
t
a

Y
, with
associated dens
i
ty

( | )
f Y

, where

1 2
= (,,,)
d
  


is a
vector of unknown parameters with parameter space


.



We write

(,)
obs mis
Y Y Y


where
obs
Y
represent the observed part of
Y

and

mis
Y

denotes
the missing values




Y

= (
Y
obs
,
Y
mis
)


Complete data
Y

(
e.g., what we’d like to have!
)


Observed data
Y
obs

(
e.g.,
what we have
)


Missing data
Y
mis

(
e.g.,
incomplete/unobserved
)

5

Formulation of the EM Algorithm (2)

Dr. M. R. Karim, Stats, R.U.



The EM algorithm is designed to find the value of

,
denoted


, that maximizes the incomplete data log
-
likelihood

log ( ) log ( | )
obs
L f Y
 

,

that is, the MLE of
θ

based on the observed data
obs
Y



The EM algorithm starts with an initial value
(0)


.

Suppose that
( )
k


denotes the estimate of


at the
k
th
iteration; then the (
k
+1)st iteration can be described in
two
steps

as follows:

6

Formulation of the EM Algorithm (3)

Dr. M. R. Karim, Stats, R.U.



E
-
step:

Find the conditional expected complete
-
data log
likelihood given observed data and
( )
k

:






( ) ( )
( )
| log (,|,
log ( | ) ( |,)
k k
obs
k
mis obs mis
Q E L Y Y
L Y f Y Y dY
   
  
 
 


which, in the case of linear exponential family, amounts to
estimating the sufficient statistics for the complete data.



M
-
step:

Determine
( 1)
k



to be a value of



that
maximizes


( )
|
k
Q




The MLE of



is found by i
terating between the E and M
steps until
a

conve
r
gence criterion

is met
.


7

Formulation of the EM Algorithm (4)

Dr. M. R. Karim, Stats, R.U.

Guess of

unknown

parameters

initial

guess

M step

Observed data
structure

Guess of unknown/

hidden data structure

and Q function

E step

8

Formulation of the EM Algorithm (5)

Dr. M. R. Karim, Stats, R.U.



In some cases,

it may not be numerically feasibl
e to find the
value of



that globally maximizes the function


( )
|
k
Q


in
the M
-
step.

9

Formulation of the EM Algorithm (6)

Dr. M. R. Karim, Stats, R.U.



In such situations, a
Generalized EM
(GEM)

algorithm

(Dempster et al. 1977) is used to choose
( 1)
k



in the M
-
step such that the condition






( 1) ( ) ( ) ( )
| |
k k k k
Q Q
   



holds. For any EM or GEM algorithm, the change from
( )
k

to
( 1)
k


increases the likelihood; that is,






( 1) ( )
log log
k k
L L
 



which follows from the definition of GEM and Jensen's
inequality (See ,

Rao 1972, p.47).

10

Formulation of the EM Algorithm (7)

Dr. M. R. Karim, Stats, R.U.



This fact i
m
plies that the log
-
likelihood,

log ( )
L

, increases
monotonically on any iteration sequence generated by the
EM algorithm, which is the fundamental property for the
con
vergence of the algorithm
.



Detailed properties of the algorithm, including the
convergence properties, are given in

o

Dem
p
ster et al.
(
1977),


o

Wu
(
1983),

o

Redner and Walker
(
1984), and

o

McLachlan and Krishnan
(
1997)

11

Formulation of the EM Algorithm (8)

Dr. M. R. Karim, Stats, R.U.



M
ethod
s

for
obtain
ing
the
asymptotic variance
-
covariance
matrix of the EM
-
computed estimator

are derived by


o

Meng

and
Rubin

(
1991)
,

o

Louis
(
1982)
and

o

Oakes
(
1999)


12

Multinomial Example (1)

Dr. M. R. Karim, Stats, R.U.



The data relate to a problem of estimation of linkage in
genetics discussed by Rao (1 973,

pp. 368
-
369).





One considers data in which 197 animals are

distributed
multinomially into
four

categories

with cell
-
probabilities

{
1/2+
θ
/4,

(1−
θ
)/4,


(1−
θ
)/4,

θ
/
4
}

for some unknown

[0,1]




The observed number in each cell

was
Y
obs

= (125, 18, 20, 34)

n=197

y
1
=125

y
2
=18

y
3
=20

(1
-
θ
)/4

(1
-
θ
)/4

½+
θ
/4

y
4
=34

θ
/4

Observed data

Probability

13

Multinomial Example (2)

Dr. M. R. Karim, Stats, R.U.



The density of the observed data i
s

1 2 3
4
1 2 3 4
!1 1 1
( | )
!!!!2 4 4 4 4
y y y y
obs
n
f y
y y y y
   

 
       
 
       
       



The log
-
likelihood function
for θ is therefore, apart from an
additive term not involving θ,

1 2 3 4
( | ) log(2 ) ( ) log(1 ) log( )
obs
l y y y y y
   
     



Differentiating w.r.t.
θ
,

we
have
that

2 3
1 4
( | )
2 1
obs
l y y y
y y

   
 
  
  

14

Multinomial Example (3)

Dr. M. R. Karim, Stats, R.U.



Although the log
-
likelihood can be maximized explicitly we
use the example to illustrate

the EM algorithm.



To view the problem as an
unobserved data problem

we would
think of it as a multinomial experiment with
five categories

with observations

Y = (y
11
,
y
12
, y
2
, y
3
, y
4
),

each with cell probability (1/2,
θ
/4, (1−
θ
)/4, (1−
θ
)/4,
θ
/4).



That is, we
split the first category into two
, and we can only
observe the

sum y
1

= y
11

+y
12
. Then
y
11

and y
12

are considered
as the unobservable variables
.

15

Multinomial Example (4)

Dr. M. R. Karim, Stats, R.U.

n=197

y
1
=125

y
11


1/2

y
12

θ
/4

y
2
=18

(1
-
θ
)/4

y
3
=20

(1
-
θ
)/4

y
4
=34

θ
/4

Observed data

Probability

Missing data

16

Multinomial Example (5)

Dr. M. R. Karim, Stats, R.U.



The
density of the
complete data is then

11 12 2 3
4
11 12 2 3 4
!1 1 1
( | )
!!!!!2 4 4 4 4
y y y y y
n
f y
y y y y y
   

 
         

         
         

and the log
-
likelihood is (apart from a term not involving
θ
)


12 4 2 3
( | ) ( ) log( ) ( ) log(1 )
l y y y y y
  
    



Since
y
12

is unobservable

we cannot maximize this directly.
This obstacle is overcome

by the E
-
step

(
as it handles the
problem of filling

in for unobservable data by averaging the
complete
-
data log likelihood over its conditional

distribution
given the obs
erved data
).

17

Multinomial Example (6)

Dr. M. R. Karim, Stats, R.U.



Let
(0)


be an initial guess for
θ
.

T
he E
-
step requires computation of











(0)
(0)
(0)
(0)
11 12 2 3 4 1 2 3 4
12 1 4 2 3
| (,) |
= (,,,,,) |,,,
= | log( ) ( )log(1 )
c obs
c
Q E l y y
E l y y y y y y y y y
E y y y y y



 

 

   



Thus, we need to compute the conditional expectation of
y
12

given
y
1

given
(0)
 

. But this is a Binomial distribution with
sample size
y
1

and
probability
parameter

(0)
(0)
/4
1/2/4
p





y
1
=125

y
11


1/2

y
12

θ
/4

18

Multinomial Example (7)

Dr. M. R. Karim, Stats, R.U.



Hence, the expected value is



(0)
(0)
(0)
12 1 1 12
(0)
/4
|. (say)
1/2/4
E y y y y



 


and the expression for


(0)
|
Q


is





(0) (0)
12 4 2 3
| log( ) ( )log(1 )
Q y y y y
  
    



In the M
-
step we maximize this with respec
t

to
θ

to get

(0)
(1)
12 4
(0)
12 2 3 4
y y
y y y y



  

19

Multinomial Example (8)

Dr. M. R. Karim, Stats, R.U.



Then, iterating this gives us finally the estimate for
θ
.
Summarizing, we get the

iterations

( )
( 1)
12 4
( )
12 2 3 4
i
i
i
y y
y y y y




  

where

( )
( )
12 1
( )
/4
.
1/2/4
i
i
i
y y





20

Flowchart for EM
Algorithm

Dr. M. R. Karim, Stats, R.U.



( )
E-step: Compute |
k
Q



( ) *
M-step: Maximize |
k
Q
 

( 1) *
Set
k
 


( 1) ( )
?
k k
  

 
( 1)
ˆ
k
 


( )
Set 0;Initialize
k
k


1
k k
 
Yes

No

21

R function for the Example: (1)

(y1, y2, y3, y4 are the observed frequencies)

Dr. M. R. Karim, Stats, R.U.

EM
.
Algo

=

function(y
1
,

y
2
,

y
3
,

y
4
,

tol
,

start
0
)

{


n

=

y
1
+y
2
+y
3
+y
4
;

theta
.
current

=

start
0
;

theta
.
last

=

0
;


theta

=

theta
.
current
;



while

(abs(
theta
.
last

-

theta)

>

tol

){



y
12

=

E
.
step
(
theta
.
current
,

y
1
)



theta

=

M
.
step
(y
12
,

y
2
,

y
3
,

y
4
,

n)



theta
.
last

=

theta
.
current



theta
.
current

=

theta



log
.
lik

=

y
1
*log(
2
+theta
.
current)



+(y
2
+y
3
)*log(
1
-
theta
.
current)+

y
4
*log(
theta
.
current
)



cat(c(
theta
.
current
,

log
.
lik),

'
\
n')


}

}

22

R function for the Example (2)

Dr. M. R. Karim, Stats, R.U.

M
.
step

=

function(y
12
,

y
2
,

y
3
,

y
4
,

n){



return((y
12
+y
4
)/(y
12
+y
2
+y
3
+y
4
))

}


E
.
step

=

function(
theta
.
current
,

y
1
){



y
12

=

y
1
*(
theta
.
current
/
4
)/(
0
.
5
+theta
.
current/
4
)
;



return(c(y
12
))

}




#

Results
:


EM
.
Algo
(
125
,

18
,

20
,

34
,

10
^(
-
7
),

0
.
50
)

23

R function for the Example (3)

Dr. M. R. Karim, Stats, R.U.

Iteration (
k
)

0

0.5000000

64.62974

1

0.6082474

67.32017

2

0.6243210

67.38292

3

0.6264889

67.38408

4

0.6267773

67.38410

5

0.6268156

67.38410

6

0.6268207

67.38410

7

0.6268214

67.38410

8

0.6268215

67.38410



( )
log
k
L

( )
k

ˆ
0.6268215


24

Dr. M. R. Karim, Stats, R.U.

Monte Carlo EM (1)


In

an

EM

algorithm,

the

E
-
step

may

be

difficult

to

implement

because

of

difficulty

in

computing

the

expectation

of

log

likelihood
.




Wei

and

Tanner

(
1990
a,

1990
b)

suggest

a

Monte

Carlo

approach

by

simulating

the

missing

data

Z

from

the

conditional

distribution

k
(
z

|

y
,

θ
(
k
)
)

on

the

E
-
step

of

the

(
k

+

1
)
th

iteration

25

Dr. M. R. Karim, Stats, R.U.

Monte Carlo EM (2)


Then

maximizing

the

approximate

conditional

expectation

of

the

complete
-
data

log

likelihood






The

limiting

form

of

this

as

m

tends

to



is

the

actual

Q
(
θ
;

θ
(k)
)






m
j
j
mis
obs
c
k
y
y
L
m
Q
1
)
(
)
(
)
,
|
(
log
1
)
|
(



26

Dr. M. R. Karim, Stats, R.U.

Monte Carlo EM (3)


Application

of

MCEM

in

the

previous

example
:


A

Monte

Carlo

EM

solution

would

replace

the

expectation




with

the

empirical

average



where

z
j

are

simulated

from

a

binomial

distribution

with

size

y
1

and

probability





m
j
j
m
i
z
m
z
y
1
)
(
12
1
( )
( )
12 1
( )
/4
.
1/2/4
i
i
i
y y




( )
( )
/4
1/2/4
i
i



27

Dr. M. R. Karim, Stats, R.U.

Monte Carlo EM (4)


Application

of

MCEM

in

the

previous

example
:



The

R

code

for

the

E
-
step

becomes


E
.
step

=

function(
theta
.
current
,

y
1
){


bprob

=

(
theta
.
current
/
4
)/(
0
.
5
+theta
.
current/
4
)


zm

=

rbinom
(
10000
,

y
1
,

bprob
)


y
12

=

sum(
zm
)/
10000


return(c(y
12
))

}


28

Dr. M. R. Karim, Stats, R.U.

Applications of EM algorithm (1)

EM

algorithm

is

frequently

used

for





Data

clustering

(the

assignment

of

a

set

of

observations

into

subsets,

called

clusters,

so

that

observations

in

the

same

cluster

are

similar

in

some

sense)

used

in

many

fields,

including

machine

learning
,

computer

vision
,

data

mining
,

pattern

recognition
,

image

analysis
,

information

retrieval
,

and

bioinformatics


Natural

language

processing

(NLP

is

a

field

of

computer

science

and

linguistics

concerned

with

the

interactions

between

computers

and

human

(natural)

languages)

29

Dr. M. R. Karim, Stats, R.U.

Applications of EM algorithm (2)


Psychometrics

(the

field

of

study

concerned

with

the

theory

and

technique

of

educational

and

psychological

measurement,

which

includes

the

measurement

of

knowledge,

abilities,

attitudes,

and

personality

traits
.
)


Medical

image

reconstruction
,

especially

in

positron

emission

tomography

(
PET
)

and

single

photon

emission

computed

tomography

(
SPECT
)

30

Dr. M. R. Karim, Stats, R.U.

Applications of EM algorithm (3)

More

applications

regarding

data

analysis

examples

are





Multivariate

Data

with

Missing

Values

o
Example:
Bivariate

Normal Data with Missing Values


Least Squares with Missing Data

o
Example: Linear Regression with Missing Dependent Values

o
Example: Missing Values in a Latin Square Design


Example: Multinomial with Complex Cell Structure


Example: Analysis of PET and SPECT Data


Example: Mixture distributions


Example: Grouped, Censored and Truncated Data

o
Example: Grouped Log Normal Data

o
Example: Lifetime distributions for censored data

31

Dr. M. R. Karim, Stats, R.U.

Advantages of EM algorithm (1)


The

EM

algorithm

is

numerically

stable,

with

each

EM

iteration

increasing

the

likelihood


Under

fairly

general

conditions,

the

EM

algorithm

has

reliable

global

convergence

(depends

on

initial

value

and

likelihood!)
.

Convergence

is

nearly

always

to

a

local

maximizer
.


The

EM

algorithm

is

typically

easily

implemented,

because

it

relies

on

complete

data

computations


The

EM

algorithm

is

generally

easy

to

program,

since

no

evaluation

of

the

likelihood

nor

its

derivatives

is

involved

32

Dr. M. R. Karim, Stats, R.U.

Advantages of EM algorithm (2)


The

EM

algorithm

requires

small

storage

space

and

can

generally

be

carried

out

on

a

small

computer

(it

does

not

have

to

store

the

information

matrix

nor

its

inverse

at

any

iteration)
.


The

M
-
step

can

often

be

carried

out

using

standard

statistical

packages

in

situations

where

the

complete
-
data

MLE’s

do

not

exist

in

closed

form
.



By

watching

the

monotone

increase

in

likelihood

over

iterations,

it

is

easy

to

monitor

convergence

and

programming

errors
.


The

EM

algorithm

can

be

used

to

provide

estimated

values

of

the

“missing”

data
.

33

Dr. M. R. Karim, Stats, R.U.

Criticisms of EM algorithm


Unlike

the

Fisher’s

scoring

method,

it

does

not

have

an

inbuilt

procedure

for

producing

an

estimate

of

the

covariance

matrix

of

the

parameter

estimates
.



The

EM

algorithm

may

converge

slowly

even

in

some

seemingly

innocuous

problems

and

in

problems

where

there

is

too

much

‘incomplete

information’
.


The

EM

algorithm

like

the

Newton
-
type

methods

does

not

guarantee

convergence

to

the

global

maximum

when

there

are

multiple

maxima

(in

this

case,

the

estimate

obtained

depends

upon

the

initial

value)
.


In

some

problems,

the

E
-
step

may

be

analytically

intractable,

although

in

such

situations

there

is

the

possibility

of

effecting

it

via

a

Monte

Carlo

approach
.

34

Dr. M. R. Karim, Stats, R.U.

References (1)

1.
Dempster

AP,

Laird

NM,

Rubin,

DB

(
1977
)

Maximum

likelihood

from

incomplete

data

via

the

EM

algorithm

(with

discussion)
.

J

Royal

Statist

Soc

-

B

39
:
1

38

2.
Hartley

HQ

(
1958
)

Maximum

likelihood

estimation

from

incomplete

data
.

Biometrics

14
:
174
-
194

3.
Little

RJA,

Rubin

DB

(
1987
)

Statistical

Analysis

with

Missing

Data
.

John

Wiley

&

Sons,

Inc
.
,

New

York

4.
Louis

TA

(
1982
)

Finding

the

observed

information

matrix

when

using

the

EM

algorithm
.

J

Royal

Statist

Soc

-

B

44
:
226

233

5.
McLachlan

GJ,

Krishnan

T

(
1997
)

The

EM

Algorithm

and

Extensions
.

John

Wiley

&

Sons,

Inc
.
,

New

York

35

Dr. M. R. Karim, Stats, R.U.

References (2)

6.
Meng

XL,

Rubin

DB

(
1991
)

Using

EM

to

obtain

asymptotic

variance
-
covariance

matrices
:

the

SEM

algorithm
.

J

Am

Statist

Assoc

86
:
899
-
909

7.
Oakes

D

(
1999
)

Direct

calculation

of

the

information

matrix

via

the

EM

algorithm
.

J

Royal

Statist

Soc

-

B

61
:
479
-
482

8.
Rao

CR

(
1972
)

Linear

Statistical

Inference

and

its

Applications
.

John

Wiley

&

Sons,

Inc
.
,

New

York

9.
Redner

RA,

Walker

HF

(
1984
)

Mixture

densities,

maximum

likelihood

and

the

EM

algorithm
.

SIAM

Rev

26
:
195
-
239

36

Dr. M. R. Karim, Stats, R.U.

References (3)

10
.

Wei,

G
.
C
.
G
.

and

Tanner,

M
.
A
.

(
1990
a)
.

A

Monte

Carlo

implementation

of

the

EM

algorithm

and

the

poor

man’s

data

augmentation

algorithms
.

Journal

of

the

American

Statistical

Association

85
,

699
-
704
.


11
.

Wei,

G
.
C
.
G
.

and

Tanner,

M
.
A
.

(
1990
b)
.

Posterior

computations

for

censored

regression

data
.

Journal

of

the

American

Statistical

Association

85
,

829
-
839
.


12
.

Wu

CFJ

(
1983
)

On

the

convergence

properties

of

the

EM

algorithm
.

Ann

Statist

11
:
95
-
103

37

Dr. M. R. Karim, Stats, R.U.

Thank You