# Beyond IBM Model 1

Λογισμικό & κατασκευή λογ/κού

30 Οκτ 2013 (πριν από 4 χρόνια και 8 μήνες)

140 εμφανίσεις

Beyond IBM Model 1

4
th

April 2011

(adapted from notes from Philipp Koehn & Mary Hearne)

Dr.

Declan
Groves,
CNGL, DCU

dgroves@computing.dcu.ie

IBM Model 1: Formal Description

Generative model: break up translation into smaller steps

More formally…

Translation probability:

Foreign (for this case, we’ll say French…) sentence

=
(

1





) of length



English sentence

=
(

1




)

of length

Each French word



is generated by an English word

𝑎
(

)
, as
defined by the alignment function

:


, with the probability
𝑡



,


=

𝑡
(


|

𝑎

)



=
1

(

)

= English
position connected
by the

𝑡ℎ

French
word in alignment

IBM Model 1: Formal Description

𝑃


,

=
𝑃
(
𝑎
,

|

)
𝑃
(

|

)

=
𝑃
(
𝑎
,

|

)

𝑃
(
𝑎
,

|

)
𝑎

=

𝑡
(

𝑗
|

𝑎
𝑗
)
𝑙

𝑗
=
1

𝑡
(

𝑗
|

𝑎
𝑗
)
𝑙

𝑗
=
1
𝑎

This means in order to compute the denominator, we need to compute the
probabilities of all possible alignments for each sentence pair.

How many alignments are there?




In other words we have to carry out
(



)
(


)

arithmetic operations,
which is far too many for a practical proposition:

Sum over all
possible alignments

IBM Model 1: Formal Description

𝑡

1

1
×
𝑡

2

1
×
𝑡

3

1
×

×
𝑡




1
+

𝑡

1

1
×
𝑡

2

1
×
𝑡

3

1
×

×
𝑡




2
+

𝑡

1

1
×
𝑡

2

1
×
𝑡

3

1
×

×
𝑡




3
+

𝑡

1

1
×
𝑡

2

2
×
𝑡

3

1
×

×
𝑡




1
+

𝑡

1

1
×
𝑡

2

2
×
𝑡

3

1
×

×
𝑡




2
+

𝑡

1

1
×
𝑡

2

1
×
𝑡

3

1
×

×
𝑡




3
+

𝑡

1

2
×
𝑡

2

2
×
𝑡

3

2
×

×
𝑡




1
+

𝑡

1

2
×
𝑡

2

2
×
𝑡

3

2
×

×
𝑡




2
+

𝑡

1

2
×
𝑡

2

2
×
𝑡

3

2
×

×
𝑡




3
+

𝑡

1

1
×
𝑡

2

1
×
𝑡

3

1
×

×
𝑡






+

IBM Model 1: Formal Description

However, there are regularities in this expression which we
can factor out.

For example, we can factor out
𝑡

1

1

from all the rows it occurs.

Difference between
xy

+
xz

(3 arithmetic operations) and x(
y+z
) (2
arithmetic operations)

IBM Model 1: Formal Description

𝑡

1

1
×
𝑡

2

1
×
𝑡

3

1
×

×
𝑡




1
𝑡

2

1
×
𝑡

3

1
×

×
𝑡




2

𝑡

2

2
×
𝑡

3

1
×

×
𝑡




1
𝑡

2

2
×
𝑡

3

1
×

×
𝑡




2
𝑡

2

1
×
𝑡

3

1
×

×
𝑡




3

+

𝑡

1

2
×
𝑡

2

2
×
𝑡

3

2
×

×
𝑡




1
+

𝑡

1

2
×
𝑡

2

2
×
𝑡

3

2
×

×
𝑡




2
+

𝑡

1

2
×
𝑡

2

2
×
𝑡

3

2
×

×
𝑡




3
+

𝑡

1

1
×
𝑡

2

1
×
𝑡

3

1
×

×
𝑡






+

IBM Model 1: Formal Description

However, there are regularities in this expression which we
can factor out.

For example, we can factor out
𝑡

1

1

from all the rows it occurs.

Difference between
xy

+
xz

(3 arithmetic operations) and x(
y+z
) (2
arithmetic operations)

Factoring out continually gives us:

Makes solution tractable

𝑡



𝑎

=




=
1

𝑡






=
1




=
1
𝑎

IBM Model 1: Pseudocode

initialize t(
e|f
) uniformly

do

set count(
e|f
) to 0 for all
e,f

set total(f) to 0 for all f

for all sentence pairs(
e_s,f_s
)

for all words e in
e_s

total_s
=0

for all words f in
f_s

total_s

+= t(
e|f
)

for all words e in
e_s

for all words f in
f_s

count(
e|f
) += t(
e|f
) /
total_s

total(f) += t(
e|f
) /
total_s

for all f in domain( total(.) )

for all e in domain( count(.|f) )

t(
e|f
)=count(
e|f
) / total(f)

until convergence

EM
algorithm

Beyond IBM Model 1

So far, we’ve worked using IBM Model 1, computing
according to the following formulae:

In other words, we’re saying the likelihood of two strings
corresponding is calculated only in terms of word
-
translation probabilities

𝑡



𝑎

=




=
1

𝑡






=
1




=
1
𝑎

color

t
ête
d’impression

couleur

Beyond IBM Model 1

Calculating alignment probabilities (and therefore, string
correspondences) based only on word
-
for
-
word
translation probabilities does not necessarily yield
particularly good alignments

This is why, before, we talked about things like
fertility
,
distortion

and
insertion
.

modification de les options de impression

IBM Model 2

IBM Model 2 involves both word translations and distortions

Distortion probabilities:

Reverse distortion probabilities:

Model 2 uses these reverse distortion probabilities




,



French position English position English \$ length French \$ length

𝑟


,



French position English position English \$ length French \$ length

IBM Model 2

How to incorporate reverse distortion probabilities:

This model may:

Penalise an alignment connecting words which are very far apart, even if
these words are (statistically) good translations of each other.

Prefer alignments on the basis of positioning, which can be useful where
the words themselves are rare and, therefore, we don’t have good t
-
values for them

To find the
best

alignment:

EM training of this model works the same way as IBM Model 1



,


=

𝑡
(


|

𝑎

)



=
1

𝑟
(

|

,

,


)



=
1

argmax
𝑎

(

|

,

)
=

argmax
1

𝑡




×
𝑟
(

|

,

,


)



=
1

a(j) = i

p
osition in
English sentence

Moving from Model 1 to Model 2

We could apply Model 2 directly to the sentence pair corpus to
establish t
-

and
rd
-
tables, thereby by
-
passing Model 1.

However, it may be better to settle on reasonable word
-
translation
probabilities first, before introducing distortions

We can:

Run Model 1 for several iterations

Use the resulting t
-
table as input to the first Model 2 iteration
-
table values

Use a uniformly
-
distributed
rd
-
table as input to the first Model 2
iteration

This is called
transferring parameter values

from one model to
another

IBM Model 3

First allow a single English word to translate many French
words

Fertility

probabilities

Next translate the English words into French words

Model 3 also caters for the translation of the NULL token to
allow for the
insertion

of words

Finally allow the French units to be reordered

Distortion

probabilities

Model 3 calculates distortion probabilities (as apposed to
Model 2’s reverse distortion probabilities)

IBM Model 3: NULL values

Problem 1: We should not count distortions for French words generated by NULL

Eliminate any


value for which


=
0

Problem 2: We did not include costs for inserting extra foreign words,
i.e

foreign
words generated by NULL.




for which


=0 (i.e.

𝑎
=NULL) is extra

There are no phi
-
0 extra French words (phi
-
0= fertility of NULL)

There are (



phi
-
0) non
-
extra French words

After each one of these (



phi
-
0) non
-
extra words, we have the option of inserting an
extra word with probability p1

How many ways are there to insert phi
-
0 extra words?

Must also factor in cost of inserting/not inserting:

If we add an extra word phi
-
0 times:

p1
phi−0

Thus if we do
not



phi
-
0)
-

phi
-
0) times:

(
1

p1
)
(



2
×
phi−0
)

comb(



phi−0,

phi−0)

=


!
phi−0
!



phi−0
!

IBM Model 3: NULL values

Problem 3: We did not include costs for permuting the extra French words
into available slots

Do not use distortion probabilities (we don’t have these values)

Instead, we slot in extra words after their positions have been generated

Since there are phi
-
0! possible orderings, all of which are deemed equally likely,

1
phi−0
!

Problem 4: Word alignments do not encode all of the information about
process which turned
e

into
f
.

If an English word


is connected to French words



and



we don’t know
what order they were generated in.

Problem arises whenever a word has a fertility > 1. So we add the factor:

phi

!


=
0

Formula for IBM Model 3

𝑃

,


=





ℎ

0
,
ℎ

0
×

1
𝑝ℎ

0

×

(
1

1
)
(



2
×
𝑝ℎ

0
)
×


(
ℎ

|


)


=
1

×

𝑡



𝑎

×



=
1




,

,






:

<
>
0
×

ℎ

!



=
1
×

1
𝑝ℎ

0
!

p
roblem 2

fertility

translation

distortion + problem 1

problem
4

problem 3

IBM Model 4

Technical extensions of Model 3

HMM Modelling

Words do not move independently of each other. Condition word
movements on previous word

Relative distortion
model

Crucially, increased sophistication of Model 3 means that computational
trick no longer words for summing over all possible alignments

Exhaustive count collection becomes computationally too expensive

Instead, we have to sum over
high probability

alignments

Search for most probable alignment

This is known as the
Viterbi

alignment: the alignment which maximizes



,


No efficient algorithm for finding the Viterbi alignment for Models 3 onwards

Obtain the sample for summing over by introducing slight modifications to
the high probability alignment

IBM Models: Summary

IBM Model 1

Lexical translation (translation probabilities)

Global Unique Maximum

Efficiently estimated using EM algorithm

Assumes all alignment

connections are equally likely

IBM Model 2

Lexical translation (translation probabilities)

Distortion

probabilities (uses reverse distortion)

Efficiently estimated using EM algorithm

Assumes alignment connection depends on position & lengths of the source/target strings

Estimates from Model 1 can be used as initial guesses for Model 2

IBM Model 3

Fertility probabilities (incl. dropping words i.e. words with fertility of 0)

Lexical translation (translation probabilities)

Distortion probabilities

Insertion probabilities (using special NULL token)

Computationally biggest change in occurs in Model 3

Efficient estimation difficult

sum over only high probability alignments

Viterbi alignment

IBM Model 4

Fertility

model

Translation model

Relative distortion probabilities (placement of words is relative to the placement of
surrounding words)

Insertion probabilities

Technical extensions of Model 3