Beyond IBM Model 1
4
th
April 2011
(adapted from notes from Philipp Koehn & Mary Hearne)
Dr.
Declan
Groves,
CNGL, DCU
dgroves@computing.dcu.ie
IBM Model 1: Formal Description
Generative model: break up translation into smaller steps
More formally…
Translation probability:
Foreign (for this case, we’ll say French…) sentence
=
(
1
…
) of length
English sentence
=
(
1
…
)
of length
Each French word
is generated by an English word
𝑎
(
)
, as
defined by the alignment function
:
→
, with the probability
𝑡
,
=
𝑡
(

𝑎
)
=
1
(
)
= English
position connected
by the
𝑡ℎ
French
word in alignment
IBM Model 1: Formal Description
𝑃
,
=
𝑃
(
𝑎
,

)
𝑃
(

)
=
𝑃
(
𝑎
,

)
𝑃
(
𝑎
,

)
𝑎
=
𝑡
(
𝑗

𝑎
𝑗
)
𝑙
𝑗
=
1
𝑡
(
𝑗

𝑎
𝑗
)
𝑙
𝑗
=
1
𝑎
This means in order to compute the denominator, we need to compute the
probabilities of all possible alignments for each sentence pair.
How many alignments are there?
In other words we have to carry out
(
)
(
)
arithmetic operations,
which is far too many for a practical proposition:
Sum over all
possible alignments
IBM Model 1: Formal Description
𝑡
1
1
×
𝑡
2
1
×
𝑡
3
1
×
⋯
×
𝑡
1
+
𝑡
1
1
×
𝑡
2
1
×
𝑡
3
1
×
⋯
×
𝑡
2
+
𝑡
1
1
×
𝑡
2
1
×
𝑡
3
1
×
⋯
×
𝑡
3
+
…
𝑡
1
1
×
𝑡
2
2
×
𝑡
3
1
×
⋯
×
𝑡
1
+
𝑡
1
1
×
𝑡
2
2
×
𝑡
3
1
×
⋯
×
𝑡
2
+
𝑡
1
1
×
𝑡
2
1
×
𝑡
3
1
×
⋯
×
𝑡
3
+
…
𝑡
1
2
×
𝑡
2
2
×
𝑡
3
2
×
⋯
×
𝑡
1
+
𝑡
1
2
×
𝑡
2
2
×
𝑡
3
2
×
⋯
×
𝑡
2
+
𝑡
1
2
×
𝑡
2
2
×
𝑡
3
2
×
⋯
×
𝑡
3
+
…
𝑡
1
1
×
𝑡
2
1
×
𝑡
3
1
×
⋯
×
𝑡
+
IBM Model 1: Formal Description
However, there are regularities in this expression which we
can factor out.
For example, we can factor out
𝑡
1
1
from all the rows it occurs.
Difference between
xy
+
xz
(3 arithmetic operations) and x(
y+z
) (2
arithmetic operations)
IBM Model 1: Formal Description
𝑡
1
1
×
𝑡
2
1
×
𝑡
3
1
×
⋯
×
𝑡
1
𝑡
2
1
×
𝑡
3
1
×
⋯
×
𝑡
2
…
𝑡
2
2
×
𝑡
3
1
×
⋯
×
𝑡
1
𝑡
2
2
×
𝑡
3
1
×
⋯
×
𝑡
2
𝑡
2
1
×
𝑡
3
1
×
⋯
×
𝑡
3
…
+
…
𝑡
1
2
×
𝑡
2
2
×
𝑡
3
2
×
⋯
×
𝑡
1
+
𝑡
1
2
×
𝑡
2
2
×
𝑡
3
2
×
⋯
×
𝑡
2
+
𝑡
1
2
×
𝑡
2
2
×
𝑡
3
2
×
⋯
×
𝑡
3
+
…
𝑡
1
1
×
𝑡
2
1
×
𝑡
3
1
×
⋯
×
𝑡
+
IBM Model 1: Formal Description
However, there are regularities in this expression which we
can factor out.
For example, we can factor out
𝑡
1
1
from all the rows it occurs.
Difference between
xy
+
xz
(3 arithmetic operations) and x(
y+z
) (2
arithmetic operations)
Factoring out continually gives us:
Makes solution tractable
𝑡
𝑎
=
=
1
𝑡
=
1
=
1
𝑎
IBM Model 1: Pseudocode
initialize t(
ef
) uniformly
do
set count(
ef
) to 0 for all
e,f
set total(f) to 0 for all f
for all sentence pairs(
e_s,f_s
)
for all words e in
e_s
total_s
=0
for all words f in
f_s
total_s
+= t(
ef
)
for all words e in
e_s
for all words f in
f_s
count(
ef
) += t(
ef
) /
total_s
total(f) += t(
ef
) /
total_s
for all f in domain( total(.) )
for all e in domain( count(.f) )
t(
ef
)=count(
ef
) / total(f)
until convergence
EM
algorithm
Beyond IBM Model 1
So far, we’ve worked using IBM Model 1, computing
according to the following formulae:
In other words, we’re saying the likelihood of two strings
corresponding is calculated only in terms of word

translation probabilities
𝑡
𝑎
=
=
1
𝑡
=
1
=
1
𝑎
color
print head
t
ête
d’impression
couleur
Beyond IBM Model 1
But what about this one:
Calculating alignment probabilities (and therefore, string
correspondences) based only on word

for

word
translation probabilities does not necessarily yield
particularly good alignments
This is why, before, we talked about things like
fertility
,
distortion
and
insertion
.
changing your printer settings
modification de les options de impression
IBM Model 2
IBM Model 2 involves both word translations and distortions
Distortion probabilities:
Reverse distortion probabilities:
Model 2 uses these reverse distortion probabilities
,
French position English position English $ length French $ length
𝑟
,
French position English position English $ length French $ length
IBM Model 2
How to incorporate reverse distortion probabilities:
This model may:
Penalise an alignment connecting words which are very far apart, even if
these words are (statistically) good translations of each other.
Prefer alignments on the basis of positioning, which can be useful where
the words themselves are rare and, therefore, we don’t have good t

values for them
To find the
best
alignment:
EM training of this model works the same way as IBM Model 1
,
=
𝑡
(

𝑎
)
=
1
𝑟
(

,
,
)
=
1
argmax
𝑎
(

,
)
=
argmax
1
𝑡
×
𝑟
(

,
,
)
=
1
a(j) = i
p
osition in
English sentence
Moving from Model 1 to Model 2
We could apply Model 2 directly to the sentence pair corpus to
establish t

and
rd

tables, thereby by

passing Model 1.
However, it may be better to settle on reasonable word

translation
probabilities first, before introducing distortions
We can:
Run Model 1 for several iterations
Use the resulting t

table as input to the first Model 2 iteration
instead of using uniform t

table values
Use a uniformly

distributed
rd

table as input to the first Model 2
iteration
This is called
transferring parameter values
from one model to
another
IBM Model 3
First allow a single English word to translate many French
words
Fertility
probabilities
Next translate the English words into French words
Model 3 also caters for the translation of the NULL token to
allow for the
insertion
of words
Finally allow the French units to be reordered
Distortion
probabilities
Model 3 calculates distortion probabilities (as apposed to
Model 2’s reverse distortion probabilities)
IBM Model 3: NULL values
Problem 1: We should not count distortions for French words generated by NULL
Eliminate any
value for which
=
0
Problem 2: We did not include costs for inserting extra foreign words,
i.e
foreign
words generated by NULL.
Add French word
for which
=0 (i.e.
𝑎
=NULL) is extra
There are no phi

0 extra French words (phi

0= fertility of NULL)
There are (
–
phi

0) non

extra French words
After each one of these (
–
phi

0) non

extra words, we have the option of inserting an
extra word with probability p1
How many ways are there to insert phi

0 extra words?
Must also factor in cost of inserting/not inserting:
If we add an extra word phi

0 times:
p1
phi−0
Thus if we do
not
add an extra word ((
–
phi

0)

phi

0) times:
(
1
−
p1
)
(
−
2
×
phi−0
)
comb(
−
phi−0,
phi−0)
=
!
phi−0
!
−
phi−0
!
IBM Model 3: NULL values
Problem 3: We did not include costs for permuting the extra French words
into available slots
Do not use distortion probabilities (we don’t have these values)
Instead, we slot in extra words after their positions have been generated
Since there are phi

0! possible orderings, all of which are deemed equally likely,
then any ordering generated adds an additional factor of:
1
phi−0
!
Problem 4: Word alignments do not encode all of the information about
process which turned
e
into
f
.
If an English word
is connected to French words
and
we don’t know
what order they were generated in.
Problem arises whenever a word has a fertility > 1. So we add the factor:
phi
−
!
=
0
Formula for IBM Model 3
•
𝑃
,
=
•
−
ℎ
−
0
,
ℎ
−
0
×
•
1
𝑝ℎ
−
0
×
•
(
1
−
1
)
(
−
2
×
𝑝ℎ
−
0
)
×
•
(
ℎ
−

)
=
1
×
•
𝑡
𝑎
×
=
1
•
,
,
:
<
>
0
×
•
ℎ
−
!
=
1
×
•
1
𝑝ℎ
−
0
!
p
roblem 2
fertility
translation
distortion + problem 1
problem
4
problem 3
IBM Model 4
Technical extensions of Model 3
HMM Modelling
Words do not move independently of each other. Condition word
movements on previous word
Relative distortion
model
Crucially, increased sophistication of Model 3 means that computational
trick no longer words for summing over all possible alignments
Exhaustive count collection becomes computationally too expensive
Instead, we have to sum over
high probability
alignments
Search for most probable alignment
This is known as the
Viterbi
alignment: the alignment which maximizes
,
No efficient algorithm for finding the Viterbi alignment for Models 3 onwards
Obtain the sample for summing over by introducing slight modifications to
the high probability alignment
IBM Models: Summary
IBM Model 1
Lexical translation (translation probabilities)
Global Unique Maximum
Efficiently estimated using EM algorithm
Assumes all alignment
connections are equally likely
IBM Model 2
Lexical translation (translation probabilities)
Distortion
probabilities (uses reverse distortion)
Efficiently estimated using EM algorithm
Assumes alignment connection depends on position & lengths of the source/target strings
Estimates from Model 1 can be used as initial guesses for Model 2
IBM Model 3
Fertility probabilities (incl. dropping words i.e. words with fertility of 0)
Lexical translation (translation probabilities)
Distortion probabilities
Insertion probabilities (using special NULL token)
Computationally biggest change in occurs in Model 3
Efficient estimation difficult
–
sum over only high probability alignments
Viterbi alignment
IBM Model 4
Fertility
model
Translation model
Relative distortion probabilities (placement of words is relative to the placement of
surrounding words)
Insertion probabilities
Technical extensions of Model 3
Comments 0
Log in to post a comment