Beyond IBM Model 1

bloatdecorumΛογισμικό & κατασκευή λογ/κού

30 Οκτ 2013 (πριν από 4 χρόνια και 8 μέρες)

121 εμφανίσεις

Beyond IBM Model 1

4
th

April 2011


(adapted from notes from Philipp Koehn & Mary Hearne)



Dr.

Declan
Groves,
CNGL, DCU

dgroves@computing.dcu.ie

IBM Model 1: Formal Description


Generative model: break up translation into smaller steps


More formally…


Translation probability:


Foreign (for this case, we’ll say French…) sentence

=
(

1





) of length





English sentence

=
(

1




)

of length





Each French word



is generated by an English word

𝑎
(

)
, as
defined by the alignment function

:



, with the probability
𝑡




,


=

𝑡
(


|

𝑎

)



=
1


(

)

= English
position connected
by the

𝑡ℎ


French
word in alignment


IBM Model 1: Formal Description


𝑃


,

=
𝑃
(
𝑎
,

|

)
𝑃
(

|

)


=
𝑃
(
𝑎
,

|

)

𝑃
(
𝑎
,

|

)
𝑎


=


𝑡
(

𝑗
|

𝑎
𝑗
)
𝑙

𝑗
=
1


𝑡
(

𝑗
|

𝑎
𝑗
)
𝑙

𝑗
=
1
𝑎


This means in order to compute the denominator, we need to compute the
probabilities of all possible alignments for each sentence pair.


How many alignments are there?






In other words we have to carry out
(




)
(


)

arithmetic operations,
which is far too many for a practical proposition:



Sum over all
possible alignments

IBM Model 1: Formal Description


𝑡

1

1
×
𝑡

2

1
×
𝑡

3

1
×

×
𝑡




1
+


𝑡

1

1
×
𝑡

2

1
×
𝑡

3

1
×

×
𝑡




2
+


𝑡

1

1
×
𝑡

2

1
×
𝑡

3

1
×

×
𝑡




3
+





𝑡

1

1
×
𝑡

2

2
×
𝑡

3

1
×

×
𝑡




1
+


𝑡

1

1
×
𝑡

2

2
×
𝑡

3

1
×

×
𝑡




2
+


𝑡

1

1
×
𝑡

2

1
×
𝑡

3

1
×

×
𝑡




3
+





𝑡

1

2
×
𝑡

2

2
×
𝑡

3

2
×

×
𝑡




1
+


𝑡

1

2
×
𝑡

2

2
×
𝑡

3

2
×

×
𝑡




2
+


𝑡

1

2
×
𝑡

2

2
×
𝑡

3

2
×

×
𝑡




3
+





𝑡

1

1
×
𝑡

2

1
×
𝑡

3

1
×

×
𝑡






+






IBM Model 1: Formal Description


However, there are regularities in this expression which we
can factor out.


For example, we can factor out
𝑡

1

1

from all the rows it occurs.


Difference between
xy

+
xz

(3 arithmetic operations) and x(
y+z
) (2
arithmetic operations)


IBM Model 1: Formal Description


𝑡

1

1
×
𝑡

2

1
×
𝑡

3

1
×

×
𝑡




1
𝑡

2

1
×
𝑡

3

1
×

×
𝑡




2

𝑡

2

2
×
𝑡

3

1
×

×
𝑡




1
𝑡

2

2
×
𝑡

3

1
×

×
𝑡




2
𝑡

2

1
×
𝑡

3

1
×

×
𝑡




3

+





𝑡

1

2
×
𝑡

2

2
×
𝑡

3

2
×

×
𝑡




1
+


𝑡

1

2
×
𝑡

2

2
×
𝑡

3

2
×

×
𝑡




2
+


𝑡

1

2
×
𝑡

2

2
×
𝑡

3

2
×

×
𝑡




3
+





𝑡

1

1
×
𝑡

2

1
×
𝑡

3

1
×

×
𝑡






+






IBM Model 1: Formal Description


However, there are regularities in this expression which we
can factor out.


For example, we can factor out
𝑡

1

1

from all the rows it occurs.


Difference between
xy

+
xz

(3 arithmetic operations) and x(
y+z
) (2
arithmetic operations)


Factoring out continually gives us:






Makes solution tractable




𝑡



𝑎

=




=
1


𝑡







=
1




=
1
𝑎

IBM Model 1: Pseudocode

initialize t(
e|f
) uniformly

do


set count(
e|f
) to 0 for all
e,f


set total(f) to 0 for all f


for all sentence pairs(
e_s,f_s
)


for all words e in
e_s


total_s
=0


for all words f in
f_s


total_s

+= t(
e|f
)


for all words e in
e_s


for all words f in
f_s


count(
e|f
) += t(
e|f
) /
total_s


total(f) += t(
e|f
) /
total_s


for all f in domain( total(.) )


for all e in domain( count(.|f) )


t(
e|f
)=count(
e|f
) / total(f)

until convergence




EM
algorithm

Beyond IBM Model 1


So far, we’ve worked using IBM Model 1, computing
according to the following formulae:





In other words, we’re saying the likelihood of two strings
corresponding is calculated only in terms of word
-
translation probabilities



𝑡



𝑎

=




=
1


𝑡







=
1




=
1
𝑎

color

print head


t
ête
d’impression

couleur

Beyond IBM Model 1


But what about this one:




Calculating alignment probabilities (and therefore, string
correspondences) based only on word
-
for
-
word
translation probabilities does not necessarily yield
particularly good alignments


This is why, before, we talked about things like
fertility
,
distortion

and
insertion
.

changing your printer settings


modification de les options de impression

IBM Model 2


IBM Model 2 involves both word translations and distortions


Distortion probabilities:





Reverse distortion probabilities:





Model 2 uses these reverse distortion probabilities




,






French position English position English $ length French $ length

𝑟


,






French position English position English $ length French $ length

IBM Model 2


How to incorporate reverse distortion probabilities:




This model may:


Penalise an alignment connecting words which are very far apart, even if
these words are (statistically) good translations of each other.


Prefer alignments on the basis of positioning, which can be useful where
the words themselves are rare and, therefore, we don’t have good t
-
values for them


To find the
best

alignment:




EM training of this model works the same way as IBM Model 1



,


=

𝑡
(


|

𝑎

)



=
1

𝑟
(

|

,


,


)



=
1

argmax
𝑎

(

|

,

)
=

argmax
1

𝑡




×
𝑟
(

|

,


,


)



=
1

a(j) = i

p
osition in
English sentence

Moving from Model 1 to Model 2


We could apply Model 2 directly to the sentence pair corpus to
establish t
-

and
rd
-
tables, thereby by
-
passing Model 1.


However, it may be better to settle on reasonable word
-
translation
probabilities first, before introducing distortions


We can:


Run Model 1 for several iterations


Use the resulting t
-
table as input to the first Model 2 iteration
instead of using uniform t
-
table values


Use a uniformly
-
distributed
rd
-
table as input to the first Model 2
iteration


This is called
transferring parameter values

from one model to
another

IBM Model 3


First allow a single English word to translate many French
words


Fertility

probabilities


Next translate the English words into French words


Model 3 also caters for the translation of the NULL token to
allow for the
insertion

of words


Finally allow the French units to be reordered


Distortion

probabilities


Model 3 calculates distortion probabilities (as apposed to
Model 2’s reverse distortion probabilities)

IBM Model 3: NULL values


Problem 1: We should not count distortions for French words generated by NULL


Eliminate any


value for which


=
0


Problem 2: We did not include costs for inserting extra foreign words,
i.e

foreign
words generated by NULL.


Add French word



for which


=0 (i.e.

𝑎
=NULL) is extra


There are no phi
-
0 extra French words (phi
-
0= fertility of NULL)


There are (





phi
-
0) non
-
extra French words


After each one of these (





phi
-
0) non
-
extra words, we have the option of inserting an
extra word with probability p1


How many ways are there to insert phi
-
0 extra words?




Must also factor in cost of inserting/not inserting:


If we add an extra word phi
-
0 times:



p1
phi−0


Thus if we do
not

add an extra word ((





phi
-
0)
-

phi
-
0) times:

(
1

p1
)
(




2
×
phi−0
)



comb(





phi−0,

phi−0)

=



!
phi−0
!




phi−0
!

IBM Model 3: NULL values


Problem 3: We did not include costs for permuting the extra French words
into available slots


Do not use distortion probabilities (we don’t have these values)


Instead, we slot in extra words after their positions have been generated


Since there are phi
-
0! possible orderings, all of which are deemed equally likely,
then any ordering generated adds an additional factor of:


1
phi−0
!


Problem 4: Word alignments do not encode all of the information about
process which turned
e

into
f
.


If an English word



is connected to French words



and



we don’t know
what order they were generated in.


Problem arises whenever a word has a fertility > 1. So we add the factor:


phi


!



=
0




Formula for IBM Model 3


𝑃

,


=






ℎ

0
,
ℎ

0
×



1
𝑝ℎ

0

×



(
1

1
)
(



2
×
𝑝ℎ

0
)
×





(
ℎ


|


)



=
1

×




𝑡



𝑎

×



=
1






,


,







:

<
>
0
×



ℎ


!




=
1
×


1
𝑝ℎ

0
!

p
roblem 2

fertility

translation

distortion + problem 1

problem
4

problem 3

IBM Model 4


Technical extensions of Model 3


HMM Modelling


Words do not move independently of each other. Condition word
movements on previous word


Relative distortion
model


Crucially, increased sophistication of Model 3 means that computational
trick no longer words for summing over all possible alignments


Exhaustive count collection becomes computationally too expensive


Instead, we have to sum over
high probability

alignments


Search for most probable alignment


This is known as the
Viterbi

alignment: the alignment which maximizes



,



No efficient algorithm for finding the Viterbi alignment for Models 3 onwards


Obtain the sample for summing over by introducing slight modifications to
the high probability alignment

IBM Models: Summary

IBM Model 1

Lexical translation (translation probabilities)

Global Unique Maximum

Efficiently estimated using EM algorithm

Assumes all alignment

connections are equally likely

IBM Model 2

Lexical translation (translation probabilities)

Distortion

probabilities (uses reverse distortion)

Efficiently estimated using EM algorithm

Assumes alignment connection depends on position & lengths of the source/target strings

Estimates from Model 1 can be used as initial guesses for Model 2

IBM Model 3

Fertility probabilities (incl. dropping words i.e. words with fertility of 0)

Lexical translation (translation probabilities)

Distortion probabilities

Insertion probabilities (using special NULL token)

Computationally biggest change in occurs in Model 3

Efficient estimation difficult


sum over only high probability alignments

Viterbi alignment

IBM Model 4

Fertility

model

Translation model

Relative distortion probabilities (placement of words is relative to the placement of
surrounding words)

Insertion probabilities

Technical extensions of Model 3