A sentence Clustering Algorithm for Specialized Translation Memories

overratedbeltAI and Robotics

Nov 25, 2013 (3 years and 7 months ago)

58 views


A sentence
C
lustering
A
lgorithm

for
S
pecialized
T
ranslation
M
emories

Rafał Jaworski


Uniwersytet im. Adama Mickiewicza

ul. Umultowska 87, 61
-
614 Poznań

rjawor@amu.edu.pl


ABSTRACT

In the era of the Internet
,

paralle
l T
ranslation
M
emories

(TM)

are

relatively easy

to acquire
. Online translation databases, published
translation memories or even multilingual web pages
provide

an
abundance of translations. One of the major fields
,

in which translation
memories are of use
, is
Computer
-
Aided Translation (CAT)
.

For a given
sentence, a CAT
system searches for a
similar
sentence in the translation
memory.
If

such
a
sentence

is found
, it
s translation

is
use
d

t
o produce the
output sentence.
It is crucial for
a CAT

system that th
e translation
memory contain

the
exact

translations of frequently used sentences
.
Large t
ranslation memories acquired
automatically
from the Internet
do
not meet this condition
.
This paper
introduces
the idea of creating

high
quality

translation memory con
sisting of sentences most frequently used
in specialized texts
.

A clustering algorithm classifies sentences from

a monolingual corpus

into clusters
of similar sentences
and selects one
representative for each cluster. The translation
for each representati
ve
is
then
produced

manually by
human specialists
.

The
database

prepared in
such
a manner

form
s

a high
-
quality specialized translation memory
,
cover
ing

a wide range of domain
-
specific sentences intended for
translation
.

Th
is

algorithm has been used for a l
egal texts translation
project, called LEX
1
.




1

The LEX Project is developed by
:
Wolters Kluwer Polska Sp.
z o.o., pwn.pl
Sp. z o.o. and PolEng Sp. z o.o.
. The Project aims at producing a CAT tool for
lawyers.


2

1.
Introduction

The idea of basing automatic translation of a sentence on
a
previously
translated, similar example is known as
the
Example
-
Based Machine
Translation (EBMT)

method
. Numerous EBMT systems have be
en
developed since 1984, when Makoto Nagao
first
suggested
the idea

(
[1]
)
.

However,
over the past

few years researchers have been turning away from
EBMT as
another translation paradigm


Statistical Machine Translation
(SMT)


has taken over the field of m
achine translation based on translation
memories

(
[2]
)
.

One of the reason
s

of
this
process

is a certain
drawback

of EBMT
:

An
EBMT system is likely to produce a high
-
quality translation

provided that

an appropriate

example
is found
in the translation memor
y (
[3]
). However,
in most cases it is impossible to find such an example, due to data
-
sparseness
constraints
[3]
. This
greatly
limits the usability of an EBMT system.

But if it
could be overcome, the resulting machine translation system would be

a powerfu
l tool.

The paper

presents two ideas for
overcoming

the
problem of
translation
memory data sparseness. The first is

specialization in a narrow domain
(inspired by
[4]
).
By limiting the range of texts
for

translat
ion

we
restrict

both vocabulary (to domain
-
specific terms) and grammar (to most common
constructions).
Therefore,
when
searching
the

translation memory for

a sentence taken from e.g. medical texts, the
likelihood

of finding a good
example in a medical translation memory is
far
higher

than
that

of
finding
such example in

a general translation memory of similar size.

In
our
experiments the use of

the
EBMT method

has been limited

to

legal texts.

The second idea is a novel method of preparing a specialized translation
memory for a given purpose. It is
based on
the

assumption that the most
useful sentences that might appear in a domain
-
restricted translation memory
are those which occur most frequently in texts from this domain. We suggest
to select such sentences by means of a clustering algorithm and t
ranslate
them manually to form a high
-
quality specialized translation memory.

Section 2

of this
paper

gives a brief description of the LEX project.
Section 3

presents an idea of using EBMT techniques in a
CAT system
.
Section 4 describes the clustering
algo
rithm
.
The

ideas for future work are
presented

in Section 5
.

2.
The LEX
P
roject

LEX is a project

aimed

at
develop
ing

a CAT tool for

translati
ng

legal
texts
. It involves
creati
ng

legal glossaries and translation memories as well
as improv
ing

CAT techniques.


The
creation

of glossaries and translation memories
involves the
participation of

professional
translators
. First, a legal translation memory is

A sentence clustering algorit
hm for the needs of specialized TM’s


3

obtained,
either downloaded
directly from the Internet
,

or compiled from
bilingual texts (using automatic ali
gn
ment

mechanisms). The
n, the memory
is verified by specialists in order to
ensure the high quality of translation.
Translation memories serve a
dditionally

as a source for
bilingual
dictionaries
of legal phrases
, which
are extracted
automatically

and

verif
ied by
professional linguists.

Polish
-
English translation memories collected so far include: the Eur
-
Lex translation memory

[5]
, the Polish

constitution and
acts

on
:

higher
education, copyright
s
, employment and company law. The total number of

transla
tion
units exceeds 4.5 million.

3.
Example
-
Based Machine Translation in a CAT
System

3.1. General information

A
novel

feature
of the LEX project is the use of an EBMT module,
called NeLex, in a CAT system.
Accord
ing to

the

definitions in
[6]

and
[7]
,
NeLex imp
l
ements the idea of "pure" EBMT: NeLex

uses translation
memory:
for each translated sentence an example is
searched for in the
memory

before the transfer process

is initiated
. Like the EBMT system
designed at th
e Chinese Academy of Science (
[8]
), NeLex con
sists of two
basic modules: Example Matcher and Transferer. The former is responsible

for finding in a translation memory an example best suited for the input
sentence. The latter
modifies
the example target sentence so that it can be
returned as translati
on of the input sentence.

3.2. Word substitution

NeLex’s Transferer module performs a sequence of operations to
produce the translation of the input sentence. One of them is word
substitution. The mechanism of this substitution
is illustrated

by

the
follo
wing example:


INPUT SENTENCE (in Polish): "Uwzględniając Traktat ustanawiaj
ą
cy
Parlament Europejski".

(in English: Having regard to the Treaty establishing the European
Parliament).


Example from the translation memory:

SOURCE SENTENCE (in Polish): "Uwzgl
ędniając Traktat ustanawiający
Wspolnotę Europejską".

TARGET SENTENCE (in English): "Having regard to the Treaty
establishing the European Community."


The translation result is:

"Having regard to the Treaty establishing the European Parliament".


4


Note tha
t the word Community in the example was substituted by the word
Parliament from the input sentence.

This simple
-
looking operation requires

a
correct word
-
alignment between sentence pairs. More information on
word
-
alignment in NeLex may be found in
[9]
.

3
.3 Named Entity
R
ecognition and
T
ranslation

The NeLex system is also capable of substituting larger parts

of
sentences


Named Entities.
A
Named Entity is defined
(like in
[10]
)
as
a
continuous fragment of text
referring

to information units such as person
s,
geographical locations, names of organizations or locations, dates,
percentages or amounts of money
. Named Entity Recognition plays a

key

role in the process of Machine Translation (
[10]
).
Usuall
y, Named Entities
carry the most important information
in
the

sentence.
Experience shows
(as
stated in
[11]
)

that
Named Entities are prone to translation

errors.
Hence,
correct handling

of

Named Entities during the process of translation can
considerably

improve

the

translation quality (
[10]
).

Table 1
.

list
s

some

specific
Named Entity types
handled

by NeLex:

Table 1.
Common Named Entity types.

Type

Description

Example(s)

JOLFull

Reference to the Journal of Laws

Journal of Laws of
2004/04/06 No. 1,

item 12

Company

Name of a corporation

ACME Ltd.

E
-
mail

An
e
-
mail

address

login@example.com

Paragraph

Reference to a specific part of a
regulation

§34 sec. 4 item g)


Number

A real number

315.871


In NeLex, Named Entities are
managed by

a substitution mechanism
similar to word substitution. Recognition
and translatio
n
of Named Entities
is
executed
by a
dedicated module (see

[9]
), which uses semi
-
supervised
learned rules. The
rules
are
written in a formalism called

NERT (full
specification of
the formalism
can

be found

in
[10]
). The NeLex
formalism
applies
only

a part
of the NERT formalism
: it disregards linguistic
knowledge
about

words
. A simple example of a NERT rule is presented
below:


Direction: Polish to English

Match: <[A
-
Z]
\
w+> Sp
\
. z o
\
.o
\
.

Action: replace(
\
1 Ltd.)


A sentence clustering algorit
hm for the needs of specialized TM’s


5

The above rule applies to translation from Po
lish to English. It
serves to
substitute a Polish name of a company
with

its appropriate English
equivalent.

3.4. Specialization in
L
egal
T
exts

The NeLex’s key feature is specialization in a
dedicated

domain of
texts (the id
ea is described in general in
[
4]
), i.e. in legal texts. NeLex deals
with characteristic features of texts in this domain, such as:

legal vocabulary
and
the
presence of legal references,

e.g.
’article 23 No. 78, item 483’.

Much of the most common legal vocabulary is contained in
the
NeL
ex’s dictionaries and used during word substitution.
L
egal references, on
the other hand, are treated as a type of Named Entities.

L
egal texts are likely to be translated correctly with the use of
translation memory, as they contain recurring sentences an
d phrases, such
as: ’with subsequent amendmen
t
s’, ’with regard to the provisions of...’, ’this
article has
been repealed’.

4.
The
A
lgorithm for
S
entence
C
lustering

4.1.
The Need for the
A
lgorithm

In the first step

o
f

creating

translation memory for the Le
x project

we
collected a

large translation memory. However,

usability tests have
reveal
ed
significant disadvantages of this translation memory.

One disadvantage is
a
low speed of example search. Although the
search mechanism has been optimized,

it still r
equires considerable amount
of time to find a good example for a given input
sentenc
e
(especially for
long input sentences) because of the translation memory size. This problem
limits the

usability.

Moreover,

when the translation system
has been

tested
on

real
-
life
examples, it
has
bec
o
me

clear that the Eur
-
Lex translation memory, though
vast, does not contain examples of sentences commonly

used by lawyers.
This is because it only contains the Official Journal of the European Union
as

well as the treaties,
legislation, case
-
law and legislative proposals (
[5]
)
. It
does not, for instance, contain

any contracts
, which is the main type of
documents translated for lawyers
.

A
closer

look into our Translation Memory revealed that not all
transl
ations acquired auto
matically are of
high

quality.

4.2. Clustering


the M
ain
I
dea

Having discovered these problems,
we
developed

a new method of
developing translation memory
:
i
n the first step a monolingual corpus is
selected, contain
ing

the
types of texts most likely to
be translated with the
aid of the system.

Such
a monolingual
corpus is
obviously
easier to obtain
than a
n equivalent bilingual corpus.
In the LEX project,
we compiled


6

a corpus containing
various
contract and pleading templates, as well as court
decisions
from
the

Polish legal text database commonly used by lawyers
(LEX Prestige system).

The idea of clustering is based on the observation that
many

sentences
in the corpus

exhibit certain similarities
.
Thus, a valuable translation
memory can be obtained by pr
oducing the

translations
of
the most frequent
sentences in the corpus.
Indeed,
transla
ting

all the sentences in the corpus

would be time
-
consuming and labour
-
intensive, yet

for a relatively small list
of most
typical
sentences it is possible to translate
them manually.

In the LEX proje
c
t, a list of approximately 500 most
typical

Polish
sentences used in contracts
were

extracted by means of clustering analysis.
The sentences were then translated into English by professional

translators.

4.3. The
C
lustering
A
lgorithm

C
lustering algorithm
s work on
set
s

of objects.
They use

a measure of
distance between these objects
in order
to divide set
s into smaller chunks
contain
ing

objects which are "close" to each other (in terms of the distance
measure).

In
our

case, th
e set
of objects
is the monolingual corpus, the objects are
sentences.
In order to determine the distance between sentences we use
two
distance measures (

cheap


and
“expensive”
), inspired by
[12]
.

The “cheap”
and “expensive” terms refer to complexity of c
alculations.

The
clustering
algorithm
is presented in Table 2.


IN: Set (S) of strings (representing sentences)

OUT: List (C) of
clusters

(representing clusters of
sentences)


1. Divide set S into clusters using QT algorithm (
[13]
)
with the

cheap


sentenc
e distance measure (the

measure
is based only on sentence length).

2. Sort the clusters in
the
descending

order by number
of elements
,

resulting in the sorted list of clusters,
C
.

3. For each cluster cL in the list C:

(a) Apply QT algorithm with
the “
expe
nsive


distance
measure (based on sentences contents) to cL
, resulting

in

subclusters.

(b) Sort the subclusters
in cL
in
descending

order by
number of elements
.

(c
)
Copy

the sentences from the subclusters

into C


The final step is
performed

by humans


t
hey
manually
select
the most
valuable sentences from the clusters. The task is
facilitated by
a

frequency
A sentence clustering algorit
hm for the needs of specialized TM’s


7

prompt
(the

most frequent sentences always appear at the begin
n
ing of the
cluster
)
.

5. Future work

Ideas for future work include:



d
evelop
ing

the
algor
ithm for
translation memory preparation



d
evelop
ing

EBMT algorithms, mainly
the
optimization of
translation memory search



a
dapting the system to deal with texts from other domains (such as
medicine, business or computer science).

References

[1]

Nagao

M
. 198
4. A framework of a mechanical translation between
Japanese and English by analogy principle. In:
Proceedings of the
International NATO Symposium on Artificial and Human Intelligence,
pages 173

180.

[2]

Forcada M., Way A. 2009. Foreword of the
Proceedings
of the 3rd
International Workshop on Example
-
Based Machine Translation

[3]

Smith J., Clark S. 2009. EBMT for SMT: A New EBMT
-
SMT Hybrid.
In:
Proceedings of the 3rd International Workshop on Example
-
Based
Machine Translation

[4]

Gough

N.,
Way

A
.
2003.
Contr
olled generation in example
-
based
machine translation.

http://www.mt
-
archive.info/MTS
-
2003
-
Gough.pdf

[5]

Access to European Union law. http://eur
-
lex.europa.eu/.

[6]

Carl M., Way A
.
2003
.
Recent Advances in Example
-
Based Machine
Translation
. Kluwer Acade
mi
c Publisher
.

[7]

Turcato
D.,
Popowich

F
.
2001.
What is example
-
based machine
translation?
http://www.
iai.un
i
-
sb. de/~carl/
ebmt
-
workshop/dt.pdf.

[8]


Hou H
.
, Deng D
.
, Zou G
.
, Yu H
.
, Liu Y
.
, Xiong D.

2004.

An EBMT
system based on word alignment
. http://www.m
t
-
archive.info/
IWSLT
-
2004
-
Hou.
pdf
.

[9]

Jaworski R. 2009. Tłumaczenie tekstów prawniczych przez analogie.
Master thesis under supervision of dr Krzysztof Jassem.

[10]

Jassem

K.,

Marci
ń
czuk

M
.
2008.
Semi
-
supervised learning rule
acquisition for Named Entity
recognition

and translation.

[11]

Vilar

D
., Xu J., D’Haro

L. F.
,

Ney
, H. 2006.
Error Analysis of
Statistical Machine

Translation Output.

In:

Proc. Language Resources
and Evaluation conference


8

[12]

McCallum
A.,
Nigam
K.,
Ungar

LH
.
2000.
Efficient clustering

of
highdimensional data sets with

application to reference matching.

http://www.kamalnigam.com/papers/canopy
-
kdd00.pdf
.

[13]

Heyer

LJ.
, Kruglyak

S.
, and Yooseph

S. 1999.

Exploring expression
data: Identification and analysis of coexpressed genes.

In:
Geno
me
Research 9:1106
-
1115
.