CLIR System Enhanced with Transliteration Generation and Mining

bloatdecorumΛογισμικό & κατασκευή λογ/κού

30 Οκτ 2013 (πριν από 3 χρόνια και 5 μήνες)

69 εμφανίσεις

CLIR System Enhanced with
Transliteration Generation and Mining

K Saravanan, Raghavendra Udupa & A Kumaran

Microsoft Research India




CLIR System

Query
Translator

Document
Ranker

Dictionary

Indexed
Documents

Telegraph
India 04
-
07
Articles

FIRE
2010
Hindi Query
no.
112
:
गुटखा मालिकों का
अन्डरवर्लडड के साथ उिझाव
....

सम्बन्ध
:relations

मालिकों
:owners

प्रलसद्ध

:

famous

….

Query

Resul ts

Baseline Retrieval System


Language Model
-
Based Retrieval






Probabilistic
Translation Lexicon

~
100
K Hindi
-
English parallel sentences

~
50
K Tamil
-
English parallel sentences

IBM Model
3
alignment , GIZA++

J.
Jagarlamudi

and A.
Kumaran,

Cross
-
Lingual
Information
Retrieval System for Indian Languages.

Working Notes
for the CLEF
2007
Workshop.

English Monolingual Results

Run ID

MAP

P@10

English
-
English
-
T

0.3653

0.344

English
-
English
-
TD

0.4571

0.406

English
-
English
-
TDN

0.5133

0.462

T: Title

D: Description

N: Narration



Monolingual performance is considered as upper limit for
crosslingual

performance

Basic
Crosslingual

Results

Run ID

MAP

% of Mono IR

P@10


Hindi
-
English
-
T

0.293

80.24

0.26


Hindi
-
English
-
TD

0.4042

88.43

0.356


Hindi
-
English
-
TDN

0.47

92.5

0.424


Tamil
-
English
-
T

0.271

74.19

0.26


Tamil
-
English
-
TD

0.3439

75.24

0.35


Tamil
-
English
-
TDN

0.3912

76.21

0.37

Can we
improve this?


CLIR System & OOV’s

Query
Translator

Document
Ranker

Dictionary

Indexed
Documents

Telegraph
India 04
-
07
Articles

FIRE
2010
Hindi Query
no.
112
:
गुटखा मालिकों का
अन्डरवर्लडड के साथ उिझाव
....

सम्बन्ध
:relations

मालिकों
:owners

प्रलसद्ध

:
famous

….

अन्डरवर्लडड
:
?

प्रिेख
:
?

मालिकचन्द
:
?
….

Query

Resul ts

OOVs

?

Out
-
of
-
Vocabulary(OOV) Query
Terms


Many OOV terms are named entities (NEs)


NEs are often the focus of a query


NEs form an open class of terms in all languages


Hence, getting their transliterations right may help
CLIR performance


E.g.
इस्राइिी

(
israel
),
तसलिमा

(
taslima
),




लवजयेंद्र

(
vijaye湤na
),
लिजबुर्लिाि
(
桩z扯llah
)


Many OOV terms are borrowed terms


E.g.
टेन्डर

(te湤nr),
अन्डरवर्लडड

(underworld),



कौसमेटटक्स

(捯s浥ti捳),
एनकाउंटर

(encounter),






टखा

(
gutkha
)

OOV Terms …


With long query (TDN) setup


Hindi queries have
73
OOV terms


31
of them are NEs or borrowed from English


Tamil have
129
OOV terms


61
of them are NEs or borrowed from English



Nearly
50
% of them may be transliterated
(fact) and may improve CLIR performance
(hypothesis)


Two Ways of Handling OOV terms


Transliteration Generation
[Li et al., 2009;
Khapra

et al., 2010]


Transliterations of OOV terms are generated
using an automatic Machine Transliteration
system



Transliteration Mining
(Udupa et al., 2009
-
b)


Transliterations of OOV terms are mined from
the top
-
retrieved documents

Transliteration Generation
-

Direct


Based on Conditional Random Fields


Feature set include character
-

alignment data,
source and target bigrams and trigrams


Trained on 15K source
-
target language
parallel single word names


Transliteration Generation
-
Transitive


Serial combination of multiple direct
transliteration systems


Useful when sufficient parallel data between source
and target languages are not available, directly


[Hindi
-
English]=[Hindi
-
Kannada]+[Kannada
-
English]


15
K parallel single word names were used for training
on each pair


For a given input, top
10
results of first system given
to the second system


Outcome of second system were merged and re
-
ranked finally by their probability scores



M. Khapra, A. Kumaran and P. Bhattacharyya,
Everybody loves a rich cousin: An empirical study
of transliteration through bridge languages,

NAACL
2010
.

Examples of Transliteration Generation

Hindi /Tamil
OOV

Generation


Direct

Generation


Transitive


आँध्र

aandhra
,
andhra
,
aandra
,
aanara
,
aandhara

aandhra
,
andhra
,
aandhrar
,
andhrar
,
aandhar


इस्राइली

israili
,
israeli
,
israili
,
israilli
,
istraili

israily
,
isralie
,

israly
,
isrily
,
israly


मसजिद

masjid
,
masajid
,
masjaid
,
msajid
,
masjed

masjid
,
masajid
,
maszid
,
maszid
,
masajid


என்செபாலிடிஸ்

encebalidis
,
encebalydis
,
encepalidis
,
encebalitis
,
ensebalidis

-
NA
-

மகராஷ்டிர

makrashtir
,
makrashir
,
makrashtira
,
magrashtir
,
makrashira

-
NA
-

Transliteration Mining


Hypothesis:


The transliterations of many OOV query
terms can be found in the top results of the
CLIR system for that query.


Basic Idea:


Pair the query with each of the top N results.


Treat each pair as a comparable document
pair.


Mine transliteration equivalents from the
comparable document pairs.

R. Udupa, K. Saravanan, A.
Bakalov

and A. Bhole,


They are out there, if you know where to look
”: Mining Transliterations of OOV Query Terms
for Cross
-
Language Information Retrieval, ECIR 2009

Examples of Transliteration Mining

Hindi / Tamil OOV terms

Mined English words


आँध्र

Andhra


इस्राइली

i獲sel

i獲seli

i獲selis


मसजिद

mऊ獪id

mऊ獪ids


என்செபாலிடிஸ்

en捥phalitis


மகராஷ்டிர

maha牡獨t牡


maha牡獨t牡s

Hybrid Approach (Mining + Generation)


Combination of transliteration mining and
generation


First, transliterations for the OOV terms
were mined from the top results of CLIR
system


Second, transliterations were generated for
those OOV terms for which mining couldn’t
get anything

Experimental Results

Hindi
-
English
Crosslingual

Results : T

Run ID

MAP

%

of Mono
IR

P@10


Hindi
-
English
-
T

0.293

80.24

0.26


Hindi
-
English
-
T[G
D
]

0.3168

87.72

0.282


Hindi
-
English
-
T[G
T
]

0.314

85.96

0.276


Hindi
-
English
-
T[M
]

0.339

92.8

0.304


Hindi
-
English
-
T[M+G
D
]

0.3388

92.75

0.302


Hindi
-
English
-
T[M+G
T
]

0.3388

92.75

0.302

M
: Transliteration Mining,
G
D

: Transliteration Generation
-

Direct,

G
T

: Transliteration Generation


Transitive

Hindi
-
English
Crosslingual

Results : TD

Run ID

MAP

%

of Mono
IR

P@10


Hindi
-
English
-
TD

0.4

88.43

0.356


Hindi
-
English
-
TD[G
D
]

0.4336

94.86

0.386


Hindi
-
English
-
TD[G
T
]

0.4369

95.58

0.382


Hindi
-
English
-
TD[M
]

0.4376

95.73

0.388


Hindi
-
English
-
TD[M+G
D
]

0.4378

95.78

0.386


Hindi
-
English
-
TD[M+G
T
]

0.4375

95.71

0.386

M
: Transliteration Mining,
G
D

: Transliteration Generation
-

Direct,

G
T

: Transliteration Generation


Transitive

Hindi
-
English
Crosslingual

Results :TDN

Run ID

MAP

%

of Mono
IR

P@10


Hindi
-
English
-
TDN

0.47

92.5

0.424


Hindi
-
English
-
TDN[G
D
]

0.4942

96.28

0.434


Hindi
-
English
-
TDN[G
T
]

0.497

96.82

0.438


Hindi
-
English
-
TDN[M
]

0.4977

96.96

0.442


Hindi
-
English
-
TDN[M+G
D
]

0.4971

96.84

0.444


Hindi
-
English
-
TDN[M+G
T
]

0.4965

96.73

0.444

M
: Transliteration Mining,
G
D

: Transliteration Generation
-

Direct,

G
T

: Transliteration Generation


Transitive

Tamil
-
English
Crosslingual

Results

Run ID

MAP

%

of Mono IR

P@10


Tamil
-
English
-
T

0.271

74.19

0.26


Tamil
-
English
-
T[G
D
]

0.2891

79.14

0.3


Tamil
-
English
-
T[M
]

0.2815

77.06

0.26


Tamil
-
English
-
T[M+G
D
]

0.2816

77.09

0.27


Tamil
-
English
-
TD

0.3439

75.24

0.35


Tamil
-
English
-
TD[G
D
]

0.3548

77.62

0.35


Tamil
-
English
-
TD[M
]

0.3621

79.22

0.35


Tamil
-
English
-
TD[M+G
D
]

0.3617

79.13

0.4


Tamil
-
English
-
TDN

0.3912

76.21

0.37


Tamil
-
English
-
TDN[G
D
]

0.4068

79.25

0.38


Tamil
-
English
-
TDN[M
]

0.4145

80.75

0.37


Tamil
-
English
-
TDN[M+G
D
]

0.4139

80.64

0.4

M
: Transliteration Mining,
G
D

: Transliteration Generation
-

Direct

Conclusion


We presented a modular CLIR system
that allows experimentation with different
methodologies


Our methodologies for handling OOV
terms improved
crosslingual

retrieval
performance significantly


Our Hindi
-
English(TDN)
crosslingual

performance is 97% of the monolingual
performance


Publications


Jagarlamudi
,

J
.

and

Kumaran,

A
.

2007
.

Cross
-
Lingual

Information

Retrieval

System

for

Indian

Languages
.

Working

Notes

for

the

CLEF

2007

Workshop
.


Udupa,

R
.
,

Jagarlamudi
,

J
.

and

Saravanan,

K
.

2008
.

Microsoft

Research

India

at

FIRE
2008
:

Hindi
-
English

Cross
-
Language

Information

Retrieval
.

Working

notes

for

Forum

for

Information

Retrieval

Evaluation

(FIRE)

2008

Workshop
.


Udupa,

R
.
,

Saravanan,

K
.
,

Bakalov
,

A
.

and

Bhole
,

A
.

2009
.

"They

Are

Out

There,

If

You

Know

Where

to

Look"
:

Mining

Transliterations

of

OOV

Query

Terms

for

Cross
-
Language

Information

Retrieval
.

In

31
th

European

Conference

on

IR

Research,

ECIR

2009
.


Li,

H
.
,

Kumaran,

A
.
,

Pervouchine
,

V
.

and

Zhang,

M
.

2009
.

Report

of

NEWS

2009

Machine

Transliteration

Shared

Task
.

Proceedings

of

the

ACL

2009

Workshop

on

Named

Entities

(NEWS

2009
),

Association

for

Computational

Linguistics,

August

2009
.


Khapra
,

M
.
,

Kumaran,

A
.

and

Bhattacharyya,

P
.

2010
.

Everybody

loves

a

rich

cousin
:

An

empirical

study

of

transliteration

through

bridge

languages
.

In

proceedings

of

NAACL

2010
.

Thank You


Impact of M/G on Queries

Run ID

No. queries got
+/
-
ive

impact of
M/D with
change in 2
nd

digit

+ive

-
ive

Hindi
-
English
-
T[M]

9

2

Hindi
-
English
-
T[D]

6

0

Hindi
-
English
-
TD[M]

9

4

Hindi
-
English
-
TD[D]

11

4

Hindi
-
English
-
TDN[M]

11

7

Hindi
-
English
-
TDN[D]

9

3

Tamil
-
English
-
T[M]

4

2

Tamil
-
English
-
T[D]

6

0

Tamil
-
English
-
TD[M]

10

2

Tamil
-
English
-
TD[D]

5

1

Tamil
-
English
-
TDN[M]

12

4

Tamil
-
English
-
TDN[D]

6

2

Queries With High Positive Impact

Run ID

Query

high
+
ive

impact

The possible reason

Hindi
-
English
-
T[M],

Hindi
-
English
-
T[D]

111

Both mining and generation got the valid equivalent ‘dance’ for the only OOV term ‘
डान्स
’.

Hindi
-
English
-
TD[M],

77

There are
2
OOV terms which are
transliteratable
. Mining got a valid equivalent ‘
israeli
’ for the OOV term

इस्राइली
’.

Hindi
-
English
-
TD[D]

111

Generation produced a valid equivalent ‘dance’ for the only OOV term ‘
डान्स
’ which was occurred thrice in
the query.

Hindi
-
English
-
TDN[M]

112

2
out of
5
OOV terms are transliteratable. Mining got a valid equivalent ‘underworld’ for the first OOV term

अन्डरवर्लडड
’ and
‘manikchand’ for the second OOV term ‘
माणिकचन्द
’.

Hindi
-
English
-
TDN[D]

112

2
out of
5
OOV terms are transliteratable. Generation produced a valid equivalent ‘manikchand’ for the OOV
term ‘
माणिकचन्द
’.

Tamil
-
English
-
T[M], Tamil
-
English
-
T[D],

103

The only OOV term ‘
பக்லிஹர்


got a valid equivalent ‘
baglihar
’ both from mining and generation.

Tamil
-
English
-
TD[M]

81

There are two OOV terms ‘
என்செபாலிடிஸ்
‘ and it’s inflection ‘
என்செபாலிடிஸ்ஸின்
’.
Mining got ‘encephalitis’ for the first OOV term.

Tamil
-
English
-
TD[D],

103

The only OOV term ‘
பக்லிஹர்

which occurred twice in the query got a valid equivalent ‘baglihar’
from generation.

Tamil
-
English
-
TDN[M]

76

3
out of
7
OOV terms are transliteratable. Mining got valid equivalent ‘meenas’ for the OOV term

மீனாஸ்
͕͛ǁŚŝĐŚŽĐĐƵƌƌĞĚĨŽƵƌƚŝŵĞƐŝŶƚŚĞƋƵĞƌLJ͘

Tamil
-
English
-
TDM[D]

103

1
out of
2
OOV terms are
transiteratable
.
The only OOV term ‘
பக்லிஹர்

which occurred thrice in the
query got a valid equivalent ‘
baglihar
’ from generation.

Queries With High Negative Impact

Run ID

Query


-
ive

impact

The possible reason

Hindi
-
English
-
T[M], Hindi
-
English
-
TD[M]

80

The only OOV term ‘
मसजिद
’ got two valid equivalents ‘masjids’ and ‘masjid’
. But it was
not clear why it was negatively affected.

Hindi
-
English
-
TD[D]

90

The only OOV term ‘
प्रतितिधी
’ is not transliteratable and so all the
generation outputs are noise.

Hindi
-
English
-
TDN[M]

123

There are two OOV terms ‘
फलस्िीिी
’ and ‘
प्रलेख
’. For the first term mining got two valid
equivalents ‘palestines’ and ‘palestine’. The second term ‘
प्रलेख
’ which is not
transliteratale got a noise ‘palekar’.

Hindi
-
English
-
TDN[D]

98

The only OOV term ‘
प्रलेख
’ is not transliteratable and so all the generated words are noise.

Tamil
-
English
-
T[M]

106

Mining got
7
English equivalents for the OOV term ‘
ஷேம்
’. But

all of them or noise.

Tamil
-
English
-
TD[M
], Tamil
-
English
-
TDN[M]

114

Mining got ‘fernandess’ for the OOV term ‘
சபர்ணான்டர்ஸ்
’.
But the actual English
query has ‘fernandez’.

Tamil
-
English
-
TD[D],
Tamil
-
English
-
TDM[D]

110

For the OOV term ‘
நாதுலா
’, generation produced ‘
naatula
,
nadhula
,
nathula
,
nadula
,
natula
’. But the actual English query has the equivalent in two words as

Nathu

la’. Also, some of the other OOVs which are not
transliteratable

and so
all the generated words for them are noise.

Hindi Query 112:

Type

Query

Title



टखा

माललकों

का

अन्डरवर्लडड

के

साथ

उलझाव

Description

प्रलसद्ध



टखा

कम्पिी

(
माणिकचन्द

और

गोवा
)
के

साथ

दाऊद

इब्राहिम

के

सम्बन्ध

Narration

प्रासंगगक

प्रलेख

में

माणिकचन्द



टखा

और

गोवा



टखा

माललकों

का

अन्डरवर्लडड

डॅन

दाऊद

इब्राहिम

के

साथ

सम्बन्ध
,
से

सम्बजन्धि



चिाएँ

यिाँ

िोिी

चाहिये


अन्य

कम्पतियों

के

?!•

दाऊद

इब्राहिम

के

सम्बन्ध

यिाँ

अप्रासंगगक

िैं