Cross-Lingual Linking of News Stories using ESA

drillchinchillaInternet and Web Development

Oct 21, 2013 (3 years and 7 months ago)

75 views



Copyright 2011 Digital Enterprise Research Institute. All rights reserved.

Digital Enterprise Research Institute

www.deri.ie

Enabling

Networked
Knowledge

Cross
-
Lingual Linking of News

Stories using ESA

Nitish Aggarwal, Kartik Asooja, Paul Biutelaar, Tamara Polajanar,
Jorge Gracia

DERI, NUI Galway, Ireland

OEG, UPM, Madrid, Spain

Tuesday, 18 Dec, 2012

CL!NSS, FIRE
-
2012

Digital Enterprise Research Institute

www.deri.ie

Enabling

Networked
Knowledge

Overview


Problem Space


Approach


Search Space Reduction


Semantic Ranking


Cross
-
Lingual Explicit Semantic Analysis (CL
-
ESA)


Evaluations


Conclusion & Future Work

2

Digital Enterprise Research Institute

www.deri.ie

Enabling

Networked
Knowledge

Problem Space


Cross
-
lingual news story linking


identify the same news articles in different languages


Cross
-
Lingual Plagiarism detection



Data set


50 English News Stories


50K Hindi News Stories



Challenge


Not directly Translated


Similar keywords in different stories


Different keywords in similar stories

3

Digital Enterprise Research Institute

www.deri.ie

Enabling

Networked
Knowledge

Approach


Search Space Reduction


News publication dates


by taking K days window


Vocabulary overlap


Translating English news stories using Google Translate



Semantic

Ranking


Rank the news stories with their semantic relatedness


CL
-
ESA semantic relatedness score

4

Digital Enterprise Research Institute

www.deri.ie

Enabling

Networked
Knowledge


Corpus
-
based Relatedness


Semantic meaning as a distributional vector


Words that occur in similar contexts tend to have similar/
related meanings i.e. meaning of a word can be defined in
terms of its context. (Distributional Hypothesis (Harris, 1954))



Latent Semantic Analysis (LSA)


Latent or implicit semantics (unsupervised)



Explicit Semantic Analysis (ESA)


Explicit semantics from explicitly derived concepts
(supervised)



5

Semantic Ranking/Relatedness

Digital Enterprise Research Institute

www.deri.ie

Enabling

Networked
Knowledge

6

Word
1


Word
n


W
1
*URI1+w
2
*URI
2
….
w
n
*
URI
n


W
1
*URI1+w
2
*URI
2
….
w
n
*
URI
n


Word
1


Word
n


W
1
*URI1+w
2
*URI
2
….
w
n
*
URI
n


W
1
*URI1+w
2
*URI
2
….
w
n
*
URI
n


Word
1


Word
n


W
1
*URI1+w
2
*URI
2
….
w
n
*
URI
n


W
1
*URI1+w
2
*URI
2
….
w
n
*
URI
n


EN

HI

ES

Inverted Index

W
11
*URI1+w
12
*URI
2
…. w
1n
*
URI
n


W
11
*URI1+w
12
*URI
2
…. w
1n
*
URI
n


Vector
Cosine


Semantic
Relatedness

Term@en

Term@hi

Cross lingual ESA (CL
-
ESA)


Multilingual
Wikipedia

Index


EN, DE, ES, PT, FR, NL, HI


Easily extendable for other languages


Performed better than CL
-
latent models

Digital Enterprise Research Institute

www.deri.ie

Enabling

Networked
Knowledge


Run1


window of 4 days (2 days before and 2 days after)


Rank all news stories using CL
-
ESA


Run2


window of 14 days (7 days before and 7 days after)


Rank all news stories using Modified CL
-
ESA


Run3


English stories were translated into Hindi using Google
translator


Took top 1000 Hindi news using vocabulary overlap


Re
-
rank all news stories using CL
-
ESA






7

Experiments

Digital Enterprise Research Institute

www.deri.ie

Enabling

Networked
Knowledge


CL!NSS challenge

8

Evaluation: Results

Digital Enterprise Research Institute

www.deri.ie

Enabling

Networked
Knowledge


Initial approach for cross lingual linking of news
stories


Bigger window with modified CL
-
ESA works best


Translated vocabulary overlap did not work well



Use other ranking scores


LSA, LDA


Evaluate separate effect of components


Bigger window size Vs Ranking function

9

Conclusion

Digital Enterprise Research Institute

www.deri.ie

Enabling

Networked
Knowledge




Thank You






Questions?


10