Multi-Language Ontology-based Search Engine - Western Kentucky ...

schoolmistInternet και Εφαρμογές Web

22 Οκτ 2013 (πριν από 3 χρόνια και 1 μήνα)

59 εμφανίσεις

Robert Wyatt
and Elizabeth Romero

The Office of Distance Learning

Western Kentucky University, KY 42101, USA

Robert.wyatt@wku.edu

Elizabeth.romero@wku.edu

Leyla Zhuhadar and Olfa Nasraoui

Knowledge Discovery and Web Mining Lab

Dept. of Computer Engineering and Computer Science

University of Louisville, KY 40292,
USA

Leyla.zhuhadar@wku.edu

Olfa.nasraoui@louisville.edu

The “Big Issues” in Information Retrieval.


Performance
: Efficient search and indexing (Bruce Croft,
2009);


Incorporating new da
ta: Coverage and Freshness (Bruce
Croft, 2009);


Scalability
: Growing with data and users (Bruce Croft,
2009);


Adaptability
: Tuning for applications and users (Bruce
Croft, 2009);


Current problems
: Information overload, keywords
matching, ambiguity, handling evolution domain and users
(Nasraoui: PKDD
-
2006
-
Invited
-
Talk).



ACHI 2010:

Multi
-
Language Ontology
-
based Search Engine

HyperManyMedia @WKU.


ACHI 2010:

Multi
-
Language Ontology
-
based Search Engine

http://hypermanymedia.wku.edu

Semantic Search using Ontology.


ACHI 2010:

Multi
-
Language Ontology
-
based Search Engine

Why do we need a Cross/Multi
-
language
Information Retrieval System?


Some major interesting reasons for designing a MLIR system:


1.
Having a repository of documents written in multi
-
languages, with each individual
document containing more than one language, for example:

1.
technical documents written in non
-
English, but use expressions (jargon terms) written in
English,

2.
a document that uses quotes written in languages different than the language of the article
itself and

3.
a document that cites foreign articles and those citations are written in a language that is
different from the language of the article itself.


1.
The problem of a user who is capable to read or use documents written in a specific language,
but he/she is not fluent in this specific language to query for the right terms to find the
document, for example:

1.
a user who is searching for images where those images are tagged and indexed in a language
that the user does not understand,

2.
a researcher who is interested in a specific research topic and would like to know which
individuals or institutes world wide are working on the same topic and

3.
a user who has a system to translate documents to different languages and would like to search
for those documents in languages he is unfamiliar with.


ACHI 2010:

Multi
-
Language Ontology
-
based Search Engine

Natural Language Processing & Machine
Translation.


ACHI 2010:

Multi
-
Language Ontology
-
based Search Engine

Natural
Language
Processing

Phonetics

& Phonology

Morphology

Syntax

Semantics

Pragmatics

Discourse

Machine
Translation

Word
-
for Word
Approach

Syntacic Approach

Semantic
Approach

Interlingua
Approach

Multi
-
language Information Retrieval System.


ACHI 2010:

Multi
-
Language Ontology
-
based Search Engine

MLIR Research
Field

Multilingual Retrieval

Bilingual Retrieval

Monolingual Retrieval

Domain Specific Retrieval

MLIR Approaches

Text Translation Approach

Thesaurus
-
based Approach

Corpus
-
based Approach

Approaches to MLIR.


ACHI 2010:

Multi
-
Language Ontology
-
based Search Engine

Thesaurus
-
based
Approach

Query Translation

Document Translation

Mix of Query and Document
Translation

Corpus
-
based

Approach

Automatic Thesaurus
Construction

Term Vector Translation

Latent Semantic Indexing

Some History.


First MLIR in 1969 by Gerard Salton (Enhanced SMART
system to retrieve multilingual documents (English &
German)


Pigur's system IRRD in 1979, based on a Vocabulary
Thesaurus that used three languages (English, French and
German)


Van
der

Eijk

in 1993 used the linguistic knowledge:


Subject Thesaurus,


Concept List,


Term List , and


Lexicon.






ACHI 2010:

Multi
-
Language Ontology
-
based Search Engine

HyperManyMedia Methods for Cross
-
language.


Falls into the Domain Specific Retrieval (E
-
learning).


A synergistic approach:


Thesaurus
-
based Approach (
Query translation
), and


Corpus
-
based Approach (
Term Vector Translation
).


ACHI 2010:

Multi
-
Language Ontology
-
based Search Engine

HyperManyMedia @WKU.


ACHI 2010:

Multi
-
Language Ontology
-
based Search Engine

http://hypermanymedia.wku.edu

Thesaurus
-
based Approach.



A simple bilingual ontology thesaurus listing of terms,
phrases, concepts, and subconcepts;


Using domain specific terminology to capture the
HyperManyMedia domain in two languages (English and
Spanish).


ACHI 2010:

Multi
-
Language Ontology
-
based Search Engine

Thesaurus
-
based Approach.


ACHI 2010:

Multi
-
Language Ontology
-
based Search Engine

Building the OWL File Using Protégé.


ACHI 2010:

Multi
-
Language Ontology
-
based Search Engine

http://protege.stanford.edu/

Thesaurus
-
based Approach.


ACHI 2010:

Multi
-
Language Ontology
-
based Search Engine


Building HyperManyMedia Bilingual Ontology:


We used Protégé (current ontology consists of ~40,000 lines of code:
http://161.6.105.21:8084/ontology/semantic.owl
)

Thesaurus
-
based Approach (Query translation
approach) Method.


ACHI 2010:

Multi
-
Language Ontology
-
based Search Engine

Scenario
:
A user submits a query in the semantic search interface,
the following two parallel processes occur:



1.
All relevant documents to the query term will be retrieved, and the
ranked based on Eq(1)





2.
An automatic semantic mapping between the query term and the
HyperManyMedia ontology, which is resident in memory, if the
query term is a part of the HyperManyMedia ontology; the
information retrieval system will automatically present two
semantic entities:

1.
All the subconcepts related to this query term in both
languages (English and Spanish)

2.
Synonym to the query term in the alternative language


Corpus
-
based Approach (Term Vector
Translation) Method.


ACHI 2010:

Multi
-
Language Ontology
-
based Search Engine

Scenario
:
A user submits a query in one of the languages,
English or Spanish, and clicks on cross
-
language
translation, if the query contains part of our indexed
translated terms, the search engine does the following:

1.
Translate the query to the alternative language , as shown in
Algorithm 1

2.
Use the Vector Space Model to calculate the dot product between the

translated query and the documents in

the HyperManyMedia repository,

after substituting each to retrieve

relevant documents and ranks

them based on the score Eq(1).

Evaluation of Cross
-
Language Search Model.


ACHI 2010:

Multi
-
Language Ontology
-
based Search Engine


Research Questions


1.
Will there be a difference in Top
-
n
-
Recall and Top
-
n
-
Precision between
College
-
level, Course
-
level, and lecture
-
level?

2.
Will there be a difference in Top
-
n
-
Recall and Top
-
n
-
Precision when we
Cross from the Spanish Language to the English language vs. from the
English Language to the Spanish?

Top
-
n
-
Recall/Precision for Cross
-
language
Search Engine.


ACHI 2010:

Multi
-
Language Ontology
-
based Search Engine


Top
-
n Recall
: is the number of relevant retrieved documents among the
top n retrieved documents divided by the total number of relevant
documents.




Top
-
n Precision
: is the number of relevant retrieved documents within
the top n divided by n.



Top
-
n
-
Recall/Precision for Cross
-
language
Search Engine.


ACHI 2010:

Multi
-
Language Ontology
-
based Search Engine

Top
-
n
-
Recall/Precision for Cross
-
language
Search Engine.


ACHI 2010:

Multi
-
Language Ontology
-
based Search Engine

Evaluation Conclusion.


ACHI 2010:

Multi
-
Language Ontology
-
based Search Engine


The Cross
-
language search engine performs better when we cross from
the Spanish language to the English language in the Precision and the
opposite in Recall


Fact: The following reasons have influenced the results:


English courses have been indexed and boosted in multiple stages
during the design of the platform (during the last two years).


Adding the Spanish courses was done during a very short period of
time; thus we have not been able to add sophisticated tagging to
these resources.


The ontology relationships between the two languages need to be
logically improved using a higher level of interrelationship
between entities and concepts.

Future Work.


ACHI 2010:

Multi
-
Language Ontology
-
based Search Engine


In the domain of Natural Language
Processing:


An area of research that could be
beneficial is to consider building
the manual thesauri not only based
on the controlled vocabulary
extracted from the domain
ontology as concepts/
subconcepts
,
but by using computational
linguistics; in this case, an
integration between the thesauri
and techniques based on corpus
statistics is needed.



In the domain of Semantic Web:


“Linked Data” is the right place to
extend this research (Linked Data
is a project directed by Christian
Bizer
, Tom Heath and Tim Berners
-
Lee).


Multilinguality and linked data
(generation, querying, visualization &
presentation)

T
he growth of Linked Dataset (July 2009)