Question answering in the Holly Quran

elbowcheepAI and Robotics

Oct 15, 2013 (3 years and 5 months ago)

204 views

1




Faculty of engineering

Computer and systems department





Question answering in the Holly Quran






Supervised by:

DR/
Mohamed El
-
Shafae
y

Team members:

Lina Tarek Eweis

Marwa Naser Ghazi

Nermien Shawka
t Ibrahim

Hala Gamal Mohammed

Yomna Salah El
-
Di
n

Sayed





2



























3




























Developing a tool for the Holy Quran that helps people
better understand the topics and the Verses of this book is
one of the best things we have achieved in our life. It was
the main source of pleasure
and enthusiasm for nearly ten
months.

We have tried hard to accomplish this task in the best
possible way. May God accept this work from us.

4



Table of contents



Overview………………………………………………………………………………………………………………...............7

Introduction……………………………………………………………………………………………………………………….10


An overview of the Quran


An overview of Arabic language


understanding the concepts of the Quran


Goals




Arabic natural language processing challenges …………………………………………………………………16


western developers vs. Arabic developers


some ANLP challenges


Background ………………………………………………………………………………………………………………………22


Question Answering systems ……………………………………………………………………………….22


Open domain QA system


Main features of question answering system


Question


Docs


Lexical Analyzer


Stop words Remover


Morphological Analyzer


Synonyms Identifier


Query Expansion


Question classification


Annotation Engines


Indexing


Ontology Knowledge Base


Annotated Doc.s Database


Knowledge Modeler/Retriever


Answer Processing


Closed domain QA sys
tem………………………………………………………………………..30


QA Architecture


Pattern Learning


Question Patterns


Examples from our project


Semantic
search…………………………………………………………………………………………………38


Vocabularies, Taxonomies, and Ontologies


Resource Description Framework (RDF)


Web Ontology Language (OWL)


Open Linked Data


5


Sparql

Quran corpus…………………………….……………………………………………….45


Semantic Quran
Ontology


Description


Data Retrieval


Animal Ontology


SPARQL Queries and Results




Recognizing textual
entailme
nt………………………………………………………………….59


textual entailment


entailment examples


recognizing textual entailment


applications of
recognizing textual entailment


approaches for recognizing textual entailment




Implementation………………………………………………………………………………………………………….70


Ontology based module………………………………………………………………………………….70


Template Algorithm Description


Finding matching
question


Ash
-
shuraa' module……………
…………………………………………………………………………..73


List of questions (Question Dataset)


Main Database


Synonyms Database


Arabic Module Search…………………………………………………………………………………….88


Keyword Search Module


Our First KeyWord Search
Module

Other Quranic Search

Engines………………
…………………………………………………………………..95


Future work…
………………………………………………………………………………………………………….10
0


References
…………………………………………………………………………………………………………….....10
2





































6
















Abstract

The Holy Quran, due to its unique style and allegorical nature, needs special
attention about search and information retrieval issues. Many works have been done
to accomplish keyword search from Holy Quran. The main problem in all these

works
is that these are either static or they does not prov
ide us semantic search or Tafsi
r



The aim of this Project is to develop a tool for searching for concepts in the Holy
Quran book. Although the Quran is one of the important religious books of the

world,
there is little computational analysis performed on it. This is down to many reasons;
the abs
ence of adequate morphological
analyzers for Classical Arabic (the language
of the
Quran), the absence of Quranic
Wordnet

Quran
has its unique style of des
cribing the topics. At some places, some topics are
explicitly mentioned while some others are meant implicitly. There are many
difficulties for implementation of
semantic search for Holy Quran,
the nature of the
text itself as it is not an ordinary text t
hat we can put to standard machine processing
but rather text that is compiled in some special way in terms of linguistic structures
that can reveal different meanings across the ages.



7


Overview

The system presents three approaches (modules) to retrieve
the answer of the input
question
.

1.

ontology based QA module




Figure
1



This module is based on RDF Quranic database(ontology) that contains
Quranic chapters and their verses and tafsir
.


This ontology allows us to
answer the statistical questions eg: '
what is
the number of verses in As
-
shu'raa chapter?
'


This ontology is accessed using SPARQL queries

p
ossible input questions
are stored in [questions with patterns, const. questions] database
.

Translation

Ar
Question

EnQuestion

Find matching question

Const. questions


Questions
with pattern
s

Answer

retrieval

Execute Query

Generate
Query


Question
with
pattern
?

Question
with
query
?

YES

YES

NO

NO

Q
id

RDF fact database

8



Questions with pattern
are the questions where it is needed to extract
some information from the input question as chapter name or verse
number
.


Const. questions are two types:

.

Questions with queries : questions with queries stored in the
database, their answers are retrieved
by executing those queries
.

.

Other questions: questions with their answers stored directly in
the database
.

In this module answer is retrieved by matching the input question with
all of the questions stored in the database and extracting the answer,
executi
ng the query or generating the query of the most matching
questi
on to the input question.

2.

As
-
shu'raa module



Figure
2

In this module semantic (targeting the meaning) questions in As
-
shu'raa
chapter can be answered
.

The expected input questions are represented by a list of main words;
those words are unique that no two questions share the same main
Stopwords removal

Find main words

Main

Synonyms

Generate query

Question

Answer

retrieval

Execute Query

RDF fact database

9


words list, in main database there are for each word list a list of verses
that answer the questions
.

It's considered tha
t the answer of the input question is the verses that
contain the answer in its meaning and the tafsir of those verses to clarify
the meaning of the verses
.

Because the input question may be similar to a question in the database
with different words that h
as the same meaning, synonyms database is
built containing synonyms of the main words
.

The query generated is to retrieve the verses and their tafsir that contain
the answer which are in the RDF fact database
.


3.

Search in the quran for the answer


if the
input question is not one of the supported system questions a
keyword search with the input question words after removing the
stopwords is done
.


Beside the question answering the system supports also word analyzer t
hat gets
root
for any input quranic word

Also if the input is one word a search for this word in quran is done




10


Introduction

An Overview of the Quran

The Quran consists of 114 chapters of varying lengths, each known as a sura. The title
of each sura is derived from a name or quality discussed in the text or from the first
letters or words of the sura. In general, the longer chapters appear earlier in th
e
Quran, while the shorter ones appear later.

Each sura is formed from several ayahs
or verses which originally mean

a sign or
portent sent by God. The number of
the ayahs isn
't the same in various Suras. The
Quran was revealed in the Arabic language and
has been translated to other
languages. The Quran corpus consists of 77, 784 word tokens and 19,287 word types.




11


An
Overview of the Arabic Language

The Arabic language is a Semitic language with many varieties. It is the largest living
member of the Semi
tic language family in terms of speakers. Modern Arabic is
classified as a macro language with 27 sub
-
languages in ISO 639
-
3. These varieties are
spoken throughout the Arab world, and Standard Arabic is widely studied and kno
wn
throughout the Islamic world
. Modern Standard Arabic (MSA) derives from Classical
Arabic, the only surviving member of the Old North Arabian dialect group. The
modern Standard language is closely based on the Classical language, and most

Arabs consider the two varieties to be two reg
isters of one and the same language.

Classical Arabic, also known as Koranic (or Quranic) Arabic, is the form of the Arabic
language used in the Quran as well as in numerous literary texts from Umayyad and
Abbasid times (7th to 9th centuries). Modern Stand
ard Arabic (MSA) is a modern
version used in writing and in formal speaking (for example, prepared speeches and
radio broadcasts). It differs minimally in morphology but has significant differences in
syntax and lexicon, reflecting the influence of the mod
ern spoken dialects. Classical
Arabic is often believed to be the parent language of all the spoken varieties of
Arabic.


Morphological Analysis Systems Developed for the Arabic Language

There
is

a number of Arabic morphological analysis systems developed
for the
MSA as it is the usual form of everyday written and printed materials. To name
a few of the tools: [Beesley 1996, Beesley 1998; Beesley 2001; Al
-
Shalabi 1996;
Darwish 2002].


A complete survey of these systems and the morphological analysis
techniq
ues used in developing them can be reviewed in [Al
-
Sughaiyer and Al
-

Kharashi, 2004]. In [Sawalha and Atwell 2008], a comparison between the
accuracy of different Part
-
of
-
Speech taggers for the MSA is conducted. Also, a
recent Part
-
of
-

Speech tagging syste
m for the MSA text has been developed by
[Alqrainy et al, 2008] which achieved an accuracy of 91%.


But the text of the Holy Quran and more generally the collections of classical
Arabic poetry have a different lexicon, morphology and syntax from that of th
e
12


MSA. Hence, the inadequacy of the available tools to be used for analyzing the
classical Arabic material and especially the Quran which is the most important
book in the Muslim world
.


Understanding the Concepts of th
e Quran

The Quran, the holy book of
Islam, may well be the most powerful book in human
history. Both in world history and contemporary affairs, it is doubtful that any other
book now commands, or has in the past exerted, so profound an influence.
Objectively, one of every five people on eart
h today is Muslim. Hence, the
importance of understanding the Quran for every Muslim and also for those scholars
who are interested in the study of man and society, since this book has been
effectively instrumental not only in moulding the destinies of Isl
amic societies, but
also in shaping the destiny of the human race as a whole [Mutahhari, 1984].
Therefore, understanding the concepts of the Quran is of paramount importance if
one wishes to study this book comprehensively.


Defining the meaning of a 'Conc
ept'

In the Wordn
et dictionary, a concept is defined as an abstract or general idea
inferred or derived from specific instances.

Defining concepts for any domain of knowledge is far from an easy task.
[Bennett, 2005] considered the problem of defining con
cepts within formal
ontologies. He concluded that the disagreement stems from the differences in
understanding of the word ‘concept’.

Bennett says, “those who are skeptical about the idea of precision and
universality tend to regard a ‘concept’ as somethi
ng rather close to a natural
language term, …But for a (classical) logician, a concept is an abstract entity
that is largely independent of the vagaries of natural language: only in
idealized circumstances can a concept within a formal system be regarded a
s
the referent of a natural term.” From a computational point of view, the
second approach is more appealing.

13


But, if we consider the case of holy books in general and the Quran in specific,
all the classification of concepts are developed by people who c
ome from a
humanities background not from a computing one; they prefer the first
approach.

Concepts of the Quran

Concepts could be classified in to two main categories: Concrete concepts
(lexical or keyword concepts) and Abstract concepts (general concepts
).

An example of a concrete concept would be any word type or term that
already exists in the text such as: names of persons, names of prophets, names
of places or cities…etc.

Abstract concepts are more general. They are not usually explicitly mentioned
i
n the text. They represent general themes or features covered by the text. For
instance, there are several verses in the Quran that describe the main pillars of
Islam. This is an abstract concept and is the most important theme in the
Quran but was never m
entioned explicitly in the Quran book.

Concrete Concepts

Concrete concepts; consider every word type of the Quran as a concept.

Developing computational systems for concrete terms generally involve
using keyword search tools.

The main problem with the
keyword search too
ls is their poor recall
values.

Abstract Concepts

Understanding the meaning of the Quran verses through reading the
Tafsir (detailed explanation of the meaning of the verses) is quite helpful
but does not draw the complete picture of the
message that this book
tries to convey to its readers. This is because the Quran covers one
theme in many different chapters and to get the complete picture, the
reader must refer to all the passages with their context. In addition to
that, the reader need
s to relate the subject to the central themes, which
are the main unifying ideas of the book. Therefore, the use of thematic
14


approach helps us to comprehend the Quran’s message by avoiding
pitfalls and misuse of phrases by picking them out of context. This

approach organizes subject matter around major or unifying themes,
thus building them into a whole and enabling the readers to make
important connections between them. This job must be carried out by
an expert as there are some necessary conditions that n
eed to be
fulfilled to be able to do this hard job properly.

Murtada Mutahhari, a professor of theology in the University of Tehran,
[Mutahhari, 1984] lists these conditions as follows:

The understanding of the Qur'an requires certain preliminaries which
a
re briefly described here. The first essential condition necessary for the
study of the Qur'an, is the knowledge of the Arabic language, such as for
the understanding of Hafiz and Sa'di,

It is impossible to get anywhere without the knowledge of the Persian

language. In the same way, to acquaint oneself with the Qur'an without
knowing the Arabic language is impossible.

The other essential condition is the knowledge of the history of Islam.
This book was revealed gradually during a long period of twenty three

years of the Prophet's life, a tumultuous time in the history of Islam. It is
on this account that every verse of the Qur'an is related to certain
specific historical incident called "The reasons for the descent", which by
itself does not restrict the mea
ning of the verses, but the knowledge of
the particulars of revelation throws more light on the subject of
the
verses in an effective way.

The third condition essential for the understanding of the Qur'an, is the
correct knowledge of the sayings of the Pro
phet (S). He was, according
to the Qur'an itself, the interpreter of the Qur'an par excellence.

The Qur'an says: "
We have revealed to you the Reminder that you may
make clear to men what has been revealed to them ... (16:44)
".




15


Goals

Therefore, the main
goals for this research project are:



To develop

an Arabic tool For
Quranic keyw
ord search based on (Word
+Root
)

Matching
.



To build The First Quranic Question Answer System for

Arab

People.



To develop Root Extractor tool that improves on the accuracy o
f t
he
available Root Extractor
tools already existing.



To provide Semantic Search on Quran Chapters.



To Provide the User

of the Tafsir of Quran Verses
.



16


Arabic natural language
processing challenges

The Arabic language is both challenging and interesting. It is interesting due to its
history [Versteegh 1997], the strategic importance of its people and the region they
occupy, and its cultural and literary heritage [Bakalla 2002]. It is also challenging

because of its complex linguistic structure [Attia 2008].
Strategically
, it is the native
language of more than 330 million speakers [CIA 2008].

It is also the language in
which 1.4 billion Muslims perform their prayers five times daily.
Linguistically
,
it is
characterized by a complex Diglossia situation [Diab and Habash 2007; Farghaly 1999;
Ferguson 1959, 1996]. Due to existence of Classical Arabic and Modern Standard
Arabic (MSA), we will not consider this Diglossia phenomenon as it is not our case.

An
y Arabic NLP system that does not take the specific features of the Arabic language
into account is certain to be inadequate [Shaalan 2005a;

2005b]. For example, Arabic
is written from right to left. Like Chinese, Japanese, and Korean there is no
capitaliz
ation in Arabic. In addition, Arabic letters change shape according to their
position in the word. Like Italian, Spanish, Chinese, and Japanese, Arabic is a pro
-
drop
language, that is, it allows subject pronouns to drop [Farghaly 1982] subject to
recoverab
ility of deletion [Chomsky 1965].

Western developers vs. Arabic developers
:

Over

the last few years, Arabic natural language processing (ANLP) has gained
increasing importance, and several state
-
of
-
the
-
art systems have been developed for
a wide range of ap
plications, including machine translation, information retrieval and
extraction, speech synthesis and recognition, localization and multilingual
information retrieval systems, text to speech, and tutoring systems. These
applications had to deal with severa
l complex problems pertinent to the nature and
structure of the Arabic language. Most ANLP systems developed in the Western
world focus on tools to enable non
-
Arabic speakers make sense of Arabic texts.
Arabic Tools such as Arabic named entity recognition,

machine translation and
sentiment analysis are very useful to intelligence and security agencies. Because the
need for such tools was urgent, they were developed using machine learning
approaches. Machine learning does not usually require deep linguistic
knowledge
and fast and inexpensive. Developers of such tools had to deal with difficult issues.
One problem is when Arabic texts include many translated and transliterated named
entities whose spelling in general tends to be inconsistent in Arabic texts [S
haalan
17


and Raza 2008]. For example a named entity such as the city of Washington could be
spelled ‘
نوطجنشاو

,


و
نطنش

,


نوطغنشاو

,


نطنشاو
’. Another problem is the lack of a sizable
corpus of Arabic
-
named entities which would have helped both in rule
-
based

and
statistical named entity recognition systems. Efforts are being made to remedy this.
For example, the LDC released in May 2009 an entity translation training/dev test for
Arabic, English, and Mandarin Chinese. A third limitation is that NLP tools deve
loped
for Western languages are not easily adaptable to Arabic due to the specific features
of the Arabic language. Recognizing that developing tools for Arabic is vital for the
progress in ANLP, the MEDAR consortium has started an initiative for cooperati
on
between Arabic and European Union countries for developing Arabic language
Resources [Choukri 2009].

On the other hand, ANLP applications developed in the Arab World have different
objectives and usually employ both rule
-
based and machine
-
learning appro
aches.


The following are some of the objectives of ANLP for the Arab World:

1.

Transfer of knowledge and technology to the Arab World. Most recent
publications in science and technology are published in the other languages.

2.

Modernize and fertilize the Arabic

language. This follows from (1) above.
Translating new concepts and terminology into Arabic.

3.

Make information retrieval, extraction, summarization, and translation
available to the Arab user (this field related to our project topic).

Some ANLP challenges
:

One

of the challenges that faces ANLP developers is the Arabic Script. It is one of the
key linguistic properties of the Arabic language that poses a challenge to the
automatic processing of Arabic. Although Arabic is a phonetic language in the sense
that

there is one
-
to
-
one mapping between the letters in the language and the sounds
they are associated with, Arabic is far from being an easy language to read due to the
lack of dedicated letters to represent short vowels, changes in the form of the letter
de
pending on its place in the word, and the absence of capitalization and minimal
punctuation.

Also

the Arabic script does not have dedicated letters to represent the short vowels
in the language, short vowels have been represented by diacritics which are ma
rks
above or below the letters. However, these diacritics have been disappearing in
18


contemporary writings and readers are expected to fill in the missing short vowels
through their knowledge of the language.

Arabic

letters have different shapes depending o
n the position of the letter in the
word. All Arabic word processors implement these rules so that the user does not
have to manually select the correct shape.

In

NLP applications such as machine translation, information retrieval, information
exetraction,

clustering, and classification it is necessary to split a running text
correctly into sentences and a sentence splitter capitalizes on these features. But
scripts such as Arabic, Chinese, Japanese, and Korean have neither capitalization nor
strict rules o
f punctuation and their absence [Shaalan and Raza 2008, 2009] makes
the task of preprocessing a text much more difficult and challenging.

Another

challenge facing researchers and developers of Arabic computational
linguistics is the dilemma of normalizatio
n. The problem arises because of the
inconsistency in the use of diacritic marks and certain letters in contemporary Arabic
texts. Some Arabic letters share the same shape and are only differentiated by
adding certain marks such as a dot, a hamza or a madd
a placed above or below the
letter. For example, the “alif” in Arabic (

ا
) may be three different letters depending
on whether it has a hamza above as in (

أ
) or a hamza below as in (

إ
) or a madda
above as in (

آ
). Recognizing these marks above or below a

letter is essential to be
able to distinguish between apparently similar letters.

For this problem we normalized alif with a hamza above or below or bare alif or alif
madda with simply an alif. Also we normalize the final taa marbuuTa (
ة

or

ةـ
) and a
fin
al haa (
ه

or

هـ
) and the alif maqsuura (

ى
) and the yaa (

ي
), and another
normalization processes.

But it soon became apparent that although normalization improves recognition by
solving the variability in input, it increases the probability of ambiguity

[Farghaly
2010]. For example, normalizing an initial alif with a hamza above or below it,
removes an important distinction between (
نأ
) ann and (
نإ
) inn. The first means
“that” and must be followed by a nominal sentence. The second means “to” which
indica
tes the English infinitive, but whose translation is meaningless if followed by a
noun. In short, although normalization solves recognition problems, it creates the
unintended effect of increased ambiguity.

19


The

many levels of ambiguity pose a significant c
hallenge to researchers developing
NLP systems for Arabic [Attia 2008]. The reason is that ambiguity exists on many
levels as evidenced by Maamouri and Bies [2010] who show 21 different analyses of
the Arabic word (
نمث
) tmn, produced by BAMA. The average n
umber of ambiguities
for a token in Arabic is higher than any other language. Ambiguity in Arabic is present
at the following levels:

1.

Homographs:

A word belonging to more than one part of speech such as
مدق

qdm which could be a verb of Form II meaning “to

introduce” or a verb of Form
1 meaning “to arrive from” or a noun meaning “foot.” Some homograph
ambiguity can be resolved by contextual rules. For example, an Arabic word
that could be either a noun or a verb can be disambiguated by the following
rule wh
ich says that such a word will be disambiguated to a noun when
preceded by a preposition.


Contextual homograph resolution

e.g., [
بتك
] N | V
-
> N / Prep
___

2.

Internal word structure ambiguity:

That is, when a complex Arabic word could
be segmented in differe
nt ways. For example, “
يلو
” wly could be segmented
into “
ي
+
ل
+
و
” corresponding to coordinate
-
prep
-
pronoun meaning “and for
me,” or may not be segmented at all meaning “a pious person favored by
God”.

3.

Syntactic ambiguity:

As in the case of a prepositional
attachment as in


تلباق

ديدجلا كنبلا ريدم
”/qabaltu mudiir al
-
bank al
-
jadiid/ which could mean “I

met with
the new bank manager” or “I met with the manager of the new

bank”
depending on the internal analysis of the noun phrase.

4.

Semantic ambiguity:

Sentences and phrases may be interpreted in different

ways. For example, “
نم رثكأ دمحأ ىلع بحي
ميهاربإ

/yhb ’ly ahmd aktr mn

abrahym/
“Ali likes Ahmed more than Ibrahim.” Does this mean that

Ali likes Ahmed
more than Ali likes Ibrahim, or do Ali and
Ibrahim like

Ahmed, but Ali likes
Ahmed more than Ibrahim likes Ahmed?

5.

Constituent boundary ambiguity:

For example “
ديدجلا كنبلا ريدم
” mdyr albnk

algydyd could mean “the new manager of the bank” or “the manager of the

new bank” depending on the boundary of

the adjective phrase within this

noun construct.

20


6.

Anaphoric ambiguity:

As in
هنا ىلع لاق
حجن
/
qala Ali annahu najah/Ali said

that
he succeeded. This sentence is ambiguous both in English and

Arabic.
Chomsky’s Binding principles account for sentences like t
his.The question here
is does “he” refer to Ali or to someone else?

In

addition to these levels of ambiguity the process of normalization plus

features of
Arabic such as the pro
-
drop structure, complex word structure, lack

of capitalization,
and minimal pu
nctuation contribute to ambiguity, but it is

the absence of short
vowels that contributes most significantly to ambiguity.

With

the absence of short vowels, two types of linguistic information are

lost.
The
first

is most of the case markers that define the

grammatical function

of Arabic
nouns and adjectives

using
diacritics

(e.g Damma,

fatHa, kasra).

The absence of case
markers and thus the grammatical function of a word, creates multiple ambiguities
due to the relatively free word order in Arabic and because Arabic is a pro
-
drop
language.
The second

type of information that is lost due to the nature o
f the Arabic
script is the lexical and part of speech information.

Thus, in the absence of internal
voweling it is sometimes impossible to determine the part of speech (POS) without
contextual clues.


For example, without contextual clues a word like (
نم
)
mn could be a preposition

meaning “from”, a wh
-
phrasemeaning “who” or a verb

meaning “granted”. An Arabic
token such as (
بتك
) ktb without internal voweling could be a plural noun “books”, an
active past tense verb “wrote”, a passive past tense verb “was wr
itten” or a causative
past tense verb “he made him write.”

While ambiguity is a challenge in any language, what makes Arabic so challenging is
that all of these features are present in one.

Another

important challenge is nonconcatenative morphology [McCart
hy 1981]
which presents a challenge to the structural lists theory of the morpheme. They
defined the morpheme as a minimal linguistic unit that has a meaning. By minimal it
is always meant that a morpheme cannot have a morpheme boundary within it. This
def
inition works well for languages with concatenative morphology like English.
McCarthy [1981] points out that the building blocks of Arabic words are the
consonantal root which represents a semantic field such as “KTB” “writing” and a
vocalism that represen
t a grammatical form.

21



Arabic stems are described in terms of prosodic templates such as CVCVC. The Cs
represent

the root radicals and Vs represent the vocalism [Cavalli
-
Sforza et al. 2000].
Thus
words such as “
بتك
”/katab/ are

formed by an association of t
he radicals to the
vocalism. While McCarthy proposes that Arabic words are analyzed at tiers (the root
and the vocalism).

Arabic is an agglutinative language and affixes that represent
different parts of speech can be attached to a stem or root to form a t
oken.

Unlike
English and most languages, Arabic has a complex word structure.

For example, theArabic sentence
مهتيأرو

/wra’aytuhum/ “and I saw them” is written

as one word and may be decomposed into the following four morphemes:

1.

و

/wa/ Conjunction “and”

2.

ىأر

/
r’aa
/ Past tense Verb “saw”

3.

ت

/
tu
/ Subject Pronoun “I”

4.

مه

/
hum
/ Object Pronoun “them”

The order in which affixes are attached to stems is rule governed which

makes the
decomposition of Arabic words possible. However, it is far from an

easy process due

to the high degree of ambiguity in Arabic [Attia 2008]. For

example, the word
مهو
/whm/ has at least four valid analyses:

1.

مه

+
و

/wa+ hum/ CONJ SUBJPRON/OBJPRON “and they”

2.

مه

+
و

/wa+hammun/ CONJCOMMON NOUN “and worry”

3.

مهو

/wahm/ COMMON NOUN “illusion”

4.

مه
+
و

/wa+ hamma/ CONJ PVERB “and he initiated”



22


Background


Question Answering(QA) System


is a computer science discipline within the fields of

information retrieval

and

natural
language processing

(NLP), which is concerned with building systems that
automatically answer questions posed by humans in a

natural language
.

A QA implementation, usually a computer program, may construct its answers by
querying a structured

database

of knowledge or information, usually a

knowledge
base
. More commonly, QA systems can pull answers from an unstructured collection
of natural language documents, and
to determine which kind of method will be used
to get the answer we have to know the kind of questions we deal with, there are two
kinds of questions:

1.

Factoid questions:

are questions whose answers can be found in short spans
of text and correspond to a sp
ecific easily characterized, category, often
named entity like... (Person name, location).

2.

Complex questions:

are questions whose answers are information whose
scope is greater than a single factoid but less than an entire document, in such
cases we might
need summary of document or a set of documents.




23


Type
s of question answering systems

Open
-
domain
:

question

answering deals with questions about nearly anything,
and can only rely on general ontologies and world knowledge. On the other
hand, these systems usually have much more
data available from

which to
extract the answer.


Main features of question answering system:

1.

Question:

It represents the input question of the user.

2.

Doc.s:

Represents the documents from which the answer will be retrieved.

3.

Lexical Analyzer:

Lexical analysis

is the process of converting a stream of characters (the text of the
document or queries) into a stream of words (the candidate words to be adopted as
index terms).This process is Also called the tokenization of the sentences which is
done

to identify the words individually for further processing on them.

4.

Stop words Remover:

Words which are too frequent among the documents’ and the queries’ text in the
collection are not good discriminators , they are referred to as stop
-
words.so these
Figure
3

represents the main features of question answer
ing system

24


words are removed in this step with the objective of filtering out words with very low
discrimination values for retrieval purposes.

5.

Morphological Analyzer:

The words that appear in documents and in queries often have many morphological
variants also know
n as stemming words. This subsystem attempts to reduce words to
their stems or the root forms. Thus, the key terms of a query or document are
represented by stems rather than by the original words. The rationale for such a
procedure is that similar words g
enerally have similar meaning and thus retrieval
effectiveness is enhanced if morphological variants are uniformed.

6.

Synonyms Identifier:

The synonyms of the words are identified in this subsystem to make the user’s query
intense and in the same way to obta
in every possible concept present in the
document to response the user’s query effectively.

7.

Query Expansion:

Query expansion

is a technique used to boost performance of a document retrieval
engine. Common methods of query expansion for Boolean keyword
-
base
d document
retrieval engines include inserting query terms, such as alternate inflectional or
derivational forms generated from existing query terms, or dropping query terms
that are, for example, deemed to be too restrictive
.


8.

Question classification:

It’
s also called “answer type Recognition”, it is the process of classifying the question
by its expected answer type .for example a question like “who founded virgin
airlines?” expects an answer of type person .a question like “what Canadian city has
the lar
gest population?” expects an answer of type city .If we know the answer type
for a question ,we can avoid looking at every sentence or noun phrase in the entire
suite of documents for the answer instead focusing on just people or cities knowing
an answer
type is also important for presenting the answer .a definition question like
“what is a prism” might use a simple answer template like “ a prism is……” while an
answer to a biography question like “who is Zhou enlai?” might use
a biography
-
specific template
.

25



9.

Annotation Engines:


It is not enough to simply provide a computer with a large amount of data and
expect it to learn to speak

the data has to be prepared in such a way that the
computer can more easily find patterns and inferences. This is usually done by adding
relevant met
adata to a dataset. Any metadata tag used to mark up elements of the
dataset is called an
annotation
over the input. However, in order for the algorithms
to learn efficiently and effectively, the annotation done on the data must be
accurate, and relevant t
o the task the machine is being asked to perform. For this
reason, the discipline of language annotation is a critical link in developing intelligent
human language technologies, There is different aspects of language that are studied
and used for annotati
ons syntax, semantics, morphology, phonology (and phonetics),
and the lexicon.

Syntax:

The study of how words are combined to form sentences. This includes examining
parts of speech and how they combine to make larger constructions.

Semantics:

The study of

meaning in language. Semantics examines the relations between words
and what they are being used to represent.

Morphology:

The study of units of meaning in a language. A “morpheme” is the smallest unit of
language that has meaning or function, a definitio
n that includes words, prefixes,
affixes and other word structures that impart meaning.

Phonology:

The study of how phones are used in different languages to create meaning. Units of
study include segments (individual speech sounds), features (the individu
al parts of
segments), and syllables.

Phonetics:

26


The study of the sounds of human speech, and how they are made and perceived.
“Phones” is the term for an individual sound, and a phone is essentially the

smallest
unit of human speech.

10.

Indexing:

The idea of

Index expansion

is to create a highly tangled representation of the
sentences where each word is directly connected to others representing both
meaning and relations. Instead of keeping the knowledge base separate, the relevant
knowledge gets embedded wit
hin the text. We can hence use efficient indexing
techniques to represent such knowledge and query it very effectively with suitably
modified techniques of information retrieval.


11.

Ontology Knowledge Base:

In the context of knowledge sharing, ontology is
means a specification of a
conceptualization. That is, an ontology is a description (like a formal specification of a
program) of the concepts and relationships that can exist for an agent or a
community of agents. This definition is consistent with the us
age of ontology as set
-
of
-
concept
-
definitions, but more general. And it is certainly a different sense of the
word than its use in philosophy.

And the task of this subsystem is to manage the ontological knowledge and factual
knowledge(instances and stateme
nts),
Ontology knowledge base

contains the
designed ontology related to a specific domain .domain ontology is defined in OWL to
represent the concepts and relations in the specific domain .to facilitate the
annotation engine ,the
Ontology knowledge base

has

been pre
-
populated with
plenty of entities of general importance and various relations between them, for
implementation of OWL ontology and knowledge representation ,protégé ontology
editor and protégé OWL API were used.

27



Figure 4

Semblance of OWL classes in
protégé

12.

Annotated Doc.s Database:

Annotated documents’ Database

stores all the
documents that have gone through the process of annotation from the
Annotation
Engine
. This database is ready to query upon by the
knowledge

retrieve
r

to present the list of the related documents’ to the users’
query.
Figure below represent

structure of annotated documents’ database

28



13.

Knowledge Modeler/Retriever:

Knowledge modeler/retriever

has two main functions; one that it has to save the
annotated documents coming from the
Annotation Engine
in to the
Annotated
Documents’ Database

for using later, here it acts as a

knowledge Modeler/Retriever
is to analyze the annotated query coming from
the
Annotation Engine
and to
retrieve the results corresponding to that query from the
Annotated documents’
Database

and returns the query’s results to the application logic layer; here it acts as
a
Knowledge Retriever

.



14.

Answer Processing:

The final stag
e of question answering is to extract a specific answer from the passage
so as to be able to present the user with an answer like “300 million” to the question
“what is the population of the united states?” .

Two classes of algorithms have been applied to

the answer extraction task, one
based on answer
-
type pattern extraction and one based on N
-
gram tiling.

In the pattern extraction method for answer processing, we use information about
the expected answer type together with regular expression patterns, fo
r example, for
questions with HUMAN answer type, we run the answer type or named entity tagger
on the candidate passage or sentence, and return whatever entity is labeled with
type HUMAN. Thus in the following examples the underlined named entities are
ext
racted from the candidate answer passages as the answer to the HUMAN and
DISTANCE,QUANTITY questions…


“Who is the prime minister of India?”

Manmohan Singh

prime minister of India, had told left leaders that the deal would
not be renegotiated.

“How tall is

Mt. Everest?”

The official height of Mount Everest is
29035 feet.

29



An alternative approach to answer extraction , used solely in web search ,is based on
N
-
gram tiling ,sometimes called Redundancy
-
based approach .this simplified method
begins with the snip
pets returned from the web search engine .produced by a
reformulated query .in the first step of the method , N
-
gram mining ,every unigram,

bigram

and trigram occurring in the snippet is extracted and weighted .the weight is
a function of the number of sni
ppets the N
-
gram occurred in, and the weight of the
query reformulation pattern that returned it .

In the N
-
gram filtering step, N
-
grams are scored by how well they match the
predicted answer type .finally an N
-
gram fragments into longer answers. A standar
d
greedy method is to start with the highest scoring candidate and try to tile each
other candidate with this candidate. The best scoring concatenation is added to the
set of candidates, the lower scoring candidate is removed, and the process continues
unt
il single answer is built.


Relevance Feedback:

To improve the response of the question answering system some systems use
relevance feedback.


The idea of

relevance feedback

(

) is to involve the user in the retrieval process so as
to improve the final

result set. In particular, the user gives feedback on the relevance
of documents in an initial set of results. The basic procedure is:



The user issues a (short, simple) query.



The system returns an initial set of retrieval results.



The user marks some
returned documents as relevant or nonrelevant.



The system computes a better representation of the information need based
on the user feedback.



The system displays a revised set of retrieval results.


30


Closed
-
domain
:

Closed
-
domain question answering deals with questions under a
specific domain (for example, medicine or automotive maintenance), and can be seen
as an easier task because NLP systems can exploit domain
-
speci
fic knowledge
frequently formalized in
ontologies
. Alternatively, closed
-
domain might refer to a
situation where only a limited type of questions
are accepted, such as questions
asking for
descriptive

rather than
procedural

information.

We chose The Holly Qur’an to be our specific domain. As the user can ask a question
about The Holly Qur’an such as (
ريسفت ام ,؟ميركلا نآرقلا ىف ةروس مك ,؟ميركلا نآرقلا تايآ ددع مك
مقر ةيلآا
7

؟فهكلا ةروس ىف
,etc.) .
We also chose Sura
t Al
-
Shuraa
as a specific closed
domain in The Holly Qur’an where the user can ask any question he wants in this
chapter specially such as (
الله ءامسأ ىه ام
-
لجو زع
-

انديس ةصق ركذأ ,؟ءارعشلا ةروس ىف تركذ ىتلا
؟ءارعشلا ةروس ىف تءاج ىتلا ةيعدلأا ىه ام , ىسوم
, etc.)


QA Architecture:

QA systems typically consist of three core components: query formation,
information retrieval and answer selection.

The query formation component takes a question string as input and
transforms it into one or more queries, which
are passed to the information
retrieval component.

Depending on the type of the question, the IR component queries one or more
knowledge sources that are suitable for that type and aggregates the results.
These results are then passed as answer candidates

to the answer selection
component, which drops candidates that are unlikely to answer the question
and returns a ranked list of answers.

This overall architecture is illustrated in Figure
6
.



Figure 6.Typical architecture of QA system


Pattern Learning:

The technique based on question interpretation and answer extraction uses
different types of text patterns. Question patterns are used to interpret a
question, i.e. to determine the property it asks for and to extract the target and
31


context objects. Answe
r patterns are used to extract answers from text
passages. This chapter describes how these patterns are generated.


Initially, the properties and question patterns need to be specified manually. In
a second step, the system automatically learns answer p
atterns for each of the
properties, using question
-
answer pairs as training data. See the following two
sections.

1.

Question Patterns

For example:

who <be> <T_NONE>

<what> <be> <T_NOABBR>

(who|<what>) <be> <T> (in|of) <C>

(who|<what>) <be> <C>'s <T>

(name the|<what> <be> the (names?|term) (for|given

to|of)|who <be> considered to be) <T>

(name the|<what> <be> the (names?|term) (for|given

to|of)|who <be> considered to be) <T> (in|of) <C>

(name the|<what> <be> the (names?|term) (for|given

to|of)|who

<be> considered to be) <C>'s <T>

what <T> <be> (called|known as|named)

what <T> (in|of) <C> <be> (called|known as|named)

what <C>'s <T> <be> (called|known as|named)

<T> <be> (called|known as|named) what

<T> (in|of) <C> <be> (called|known as|named) wha
t

<C>'s <T> <be> (called|named) what

what do you call <T>

what do you call <T> (in|of) <C>

Figure 2


Question patterns for the property NAME.

At first, we determined the properties that the questions ask for and created
an initial set of question patter
ns by simply replacing the target and context
objects in the questions by <T>and <C>tags respectively.

Then we converted the first letter to lower case, dropped the final
punctuation mark and reordered auxiliary verbs in the same way as it is done
by the
question normalization component. This is necessary because the
question patterns are applied to a question string after normalization.

32


In the next step, we merged patterns by extending them to regular
expressions (e.g.

“who (assassinated| killed| murder
ed| shot) …”). We then used my personal
knowledge and dictionaries to further extend the patterns by adding
synonyms, and we added patterns to cover other possible formulations we
could think of.

To further generalize the patterns, we invented shortcuts,
which are tags that
represent regular expressions. These shortcuts are specified in a separate
resource file. Table 1 shows the shortcuts and the regular expressions they
stand for.


Table 1


Shortcuts and corresponding regular expressions.

Sometimes
it is hard to determine the property a question asks for. E.g. a
question of the format “What is …?” could ask for the DEFINITION of a term,
the LONGFORM of an abbreviation or the NAME of a place. To allow a
definite interpretation of such questions, I inv
ented a mechanism that uses
object types. A target tag can be associated with an object type to restrict the
format of a target object. Such target tags are not replaced by the general
capturing group (.*) but by a more constraining regular expression. For

example, the tag <T_ABBR>indicates that the target must be an
abbreviation, i.e. a sequence of upper case letters. Table 2 gives an overview
of the object types.


Table 2


Object types, their meanings and the corresponding capturing
groups.

Finally, we
reviewed the properties and question patterns. We dropped rare
properties and merged some properties with similar patterns. E.g. first we
33


distinguished between the property NAME for proper names and the
property TERM for technical terms, which turned out t
o be impractical.

Examples from our project:

1.

For the following question:

؟ ص مقر ةروس يف س مقر هيلاا ريسفت وه ام

First, we simply translate the question like that:

What is the interpretation of the verse No.Q in chapter No.r?

Then we determine the
property, target and context:

Property: interpretation

Target: verse

Index, chapter

Index

Context: chapter, verse, chapter No., verse No.

Then the template will be:


What <be> the <P> of the <C> <T1> in <C> <T2>?


Then we take those two targets a
nd entered a query to get the appropriate
answer.

2.

For the following question:


؟س مقر ةروس تايآ ددع مك

Translation:

How many verses of chapter No. Q?

Determining P, T and C:

Property: number of verses

Target: chapter

Index

Context: chapter, chapter No., ve
rses

The Template:

How many <C> (of| in) <C> <T>?



Building System Question

Dataset and Their Relative Queries:

We Start with Entering more than 100 Questions and their Queries Manually
that defines the system Constrains and which Questions can be
answered.

34











Figure
7.part of the Question Dataset

35


Storing Templates in Database:

We Designed a Da
tabase that Stores the Template
,

its Property,

its

Target, its
relative Questions and The answer of it.

We Divided the Questions to Two
Main Classes:

1.

Question with Query:




This Question depends on user Question to Fulfill Gaps in the Query
of this Question.



Example:


Q:

مقر ةيلاا ركذا
01

.ةرقبلا ةروس ىف


Q

مقر ةيلال نيللاجلا ريسفت ام
01

:ةرقبلا ةروس ىف

2.

Const Questions:



This
Questions is divided

into two subclasses:

1.

Questions which its answer entered manually to the DB,

Example:


Q:
ام

روسلا

ىتلا

تيمس

مساب

؟مجن


ANS:
سمشلا
-
مجنلا


Q:
ام

روسلا

ىتلا

ىهتنت

لك

اهتايا

فرحب

؟ءارلا


ANS:
رثوكلا

2.

Questions with Query to Execute to get the answer

Q:
ام

ىه

ةيلاا

ىتلا

نوكتت

نم

ةملك

؟ةدحاو

ANS:












36


Q:

ام

رثكا

ةيا

تركذ

ىف

نارقلا

؟ميركلا
؟


ANS:




Database Schema:




Figure 8.Database Schema





37



Database Contents:






Figure 9. Snapshots



Answer for const. Question

Templates

Targets

38


Semantic search


The word semantics is derived from the Greek word
semantikos

,"
significant
" from
semaino
, "to signify, to indicate" and that from
sema
,"sign, mark,". Linguistically, it is the study of interpretation of signs
or symbols which are used for some specific contexts.

Semantic analysis is the process of relating syntactic structures, from the
levels of phrases, clauses, sentences and the
whole text, to their language
-
independent meanings. It causes removal of characteristics related to
fussy linguistic contexts up till that.


Vocabularies, Taxonomies, and Ontologies


Vocabularies, taxonomies, and ontologies are all related. Each contains a

set of
defined terms, and each is critical to the ability to express the meaning of
information.


Their differences lie in their expressiveness, or how much meaning each attaches to
the terms that it describes:



A vocabulary
is a collection of unambiguous
ly defined terms used in
communication. Vocabulary terms should not be redundant without explicit
identification of the redundancy. In addition, vocabulary terms are expected to
have consistent meaning in all contexts.



Taxonomy

is a vocabulary in which terms are organized in a hierarchical
manner. Each term may share a parent
-
child relationship with one or more
other elements in the taxonomy. One of the most common parent
-
child
relationships used in taxonomies is that of special
ization and generalization,
where one term is a more
-
specific or less
-
specific form of another term. The
parent
-
child relationships can

be many
-
to
-
many; however, much taxonomy

39


adopts the restriction that each element can have only one parent. In this case,

the taxonomy is a tree or collection of trees (forest).



Ontology

uses a predefined, reserved vocabulary of terms to define concepts
and the relationships between them for a specific area of interest, or domain.
Ontology can actually refer to a vocabulary,

taxonomy, or something more.
Typically the term refers to a rich, formal logic
-
based model for describing a
knowledge domain. Using ontologies, you can express the semantics behind
vocabulary terms, their interactions, and context of use.



vocabularies
are simple collections of well
-
defined terms. Taxonomies extend vocabularies by
adding hierarchial relationships between terms.




The Semantic Web uses a combination of a schema language and an ontology
language to provide the capabilities of vocabularies
, taxonomies, and ontologies. RDF
Schema (RDFS) provides a specific vocabulary for RDF that

can be used to define
taxonomies of classes and properties and simple domain and range specifications for
properties.

The OWL Web Ontology Language provides an exp
ressive language for defining
ontologies that capture the

semantics of domain knowledge.



40


Resource Description Framework (RDF)


The Resource Description Framework (RDF) is a general purpose language for
representing information in the Web in a minimally
constraining and maximally
flexible way. Its purpose is the processing

of machine
processable information
without loss of information.

RDF is intended for situations where information needs to be processed and
exchanged between applications rather than on
ly being displayed to people.

In the Semantic Web, RDF is the data
-
model for representing the meta
-
data. It
provides users with a domain
-
independent framework for representing information
about resources in the WWW. With it, one can unambiguously express
and formalize
the meaning of concepts and facts.

Everything that can be described is called a resource. A resource can be anything: a
city, a book, a person, a process, a company, etc. Resources being described have
properties which have values. These prop
erties and values are specified by describing
the resources in RDF statements. The part that identifies the resource the statement
is about is called the subject. The part that identifies the property of the subject that
the statement specifies is the pred
icate, and the part that identifies the value

of that property is the object. Hence, an RDF statement is a triple consisting of a
subject, a property, and an object.

RDF extends the linking structure of the Web to use URIs to name the relationship
betwe
en things as well as the two ends of the link (this is usually referred to as a
―triple‖). Using this simple model, it allows structured and semi
-
structured data to
be mixed, exposed, and shared across different applications.

This linking structure forms
a directed, labeled graph, where the edges represent the
named link between two resources, represented by the graph nodes. This graph view
is the easiest possible mental model for RDF and is often used in easy
-
to
-
understand
visual explanations.


41



Figure
10



Web Ontology Language (OWL)


The limited expressiveness of RDFs resulted in the need for a more powerful
ontology modeling language

in particular, one that permitted greater machine
interpretability of Web content. This led to the W3C recommendation
of the Web
Ontology Language OWL.


OWL allows modelers to use an expressive formalism to define various logical
concepts and relations in ontologies to annotate Web content. The enriched content
can then be consumed by machines in order to assist humans in

various tasks. As
such, OWL fulfills the requirement for an ontology language that can formally
describe the meaning of terminology on Web pages. If machines and applications are
expected to perform useful reasoning tasks on these Web documents, the langu
age
must surpass the semantics of RDFS. OWL has been designed to meet this need.

Similar to RDFS, OWL can be used to explicitly represent the meaning of terms in
vocabularies and the relationships between those terms. The resulting ontologies are
used by
applications that need to process the content of information instead of just
presenting it. OWL provides a richer vocabulary along with formal semantics than
RDFS by allowing additional modeling primitives that result in an increased
expressivity for descr
ibing properties and classes.




42


Open Linked Data


Linked data describes a method of publishing structured data so that it can be
interlinked and become more useful. It builds upon standard Web technologies such
as HTTP and URIs, but rather than using the
m to serve web pages for human readers,
it extends them to share information in a way that can be read automatically by
computers. This enables data from different sources to be connected and queried.

The Semantic Web isn't just about putting data on the
web. It is about making links,
so that a person or machine can explore the web of data. With linked data, when you
have some of it, you can find other, related, data.

Like the web of hypertext, the web of data is constructed with documents on the
web.
However, unlike the web of hypertext, where links are relationships anchors in
hypertext documents written in HTML, for data they links between arbitrary thi
ngs
described by RDF,. The URIs
identifies any kind of object or concept.

There are four principle
s of linked data:

1.

Use URIs as names for things

2.

Use HTTP URIs so that people can look up those names.

3.


When someone looks up a URI, provide useful information, using the
standards (RDF/SPARQL)

4.

Include links to other URIs.
So

that they can discover more
things.

43



Figure
11


Querying data on your own hard drive is useful, but the real fun of SPARQL starts
when you query public data sources.

You need no special software, because these data collections are often made publicly
available through a SPARQL
endpoint, which is a web service that accepts SPARQL
queries.

The most popular SPARQL endpoint is DBpedia, a collection of data from the gray
infoboxes of fielded data that you often see on the right side of Wikipedia pages. Like
many SPARQL endpoints, DBp
edia includes a web form where you can enter a query
and then explore the results, making it very easy to explore its data. DBpedia uses a
program called SNORQL to accept these queries and return the answers on a web
page.

If you send a browser to

http://dbpedia.org/snorql/

,

you’ll see a form where you can
enter a query and select the format of the results you want to see.

44



Figure
12

DBpedia's SNORQL web form


Sparql


More and more people are using the query language SPARQL (pronounced “sparkle”)
to pull data from a growing collection of public and private data.

Whether this data is part of a semantic web project or an integration of two
inventory databases on differen
t platforms behind the same firewall, SPARQL is
making it easier to access it.

In the words of W3C Director and Web inventor Tim Berners
-
Lee, “Trying to use the
Semantic Web without SPARQL is like trying to use a relational database without
SQL.”

SPARQL
was not designed to query relational data, but to query data conforming to
the RDF data model. RDF
-
based data formats have not yet achieved the main stream
status that XML and relational databases have, but an increasing number of IT
professionals are disc
overing that tools using the RDF data model let them expose
d
iverse sets of data (including
relational databases) with a common, standardized
interface.

Both open source and commercial software have become available with SPARQL
support, so you don’t need
to learn new programming language APIs to take
advantage of these data sources.

45


This data and tool availability has led to SPARQL letting people access a wide variety
of public data and providing easier integration of data silos within an enterprise.

Qura
n

corpus

Quranic Arabic is a unique form of Arabic used in the Quran and it is the direct
ancestor language of Modern Standard Arabic MSA. Annotating the Quran faces an
extra set of challenges compared to MSA due to the fact that the text is over 1,400
yea
rs old, and the Quranic script is more varied than modern Arabic in terms of
orthography, spelling and inflection. The same word is spelled in different ways in
different chapters. However, the Quran is fully diacritized which reduces its
ambiguity [3].

Wi
th respect to morphological segmentation, 54% of the Quran’s 77,430 words
require segmentation, resulting in 127,806 segments. A typical word in the Quran
consists of multiple segments combined into a single whitespace
-
delimited word
form. For example, “ ,

ينوركذاف

which means “so remember me” should be
segmented to four segments; “
و
“ ,”
ركذا
“ ,”
ف

”, and “
ين

”, where each segment
represents an individual syntactic unit as follows:

Conjunction prefix, verb, subject pronoun, and object pronoun, respectively.

The computational analysis of the Quran remains an intriguing but unexplored field
since little and isolated efforts have been done in this field [1]. For the young learner,
this may be not just a challenging requirement, but a deterrent.

The Quranic Arab
ic Corpus is an annotated linguistic resource consisting of 77,430
words of Quranic Arabic. The corpus aims to provide morphological and syntactic
annotations for researchers wanting to study the language of the Quran.

The grammatical analysis helps
readers further in uncovering the detailed intended
meanings of each verse and sentence. Each word of the Quran is tagged with its part
-
of
-
speech as well as multiple morphological features. Unlike other annotated Arabic
corpora, the grammar framework adopt
ed by the Quranic Corpus is the traditional
Arabic grammar of (
بارعإ
).

Corpus annotation assigns a part
-
of
-
speech tag and morphological features to each
word. For example, annotation involves deciding whether a word is a noun or a verb,
and if it is inflec
ted for masculine or feminine.

46


The annotated corpus includes also an annotated Treebank of Quranic Arabic.

The
Quranic Arabic Dependency Treebank (QADT)[4] has been developed and
implemented in Java at the University of Leeds, UK, as part of the Quran Corp
us
project[2]. It works as a Treebank for Quranic sentences by providing a deep
computational linguistic model based on historical traditional Arabic grammar (
بارعا

نارقلا

میركلا
)
. The Treebank is freely available under an open source license. QADT
provide
s two levels of analysis:

Morphological annotation and syntactic representation.

The morphological segmentation has been applied to all the 77,430 words in the
Quran. The treebank also represents Quranic syntax through dependency graphs as
shown in Figure
. QADT is applied to each individual Quranic verses (or part of the
verse) even if that verse does not form a complete sentence unless it is joined with
its neighbor verse. In addition to QADT, there are very few works done in the field of
morphological an
alysis of Quran such as [1].



Figure
37


A Quranic Ontology uses knowledge representation to define the key concepts in the
Quran, and shows the relationships between these concepts using predicate logic.
47


Named entities in verses, such as the names of
historic people and places mentioned
in the Quran, are linked to concepts in the ontology.

Semantic Quran Ontology




It is a multilingual RDF representation of translations of the Quran.



The dataset was created by integrating data from two different
semi
-
structured sources. The dataset were aligned to an ontology

designed to
represent multilingual data from sources with a hierarchical structure. The
resulting RDF data encompasses 43

different languages which belong to the
most under represented langua
ges in Linked Data, including Arabic, Amharic
and Amazigh.



This is the Ontology which we used within our Question Answer System.


Description

The data were extracted f
rom two semi
-
structured sources

the data from the
Tanzil project and the Quranic Arabic C
orpus.

It is an
ontology for representing multilingual data extracted from sources
consisting of different translations to the Quran book as well as numbered
chapters and verses.

In addition to providing aligned translations for each verse,
it
provide
s
mo
rpho
syntactic information on each of the original Arabic terms utilized across
the dataset.

Moreover, it
interlinked the dataset to three versions of Wiktionary as well as
DBpedia
and ensured therewith that
dataset abides by all the four Linked Data
princ
iples.

To represent the data as RDF, They developed a general
-
purpose linguistic
vocabulary. The vocabulary was specified with the aim of supporting datasets
which display a hierarchical structure.


It includes four basic classes: Chapter, Verse, Word and
LexicalItem.

48




Chapter:

The chapter class provides the name of chapters in different
languages and localization data such as chapter index and order.

Additionally, the chapter class provides metadata such as the number of
verses in a chapter, revel
ation place and some provenance
information.

Finally, the chapter class provides inter
-
class linking properties to link it to
different verses contained in it.

For example each chapter provides a dcterms:tableOfContents for
each of
its verses in the form

qrn:quran<chapter>
-
<verse>.




Verse:

The verse class contains the verse text in different languages as well
as numerous localization data such as verse index and related chapter
index.

Additionally, this class provides related verse data such as differen
t verse
descriptions and provenance information.

Finally, it contains interclass linking properties to link the verse in both
directions; the chapter of the verse and the whole words contained in such
a verse.



Word:

This class encompasses the verse next le
vel of granularity and
contains the word text in different languages as well as numerous
localization data such as related verse and chapter verse indexes.

Additionally, the word class provides related word provenance information
and some inter
-
class link
ing properties to link it to chapter and verse of
such a word.



49



Figure
13

UML class diagram of the semantic Quran ontology


Data Retrieval

Since the Quran contain a lot of data concerning places, people and different
events, multilingual sentences
concerning such information can be

easily
retrieved from the dataset. The aligned multilingual representation allows
searching for the same entities across different languages.

For example,
Figure 11

shows a SPARQL query which allows retrieving Arabic,
En
glish and German translations of verses which contain “Moses”.


50



Figure
14

verses that contains moses in (i) Arabic (ii) English and (iii) German


Example2: for

retrieving

the verses that contain the word which
its Root
="smw" "
ومس
"



Figure
4


Example 3: List all the Arabic prepositions with example statement for each


Figure
16

51



Example 4: List the Names of the chapters which is Makkian



Figure
17


Example 5
: Counting the Number of the Chapters in the Quran



Figure
18


Example 6: Get the Galalyn Tafsi
r of Given Verse


Figure
19


52


Example 7: Count the Number of The Verses in Certain Chapter



Figure
5


Example 8: Count Number of Parts in the Holy Quran



Figure
6


Example 9: Detect t
he Revelation Place of Given Chapter


Figure
22


Example 10: Count Number of Quarters in the Quran

53




Figure
23



Animal Ontology


It is an OWL File Located Locally.

This is Support Semantic Search on the Quran.

It

is declare all
concepts about animals exactly as defined in Holy Quran.

There are about 167 direct or indirect references of animals in Holy Quran.


Main difficulties in the Implementation include:



In Arabic for one animal many Arabic words are present.
For instance, a camel
has words like Abal, Buaaeer, Gamal.



Majority of animals have been mentioned regarding giving similarity to some
people. This creates difficulty in creating metaphorical or abstract relationship.
For example, the non
-
believers who hav
e no faith in Holy Quranic verses are
resembled to donkey carrying books.




Many animals have been mentioned regarding their behavior. For instance,
ass, mule and horse have been defined as source of usage of ornament and
carrying luggage.

54





Behavior is such

a difficult domain that two universities SIG are extensively
working merely on the domain of animal behavior.



Figure
24


Figure
7


part of the object properties defined in the ontology





55


SPARQL Queries and Results:


Scenarios 1: (Domain
fish, Obje
ct Property: swallow, range
fish)



Fish Swallow
Yuns.

Q1: Name of Prophet who was swallowed by Fish?

A: Yuns subclass of Prophet



Figure
26


Q2
: Which animal swallowed Yunas
?


A:
Fish



Figure
27


Q3: Which animal swallow Prophet and also lives in Sea?
A: Fish


56



Figure
28


Scenario 2: (class: man, object
property: love, class: branded
horse subclass
of horse)



Man love Branded Horse

Q4: Which an
imal is loved by man?

A: Branded Horse subclass of Horse




Figure
29

Scenario 3: ALLAH forbade Swine (to eat)

Q5: Which Animal is forbi
dden (to eat) by ALLAH?

A: Swine



Figure
30


57


Scenario 4: (class: ALLAH, object property: sent down,
class)



ALLAH sent down Quail to Children of Israel (Bani
-
Israel)

Q6: Who send
Quail to Children of Israel?

A: ALLAH



Figure
31


Q7: Which Bird was sent down to Childr
en of Israel?
A:
Quial



Figure
32


Q8: Ch
ildren of Israel eat
which Bird
?

A: Quail



Figure
8

58



Scenario 5: (Data Property has_usage and string values Ride)



Horse, Mule and Ass are used for Ride.

Q9: which animal has

usage of ride?

A: Horse, Mule and Ass



Figure
34


Scenario 6: (data property has
voice, string value harshest)



Ass has harshest voice.

Q10: which Animal has harshest

voice
?

A: Ass.


Figure
9




59


Recognizing Textual Entailment


Textual entailment
:

.

Textual entailment

is defined as a directional relation between two text
fragments, termed T
-

the entailing text, and H
-

the entailed hypothesis
.

.

T entails H if, typically, a human reading T would infer that H is most likely
true
.

.

Somewhat informal definition assumes comm
on human understanding of
language as well as common background knowledge
.




Entailment examples
:



Text

Hypothesis

Entails

The drugs that slow down or
halt Alzheimer's disease
work best the earlier you
administer them.

Alzheimer's disease
is treated
using
drugs.

YES

Drew Walker, NHS Tayside's
public health director
,

said:
"It is important to stress
that this is not a con
fi
rmed
case of rabies."

A case of rabies
was confirmed

NO

Yoko Ono unveiled a bronze
statue of her late husband,
John Lennon, to
complete
the official renaming of
England's Liverpool Airport
as Liverpool John Lennon
Airport.

Yoko Ono is John
Lennon's widow.

YES






60


Recognizing Textual Entailment
:



.

Recognizing Textual Entailment (RTE)

is the task of deciding, given T and H,
whethe
r T entails H.

.

Carried out fully automatically by a computer program
.

.

Without any human intervention
.

.

The class of NO entailment can be divided in two subclasses:

1.

H is unsupported by T
.

2.

H is contradicted by T
.


Applications of Recognizing Textual
Entailment
:


.

Question Answering (QA):

given a question, automatically extract correct
answers from a collection of documents
.



Example Q: "who bought overture?"
.



QA system analyses question Creates answer pattern: "X bought
overture"
.



Searches in documents
for text fitting the answer pattern
.

.

Similar arguments can be made for many other NLP applications
.

.

Including
Information Retrieval (IR), Information Extraction

)
IE) , Automatic
Summarization and Machine Translation (MT)
.

.

Proponents of textual entailment
belief that most NLP tasks can be reduced to
RTE
.


Approaches for Recognizing Textual Entailment
:

Word Matching

If hypotheses entail a specific text then it must be sharing a large
number of words with it
.

Algorithm

.

count number of shared words between
text T and hypotheses H
.

.

set a threshold above which entailment is true
.

61


Problems

The longer H is, the more words it will likely share with T
.

Solutions



Normalize for sentence length by dividing by no of words in H
.




Lemma Matching

Word matching has ob
vious shortcomings

for example:

T: He is looking out of the windows

H: He looks out of a window


they entail but the result for applying the word matching on them is NO
and the reason is that the word "looking" does not match the word
"looks" also the w
ord "windows" does not match the word "window"
.

the solution of such problem is to match the lemmas of the words not
the words themselves
.

Lemma:

underlying form of surface form of words (look is lemma of
looking and looks & window is the lemma of windows
and window )

Algorithm

.

count number of shared words lemmas between text T and
hypotheses H
.

.

set a threshold above which entailment is true
.


Part
-
of
-
speech mapping

Word matching has more shortcomings

for example:

T: The lead singer left to wave at us

H: Th
e left wave threw lead at us

there are 6 out of 7 H words match a T word but some matching words
have completely different meaning
.

62


.

lead in the sense of "first" does not match lead in the sense of
"metal"

.


left in the sense of "going ways" does not match l
eft in the sense
opposite of "right"

.


wave in the sense of "greeting" does not match wave in the sense
of "movement in fluid"


Those words are
Homographs

"words with identical spelling but
different meanings"
.

Homographs can often be distinguished by their word class (category)
which can be known from the
Part of speech (POS)
"the lexical category
of a word" of the word
.

Some major words categories in English are

.

Noun (N) as in car

.

Verb (V) as in drive

.

Adjective

(A) as in blue

.

Adverb (Adv) as in quickly

The POS of a word can be gotten by applying
POS tagging

"
process

of
labeling words according to a predefined set of POS tags" on the
sentence (text / hypotheses), example:

T: The/
Det

lead/
N

singer/
N

left/
V