Faculty of engineering
Computer and systems department
Question answering in the Holly Quran
Lina Tarek Eweis
Marwa Naser Ghazi
Hala Gamal Mohammed
Yomna Salah El
Developing a tool for the Holy Quran that helps people
better understand the topics and the Verses of this book is
one of the best things we have achieved in our life. It was
the main source of pleasure
and enthusiasm for nearly ten
We have tried hard to accomplish this task in the best
possible way. May God accept this work from us.
Table of contents
An overview of the Quran
An overview of Arabic language
understanding the concepts of the Quran
Arabic natural language processing challenges …………………………………………………………………16
western developers vs. Arabic developers
some ANLP challenges
Question Answering systems ……………………………………………………………………………….22
Open domain QA system
Main features of question answering system
Stop words Remover
Ontology Knowledge Base
Annotated Doc.s Database
Closed domain QA sys
Examples from our project
Vocabularies, Taxonomies, and Ontologies
Resource Description Framework (RDF)
Web Ontology Language (OWL)
Open Linked Data
SPARQL Queries and Results
recognizing textual entailment
recognizing textual entailment
approaches for recognizing textual entailment
Ontology based module………………………………………………………………………………….70
Template Algorithm Description
List of questions (Question Dataset)
Arabic Module Search…………………………………………………………………………………….88
Keyword Search Module
Our First KeyWord Search
Other Quranic Search
The Holy Quran, due to its unique style and allegorical nature, needs special
attention about search and information retrieval issues. Many works have been done
to accomplish keyword search from Holy Quran. The main problem in all these
is that these are either static or they does not prov
ide us semantic search or Tafsi
The aim of this Project is to develop a tool for searching for concepts in the Holy
Quran book. Although the Quran is one of the important religious books of the
there is little computational analysis performed on it. This is down to many reasons;
ence of adequate morphological
analyzers for Classical Arabic (the language
Quran), the absence of Quranic
has its unique style of des
cribing the topics. At some places, some topics are
explicitly mentioned while some others are meant implicitly. There are many
difficulties for implementation of
semantic search for Holy Quran,
the nature of the
text itself as it is not an ordinary text t
hat we can put to standard machine processing
but rather text that is compiled in some special way in terms of linguistic structures
that can reveal different meanings across the ages.
The system presents three approaches (modules) to retrieve
the answer of the input
ontology based QA module
This module is based on RDF Quranic database(ontology) that contains
Quranic chapters and their verses and tafsir
This ontology allows us to
answer the statistical questions eg: '
the number of verses in As
This ontology is accessed using SPARQL queries
ossible input questions
are stored in [questions with patterns, const. questions] database
Find matching question
RDF fact database
Questions with pattern
are the questions where it is needed to extract
some information from the input question as chapter name or verse
Const. questions are two types:
Questions with queries : questions with queries stored in the
database, their answers are retrieved
by executing those queries
Other questions: questions with their answers stored directly in
In this module answer is retrieved by matching the input question with
all of the questions stored in the database and extracting the answer,
ng the query or generating the query of the most matching
on to the input question.
In this module semantic (targeting the meaning) questions in As
chapter can be answered
The expected input questions are represented by a list of main words;
those words are unique that no two questions share the same main
Find main words
RDF fact database
words list, in main database there are for each word list a list of verses
that answer the questions
It's considered tha
t the answer of the input question is the verses that
contain the answer in its meaning and the tafsir of those verses to clarify
the meaning of the verses
Because the input question may be similar to a question in the database
with different words that h
as the same meaning, synonyms database is
built containing synonyms of the main words
The query generated is to retrieve the verses and their tafsir that contain
the answer which are in the RDF fact database
Search in the quran for the answer
input question is not one of the supported system questions a
keyword search with the input question words after removing the
stopwords is done
Beside the question answering the system supports also word analyzer t
for any input quranic word
Also if the input is one word a search for this word in quran is done
An Overview of the Quran
The Quran consists of 114 chapters of varying lengths, each known as a sura. The title
of each sura is derived from a name or quality discussed in the text or from the first
letters or words of the sura. In general, the longer chapters appear earlier in th
Quran, while the shorter ones appear later.
Each sura is formed from several ayahs
or verses which originally mean
a sign or
portent sent by God. The number of
the ayahs isn
't the same in various Suras. The
Quran was revealed in the Arabic language and
has been translated to other
languages. The Quran corpus consists of 77, 784 word tokens and 19,287 word types.
Overview of the Arabic Language
The Arabic language is a Semitic language with many varieties. It is the largest living
member of the Semi
tic language family in terms of speakers. Modern Arabic is
classified as a macro language with 27 sub
languages in ISO 639
3. These varieties are
spoken throughout the Arab world, and Standard Arabic is widely studied and kno
throughout the Islamic world
. Modern Standard Arabic (MSA) derives from Classical
Arabic, the only surviving member of the Old North Arabian dialect group. The
modern Standard language is closely based on the Classical language, and most
Arabs consider the two varieties to be two reg
isters of one and the same language.
Classical Arabic, also known as Koranic (or Quranic) Arabic, is the form of the Arabic
language used in the Quran as well as in numerous literary texts from Umayyad and
Abbasid times (7th to 9th centuries). Modern Stand
ard Arabic (MSA) is a modern
version used in writing and in formal speaking (for example, prepared speeches and
radio broadcasts). It differs minimally in morphology but has significant differences in
syntax and lexicon, reflecting the influence of the mod
ern spoken dialects. Classical
Arabic is often believed to be the parent language of all the spoken varieties of
Morphological Analysis Systems Developed for the Arabic Language
a number of Arabic morphological analysis systems developed
MSA as it is the usual form of everyday written and printed materials. To name
a few of the tools: [Beesley 1996, Beesley 1998; Beesley 2001; Al
A complete survey of these systems and the morphological analysis
ues used in developing them can be reviewed in [Al
Sughaiyer and Al
Kharashi, 2004]. In [Sawalha and Atwell 2008], a comparison between the
accuracy of different Part
Speech taggers for the MSA is conducted. Also, a
Speech tagging syste
m for the MSA text has been developed by
[Alqrainy et al, 2008] which achieved an accuracy of 91%.
But the text of the Holy Quran and more generally the collections of classical
Arabic poetry have a different lexicon, morphology and syntax from that of th
MSA. Hence, the inadequacy of the available tools to be used for analyzing the
classical Arabic material and especially the Quran which is the most important
book in the Muslim world
Understanding the Concepts of th
The Quran, the holy book of
Islam, may well be the most powerful book in human
history. Both in world history and contemporary affairs, it is doubtful that any other
book now commands, or has in the past exerted, so profound an influence.
Objectively, one of every five people on eart
h today is Muslim. Hence, the
importance of understanding the Quran for every Muslim and also for those scholars
who are interested in the study of man and society, since this book has been
effectively instrumental not only in moulding the destinies of Isl
amic societies, but
also in shaping the destiny of the human race as a whole [Mutahhari, 1984].
Therefore, understanding the concepts of the Quran is of paramount importance if
one wishes to study this book comprehensively.
Defining the meaning of a 'Conc
In the Wordn
et dictionary, a concept is defined as an abstract or general idea
inferred or derived from specific instances.
Defining concepts for any domain of knowledge is far from an easy task.
[Bennett, 2005] considered the problem of defining con
cepts within formal
ontologies. He concluded that the disagreement stems from the differences in
understanding of the word ‘concept’.
Bennett says, “those who are skeptical about the idea of precision and
universality tend to regard a ‘concept’ as somethi
ng rather close to a natural
language term, …But for a (classical) logician, a concept is an abstract entity
that is largely independent of the vagaries of natural language: only in
idealized circumstances can a concept within a formal system be regarded a
the referent of a natural term.” From a computational point of view, the
second approach is more appealing.
But, if we consider the case of holy books in general and the Quran in specific,
all the classification of concepts are developed by people who c
ome from a
humanities background not from a computing one; they prefer the first
Concepts of the Quran
Concepts could be classified in to two main categories: Concrete concepts
(lexical or keyword concepts) and Abstract concepts (general concepts
An example of a concrete concept would be any word type or term that
already exists in the text such as: names of persons, names of prophets, names
of places or cities…etc.
Abstract concepts are more general. They are not usually explicitly mentioned
n the text. They represent general themes or features covered by the text. For
instance, there are several verses in the Quran that describe the main pillars of
Islam. This is an abstract concept and is the most important theme in the
Quran but was never m
entioned explicitly in the Quran book.
Concrete concepts; consider every word type of the Quran as a concept.
Developing computational systems for concrete terms generally involve
using keyword search tools.
The main problem with the
keyword search too
ls is their poor recall
Understanding the meaning of the Quran verses through reading the
Tafsir (detailed explanation of the meaning of the verses) is quite helpful
but does not draw the complete picture of the
message that this book
tries to convey to its readers. This is because the Quran covers one
theme in many different chapters and to get the complete picture, the
reader must refer to all the passages with their context. In addition to
that, the reader need
s to relate the subject to the central themes, which
are the main unifying ideas of the book. Therefore, the use of thematic
approach helps us to comprehend the Quran’s message by avoiding
pitfalls and misuse of phrases by picking them out of context. This
approach organizes subject matter around major or unifying themes,
thus building them into a whole and enabling the readers to make
important connections between them. This job must be carried out by
an expert as there are some necessary conditions that n
eed to be
fulfilled to be able to do this hard job properly.
Murtada Mutahhari, a professor of theology in the University of Tehran,
[Mutahhari, 1984] lists these conditions as follows:
The understanding of the Qur'an requires certain preliminaries which
re briefly described here. The first essential condition necessary for the
study of the Qur'an, is the knowledge of the Arabic language, such as for
the understanding of Hafiz and Sa'di,
It is impossible to get anywhere without the knowledge of the Persian
language. In the same way, to acquaint oneself with the Qur'an without
knowing the Arabic language is impossible.
The other essential condition is the knowledge of the history of Islam.
This book was revealed gradually during a long period of twenty three
years of the Prophet's life, a tumultuous time in the history of Islam. It is
on this account that every verse of the Qur'an is related to certain
specific historical incident called "The reasons for the descent", which by
itself does not restrict the mea
ning of the verses, but the knowledge of
the particulars of revelation throws more light on the subject of
verses in an effective way.
The third condition essential for the understanding of the Qur'an, is the
correct knowledge of the sayings of the Pro
phet (S). He was, according
to the Qur'an itself, the interpreter of the Qur'an par excellence.
The Qur'an says: "
We have revealed to you the Reminder that you may
make clear to men what has been revealed to them ... (16:44)
Therefore, the main
goals for this research project are:
an Arabic tool For
ord search based on (Word
To build The First Quranic Question Answer System for
To develop Root Extractor tool that improves on the accuracy o
available Root Extractor
tools already existing.
To provide Semantic Search on Quran Chapters.
To Provide the User
of the Tafsir of Quran Verses
Arabic natural language
The Arabic language is both challenging and interesting. It is interesting due to its
history [Versteegh 1997], the strategic importance of its people and the region they
occupy, and its cultural and literary heritage [Bakalla 2002]. It is also challenging
because of its complex linguistic structure [Attia 2008].
, it is the native
language of more than 330 million speakers [CIA 2008].
It is also the language in
which 1.4 billion Muslims perform their prayers five times daily.
characterized by a complex Diglossia situation [Diab and Habash 2007; Farghaly 1999;
Ferguson 1959, 1996]. Due to existence of Classical Arabic and Modern Standard
Arabic (MSA), we will not consider this Diglossia phenomenon as it is not our case.
y Arabic NLP system that does not take the specific features of the Arabic language
into account is certain to be inadequate [Shaalan 2005a;
2005b]. For example, Arabic
is written from right to left. Like Chinese, Japanese, and Korean there is no
ation in Arabic. In addition, Arabic letters change shape according to their
position in the word. Like Italian, Spanish, Chinese, and Japanese, Arabic is a pro
language, that is, it allows subject pronouns to drop [Farghaly 1982] subject to
ility of deletion [Chomsky 1965].
Western developers vs. Arabic developers
the last few years, Arabic natural language processing (ANLP) has gained
increasing importance, and several state
art systems have been developed for
a wide range of ap
plications, including machine translation, information retrieval and
extraction, speech synthesis and recognition, localization and multilingual
information retrieval systems, text to speech, and tutoring systems. These
applications had to deal with severa
l complex problems pertinent to the nature and
structure of the Arabic language. Most ANLP systems developed in the Western
world focus on tools to enable non
Arabic speakers make sense of Arabic texts.
Arabic Tools such as Arabic named entity recognition,
machine translation and
sentiment analysis are very useful to intelligence and security agencies. Because the
need for such tools was urgent, they were developed using machine learning
approaches. Machine learning does not usually require deep linguistic
and fast and inexpensive. Developers of such tools had to deal with difficult issues.
One problem is when Arabic texts include many translated and transliterated named
entities whose spelling in general tends to be inconsistent in Arabic texts [S
and Raza 2008]. For example a named entity such as the city of Washington could be
’. Another problem is the lack of a sizable
corpus of Arabic
named entities which would have helped both in rule
statistical named entity recognition systems. Efforts are being made to remedy this.
For example, the LDC released in May 2009 an entity translation training/dev test for
Arabic, English, and Mandarin Chinese. A third limitation is that NLP tools deve
for Western languages are not easily adaptable to Arabic due to the specific features
of the Arabic language. Recognizing that developing tools for Arabic is vital for the
progress in ANLP, the MEDAR consortium has started an initiative for cooperati
between Arabic and European Union countries for developing Arabic language
Resources [Choukri 2009].
On the other hand, ANLP applications developed in the Arab World have different
objectives and usually employ both rule
based and machine
The following are some of the objectives of ANLP for the Arab World:
Transfer of knowledge and technology to the Arab World. Most recent
publications in science and technology are published in the other languages.
Modernize and fertilize the Arabic
language. This follows from (1) above.
Translating new concepts and terminology into Arabic.
Make information retrieval, extraction, summarization, and translation
available to the Arab user (this field related to our project topic).
Some ANLP challenges
of the challenges that faces ANLP developers is the Arabic Script. It is one of the
key linguistic properties of the Arabic language that poses a challenge to the
automatic processing of Arabic. Although Arabic is a phonetic language in the sense
there is one
one mapping between the letters in the language and the sounds
they are associated with, Arabic is far from being an easy language to read due to the
lack of dedicated letters to represent short vowels, changes in the form of the letter
pending on its place in the word, and the absence of capitalization and minimal
the Arabic script does not have dedicated letters to represent the short vowels
in the language, short vowels have been represented by diacritics which are ma
above or below the letters. However, these diacritics have been disappearing in
contemporary writings and readers are expected to fill in the missing short vowels
through their knowledge of the language.
letters have different shapes depending o
n the position of the letter in the
word. All Arabic word processors implement these rules so that the user does not
have to manually select the correct shape.
NLP applications such as machine translation, information retrieval, information
clustering, and classification it is necessary to split a running text
correctly into sentences and a sentence splitter capitalizes on these features. But
scripts such as Arabic, Chinese, Japanese, and Korean have neither capitalization nor
strict rules o
f punctuation and their absence [Shaalan and Raza 2008, 2009] makes
the task of preprocessing a text much more difficult and challenging.
challenge facing researchers and developers of Arabic computational
linguistics is the dilemma of normalizatio
n. The problem arises because of the
inconsistency in the use of diacritic marks and certain letters in contemporary Arabic
texts. Some Arabic letters share the same shape and are only differentiated by
adding certain marks such as a dot, a hamza or a madd
a placed above or below the
letter. For example, the “alif” in Arabic (
) may be three different letters depending
on whether it has a hamza above as in (
) or a hamza below as in (
) or a madda
above as in (
). Recognizing these marks above or below a
letter is essential to be
able to distinguish between apparently similar letters.
For this problem we normalized alif with a hamza above or below or bare alif or alif
madda with simply an alif. Also we normalize the final taa marbuuTa (
) and a
al haa (
) and the alif maqsuura (
) and the yaa (
), and another
But it soon became apparent that although normalization improves recognition by
solving the variability in input, it increases the probability of ambiguity
2010]. For example, normalizing an initial alif with a hamza above or below it,
removes an important distinction between (
) ann and (
) inn. The first means
“that” and must be followed by a nominal sentence. The second means “to” which
tes the English infinitive, but whose translation is meaningless if followed by a
noun. In short, although normalization solves recognition problems, it creates the
unintended effect of increased ambiguity.
many levels of ambiguity pose a significant c
hallenge to researchers developing
NLP systems for Arabic [Attia 2008]. The reason is that ambiguity exists on many
levels as evidenced by Maamouri and Bies  who show 21 different analyses of
the Arabic word (
) tmn, produced by BAMA. The average n
umber of ambiguities
for a token in Arabic is higher than any other language. Ambiguity in Arabic is present
at the following levels:
A word belonging to more than one part of speech such as
qdm which could be a verb of Form II meaning “to
introduce” or a verb of Form
1 meaning “to arrive from” or a noun meaning “foot.” Some homograph
ambiguity can be resolved by contextual rules. For example, an Arabic word
that could be either a noun or a verb can be disambiguated by the following
ich says that such a word will be disambiguated to a noun when
preceded by a preposition.
Contextual homograph resolution
] N | V
> N / Prep
Internal word structure ambiguity:
That is, when a complex Arabic word could
be segmented in differe
nt ways. For example, “
” wly could be segmented
” corresponding to coordinate
pronoun meaning “and for
me,” or may not be segmented at all meaning “a pious person favored by
As in the case of a prepositional
attachment as in
ديدجلا كنبلا ريدم
”/qabaltu mudiir al
jadiid/ which could mean “I
the new bank manager” or “I met with the manager of the new
depending on the internal analysis of the noun phrase.
Sentences and phrases may be interpreted in different
ways. For example, “
نم رثكأ دمحأ ىلع بحي
/yhb ’ly ahmd aktr mn
“Ali likes Ahmed more than Ibrahim.” Does this mean that
Ali likes Ahmed
more than Ali likes Ibrahim, or do Ali and
Ahmed, but Ali likes
Ahmed more than Ibrahim likes Ahmed?
Constituent boundary ambiguity:
For example “
ديدجلا كنبلا ريدم
” mdyr albnk
algydyd could mean “the new manager of the bank” or “the manager of the
new bank” depending on the boundary of
the adjective phrase within this
هنا ىلع لاق
qala Ali annahu najah/Ali said
he succeeded. This sentence is ambiguous both in English and
Chomsky’s Binding principles account for sentences like t
his.The question here
is does “he” refer to Ali or to someone else?
addition to these levels of ambiguity the process of normalization plus
Arabic such as the pro
drop structure, complex word structure, lack
and minimal pu
nctuation contribute to ambiguity, but it is
the absence of short
vowels that contributes most significantly to ambiguity.
the absence of short vowels, two types of linguistic information are
is most of the case markers that define the
nouns and adjectives
The absence of case
markers and thus the grammatical function of a word, creates multiple ambiguities
due to the relatively free word order in Arabic and because Arabic is a pro
type of information that is lost due to the nature o
f the Arabic
script is the lexical and part of speech information.
Thus, in the absence of internal
voweling it is sometimes impossible to determine the part of speech (POS) without
For example, without contextual clues a word like (
mn could be a preposition
meaning “from”, a wh
phrasemeaning “who” or a verb
meaning “granted”. An Arabic
token such as (
) ktb without internal voweling could be a plural noun “books”, an
active past tense verb “wrote”, a passive past tense verb “was wr
itten” or a causative
past tense verb “he made him write.”
While ambiguity is a challenge in any language, what makes Arabic so challenging is
that all of these features are present in one.
important challenge is nonconcatenative morphology [McCart
which presents a challenge to the structural lists theory of the morpheme. They
defined the morpheme as a minimal linguistic unit that has a meaning. By minimal it
is always meant that a morpheme cannot have a morpheme boundary within it. This
inition works well for languages with concatenative morphology like English.
McCarthy  points out that the building blocks of Arabic words are the
consonantal root which represents a semantic field such as “KTB” “writing” and a
vocalism that represen
t a grammatical form.
Arabic stems are described in terms of prosodic templates such as CVCVC. The Cs
the root radicals and Vs represent the vocalism [Cavalli
Sforza et al. 2000].
words such as “
formed by an association of t
he radicals to the
vocalism. While McCarthy proposes that Arabic words are analyzed at tiers (the root
and the vocalism).
Arabic is an agglutinative language and affixes that represent
different parts of speech can be attached to a stem or root to form a t
English and most languages, Arabic has a complex word structure.
For example, theArabic sentence
/wra’aytuhum/ “and I saw them” is written
as one word and may be decomposed into the following four morphemes:
/wa/ Conjunction “and”
/ Past tense Verb “saw”
/ Subject Pronoun “I”
/ Object Pronoun “them”
The order in which affixes are attached to stems is rule governed which
decomposition of Arabic words possible. However, it is far from an
easy process due
to the high degree of ambiguity in Arabic [Attia 2008]. For
example, the word
/whm/ has at least four valid analyses:
/wa+ hum/ CONJ SUBJPRON/OBJPRON “and they”
/wa+hammun/ CONJCOMMON NOUN “and worry”
/wahm/ COMMON NOUN “illusion”
/wa+ hamma/ CONJ PVERB “and he initiated”
Question Answering(QA) System
is a computer science discipline within the fields of
(NLP), which is concerned with building systems that
automatically answer questions posed by humans in a
A QA implementation, usually a computer program, may construct its answers by
querying a structured
of knowledge or information, usually a
. More commonly, QA systems can pull answers from an unstructured collection
of natural language documents, and
to determine which kind of method will be used
to get the answer we have to know the kind of questions we deal with, there are two
kinds of questions:
are questions whose answers can be found in short spans
of text and correspond to a sp
ecific easily characterized, category, often
named entity like... (Person name, location).
are questions whose answers are information whose
scope is greater than a single factoid but less than an entire document, in such
cases we might
need summary of document or a set of documents.
s of question answering systems
answering deals with questions about nearly anything,
and can only rely on general ontologies and world knowledge. On the other
hand, these systems usually have much more
data available from
extract the answer.
Main features of question answering system:
It represents the input question of the user.
Represents the documents from which the answer will be retrieved.
is the process of converting a stream of characters (the text of the
document or queries) into a stream of words (the candidate words to be adopted as
index terms).This process is Also called the tokenization of the sentences which is
to identify the words individually for further processing on them.
Stop words Remover:
Words which are too frequent among the documents’ and the queries’ text in the
collection are not good discriminators , they are referred to as stop
represents the main features of question answer
words are removed in this step with the objective of filtering out words with very low
discrimination values for retrieval purposes.
The words that appear in documents and in queries often have many morphological
variants also know
n as stemming words. This subsystem attempts to reduce words to
their stems or the root forms. Thus, the key terms of a query or document are
represented by stems rather than by the original words. The rationale for such a
procedure is that similar words g
enerally have similar meaning and thus retrieval
effectiveness is enhanced if morphological variants are uniformed.
The synonyms of the words are identified in this subsystem to make the user’s query
intense and in the same way to obta
in every possible concept present in the
document to response the user’s query effectively.
is a technique used to boost performance of a document retrieval
engine. Common methods of query expansion for Boolean keyword
retrieval engines include inserting query terms, such as alternate inflectional or
derivational forms generated from existing query terms, or dropping query terms
that are, for example, deemed to be too restrictive
s also called “answer type Recognition”, it is the process of classifying the question
by its expected answer type .for example a question like “who founded virgin
airlines?” expects an answer of type person .a question like “what Canadian city has
gest population?” expects an answer of type city .If we know the answer type
for a question ,we can avoid looking at every sentence or noun phrase in the entire
suite of documents for the answer instead focusing on just people or cities knowing
type is also important for presenting the answer .a definition question like
“what is a prism” might use a simple answer template like “ a prism is……” while an
answer to a biography question like “who is Zhou enlai?” might use
It is not enough to simply provide a computer with a large amount of data and
expect it to learn to speak
the data has to be prepared in such a way that the
computer can more easily find patterns and inferences. This is usually done by adding
adata to a dataset. Any metadata tag used to mark up elements of the
dataset is called an
over the input. However, in order for the algorithms
to learn efficiently and effectively, the annotation done on the data must be
accurate, and relevant t
o the task the machine is being asked to perform. For this
reason, the discipline of language annotation is a critical link in developing intelligent
human language technologies, There is different aspects of language that are studied
and used for annotati
ons syntax, semantics, morphology, phonology (and phonetics),
and the lexicon.
The study of how words are combined to form sentences. This includes examining
parts of speech and how they combine to make larger constructions.
The study of
meaning in language. Semantics examines the relations between words
and what they are being used to represent.
The study of units of meaning in a language. A “morpheme” is the smallest unit of
language that has meaning or function, a definitio
n that includes words, prefixes,
affixes and other word structures that impart meaning.
The study of how phones are used in different languages to create meaning. Units of
study include segments (individual speech sounds), features (the individu
al parts of
segments), and syllables.
The study of the sounds of human speech, and how they are made and perceived.
“Phones” is the term for an individual sound, and a phone is essentially the
unit of human speech.
The idea of
is to create a highly tangled representation of the
sentences where each word is directly connected to others representing both
meaning and relations. Instead of keeping the knowledge base separate, the relevant
knowledge gets embedded wit
hin the text. We can hence use efficient indexing
techniques to represent such knowledge and query it very effectively with suitably
modified techniques of information retrieval.
Ontology Knowledge Base:
In the context of knowledge sharing, ontology is
means a specification of a
conceptualization. That is, an ontology is a description (like a formal specification of a
program) of the concepts and relationships that can exist for an agent or a
community of agents. This definition is consistent with the us
age of ontology as set
definitions, but more general. And it is certainly a different sense of the
word than its use in philosophy.
And the task of this subsystem is to manage the ontological knowledge and factual
knowledge(instances and stateme
Ontology knowledge base
designed ontology related to a specific domain .domain ontology is defined in OWL to
represent the concepts and relations in the specific domain .to facilitate the
annotation engine ,the
Ontology knowledge base
plenty of entities of general importance and various relations between them, for
implementation of OWL ontology and knowledge representation ,protégé ontology
editor and protégé OWL API were used.
Semblance of OWL classes in
Annotated Doc.s Database:
Annotated documents’ Database
stores all the
documents that have gone through the process of annotation from the
. This database is ready to query upon by the
to present the list of the related documents’ to the users’
Figure below represent
structure of annotated documents’ database
has two main functions; one that it has to save the
annotated documents coming from the
in to the
for using later, here it acts as a
is to analyze the annotated query coming from
retrieve the results corresponding to that query from the
and returns the query’s results to the application logic layer; here it acts as
The final stag
e of question answering is to extract a specific answer from the passage
so as to be able to present the user with an answer like “300 million” to the question
“what is the population of the united states?” .
Two classes of algorithms have been applied to
the answer extraction task, one
based on answer
type pattern extraction and one based on N
In the pattern extraction method for answer processing, we use information about
the expected answer type together with regular expression patterns, fo
r example, for
questions with HUMAN answer type, we run the answer type or named entity tagger
on the candidate passage or sentence, and return whatever entity is labeled with
type HUMAN. Thus in the following examples the underlined named entities are
racted from the candidate answer passages as the answer to the HUMAN and
“Who is the prime minister of India?”
prime minister of India, had told left leaders that the deal would
not be renegotiated.
“How tall is
The official height of Mount Everest is
An alternative approach to answer extraction , used solely in web search ,is based on
gram tiling ,sometimes called Redundancy
based approach .this simplified method
begins with the snip
pets returned from the web search engine .produced by a
reformulated query .in the first step of the method , N
gram mining ,every unigram,
and trigram occurring in the snippet is extracted and weighted .the weight is
a function of the number of sni
ppets the N
gram occurred in, and the weight of the
query reformulation pattern that returned it .
In the N
gram filtering step, N
grams are scored by how well they match the
predicted answer type .finally an N
gram fragments into longer answers. A standar
greedy method is to start with the highest scoring candidate and try to tile each
other candidate with this candidate. The best scoring concatenation is added to the
set of candidates, the lower scoring candidate is removed, and the process continues
il single answer is built.
To improve the response of the question answering system some systems use
The idea of
) is to involve the user in the retrieval process so as
to improve the final
result set. In particular, the user gives feedback on the relevance
of documents in an initial set of results. The basic procedure is:
The user issues a (short, simple) query.
The system returns an initial set of retrieval results.
The user marks some
returned documents as relevant or nonrelevant.
The system computes a better representation of the information need based
on the user feedback.
The system displays a revised set of retrieval results.
domain question answering deals with questions under a
specific domain (for example, medicine or automotive maintenance), and can be seen
as an easier task because NLP systems can exploit domain
frequently formalized in
. Alternatively, closed
domain might refer to a
situation where only a limited type of questions
are accepted, such as questions
We chose The Holly Qur’an to be our specific domain. As the user can ask a question
about The Holly Qur’an such as (
ريسفت ام ,؟ميركلا نآرقلا ىف ةروس مك ,؟ميركلا نآرقلا تايآ ددع مك
؟فهكلا ةروس ىف
We also chose Sura
as a specific closed
domain in The Holly Qur’an where the user can ask any question he wants in this
chapter specially such as (
الله ءامسأ ىه ام
انديس ةصق ركذأ ,؟ءارعشلا ةروس ىف تركذ ىتلا
؟ءارعشلا ةروس ىف تءاج ىتلا ةيعدلأا ىه ام , ىسوم
QA systems typically consist of three core components: query formation,
information retrieval and answer selection.
The query formation component takes a question string as input and
transforms it into one or more queries, which
are passed to the information
Depending on the type of the question, the IR component queries one or more
knowledge sources that are suitable for that type and aggregates the results.
These results are then passed as answer candidates
to the answer selection
component, which drops candidates that are unlikely to answer the question
and returns a ranked list of answers.
This overall architecture is illustrated in Figure
Figure 6.Typical architecture of QA system
The technique based on question interpretation and answer extraction uses
different types of text patterns. Question patterns are used to interpret a
question, i.e. to determine the property it asks for and to extract the target and
context objects. Answe
r patterns are used to extract answers from text
passages. This chapter describes how these patterns are generated.
Initially, the properties and question patterns need to be specified manually. In
a second step, the system automatically learns answer p
atterns for each of the
properties, using question
answer pairs as training data. See the following two
who <be> <T_NONE>
<what> <be> <T_NOABBR>
(who|<what>) <be> <T> (in|of) <C>
(who|<what>) <be> <C>'s <T>
(name the|<what> <be> the (names?|term) (for|given
to|of)|who <be> considered to be) <T>
(name the|<what> <be> the (names?|term) (for|given
to|of)|who <be> considered to be) <T> (in|of) <C>
(name the|<what> <be> the (names?|term) (for|given
<be> considered to be) <C>'s <T>
what <T> <be> (called|known as|named)
what <T> (in|of) <C> <be> (called|known as|named)
what <C>'s <T> <be> (called|known as|named)
<T> <be> (called|known as|named) what
<T> (in|of) <C> <be> (called|known as|named) wha
<C>'s <T> <be> (called|named) what
what do you call <T>
what do you call <T> (in|of) <C>
Question patterns for the property NAME.
At first, we determined the properties that the questions ask for and created
an initial set of question patter
ns by simply replacing the target and context
objects in the questions by <T>and <C>tags respectively.
Then we converted the first letter to lower case, dropped the final
punctuation mark and reordered auxiliary verbs in the same way as it is done
question normalization component. This is necessary because the
question patterns are applied to a question string after normalization.
In the next step, we merged patterns by extending them to regular
“who (assassinated| killed| murder
ed| shot) …”). We then used my personal
knowledge and dictionaries to further extend the patterns by adding
synonyms, and we added patterns to cover other possible formulations we
could think of.
To further generalize the patterns, we invented shortcuts,
which are tags that
represent regular expressions. These shortcuts are specified in a separate
resource file. Table 1 shows the shortcuts and the regular expressions they
Shortcuts and corresponding regular expressions.
it is hard to determine the property a question asks for. E.g. a
question of the format “What is …?” could ask for the DEFINITION of a term,
the LONGFORM of an abbreviation or the NAME of a place. To allow a
definite interpretation of such questions, I inv
ented a mechanism that uses
object types. A target tag can be associated with an object type to restrict the
format of a target object. Such target tags are not replaced by the general
capturing group (.*) but by a more constraining regular expression. For
example, the tag <T_ABBR>indicates that the target must be an
abbreviation, i.e. a sequence of upper case letters. Table 2 gives an overview
of the object types.
Object types, their meanings and the corresponding capturing
reviewed the properties and question patterns. We dropped rare
properties and merged some properties with similar patterns. E.g. first we
distinguished between the property NAME for proper names and the
property TERM for technical terms, which turned out t
o be impractical.
Examples from our project:
For the following question:
؟ ص مقر ةروس يف س مقر هيلاا ريسفت وه ام
First, we simply translate the question like that:
What is the interpretation of the verse No.Q in chapter No.r?
Then we determine the
property, target and context:
Context: chapter, verse, chapter No., verse No.
Then the template will be:
What <be> the <P> of the <C> <T1> in <C> <T2>?
Then we take those two targets a
nd entered a query to get the appropriate
For the following question:
؟س مقر ةروس تايآ ددع مك
How many verses of chapter No. Q?
Determining P, T and C:
Property: number of verses
Context: chapter, chapter No., ve
How many <C> (of| in) <C> <T>?
Building System Question
Dataset and Their Relative Queries:
We Start with Entering more than 100 Questions and their Queries Manually
that defines the system Constrains and which Questions can be
7.part of the Question Dataset
Storing Templates in Database:
We Designed a Da
tabase that Stores the Template
relative Questions and The answer of it.
We Divided the Questions to Two
Question with Query:
This Question depends on user Question to Fulfill Gaps in the Query
of this Question.
مقر ةيلاا ركذا
.ةرقبلا ةروس ىف
مقر ةيلال نيللاجلا ريسفت ام
:ةرقبلا ةروس ىف
Questions is divided
into two subclasses:
Questions which its answer entered manually to the DB,
Questions with Query to Execute to get the answer
Figure 8.Database Schema
Figure 9. Snapshots
Answer for const. Question
The word semantics is derived from the Greek word
, "to signify, to indicate" and that from
,"sign, mark,". Linguistically, it is the study of interpretation of signs
or symbols which are used for some specific contexts.
Semantic analysis is the process of relating syntactic structures, from the
levels of phrases, clauses, sentences and the
whole text, to their language
independent meanings. It causes removal of characteristics related to
fussy linguistic contexts up till that.
Vocabularies, Taxonomies, and Ontologies
Vocabularies, taxonomies, and ontologies are all related. Each contains a
defined terms, and each is critical to the ability to express the meaning of
Their differences lie in their expressiveness, or how much meaning each attaches to
the terms that it describes:
is a collection of unambiguous
ly defined terms used in
communication. Vocabulary terms should not be redundant without explicit
identification of the redundancy. In addition, vocabulary terms are expected to
have consistent meaning in all contexts.
is a vocabulary in which terms are organized in a hierarchical
manner. Each term may share a parent
child relationship with one or more
other elements in the taxonomy. One of the most common parent
relationships used in taxonomies is that of special
ization and generalization,
where one term is a more
specific or less
specific form of another term. The
child relationships can
many; however, much taxonomy
adopts the restriction that each element can have only one parent. In this case,
the taxonomy is a tree or collection of trees (forest).
uses a predefined, reserved vocabulary of terms to define concepts
and the relationships between them for a specific area of interest, or domain.
Ontology can actually refer to a vocabulary,
taxonomy, or something more.
Typically the term refers to a rich, formal logic
based model for describing a
knowledge domain. Using ontologies, you can express the semantics behind
vocabulary terms, their interactions, and context of use.
are simple collections of well
defined terms. Taxonomies extend vocabularies by
adding hierarchial relationships between terms.
The Semantic Web uses a combination of a schema language and an ontology
language to provide the capabilities of vocabularies
, taxonomies, and ontologies. RDF
Schema (RDFS) provides a specific vocabulary for RDF that
can be used to define
taxonomies of classes and properties and simple domain and range specifications for
The OWL Web Ontology Language provides an exp
ressive language for defining
ontologies that capture the
semantics of domain knowledge.
Resource Description Framework (RDF)
The Resource Description Framework (RDF) is a general purpose language for
representing information in the Web in a minimally
constraining and maximally
flexible way. Its purpose is the processing
without loss of information.
RDF is intended for situations where information needs to be processed and
exchanged between applications rather than on
ly being displayed to people.
In the Semantic Web, RDF is the data
model for representing the meta
provides users with a domain
independent framework for representing information
about resources in the WWW. With it, one can unambiguously express
the meaning of concepts and facts.
Everything that can be described is called a resource. A resource can be anything: a
city, a book, a person, a process, a company, etc. Resources being described have
properties which have values. These prop
erties and values are specified by describing
the resources in RDF statements. The part that identifies the resource the statement
is about is called the subject. The part that identifies the property of the subject that
the statement specifies is the pred
icate, and the part that identifies the value
of that property is the object. Hence, an RDF statement is a triple consisting of a
subject, a property, and an object.
RDF extends the linking structure of the Web to use URIs to name the relationship
en things as well as the two ends of the link (this is usually referred to as a
―triple‖). Using this simple model, it allows structured and semi
structured data to
be mixed, exposed, and shared across different applications.
This linking structure forms
a directed, labeled graph, where the edges represent the
named link between two resources, represented by the graph nodes. This graph view
is the easiest possible mental model for RDF and is often used in easy
Web Ontology Language (OWL)
The limited expressiveness of RDFs resulted in the need for a more powerful
ontology modeling language
in particular, one that permitted greater machine
interpretability of Web content. This led to the W3C recommendation
of the Web
Ontology Language OWL.
OWL allows modelers to use an expressive formalism to define various logical
concepts and relations in ontologies to annotate Web content. The enriched content
can then be consumed by machines in order to assist humans in
various tasks. As
such, OWL fulfills the requirement for an ontology language that can formally
describe the meaning of terminology on Web pages. If machines and applications are
expected to perform useful reasoning tasks on these Web documents, the langu
must surpass the semantics of RDFS. OWL has been designed to meet this need.
Similar to RDFS, OWL can be used to explicitly represent the meaning of terms in
vocabularies and the relationships between those terms. The resulting ontologies are
applications that need to process the content of information instead of just
presenting it. OWL provides a richer vocabulary along with formal semantics than
RDFS by allowing additional modeling primitives that result in an increased
expressivity for descr
ibing properties and classes.
Open Linked Data
Linked data describes a method of publishing structured data so that it can be
interlinked and become more useful. It builds upon standard Web technologies such
as HTTP and URIs, but rather than using the
m to serve web pages for human readers,
it extends them to share information in a way that can be read automatically by
computers. This enables data from different sources to be connected and queried.
The Semantic Web isn't just about putting data on the
web. It is about making links,
so that a person or machine can explore the web of data. With linked data, when you
have some of it, you can find other, related, data.
Like the web of hypertext, the web of data is constructed with documents on the
However, unlike the web of hypertext, where links are relationships anchors in
hypertext documents written in HTML, for data they links between arbitrary thi
described by RDF,. The URIs
identifies any kind of object or concept.
There are four principle
s of linked data:
Use URIs as names for things
Use HTTP URIs so that people can look up those names.
When someone looks up a URI, provide useful information, using the
Include links to other URIs.
that they can discover more
Querying data on your own hard drive is useful, but the real fun of SPARQL starts
when you query public data sources.
You need no special software, because these data collections are often made publicly
available through a SPARQL
endpoint, which is a web service that accepts SPARQL
The most popular SPARQL endpoint is DBpedia, a collection of data from the gray
infoboxes of fielded data that you often see on the right side of Wikipedia pages. Like
many SPARQL endpoints, DBp
edia includes a web form where you can enter a query
and then explore the results, making it very easy to explore its data. DBpedia uses a
program called SNORQL to accept these queries and return the answers on a web
If you send a browser to
you’ll see a form where you can
enter a query and select the format of the results you want to see.
DBpedia's SNORQL web form
More and more people are using the query language SPARQL (pronounced “sparkle”)
to pull data from a growing collection of public and private data.
Whether this data is part of a semantic web project or an integration of two
inventory databases on differen
t platforms behind the same firewall, SPARQL is
making it easier to access it.
In the words of W3C Director and Web inventor Tim Berners
Lee, “Trying to use the
Semantic Web without SPARQL is like trying to use a relational database without
was not designed to query relational data, but to query data conforming to
the RDF data model. RDF
based data formats have not yet achieved the main stream
status that XML and relational databases have, but an increasing number of IT
professionals are disc
overing that tools using the RDF data model let them expose
iverse sets of data (including
relational databases) with a common, standardized
Both open source and commercial software have become available with SPARQL
support, so you don’t need
to learn new programming language APIs to take
advantage of these data sources.
This data and tool availability has led to SPARQL letting people access a wide variety
of public data and providing easier integration of data silos within an enterprise.
Quranic Arabic is a unique form of Arabic used in the Quran and it is the direct
ancestor language of Modern Standard Arabic MSA. Annotating the Quran faces an
extra set of challenges compared to MSA due to the fact that the text is over 1,400
rs old, and the Quranic script is more varied than modern Arabic in terms of
orthography, spelling and inflection. The same word is spelled in different ways in
different chapters. However, the Quran is fully diacritized which reduces its
th respect to morphological segmentation, 54% of the Quran’s 77,430 words
require segmentation, resulting in 127,806 segments. A typical word in the Quran
consists of multiple segments combined into a single whitespace
form. For example, “ ,
which means “so remember me” should be
segmented to four segments; “
”, and “
”, where each segment
represents an individual syntactic unit as follows:
Conjunction prefix, verb, subject pronoun, and object pronoun, respectively.
The computational analysis of the Quran remains an intriguing but unexplored field
since little and isolated efforts have been done in this field . For the young learner,
this may be not just a challenging requirement, but a deterrent.
The Quranic Arab
ic Corpus is an annotated linguistic resource consisting of 77,430
words of Quranic Arabic. The corpus aims to provide morphological and syntactic
annotations for researchers wanting to study the language of the Quran.
The grammatical analysis helps
readers further in uncovering the detailed intended
meanings of each verse and sentence. Each word of the Quran is tagged with its part
speech as well as multiple morphological features. Unlike other annotated Arabic
corpora, the grammar framework adopt
ed by the Quranic Corpus is the traditional
Arabic grammar of (
Corpus annotation assigns a part
speech tag and morphological features to each
word. For example, annotation involves deciding whether a word is a noun or a verb,
and if it is inflec
ted for masculine or feminine.
The annotated corpus includes also an annotated Treebank of Quranic Arabic.
Quranic Arabic Dependency Treebank (QADT) has been developed and
implemented in Java at the University of Leeds, UK, as part of the Quran Corp
project. It works as a Treebank for Quranic sentences by providing a deep
computational linguistic model based on historical traditional Arabic grammar (
. The Treebank is freely available under an open source license. QADT
s two levels of analysis:
Morphological annotation and syntactic representation.
The morphological segmentation has been applied to all the 77,430 words in the
Quran. The treebank also represents Quranic syntax through dependency graphs as
shown in Figure
. QADT is applied to each individual Quranic verses (or part of the
verse) even if that verse does not form a complete sentence unless it is joined with
its neighbor verse. In addition to QADT, there are very few works done in the field of
alysis of Quran such as .
A Quranic Ontology uses knowledge representation to define the key concepts in the
Quran, and shows the relationships between these concepts using predicate logic.
Named entities in verses, such as the names of
historic people and places mentioned
in the Quran, are linked to concepts in the ontology.
Semantic Quran Ontology
It is a multilingual RDF representation of translations of the Quran.
The dataset was created by integrating data from two different
structured sources. The dataset were aligned to an ontology
represent multilingual data from sources with a hierarchical structure. The
resulting RDF data encompasses 43
different languages which belong to the
most under represented langua
ges in Linked Data, including Arabic, Amharic
This is the Ontology which we used within our Question Answer System.
The data were extracted f
rom two semi
the data from the
Tanzil project and the Quranic Arabic C
It is an
ontology for representing multilingual data extracted from sources
consisting of different translations to the Quran book as well as numbered
chapters and verses.
In addition to providing aligned translations for each verse,
syntactic information on each of the original Arabic terms utilized across
interlinked the dataset to three versions of Wiktionary as well as
and ensured therewith that
dataset abides by all the four Linked Data
To represent the data as RDF, They developed a general
vocabulary. The vocabulary was specified with the aim of supporting datasets
which display a hierarchical structure.
It includes four basic classes: Chapter, Verse, Word and
The chapter class provides the name of chapters in different
languages and localization data such as chapter index and order.
Additionally, the chapter class provides metadata such as the number of
verses in a chapter, revel
ation place and some provenance
Finally, the chapter class provides inter
class linking properties to link it to
different verses contained in it.
For example each chapter provides a dcterms:tableOfContents for
its verses in the form
The verse class contains the verse text in different languages as well
as numerous localization data such as verse index and related chapter
Additionally, this class provides related verse data such as differen
descriptions and provenance information.
Finally, it contains interclass linking properties to link the verse in both
directions; the chapter of the verse and the whole words contained in such
This class encompasses the verse next le
vel of granularity and
contains the word text in different languages as well as numerous
localization data such as related verse and chapter verse indexes.
Additionally, the word class provides related word provenance information
and some inter
ing properties to link it to chapter and verse of
such a word.
UML class diagram of the semantic Quran ontology
Since the Quran contain a lot of data concerning places, people and different
events, multilingual sentences
concerning such information can be
retrieved from the dataset. The aligned multilingual representation allows
searching for the same entities across different languages.
shows a SPARQL query which allows retrieving Arabic,
glish and German translations of verses which contain “Moses”.
verses that contains moses in (i) Arabic (ii) English and (iii) German
the verses that contain the word which
Example 3: List all the Arabic prepositions with example statement for each
Example 4: List the Names of the chapters which is Makkian
: Counting the Number of the Chapters in the Quran
Example 6: Get the Galalyn Tafsi
r of Given Verse
Example 7: Count the Number of The Verses in Certain Chapter
Example 8: Count Number of Parts in the Holy Quran
Example 9: Detect t
he Revelation Place of Given Chapter
Example 10: Count Number of Quarters in the Quran
It is an OWL File Located Locally.
This is Support Semantic Search on the Quran.
is declare all
concepts about animals exactly as defined in Holy Quran.
There are about 167 direct or indirect references of animals in Holy Quran.
Main difficulties in the Implementation include:
In Arabic for one animal many Arabic words are present.
For instance, a camel
has words like Abal, Buaaeer, Gamal.
Majority of animals have been mentioned regarding giving similarity to some
people. This creates difficulty in creating metaphorical or abstract relationship.
For example, the non
believers who hav
e no faith in Holy Quranic verses are
resembled to donkey carrying books.
Many animals have been mentioned regarding their behavior. For instance,
ass, mule and horse have been defined as source of usage of ornament and
Behavior is such
a difficult domain that two universities SIG are extensively
working merely on the domain of animal behavior.
part of the object properties defined in the ontology
SPARQL Queries and Results:
Scenarios 1: (Domain
ct Property: swallow, range
Q1: Name of Prophet who was swallowed by Fish?
A: Yuns subclass of Prophet
: Which animal swallowed Yunas
Q3: Which animal swallow Prophet and also lives in Sea?
Scenario 2: (class: man, object
property: love, class: branded
Man love Branded Horse
Q4: Which an
imal is loved by man?
A: Branded Horse subclass of Horse
Scenario 3: ALLAH forbade Swine (to eat)
Q5: Which Animal is forbi
dden (to eat) by ALLAH?
Scenario 4: (class: ALLAH, object property: sent down,
ALLAH sent down Quail to Children of Israel (Bani
Q6: Who send
Quail to Children of Israel?
Q7: Which Bird was sent down to Childr
en of Israel?
ildren of Israel eat
Scenario 5: (Data Property has_usage and string values Ride)
Horse, Mule and Ass are used for Ride.
Q9: which animal has
usage of ride?
A: Horse, Mule and Ass
Scenario 6: (data property has
voice, string value harshest)
Ass has harshest voice.
Q10: which Animal has harshest
Recognizing Textual Entailment
is defined as a directional relation between two text
fragments, termed T
the entailing text, and H
the entailed hypothesis
T entails H if, typically, a human reading T would infer that H is most likely
Somewhat informal definition assumes comm
on human understanding of
language as well as common background knowledge
The drugs that slow down or
halt Alzheimer's disease
work best the earlier you
Drew Walker, NHS Tayside's
public health director
"It is important to stress
that this is not a con
case of rabies."
A case of rabies
Yoko Ono unveiled a bronze
statue of her late husband,
John Lennon, to
the official renaming of
England's Liverpool Airport
as Liverpool John Lennon
Yoko Ono is John
Recognizing Textual Entailment
Recognizing Textual Entailment (RTE)
is the task of deciding, given T and H,
r T entails H.
Carried out fully automatically by a computer program
Without any human intervention
The class of NO entailment can be divided in two subclasses:
H is unsupported by T
H is contradicted by T
Applications of Recognizing Textual
Question Answering (QA):
given a question, automatically extract correct
answers from a collection of documents
Example Q: "who bought overture?"
QA system analyses question Creates answer pattern: "X bought
Searches in documents
for text fitting the answer pattern
Similar arguments can be made for many other NLP applications
Information Retrieval (IR), Information Extraction
IE) , Automatic
Summarization and Machine Translation (MT)
Proponents of textual entailment
belief that most NLP tasks can be reduced to
Approaches for Recognizing Textual Entailment
If hypotheses entail a specific text then it must be sharing a large
number of words with it
count number of shared words between
text T and hypotheses H
set a threshold above which entailment is true
The longer H is, the more words it will likely share with T
Normalize for sentence length by dividing by no of words in H
Word matching has ob
T: He is looking out of the windows
H: He looks out of a window
they entail but the result for applying the word matching on them is NO
and the reason is that the word "looking" does not match the word
"looks" also the w
ord "windows" does not match the word "window"
the solution of such problem is to match the lemmas of the words not
the words themselves
underlying form of surface form of words (look is lemma of
looking and looks & window is the lemma of windows
and window )
count number of shared words lemmas between text T and
set a threshold above which entailment is true
Word matching has more shortcomings
T: The lead singer left to wave at us
e left wave threw lead at us
there are 6 out of 7 H words match a T word but some matching words
have completely different meaning
lead in the sense of "first" does not match lead in the sense of
left in the sense of "going ways" does not match l
eft in the sense
opposite of "right"
wave in the sense of "greeting" does not match wave in the sense
of "movement in fluid"
Those words are
"words with identical spelling but
Homographs can often be distinguished by their word class (category)
which can be known from the
Part of speech (POS)
"the lexical category
of a word" of the word
Some major words categories in English are
Noun (N) as in car
Verb (V) as in drive
(A) as in blue
Adverb (Adv) as in quickly
The POS of a word can be gotten by applying
labeling words according to a predefined set of POS tags" on the
sentence (text / hypotheses), example: