The Future of Inference Engine using MapReduce and Semantic Web

smilinggnawboneInternet and Web Development

Dec 4, 2013 (3 years and 6 months ago)

114 views

1

> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE
-
CLICK HERE TO EDIT) <




Abstract


Reasoning in the Semantic Web has been greatly
simplified by the MapReduce model. MapReduce simplifies
administration, provides high performance and fault tolerance,
automates task management,
and reduces the problem of load
balancing. However, the model is beset by a number of
weaknesses that have reduced its utility. In this paper, we review
the underlying concepts of MapReduce with reference to
the
Semantic Web and
identify the major weaknesses of the
MapReduce model. We also identify the solutions that have been
developed to help offset these weaknesses and propose future
developments that might help to further optimize the model.

Index Terms


Inference engine
,

MapR
educe,
OWL, RDF,
Semantic Web
.

I.

INTRODUCTION

HE

Semantic Web provides an extension of the World
Wide Web through the provision of clear semantics to
services and information. The objective is to infuse
semantic meaning to the data present in the World Wid
e Web.
In Semantic Web, languages which enable information to be
expressed in a form that can be processed by machines are
developed and this allows them to make more sense out of the
web [1
, 2
].

I
n this paper, we look at the concepts behind the
MapReduce
model and its place in the Semantic Web.
We also
discuss the various solutions that have been proposed for the
optimization of MapReduce.

II.

T
HE
SEMANTIC

WEB

The Semantic Web consists of languages and tools which form
the Semantic Web stack (figure 1 below)
.
The standard
language of the Semantic Web is the extensible markup
language (XML) and this is because the language allows a
wide variety of data to be encoded and it is widely used [3
, 4
].
XML is a Flexible Meta language used for exchanging data
and which
can be used for the definition of data structures and
personalized tags. The structures can be made platform
-
independent and data defined automatically. Data definitions
are stored in a document type definition (DTD) document or
schema

[3]
.


The Semantic W
eb is poised to solve the weaknesses
associated with the current web such as lack of automatic
information transfer, inability of machines to comprehend user
data, data vagueness arising from poorly interconnected data,
lack of a suitable structure for rep
resenting data, and large
number of users and content and trust issues

[3, 5]
.



Manuscript received date.

Authors























Fig
.
1
:
T
he Semantic Web stack [3]


The Semantic Web has helped to improve the retrieval,
sorting, and classification of data,
enhanced data automation
and integration, resolved the problem of interoperability
present in available web technologies, and enhanced data
reuse [
3, 5
].


A.

Reasoning in the Semantic Web

Reasoning is the process that is used by machines in the
Semantic Web
to originate new information from an already
existing dataset [1]. Inference engines or machine reasoners
perform the reasoning. Inference engines are programs that
attempt to generate answers from a dataset through reasoning.
Reasoning can either be back
ward reasoning in which case it
is conducted at query time or forward reasoning where it is
done before the query. It can also be deductive or inductive or
classical or large
-
scale

[6, 7, 8]
. In reasoning, there are sets of
rules that are followed. Commonl
y used rule sets are The
Resource Description Framework (RDF) rule set and the Web
Ontology Language (OWL) [
7, 9
].

1)

RDF

RDF is a widely accepted W3C standard for encoding data.
It can be visualized as a data model for the description of
objects and associated relations. It breaks down information
into triples and specifies the rules that govern the triples. In
The Future of Inference Engine using
MapReduce and
S
emantic
W
eb

Name

T

2

> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE
-
CLICK HERE TO EDIT) <


RDF, informat
ion can be represented by means of a labelled
directed graph with edges and vertices. Facts or associations
inherent between 2 entities are represented by the edges. In
this respect, a fact is made up of a subject, object, and verb or
predicate. The subjec
t represents the leading part of the edge,
the verb to the type of edge, and the object to the terminal part
of the edge. Thus, a fact is a subject
-
predicate
-
object triple.
Another name for a fact is a statement. The classes and
characteristics of RDF res
ources are described by the RDF
Schema (RDFS) [
9, 10
, 11
, 12, 13
].


2)

OWL
-
Horst Rule Set

OWL is an ontology language that is based on RDF but has
higher complexity than RDF. A combination of OIL and
DAML languages, it is highly robust with a broader
vocabul
ary and more advanced machine interoperability than
RDF. It has higher expressiveness than RDF [6
, 14, 15, 16
].
OWL is also written in XML with 3 sublanguages namely
OWL Full, OWL Lite, and OWL DL. OWL Full is completely
compatible with RDF. This compatibi
lity however makes it
difficult to come up with an effective and comprehensive
reasoner since this is a problem that is undecidable and hence
cannot be executed using algorithmic methods. Whereas the
full compatibility makes OWL Full to be more sophisticat
ed
with regards to reasoning, effective implementation of
reasoning is very difficult. This drawback was resolved
through the implementation of the OWL Lite and OWL DL
languages. Unlike OWL Full, OWL DL and OWL Lite are not
fully compatible with RDF and th
eir expressivity is lower.
However, its implementation is difficult to perform [
9, 17, 18,
19
].


3)

Implementations of Classical Reasoning


Implementations of classical reasoning include Openrdf
-
Sesame which is a RDF reasoner and SwiftOWLIM which is a
plug
in used for OWL pD* reasoner [
21, 22
]. Others are
BigOWLIM which is an implementation of SwiftOWLIM,
Jena which is a rule
-
based inference engine that is written in
Java, Pellet, FaCT, Hoolet, Racer, and F
-
OWL. Pellet supports
OWL DL reasoning and is based
on Java, supports XML data
types, is decidable, has a complete consistency checker, and
interface is Java or DIG. FaCT supports OWL DL reasoning,
is based on Lisp, does not support XML datatypes, is
decidable, has a complete consistency checker, has no sup
port
for Abox and its interface can be command line or DIG [
22,
23, 24, 25
].

Hoolet supports OWL
-
DL, is based on Vampire, is not
decidable, does not support XML datatypes, has no complete
consistency checker, interface is Java, and it scales poorly
[
26
].
Racer supports OWL DL reasoning, is based on LISP, is
decidable and supports XML datatypes, has a complete
consistency checker, and its interface can be GUI, DIG, or
Java [
27
]. F
-
OWL supports OWL
-
Full reasoning and is based
on Flora and XSB logic programmi
ng system. It supports
XML datatypes but is neither decidable nor does it have
complete consistency checker. Interfaces can be Java, GUI,
and command line. Its major limitation is that it scales poorly
[
20
]. Surnia supports OWL Full reasoning and is based
on full
FOL logic and the Otter theorem prover. It does not support
XML datatypes, is not decidable, has no complete consistency
checker, and its interface is based on Python. Its main
limitation is that it scales poorly [
20
].


Standards used in the evalu
ation of classical reasoners
include the LUBM, SP2Bench, BSBM, and UOBM [
28, 29,
30
]. [
31
] and [
32
] developed large
-
scale OWL reasoners. The
former reasoner uses information crawled on the web. Other
large scale reasoners have been proposed by
[33], [34],

and
[35]
. Reasoning models can also be standard or rule
-
based
inference methods. Standard inference methods comprise of
RDFS and OWL.
An OWL inference engine should be able to
check the consistency of ontologies hence ensure that the
language and syntax o
f all the terms plus the instances adhere
to the set restrictions. It must carry out computing entailment
and be able to process queries, handle XML data types, and
reason with rules. OWL inference engines have been designed
based on 3 different approaches
. The first approach involves
the use of a specialized description logic reasone
r

while the
second approach involves the use of full first order logic
(FOL) theorem prover. The third approach involves the use of
a reasoner that‟s designed for a subset of F
OL [
20
].


The MapReduce is an example of the rule
-
based inference
method.


III.

T
HE
M
AP
R
EDUCE
M
ODEL

The MapReduce programming model is based on forward
deductive reasoning and is used to process data in a distributed
and parallel manner in machine clusters [
36
, 37
]. The model
supports fault tolerance, distributed computing, automatic
parallelization, and management of tasks [
38, 39
]. It is an
effective framework in tasks that require a lot of data [
40
].
Data is processed in 2 stages. The first stage is the map
stage
while the second stage is the reduce stage. Both stages of data
processing are parallel (figure 3). The principle behind
MapReduce is the division of data into several smaller bits and
assignment of each bit to an inactive node for processing
.


Fig
.

2
: Diagram depicting t
he basic principle of MapReduce.
Data is
processed in

2 stages: the Map and Reduce stage.
In the Map stage,
input consisting of key/value pairs is read, transformed to an
intermediate output, and grouped or p
artitioned before

being sent to
the Reduce stage [44]

A.

Map Stage


In the map stage, input is read by the mapper but can also
be received from a dedicated master node. It is based on map
functions that are specified by users, intermediate outputs are
3

> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE
-
CLICK HERE TO EDIT) <


generat
ed by map nodes. The input is a key
-
value pair while
the intermediate output is a set of key
-
value pairs that are
returned by the map function [
40, 41
]. The following equation
represents the function of the map nodes:

Map: (k1, v1)






list (k2, v2)








(1)

Propagation of the output pairs to successive stages is done
followed by partitioning and or grouping and sorting based on
the keys. A hash function is applied over the keys during
partitioning. This is performed by a partitioner and results in
gr
ouping. The number of reduce tasks corresponds to that of
the partitions [
36, 38, 40, 41
].


B.

Reduce Stage


In the Reduce stage, the intermediate outputs are consumed
by reduce nodes. Values are grouped by key whereby reducers
obtain matching partitions in
a manner that allows all values
with the same key to be dispatched to the same reducer.
Merging of the output pairs is done based on the same key and
propagation of the key/list pairs done to the reduce function
that is specified by the user. Processing of

the pairs is done by
the function as described in the following representation [
36,
38, 40, 41
].

Reduce: (k2, list (v2))




list (v3)








(2)


C.

MapReduce Services


The 3 main services which are provided by the Map
-
Reduce
system are automatic distribution of tasks, fault tolerance, and
a distributed file system. Automatic distribution of tasks
involves instantiation of the Map and Reduce tasks across
many nodes, parti
tioning and assignment of the input to
distinct Map tasks, and coordination of the duplication of
intermediate outcomes which are stored locally from the nodes
performing Map tasks to those performing Reduce tasks.
Optimization usually involves co
-
location

of a Map task to a
node that has a duplicate of the allocated input partition in the
distributed file system. This involves monitoring of the
executed tasks and re
-
execution of the tasks in the event there
is a failure. Reduce tasks are usually launched a
fter the end of
the map tasks because any map tasks may provide inputs for
the reduce task. Backup executions are scheduled close to the
end of either of the tasks in order to protect against nodes that
may take exceptionally long to complete assigned task
s. The
MapReduce system allows the storage of inputs and outputs in
a distributed file system. This enables these
outputs to be
accessed globally
. Advanced features of the system include
the InputFormat, Partitioner, Sorter, OutputFormat, and
Combiner
a
s d
epicted in
figure 3

below [36, 38, 40, 41].


The function of the InputFormat is to extract input records
that consist of a key/value pair from the input while that of the
Partitioner is to calculate a Reduce task ID from the mid_key
and ascertain the local

file that should be attached to the
intermediate result. The Sorter completes the group
-
by
operation by organizing the intermediate results. The default
option here is merge sort. Once the Reduce calculation has
been performed, the output results are form
atted by the
OutputFormat for inscription to the output files. The Combiner
can be used to help minimize the network traffic associated
with duplication [
38
].





Fig.
3
: MapReduce features
.
Distribution of the Map (
M
) and
Reduce

(
R
) tasks are distributed in an automatic manner by
MapReduce across many nodes

[
38
]

D.

Characteristics and Advantages of MapReduce


In MapReduce, creation of partitions can be done arbitrarily
and scheduling performed across many nodes in parallel. The
reas
on behind this is that the map acts on distinct data pieces
with no dependencies [6]. Secondly, all data with common
keys are operated on by the Reduce task. Assignment of
proper keys to the data during Map allows partitioning of the
data to be carried out

in Reduce. Load imbalance on the nodes
performing computation is caused by skews in the partitioning
and effective performance of MapReduce depends on the use
of balanced data partitions [6]. Thirdly, iterations of all values
are operated on by the Reduce

task since the value sets are too
big and cannot fit in memory. Consequently, correlations
between the items can only be made use of in part by the
Reducer during processing as they are received as a stream
and not as a set. Fourthly, the location of the
data is on local
nodes and the physical position of the data is recognized. This
means that locality
-
aware scheduling is carried out and that
scheduling of mapping and reducing can be done on the node
with the pertinent data thereby transferring the comput
ation
and not the data [6].

The advantages of MapReduce are that it provides high
performance and simplifies administration. It also provides
f
ault
-
tolerance,
helps to minimize load balancing through the
dynamic scheduling of tasks on the nodes that are av
ailable
and is designed in a way that minimizes the transfer of data

[
6
,
40
]
. Implementations of MapReduce are many and include
Hadoop, a Java Open source implementation of MapReduce,
Skynet, Twister, and LarKC [
42, 43, 44
].



E.

DISCUSSION


1)

Execution of Dat
a Joins


Execution of data joints using MapReduce is inefficient.
This leads to problems with load balancing since multiple
rules and inputs may be used to derive common statements
causing way too many duplicate inferences. Additionally, rule
-
matching requ
ires that a term be used to group antecedents
together and that one reduce function operating on one
4

> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE
-
CLICK HERE TO EDIT) <


machine process the grouped antecedents. Since Semantic
Web triples are significantly skewed, serious problems with
load balancing occur when single terms
are grouped together.
Skewed data results from the frequent usage of some terms
and statements more than others and this interferes with
performance since nodes that have data that is used more often
need to work harder than the others [6,
36, 45, 46
].

To

overcome these problems,
[36]

came up with the Web
-
scale Parallel Inference Engine (WebPIE), a web
-
scale parallel
inference engine that is used for scalable reasoning in pD*
semantics and which is based on the MapReduce model [
3
6].
Optimizations that were

performed include loading the schema
triples in memory and executing the join as required rather
than in the reduced stage, executing the joins in the reduce
stage and grouping the triples using the map function so as to
avert duplicates, reducing the Map
Reduce tasks by executing
the RDFS rules in a particular sequence, and use of contextual
data to minimize duplicates when carrying out joins between
instance triples. Other optimizations included carrying out
redundant tasks to avert problems associated wi
th load
balancing and restricting the exponential derivation of
owl:sameAs triples constructing a sameAs table [
3
6].

WebPIE solved the duplicate inferences problem by using
inferred triples generated to group triples. This resulted in the
simplification o
f the way in which duplicates are eliminated
besides making this elimination more effective. Load
balancing was improved through filtering out triples which are
incapable of firing any rules for the present job and through
the application of some rules in
distinct MapReduce jobs.
Schema statements were also loaded in memory taking into
consideration the fact that rule schema statements are present
in many rules and that their numbers are limited. This enabled
the streaming nature of MapReduce processes to b
e exploited
and the statements in a single area of the join to be iterated.
Required iterations over the dataset were reduced through
optimization of the order of rule application [
3
6].

To evaluate the performance of the system, Urbani et al
implemented a
prototype based on the Hadoop framework.
They conducted the evaluation using a DAS3 multi
-
cluster
comprising of 64 nodes and a gigabit Ethernet connection. The
processing power of each of the nods was 4GB while the
storage capacity was 250GB of SATA hard d
isk. The runtime
of the system was measured using 5 real world datasets
namely Falcon, Swoogle, BigWeb, LDSR, LUBM, and
Uniprot. It was found out that the runtime for OWL
-
horst
reasoning was longer and more complex than that of RDFS
and that the execution
time was not proportional to the input
size. The scalability of the system was also evaluated using the
conventional LUBM standard [
3
6].

Their results show that there is no increase in the
complexity of the produced dataset whenever the size
increases. It
is also shown that scalability of the system occurs
in a linear manner even when the size of the input is very big.
Findings show that better performance is obtained with larger
datasets and this is rationalized on the reduced costs
comparative to the tota
l time required for execution [
3
6].
According to their results, the WebPIE system performs
reasoning 60 times quicker than BigOWLIM on a large server
and outperforms the Marvin platform described by [
48
] and
the system described by [
49
] with regard to thro
ughput and
data size. It was shown to improve reasoning in the web by up
to 60 times [
3
6].

It was therefore concluded that WebPIE eliminates the
problem of computing using a very large dataset, minimizes
the load balancing problem thereby resolving the pro
blem of
data joint execution, outperforms other state of the art
approaches for reasoning and inference n Semantic Web, and
supports inference by OWL
-
horst. However, the system does
not deal with distributed data in an efficient manner and is
unable to sel
ect the most appropriate execution for each rule
and the order of execution based on the input. These provide a
scope for further research [
3
6].

A platform known as Pig together with the Pig Latin
language was developed atop Hadoop by [
34
] and

has helped
t
o simplify joins for an easier description by developers.
However, it does not get rid of processing inefficiencies [
34
].
To resolve the problem of joins,
[50]

brought in a third stage
known as „merge‟ to MapReduce to create a Map
-
Reduce
-
Merge model. This
new model enables an extension of the
MapReduce framework thereby improving its handling of join
operations. The merge stage involves the merging of outcomes
that have been partitioned and sorted by reducers [
50]. [51
]
implemented the Pig framework over th
e SPARQL query
engine using a 4
-
stage algorithm comprising of pre
-
processing, map, reduce, and merge stages. The diagram
below shows the last 3 stages of this model



Fig.
4
: The last 3 stages of the model involving implementation

of
Pig over SPARQL [
51
]

2)

Problem of too Much Data

The amount of data that is available in the Semantic Web
has increased exponentially over the past few years. This
exponential increase in data has rendered reasoning to be a
data intensive problem. The lim
ited processing capabilities
and storage capacities of single machines have thus become
obvious hindrances to the reasoning process. Together with
the problem of scalability, the availability of too much data
has necessitated the optimization of the MapRed
uce model to
enable it handle the large amount of information as well as
scale effectively [
52
].

To solve the problem caused by the intensiveness of data,
[53]

proposed a model known as Pipelined
-
MapReduce
model. In this model, the batched MapReduce model
is
expanded by the use of a pipeline to facilitate exchange of data
between operations leading to enhanced utilization rates and
5

> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE
-
CLICK HERE TO EDIT) <


lower completion times. Unlike in the traditional MapReduce
model where HTTP requests are issued by reduce tasks before
output
can be pulled from each Task
-
Tracker thereby
decoupling execution of map tasks from reduce tasks, data is
pushed to reducers following its generation in Pipelined
-
MapReduce [
53
].

The diagram below captures the differences between
MapReduce and
Pipelined
-
MapReduce. In the Pipelined
-
MapReduce model, data is sent directly from the map task to
the reduce task and pipelined data is stored in in
-
memory
buffers. Evaluation was done and the results which were
obtained indicate that Pipelined MapReduce i
s highly scalable
and is capable of processing large datasets in an effective
manner. Its utility in cloud computing has however not been
studied [
53
].



Fig
.
5
: Illustration of the Pipelined MapReduce model

[31]

The large amount
of data has also necessitated the need for
efficient methods of compressing and decompressing data so
that the performance of applications can be enhanced and the
data size decreased. Methods that use a distributed approach to
compress data have not been p
articularly effective since they
partition the input before using several machines to process
the partitions. This compels that storage of the dictionary table
be done in a single machine while retrieval of numerical IDs
be performed by the other machines
thereby creating
enormous network communication to the location of the table.
The result is that communication becomes a hindrance if the
infrastructure that is used is slow. In addition, the main
memory of the machine may not be able to hold the dictiona
ry
table where use is made of datasets with billions of statements
and this compels that a disk be used to store the table which
can significantly slow the retrieval of data [
54
].

To overcome these problems, [
54
] have described a
dictionary encoding techni
que that is based on the MapReduce
algorithm for compressing and decompressing data in the
Semantic Web. The method scales in a linear manner, can
resolve load balancing with caching and sampling, and can
create a huge dictionary containing entries in the
order of
hundreds of millions. To further improve the performance and
utility of this dictionary encoding method, future work needs
to extend the proposed algorithms so that they can be used in
different domains as well as perform incremental updates
witho
ut having the input recompressed always [
54
].


3)

Parallelism in Execution of Jobs

T

For the functions in MapReduce to work, a small part of
the input is required and access to other data is not necessary.
This makes it possible to implement the 2 functions i
n a
distributed manner across numerous nodes following the
division of the inputs into different pieces. This parallelism is
a disadvantage during the execution of joins involving
information from different varying sources as concurrent
access to several t
riples is not possible [
4
].

To redress the problems of parallelization in Semantic Web
reasoning, the Marvin system and use of distributed hash
tables (DHTs) have been proposed for RDF [
55, 56]
.
However, little has been done with regard to OWL reasoning.
[
57
] sought to fill this lacuna by developing a MapReduce
algorithm that can be used to classify and describe εL+
ontologies and logic respectively. Their work consists of
modification of the algorithm used for classifying εL+
ontologies followed by its con
version into a MapReduce
algorithm based on the notions of the RDF schema. Whereas
the work provides a basis for the development of models for
languages that are more expressive, it has neither been
implemented nor evaluated hence the benefits of the metho
d
have not been mapped [
57
].

[
58
] used Oracle to develop an inference engine as a
database application. In this model, facts are stored in database
tables and retrieved using SQL queries. Rewriting of the
derived tuples to database tables is by way of stan
dard DML
statements and the PL/SQL language is used as the procedural
code. This model is beneficial since it is not limited by the
amount of data that it can handle. Besides, it can produce rule
translations using the SQL language and supports standard
R
DFS/OWL constructs [
58
].


4)

Transfer of Huge Amounts of Data

According to [6], transfer of huge amounts of data across
many nodes can easily cause the disk bandwidth or network to
become saturated since reasoning is data intensive. This calls
for
minimization of transfer of data. Derivation of duplicates
is associated with transfer of large amounts of data is a
problem bedevilling the effective performance of MapReduce.
When large datasets are used, the number of duplicates
generated far outweighs
that of the triples and this prevents the
scaling of the implementation to large sizes since the storage
and communication layers are rendered incapable of storing
the extra data. To overcome this problem, a pre
-
processing
step was introduced in order to m
inimize the number of
duplicates. In WebPIE, the duplicate inferences problem was
solved by using inferred triples generated to group triples.
This resulted in the simplification of the way in which
duplicates are eliminated besides making this elimination

more effective [6].


5)

Complexity Reasoning and Fixpoint Iteration

Linear to exponential worst
-
case complexity can be used to
perform reasoning. Fixpoint iteration is one of the main
challenges associated with the MapReduce model and is an
6

> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE
-
CLICK HERE TO EDIT) <


example of proble
ms associated with reasoning complexity. It
is present in both the OWL and RDFS rulesets and it refers to
the joining of the output to the input followed by processing in
a repeated manner until there is no derivation of new
statements. This occurs because

of the recursive nature of
reasoning [6,
36
].

This problem was minimized by [
36
] through optimization
of the order of rule application leading to a reduction in the
number of the necessary iterations over the dataset [
36
]. [
48
]
described Marvin, a platfor
m that is used to process large RDF
datasets in a distributed and parallel manner. The platform
consists of machines that are coupled loosely based on the
peer
-
to
-
peer approach. This platform was created in order to
resolve the reasoning complexities assoc
iated with divide
-
and
-
conquer methods such as MapReduce. In particular, Marvin
was developed in order to use logical reasoning to compute a
dataset‟s deductive closure. The platform was also intended to
solve the poor scalability of these methods especial
ly in light
of the large datasets present in the Semantic Web [
48
].

The approach used in Marvin consists of a unique divide
-
and
-
conquer
-
swap method that has an iterative method that
yields outcomes that meet towards completeness with passage
of time. It
also has anytime algorithms which generate
accurate outcomes that manifest increased completeness over
time. In addition, Marvin involves use of parallel hardware
with algorithms that are distributed and which make optimal
use of the hardware notwithstandi
ng the scale of the hardware
[
48
].


6)

Interdependence between Map and Reduce
Functions

Whereas the distributed nature of MapReduce enables the
limitations imposed by physical hardware to be overcome, it
also increases reasoning complexity and introduces more

problems that affect the effectiveness of the MapReduce
model. One of the main drawbacks of the distributed system is
that the data is strongly correlated thus making it impossible to
split the input and causing interdependence between the nodes.
The inte
rdependence causes the nodes to communicate with
each other and this produces overhead, a factor that leads to a
decline in performance. A solution
for this is to partition data
[4, 59
].

Interdependence is also a big problem in large scale
parallelized co
mputing. This problem was addressed through
delay scheduling and copy
-
compute splitting and was
observed to improve response times and throughput by up to
10 times. In this particular instance, a fair scheduler for
MapReduce at Facebook was developed [
60
].

The distributed
system is also associated with the load balancing problem
where the distribution of the workload between different
nodes is often unequal causing some nodes to work more than
others thereby diluting the benefits of parallelism [
4
].


7)

Scala
bility Problem

One of the major problems affecting reasoning in Semantic
Web is the issue of scalability [
36
]. This problem has been
compounded by the exponential increase in data and size of
the web. According to [
6
], the number of triples in the
Semantic

Web was estimated at 4.4 billion in 2009 and
increased to 13 billion triples in 2010. Based on this rate of
growth, scalability in reasoning is expected to become more
and more problematic. Whereas distributed reasoning provides
improved scalability, it i
s associated with challenges that
centre on reasoning complexity, load balancing, and transfer
of huge amounts of data [
61
]. Scalability is a particularly
important aspect of Semantic Web since the latter lies atop the
traditional web and is thus affected
by the rapid increase in the
amount of data on the internet [
4
].

[
62
] have extended the use of MapReduce to carry out
reasoning in large scale fuzzy pD* semantics. The main
objective of their work was to determine how MapReduce can
be used to resolve the
scalability problems associated with
fuzzy OWL reasoning. Their work considers scalable on top
of semantic data under fuzzy pD* semantics. A prototype
system was developed and evaluation for this system against
WebPIE shows that they have comparable runni
ng times [
62
].
The problem of scalability has also been redressed through
MapReduce optimizations described previously in WebPIE
and Marvin [
48, 54
]
.


8)

Programming Rigidity

Whereas the MapReduce model enables the efficient
distribution of different tasks, i
t is a rigid model that
introduces inflexibility during programming. This confines the
user only to mapping and reducing tasks and makes it virtually
impossible to encode some operations, process only certain
tuples, or resend output information to a mappe
r [
52
].


9)

Cloud Computing Issues

[
63
] described a novel model known as Cloud MapReduce
for managing cloud resources. The method involves
implementation of MapReduce on top of a cloud operating
system thereby avoiding the weaknesses such as poor
scalability

and the weaker consistency guarantee associated
with clouds while maintaining optimal performance. This
model has incremental stability, is heterogeneous, and is
symmetrical and decentralized. Use of the cloud operating
system enables the design and execu
tion of MapReduce to be
simplified greatly. They describe an architecture that is fully
distributed for execution of the model. In this architecture, job
tasks and global status are pulled by nodes so as to ascertain
their distinct actions. Results are shu
ffled by queues from Map
to Reduce and written by Mappers immediately they are
available. Duplicate results as well as results from failed nodes
are filtered by reducers (figure below). However, full scale
evaluation of the system is yet to be carried out
[
63
].


7

> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE
-
CLICK HERE TO EDIT) <



Fig.
6
: Architecture for the Cloud MapReduce model [
63
]

Other cloud implementations designed to solve the scalability
problem include the Google file system (GFS), Dynamo,
BigTable, and Dryad
[
65, 66, 67, 68
]
.

However, cloud MapReduce is associated with numerous
disadvantages such as inability to effectively process
streaming data and the sequential running of Map and Reduce
stages that result in augmented delays, lack of parallelization
of the Map and Reduce st
ages, and compulsory batch
processing. To solve this problem, [
69
]

have proposed a
scalable and lightweight pipelined MapReduce approach that
not only supports streaming data but
also utilizes

pipelining
between the 2 stages in cloud MapReduce thus enablin
g the
output from the Map stage to be available to the Reduce stage
immediately after its generation. This has led to minimization
of delays and support for streaming data.
[70]

has proposed
the improvement of fault tolerance and minimization of
latency fo
r Cloud MapReduce applications through
provisioning redundant copies for tasks.


IV.

C
ONCLUSION

AND FUTURE WORK

This paper reviewed the state of the art with regard to
reasoning in Semantic Web using the MapReduce model.
Problems associated with the MapReduce model include
inefficient execution of data
joints;

intensiveness of data,
parallelism is execution of jobs,

transfer of huge amounts of
data leading to bandwidth or network saturation, and fixpoint
iteration. Others are interdependence between map and reduce
functions leading to suboptimal performance, poor scalability,
programming rigidity, and problems associ
ated with cloud
computing. Various implementations have been made to
overcome these problems and these have managed to resolve
some of the sticking issues such as scalability, data
intensiveness, interdependence between
M
ap and
R
educe
tasks, and joint executions.

In future, the following need to be done:

1.

Full scale evaluation of the MapReduce
implementations for managing cloud resources in
order to assess their performance

2.

Cloud Map Reduce implementations still have
weaknesses su
ch as latency, poor fault tolerance,
compulsory batch processing, and poor
parallelization. These need to be addressed to
enhance the utility of MapReduce cloud applications

3.

Indexing should be enhanced to simplify the task of
information retrieval and spee
d up data processing

4.

The rigidity of the MapReduce model has not yet
been addressed. Future work needs to look at this
problem so as to redress the inflexibility of the model

5.

Collection of statistics


performance of MapReduce
can be enhanced through col
lection of statistics.
Collected statistics can be used in subsequent
repetitive tasks thereby improving the performance of
successive runs

6.

Fine
-
grained Provenance
tracking

7.

Parallelism in execution of jobs is still a problem that
needs to be solved especia
lly for OWL schema where
there are virtually no MapReduce optimizations

8.

Sampling
-

this will enable the system to learn about
the features of tasks prior to beginning a full
-
scale
run. This can be done through random sampling of
inputs. This feature can en
able systems to project the
runtime, determine if there is adequate space, or
identify skews hence help to minimize problems of
load balancing

A
CKNOWLEDGMENT

R
EFERENCES

1.

O.L.T.,
Berners
-
Lee and
J.A,
Hendler. “The Semantic Web”.
Scientific American
,
284(5):34
-
43, 2001.

2.

Shadbolt, T. Berners
-
Lee, and W. Hall, "The Semantic web
Revisited", IEEE Intelligent Systems 21(3): 96
-
101 (2006).

3.

S.,
Decker,
S.,
Melnik,
F.,
van Harmelen,
D.,
Fensel,
M.C.A.,
Klein,
J.,
Broekstra,
M.,
Erdmann, and

I,

Horrocks “The S
emantic
Web: The roles of XML and RDF.”
IEEE Internet Computing
,
4(5):63
-
74, 2000.

4.

J.,
Urbani,
S.,
Kotoulas,
E.,
Oren, and van
F.,
Harmelen. “Scalable
distributed reasoning using mapreduce”. In
Proceedings of the
ISWC

'09. 2009.

5.

G.,
Madhu,
A.,
Govardhan, a
nd
T.V,
Rajinikanth,. “Intelligent
Semantic Web Search Engines: A Brief Survey”.
International
journal of Web & Semantic Technology

(IJWesT) Vol.2, No.1,
January 2011

6.

J.,
Urbani,
S.,
Kotoulas,
J.,
Maassen,
F.,
Harmelen, and
H.,
Bal.
“WebPIE: A Web
-
scale pa
rallel inference engine using
MapReduce.” 2011.

7.

F. Baader, D. Calvanese, D.L. McGuinness, D. Nardi, and P.F.
Patel
-
Schneider. The description logic handbook. Cambridge
University Press New York, NY, USA, 2007.

8.

U. Hustadt, B. Motik, and U. Sattler, "Reasoni
ng in Description
Logics by a Reduction to Disjunctive Datalog", Journal of
Automated Reasoning, 2007. Volume 39, Issue 3, pp 351
-
384,
(October 2007).

9.

T.
Wang,
B.
Parsia,
and J.
Hendler,
"A survey of the web
ontology landscape",
in Proc. 5th Int. Semantic Web
Conference, Athens, GA, USA,
November
5
-
9, 2006,
LNCS
4273.

10.

O., Lassila: Enabling Semantic Web Programming by Integrating
RDF and Common lisp”, Proceeding of first Semantic Web
Working Symposium, Stanford University, 2001.

11.

O.
, Lassila:


Taking the RDF Model Theory Out for a Spin”, First
International Semantic Web Conference (ISWC 2002), Sardinia
(Italy), June 2002.

12.

D
.

Brickley and R
.
V
.

Guha, editors. RDF Vocabulary Description
Language 1.0: RDF Schema. W3C Recommendation, Fe
bruary
2004.

8

> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE
-
CLICK HERE TO EDIT) <


13.

P. Hayes, editor. RDF Semantics. W3C Recommendation,
February 2004.

14.

D. Fensel, F. van Harmelen, I. Horrocks, D.L. McGuinness, and
P.F. Patel
-
Schneider. Oil: An ontology infrastructure for the
semantic web. IEEE Intelligent Systems, 16(2):38
-

45, 2001.

15.

I. Horrocks. DAML+OIL: description logic for the semantic web.
IEEE Data Engineering Bulletin, 25(1):4
-
9, 2002.

16.

Robertson, D., Antoniou, G., and van Harmelen, F. “A Semantic
Web Primer.” The MIT Press, 2004.

17.

H.J. Horst, "Completeness, decidabilit
y and complexity of
entailment for RDF Schema and a semantic extension involving
the OWL vocabulary", Journal of Web Semantics 3 (2005) pp. 79

18.

B. Motik, B. Cuenca Grau, I. Horrocks, Z. Wu, A. Fokoue, and
editors C. Lutz. OWL 2 Web Ontology Language: Pro_le
s. W3C
Recommendation, 2009.

19.

W3C OWL Working Group. OWL 2 Web Ontology Language:
Document Overview. W3C Recommendation, 2009.

20.

Zou, Y., Finin, T., and Chen, H. “F
-
OWL: an Inference Engine for
the Semantic Web.”

21.

J.

Broekstra,
A.
Kampman, and
F.
van Harmelen
.

“Sesame: A
generic architecture for storing and querying rdf and rdf schema.”
In
International Semantic Web Conference
, pages 54
-
68, 2002.

22.

A.
Kiryakov,
D.
Ognyanov, and

D.
Manov. “Owlim


a pragmatic
semantic repository for owl.” In WISE Workshops, pages
182
-
192,
2005.

23.

B. McBride
. “Jena: A Semantic Web toolkit.”
IEEE Internet
Computing
, 6(6):55
-
59, 2002

24.

E.
Sirin,
B.
Parsia,
B.C.
Grau,
A.
Kalyanpur, and
Y.
Katz. “Pellet:
A practical OWL
-
DL reasoner.”
Web Semantics: Science, Services
and Agents on the World
Wide Web
, 5(2):51
-
53, 2007.

25.

D.
Tsarkov and

I. Horrocks
. “Description logic reasoner: System
description.” In IJCAR, pages 292
-
297, 2006.

26.

A.
Riazanov. “Implementing an Efficient Theorem Prover”, PhD
thesis, University of Manchester, 2003.

27.

V.
Haarslev and
R
.
Moller. ”Racer system description”,
Proceeding of International Joint Conference on Automated
Reasoning
, Volume 2083, page 701
-
705, Springer 2001

28.

Y.
Guo
, Z.
Pan, and Je_ Hein. “LUBM: A benchmark for OWL
knowledge base systems.”
Web Semantics: Science, Se
rvices and
Agents on the World Wide Web
, 3(2
-
3):158
-

182, 2005. Selected
Papers from the International Semantic Web Conference, 2004
-

ISWC, 2004.

29.

L.
Ma,
Y.
Yang,
Z.
Qiu
, G.T.,
Xie,
Y. Pan
, and
S.
Liu. “Towards a
complete OWL ontology benchmark.” In
Europe
an Semantic Web
Conference
, pages 125
-
139, 2006.

30.

M. Schmidt, T.
Hornung,
N.
Kuchlin,
G.
Lausen, and C
.
Pinkel.
An experimental comparison of RDF data management approaches
in a SPARQL benchmark scenario. In International Semantic Web
Conference, pages 82
-
9
7, 2008.

31.

A.
Hogan,
A.
Harth, and
A.
Polleres. “SAOR: Authoritative
reasoning for the web.” In
Asian Semantic Web Conference
, pages
76
-
90, 2008.

32.

R.
Soma and
V.K.
Prasanna.. “Parallel inferencing for OWL
knowledge bases.” In
Proceedings of the 2008 37th Inte
rnational
Conference on Parallel Processing
, pages 75
-
82, Washington, DC,
USA, 2008. IEEE Computer Society.

33.

P.
Mika, and
G.
Tummarello. “Web semantics in the clouds.”
IEEE Intelligent Systems
, 23(5):82
-
87, 2008.

34.

C.
Olston,
B.
Reed,
U.
Srivastava,

R.

Kumar,

and
A.
Tomkins. Pig
latin: a not
-
so
-
foreign language for data processing. In SIGMOD
'08: Proceedings of the 2008 ACM SIGMOD international
conference on Management of data, pages 1099
-
1110, New York,
NY, USA, 2008. ACM.

35.

A.
Schlicht and
H.
Stuckenschmidt.
“Distributed resolution for
ALC.” In
Proceedings of the International Workshop on
Description Logics
, 2008.

36.

J.,
Urbani, Kotoulas, S., Maassen, J., Drost, N., Seinstra, F., van
Harmelen, F., Bal, H. “WebPIE: a Web
-
scale Parallel Inference
Engine”.

37.

M., Zaharia, A. Konwinski, A.D. Joseph, R. Katz, and I. Stoica.
Improving MapReduce in heterogenous environments. In
OSDI'08
Proceedings of the 8th USENIX conference on Operating systems
design and implementation, 2008.

38.

S.W.,
Schlosser, and Chen, S. 2008.
Map
-
Reduce Meets Wider
Varieties of Applications. Intel Corporation.

39.

Ranger, C., Raghuraman, R. Penmetsa, A., Bradski, G. and
Kozyrakis, C. “Evaluating mapreduce for multi
-
core and
multiprocessor systems,” in
Proceedings of the 13th Intl.
Symposium on Hi
gh
-
Performance Computer

40.

J.,
Dean and Ghemawat, S. “Mapreduce: Simplified data
processing on large clusters.” In Proceedings of the USENIX
Symposium on Operating Systems Design & Implementation
(OSDI), pp. 137
-
147. 2004.

41.

Condie, T., Conway, N., Alvaro, P. and Hellerstein, J. M.
“Mapreduce online,” in
Proc. NSDI
, 2010.

42.

D.Borthakur,: The Hadoop Distributed File System: Architecture
and Design (2007).

43.

J.
Ekanayake, H. Li, B. Zhang, T. Gunarathne, S.,Bae, , J.Qiu, and
G., Fox.

Twister: a runtime for iterative MapReduce. In
Proceedings of the 19th ACM International Symposium on High
Performance Distributed Computing ACM New York, NY, USA.
2010

44.

The Large Knowledge Collider (LarKC).
Investigation & Design
for Rule
-
based Reasoning.

2010.

45.

F.N. Afrati and J.D. Ullman. Optimizing joins in a map
-
reduce
environment. In EDBT '10: Proceedings of the 13th International
Conference on Extending Database Technology, pages 99
-
110,
New York, NY, USA, 2010. ACM.

46.

S. Kotoulas, E. Oren, and F. van H
armelen. Mind the data skew:
Distributed inferencing by speeddating in elastic regions. In
Proceedings of the WWW. 2010.

47.

Z. Fadika, E. Dede, M. Govindaraju, and L. Ramakrishnan.
Benchmarking MapReduce Implementations for Application
Usage Scenarios. In Gri
d Computing (GRID), 2011 12th
IEEE/ACM International Conference on,
21
-
23 Sept. 2011, Lyon.


48.

Oren, E., Kotoulas, S., Anadiotis, G. , Siebes, R. “Marvin:
distributed reasoning over large
-
scale Semantic Web data.”
Journal of Web Semantics
, 2009.

49.

Weaver, J.
and Hendler, J.. Parallel materialization of the _nite rdfs
closure for hundreds of millions of triples. In 8th International
Semantic Web Conference (ISWC2009). 2009.

50.

Yang, H., Dasdan, A., Hsiao, R., and Parker, D.S. “Map
-
Reduce
-
Merge: Simplified Relation
al Data Processing on Large Clusters.”
In
Proceedings of the 2007 ACM SIGMOD international
conference on Management of data,

2007.

51.

N.,
Soule,. “Efficient SPARQL Query Processing via Map
-
Reduce
-
Merge.”

52.

J., Urbani
. RDFS/OWL reasoning using the MapReduce
frame
work. Master thesis, Vrije Universiteit, Amsterdam. 2009.

53.

L.,
Wang,
Z.,
Ni,
Y.,
Zhang,
Z.J.,
Wu, and
L,
Tang. “Pipelined
-
MapReduce: An improved MapReduce Parallel programing
model.” In the
Proceedings of the 2011 4th International
Conference on
Intelligent Computation Technology and
Automation
, March 2011, pp. 871
-
874

54.

J.,
Urbani,
J.,
Maassen, and
H, Bal
. Massive Semantic Web data
compression with MapReduce.
HPDC'10
, June 20

25, 2010,
Chicago, Illinois, USA.

55.

Z.,
Kaoudi,
I.,
Miliaraki, and
M,
Kouba
rakis. RDFS Reasoning and
Query Answering on Top of DHTs. In Amit P. Sheth et al., editors,
Proceedings of the 7th International Semantic Web Conference
,
ISWC 2008, Karlsruhe, Germany, October 26
-
30, 2008, volume
5318 of Lecture Notes in Computer Science,

pages 499

516.
Springer, 2008.

56.

Q. Fang, Y. Zhao, G. Yang, and W. Zheng. Scalable distributed
ontology reasoning using DHT
-
based partitioning. In Proceedings
of the Asian Semantic Web Conference (ASWC). 2008.

57.

R.,
Mutharaju,
F.,
Maier, and
P,
Hitzler. 2010.

A MapReduce
Algorithm for EL+.
Proc. 23rd Int. Workshop on Description
Logics (DL2010)
, CEUR
-
WS 573, Waterloo, Canada, 2010.

58.

Z.,
Wu,
G.,
Eadon,
S.,
Das,
E.I.,
Chong,
V.,
Kolovski,
M.,
Annamalai, and Srinivasan, J. 2008. “Implementing an Inference
Engine f
or RDFS/OWL Constructs and User
-
Defined Rules in
Oracle.”

59.

D., Abadi, A. Marcus, S. Madden, and K.Hollenbach,. Scalable
semantic web data management using vertical partitioning. In
Proceedings of the 33rd international conference on ery large data
bases, pa
ges 411
-
422.VLDB Endowment, 2007.

9

> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE
-
CLICK HERE TO EDIT) <


60.

M.,
Zaharia,
D.,
Borthakur,
J.S.,
Sarma,
K.,
Elmeleegy,
S.,
Shenker, and Stoica, I. “Job Scheduling for Multi
-
User
MapReduce Clusters.” Technical Report No. UCB/EECS
-
2009
-
55,
University of California at Berkeley, April 30,

2009. Pp.1
-
16.

61.

Urbani, J., Kotoulas, S., Maasen, J., van Harmelen, F., and Bal, H.
“OWL reasoning with WebPIE: calculating the closure of 100
billion triples”. 2008.

62.

C.,
Liu,
G.,
Qi,
H.,
Wang, and Yu, Y. Large Scale Fuzzy pD*
Reasoning Using MapReduce

63.

H.,
Liu, and
D,
Orban
.

“Cloud MapReduce: a MapReduce
Implementation on top of a Cloud Operating System.”
11th
IEEE/ACM International Symposium on Cluster, Cloud and Grid
Computing
. 464
-
474. 2011.

64.

A. Newman, Y. Li, and J. Hunter. Scalable semantics the silv
er
lining of cloud computing. In Proceedings of the 4th IEEE
International Conference on eScience. 2008.

65.

S. Ghemawat, H. Gobioff, and S.
-
T. Leung, “The google file
system,” in
19th ACM Symposium on Operating Systems
Principles
, October 2003.

66.

G. DeCandia, D.

Hastorun, M. Jampani, G. Kakulapati, A.
Lakshman,A. Pilchin, S. Sivasubramanian, P. Vosshall, and W.
Vogels, “Dynamo:amazon‟s highly available key
-
value store,”
SIGOPS Oper. Syst. Rev.
,vol. 41, no. 6, pp. 205

220, 2007.

67.

F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, and D. A.
Wallach,“Bigtable: A distributed storage system for structured
data,” in
Proc.OSDI
, 2006, pp. 205

218.

68.

M.Isard,M.Budiu,Y.Yu,A.Birrell,andD.Fetterly,“Dryad:Distributed
data
-
parallel programs from sequen
tial building blocks,” in
European Conference on Computer Systems (EuroSys)
, March
2007.

69.

R. Karve, D. Dahiphale, A. Chhajer. Optimizing Cloud
MapReduce for Processing Stream Data Using Pipelining.
In
Computer Modeling and Simulation (EMS), 2011 F
ifth UKSim
European Symposium, 16
-
18 Nov. 2011

70.

Q, Zheng. Improving MapReduce fault tolerance in the cloud.
In
Parallel & Distributed Processing, Workshops and Phd Forum
(IPDPSW), 2010 IEEE International Symposium, 19
-
23 April
2010