tp.doc - Department of Computing Science - University of Aberdeen

bossprettyingData Management

Nov 28, 2012 (4 years and 10 months ago)

359 views

Thesis Proposal: ONTOSEARCH2



Edward Thomas

Department of Computing Science, University of Aberdeen

Aberdeen, AB24 3UE

ethomas@
csd.abdn.ac.uk



The problem we are addressing is performing queries across large OWL
ontologies. Current approaches use Descrip
tion Logic Reasoners which require
a large amount of computation to calculate the entailments for the particular
ontology, and to query the ontology which in worst case would be exponential
time (relative to the size of the ontology). Our approach is to ma
p each
ontology into a more tractable subset of OWL. This allows certain queries to be
performed across these ontologies far quicker than is possible using a full DL
reasoner. We also propose a method for storing large numbers of these
representations usin
g a relational database, allowing queries to operate across
multiple ontologies and removing the overhead required to create the
representations for each query execution.

Thesis

Conventional description logic reasoners are inefficient for querying large
kn
owledge bases; the approach to ontology searching and querying devised
here is significantly more efficient.


2
.
Introduction and Motivation

On the Semantic Web[BHL01], OWL[SWM03] has become the defacto standard
for ontologies and semantic data definition.

A growing library of these ontologies is
available online, covering a very wide range of subjects.

Currently it is very difficult to find ontologies suitable for a particular purpose.
Search engines like Swoogle allow ontologies to be searched for partic
ular keywords,
but further analysis or refinement of these searches is not readily possible. Because
each ontology must have its entailments calculated in a reasoner before questions can
be answered of it, this further analysis can be very time consuming.

In this document we explore a possible way of building such an index which will
allow deeper searches to be performed on the ontologies which make up the Semantic
Web.

Calvanese et al showed that a description logic they proposed, DL
-
Lite[C05] could
captur
e basic ontology structures but still have a low reasoning overhead (worst case
polynomial time, compared to worst case exponential time in the case of most
description logics).

In “A Survey of the Web Ontology Landscape” [WPH06], Wang et al show that
the
majority of OWL ontologies can easily be patched to become OWL DL ontologies
without changing the semantics of the ontology. They further show that a large
proportion of these ontologies exceed the expressivity of DL Lite.

By using DL Lite and mapping OWL

ontologies to a simpler DL Lite
representation, we will show that querying both the structure and the instances (the
TBox and the ABox) of ontologies on a large scale is practical. This will allow other
semantic web applications to query across millions o
f instances to find information,
and will give knowledge engineers access to a huge variety of prebuilt ontologies and
classes which they can reuse in new applications.

2.
Other Approaches

ONTOSEARCH [ZVS04] is the predecessor to the ONTOSEARCH2 system
whi
ch will be based around the DL Lite systems described in this proposal.
ONTOSEARCH was developed to allow simple keyword based querying of
ontologies by passing keywords to Google, packaged in such a way as to only return
ontological data in RDF. This was
supported by functions to display the context of
matching terms in the ontology, and a simple visualization of the structure of the
ontologies found.

Swoogle [D04] is a Semantic Web Search Engine developed at the University of
Maryland. Swoogle crawls and
indexes all types of Semantic Web Documents
(SWDs), these documents are indexed and stored in a triple store database. Swoogle
allows this database to be queried using a simple keyword based interface, all
ontologies which match the keywords are returned t
o the user, with additional
contextual descriptions providing information from linked SWDs. The interface to
Swoogle limits its usefulness as a query tool. Only simple keyword based searches are
possible, and additional work must be performed if one requir
es ontologies which
contain certain structures or if instance data for a particular class is required.

OntoKhoj [P03] is a system developed by the University of Missouri. It crawls the
Web searching for ontologies which it aggregates and classifies using a
n intelligent
algorithm which is trained using the DMOZ database. The latter contains a large
number of websites, sorted into human classified categories. Its search rankings are
performed using an algorithm influenced by the Pagerank algorithm developed b
y
Google. The ontologies are ranked using a calculated weighting based on the number
of hyperlinked references to the ontology from other Semantic Web Documents.
These are prioritised by the type of relationship: instantiation, sub
-
class and
domain/range.
OntoKhoj suffers the same drawback for ontology searching as
Swoogle, in that only keyword based searching is possible. OntoKhoj differs from
ONTOSEARCH and Swoogle in that OntoKhoj only allows searching of ontologies,
not of other Semantic Web documents w
hich reference ontologies.


3.
Related Research

3.1
Tractable Description Logics

DL Lite [C05] is a tractable description logic proposed by Diego Calvanese at the
University of Bolzano. It has a limited expressivity which reduces the complexity of
queries
to worst case polynomial time with respect to the size of the ABox. By
converting the concepts and roles found in the ABox to tables in a relational database
and populating these tables with the instance data found, the problem of query
answering across th
e description logic is reduced to expansion of the query and query
answering across a database.

We use DL
-
Lite to provide the framework for ABox querying and storage in
ONTOSEARCH2 and have extended the algorithms and language to also allow
storage of mult
iple TBoxes to handle searching multiple ontologies.

3.2
Ontology Metamodelling

RDFS(FA)
1

[PH02,PH03] and OWL(FA)
2

[PHS05] are fixed layered meta
-
modelling

architectures for RDFS and OWL. These try to remove the confusion
caused by the dual roles given to
certain structures in both OWL and RDFS by
creating new
modelling

primitives to describe classes and properties at the object,
language, and ontology layers. In this model, a class at the ontology layer is an
instance of a class at the language layer, whic
h in turn is an instance of a class at the
meta
-
language layer. We have adopted a similar model to this to provide the meta
-
modelling

capabilities used in ONTOSEARCH2 for describing TBox information in
terms that can be queried as ABox information against
a meta
-
TBox which described
the top level language structure of DL
-
Lite.

4. Approach

Our approach to ontology searching is to reduce the complexity of the ontology to
a DL
-
Lite representation. It has been shown that querying a DL
-
Lite knowledge base
has, a
t worst, polynomial time complexity, whereas searching and querying a more
expressive description logic such as OWL DL has a worst case complexity of
exponential time [C04]. In addition to this, the DL
-
Lite model allows us to use a
relational database to s
tore ABox information, giving access to a large number of
mature systems for storing and querying this underlying model.

The system manages large numbers of ontologies by first describing the structure
of a DL Lite TBox in terms of meta
-
concepts and meta
-
r
oles. This allows us to map
the TBox of an ontology in terms of an ABox created against this structure. By
extending the DL
-
Lite Normalise(KB) algorithm to include this step, we can reuse the



1

Resource Description Framework Schema Fixed layer metamodelling Architecture

2

Web Ontology Language Fixed layer metamodelling Architecture

ABox query methods to effectively query TBox data (see Appendix
A for more
information).


4.1
Query Answering

ABox queries within a DL
-
Lite knowledge base

are

made in a conjunctive query
format. These queries are expanded using the PerfectRef algorithm
,

which is

a

part of
the DL
-
Lite specification [C05]. The expanded q
ueries can be mapped directly to one
or more SQL queries which are run directly on the underlying database. In
ONTOSEARCH2 we have extended this query format with two additional features.
The first allows querying of datatype properties of
OWL

ontologies,
which are stored
as additional columns in each concept’s table within the database, and is described in
appendix B. The second allows role and concept names to be bound to variables
allowing us to perform queries which run across the ABox and TBox of multi
ple
ontologies simultaneously (albeit at the cost of increased query complexity, therefore
this is intended for stand alone installations of ONTOSEARCH2), and is described in
Appendix C.

Queries are made in SPARQL. These are first processed by a modified v
ersion of
the parser from the ARQ query engine in Jena, the resulting structures are used to
build a DL
-
Lite conjunctive query. These are processed using the modified PerfectRef
algorithm to produce SQL queries which are executed on the database. The resul
ts of
these queries are returned to the user.

ONTOSEARCH2 currently uses the Pellet
3

description logic reasoner to perform
validation and classification services, and calculate some entailments on OWL
-
DL
ontologies before they are added to the ONTOSEARCH2
repository. Entailed
properties of ontologies which are defined by OWL constructs like
TransitivePropery

or
InverseProperty

are made explicit to reduce the co
mplexity of the ontology to
structures

that
are
supported by DL
-
Lite. The ontology is then mapped
into a DL
-
Lite
representation and is manipulated by the normalisation and consistency checking
algorithms implemented as Jena rulesets before being passed to the database.

4.2
Search

We propose a general purpose search interface, where simple keyword based

searches can be combined with search parameters to limit the searches to particular
namespaces, or to instances of particular classes, or values of a particular datatype
property.

The search will be made on a full
-
text index of the ontologies in the repos
itory
keyed by namespace, class and property. This will be combined with queries on the
TBox portion of the repository to expand classes and properties to incorporate all
possible subsumptions. We will make it possible to incorporate ranking algorithms (to

order search results) as part of the future work, for example the AKTive Rank
algorithm[AB05].




3

http://www.mindswap.org/2003/pellet/

4.3
Infrastructure/Environment


Key
SQL Builder
RDBMS
PerfectRef
SQL Builder
ABox
TBox
Ontologies
SPARQL Parser
Results Formatter
SPARQL
Query
Query Results
OWL DL
Reasoner
Approximation
Normalise
Consistent
Document
Process
Rule
engine

Figure 1: ONTOSEARCH2 system architecture


Figure 1 shows the system architecture of ONTOSEARCH2. T
he elements shown
in the centre of the diagram are the rulesets which mediate between the standards used
on the Semantic Web for ontologies and querying (OWL and SPARQL respectively)
and the DL
-
Lite derived methods used by ONTOSEARCH2 to store and query th
e
ontologies. The rule engine converts ontologies from an entailed OWL model into the
internal DL
-
Lite representation, and convert SPARQL queries into DL
-
Lite
conjunctive queries which can be expanded and converted to SQL queries to run on
the database. Cu
rrently there is a web interface that allows queries to be made using a
web browser, returning results formatted as HTML.

The infrastructure of ONTOSEARCH2 will allow queries to be made through two
interfaces. There will be a web interface which will allow

free text Google style
queries, as well as more complex structured queries, and which will present results in
an easily readable format. There will also be a web service interface with an API
which allows direct integration with other Semantic Web applica
tions, allowing
SPARQL queries and returning an RDF
-
based results list.

In addition to providing a web platform for searching the semantic web, we will
also provide a version of the ONTOSEARCH2 platform for installation on local
machines or networks. Both
the web service API and the web search tools will be
included to allow very simple deployment across networks. In addition to tools for
easy querying of ontologies, we will provide functionality allowing new data to be
added to the repository in real time,

through the same API.

Potential Applications

Web based system:

By providing formal query tools for a good proportion of ontologies and instance
data on the semantic web, we can allow applications to access a huge pool of data.


Local system:

The local sys
tem will be enhanced with the ability to make arbitrary SPARQL
queries by implementing the extension to SPARQL described in appendix B. In
situations where one or more very large ontologies has to be queried and updated
regularly, ONTOSEARCH2 can be used t
o provide a general purpose framework for
storing, updating and querying these ontologies.

5
Deliverables

5.1
Software/Algorithms

The ONTOSEARCH2 system will be completed so that it allows arbitrary queries
to be made across ontologies stored in a relation
al database.

The software will have
the following functions:




Ability to query over many ontologies

Given the performance of the underlying database technology, the
ONTOSEARCH2 system should be able to support queries across several
thousand ontologies and

return results within 10 seconds for most queries
using low end server hardware. The system should be scalable across
larger capacity servers, or using database clustering technology.




SPARQL query interface

In addition to the conjunctive query syntax dev
eloped for DL
-
Lite and
extended for ONTOSEARCH2, the system will also be able to execute
SPARQL queries on any ontological data that has been captured.




OWL DL compatibility

The system will be able to import OWL
-
Lite and OWL
-
DL ontologies,
we will show the

extent to which queries on OWL
-
DL ontologies produce
correct and complete results and will tune the import algorithms so that
the results produced are always correct with respect to the original
ontology.




Semantic Imports

In order to provide consistency
when querying across multiple ontologies,
we will use semantic imports to maintain separation between ontologies
and results.




Web Service and HTML interfaces

A web service interface designed to support SPARQL queries from other
semantic web applications w
ill be provided. An HTML based interface to
support human queries will also have a simplified query language
available. We will evaluate several natural language query tools designed
to work with other semantic query systems to see if this would be a
suita
ble application for natural language queries.

5.3
Publications

In addition to the paper submitted to the IASIS WWW 2006 conference, four
publications are planned to cover the development of ONTOSEARCH2.




Journal of Data Semantics


Submission date 28th Oct
ober 2006

We will expand the paper already written for the IADIS WWW/Internet
2006 conference into a full journal article. Publishing date is September
2007.




SWAP 2007


Submission date 1st October 2006

A publication is planned which will announce the new

features present in
the alpha release of ONTOSEARCH2 in order to gain feedback on the
proposed applications. We will publish the extensions to the DL
-
Lite
algorithms and methods which we have implemented.




ISWC/ESWC 2007

We will introduce ONTOSEARCH2 with

an overview paper at either the
ESWC or the ISWC conference in 2007. This will cover the full
functionality of ONTOSEARCH2 and will garner wider interest in the
project. We will publish the details of the SPARQL query interface and
the results of the eval
uation of OWL DL correctness and completeness at
this conference.




AAAI 2007

We will use the AAAI conference to publish details of the combined
ABox and TBox querying approach used within ONTOSEARCH2,
presenting a technical view of the algorithms used.

6.
Evaluation

The performance of this new system will be quantitively evaluated against
“traditional” storage and query systems for ontologies; a set of tests will check for
recall and precision on queries as well as the response times for several different
o
ntologies. We have performed some initial benchmarking of the system using the
LUBM [GPH05] benchmark which has given some promising results
, the full results
of this exercise are included in Appendix D
.

The
key metrics on which performance will be evaluat
ed are the speed and
accuracy over large numbers of ontologies by the ONTOSEARCH2 system,
compared to traditional ontology reasoners such as Racer, FaCT and Pellet, as well as
other knowledge base management systems such as 3Store, DLDB and Sesame.




Test o
f accuracy of SPARQL query system

By using the

test data and queries made available by the World Wide Web
Consortium’s Data Access Working Group
4
, we can test the accuracy of
the SPARQL query system using a standard test suite.




Evaluation of performance a
gainst other s
ystems using the LUBM
benchmark

W
e will re
-
run the benchmarks performed in [GPH04] to that we can get a
direct comparison between the latest revisions of the software tested by
Guo et al and a ONTOSEARCH2. This will show us the performance of

the query answering facility over large datasets.




Evaluation of performance against a very large data set

By using web crawling techniques, we will increase the size of the
ONTOSEARCH2 repository until it is at the limit of what a single
machine’s disk s
pace can handle, this will allow us to test the
responsiveness of the system using complex queries on a very large data
set. Additional tests will also test the comparative performance when
multiple queries are submitted from different clients at the same
time.

7.
Work done so far

So far, we have created a DL Lite reasoning engine in Java using Jena and RDF as
the basic tools. We have performed mappings from OWL ontologies to DL Lite
representations using Jena rules and have been able to perform queries on
both the
structure and the data stored in the ontology.

This has resulted in a usable beta version of the ONTOSEARCH2 system, which
has been available for use on the web since August 2006.

As part of an evaluation of the ONTOSEARCH2 system, we have run the

LUBM
benchmark to test its performance on data sets representing 1, 5, 10, 20 and 50
universities’ RDF data. The full results of this benchmarking are included in appendix
D. The results showed that while ONTOSEARCH2 has a slightly higher overhead per
que
ry (as the query is converted and expanded into a DL
-
Lite conjunctive query), its
performance when querying large data sets exceeds that of other comparative systems
(ref. the LUBM paper). As part of a wider evaluation, we will redo this exercise
against a

wider range of systems and using a larger data set.

8.
Workplan

We will add the functionality described above in a staged approach, to ensure that
we can provide a stable service for users of the system. There will be an ongoing
program to expand the repo
sitory of ontologies, as well as the following milestones in
the development process:




4

http://
www.w3.org/2001/sw/DataAccess/tests/




Add Web Service interface for querying ONTOSEARCH2

To help integrate other semantic web applications, we will create a web
service API to allow direct querying of the re
pository using SPARQL.

Estimated completion date: October 2006




Improve reasoning capabilities

While performing a benchmark exercise (see Appendix D) on
ONTOSEARCH2, we identified potential constructs present in OWL DL
which cause incomplete results to be
returned for some queries. These
areas will be identified so that accurate results are returned for any
possible query.

Estimated completion date: November 2006




Create a simple search facility


To

help

users find potential candidate ontologies for further

query, and to
make the system more approachable, we will add a keyword based search
interface which will search a full text index of the ontology repository.

Estimated completion date: November 2006




Complete full evaluation

This

will

be published

as part

of a conference paper, and will contain a
full comparison of the facilities and performance of ONTOSEARCH2
against comparable systems.

Estimated completion date: December 2006




Improve results presentation

To make it easier for users to evaluate results,
we will expand the user
interface for the web based version of ONTOSEARCH2 (and add
supplementary functions to the Web Service API) to show the provenance
of returned entities, and add supplementary information on how matches
were achieved. We are investig
ating using an AJAX system using the web
browser as a direct client for the Web Service interface to reduce
maintenance and improve usability.

Estimated completion date: February 2007




Create arbitrary query engine

As part of the stand alone version of ON
TOSEARCH2, we will
implement the arbitrary query facility (Appendix C). This will allow
ONTOSEARCH2 to be used when more complex queries may need to be
made, without having to run a backup reasoner.

Estimated completion date: April 2007

References

[AB05] A
lani H, Brewster C, 2005.
Ontology ranking based on the analysis of concept
structures
.
In proceedings of Proceedings of the 3rd
I
nternational
C
onference on
Knowledge
C
apture
, Banff, Alberta, Canada, pp. 51
-
58.

[BHL01] Berners
-
Lee T, Hendler J, and Lassila

O, 2001.
The Semantic Web.
Scientific
American
, 284(5),
pp
34
-
43.

[C04] Calvanese D et al, 2004.
DL
-
Lite: Practical Reasoning for Rich DLs
.
In Proceedings
of
the 2004 Description Logic Workshop
, Whistler, British Columbia, Canada.

[C05] Calvanese D et al,
2005.
DL
-
Lite: Tractable Description Logics for Ontologies.
In
proceedings of the 20th National Conference on Artificial Intelligence
, Pittsburgh,
Pennsylvania, USA, pp. 602
-
607.

[D04] Ding L et al, 2004, "Swoogle: A Search and Metadata Engine for the Sema
ntic Web",
In
the Proceedings of the Thirteenth ACM Conference on Information and Knowledge
Management
.

[GPH04]
Guo

Y
, Pan

Z
,

Heflin

J
, 2004
. An Evaluation of Knowledge Base Systems for Large
OWL Datasets.
Third International Semantic Web Conference
, Hiros
hima, Japan, LNCS
3298, 2004, pp. 274
-
288.

[GPH05] Guo Y, Pan Z, and Heflin J, 2005. LUBM: A Benchmark for OWL Knowledge Base
Systems.
Journal of Web Semantics
,

3(2), pp. 158
-
182.

[PH02]
Pan

J,

and
Horrocks

I
. Metamodeling
, 2002.

A
rchitecture of web ontolo
gy languages.
In the emerging semantic web, Frontiers in artificial intelligence and applications
. IOS
press, Amsterdam (NL), 2002.

[PH03]
Pan

J,

and Horrocks

I
. RDFS(FA) and RDF
, 2003.
MT: Two semantics for RDFS. In
Dieter Fensel, Katia Sycara, and John M
ylopoulos, editors,
Proc. of the 2003 International
Semantic Web Conference (ISWC 2003)
, number 2870 in Lecture Notes in Computer
Science, pages 30
-
46. Springer.

[PHS05]
Pan

J
, Horrocks

I,

and Schreiber

G, 2005
. OWL FA: A Metamodeling Extension of
OWL DL.
In Proc. of the International workshop on OWL: Experience and Directions
(OWL
-
ED2005)
.

[P03] Patel C et al, 2003.
Ontokhoj: A semantic web portal for ontology searching, ranking,
and classific
a
tion.
In
Proc. 5th ACM Int. Workshop on Web Information and Dat
a
Management
, New Orleans, Louisiana, USA
, pp. 58

61.

[SWM03]
Smith M, Welty C, and McGuinness D, 2003, “Owl web ontology language guide”.
http://www.w3.org/TR/owl
-
guide
.


[WPH06] Wang T, Parsia B, Hendler J, 2006.
A Survey of the Web Ontology Landscape
.
I
n
proceedings of International Semantic Web Conference
, Athens, Georgia, USA.

[ZVS04] Zhang Y, Vasconcelos W, and Sleeman D, 2004,
OntoSearch: An Ontology Search
Engine
.
Twenty
-
fourth SGAI International Conference on Innovative Techniques and
Applications
of Artificial Intelligence, Cambridge
, UK.

Appendix A


Representing TBox data as ABox data

To allow TBox data to be queried using the same methods as ABox data, we
convert the TBox into an ABox, modeled against a meta
-
TBox. The meta
-
TBox
which described
the structure of all DL
-
Lite TBoxes is shown
below
:


AtomicConcept


BasicConcept

BasicConcept


GeneralConcept

Negation


GeneralConcept

ExistentialQualification*


BasicConcept

Existential


ExistentialQualification

InverseExistential


ExistentialQualif
ication

FunctionalRole


Role

InverseFunctionalRole


Role


domain


Role


domain¯


BasicConcept


range


Role


range¯


BasicConcept


negationOf


Disjoint


negationOf ¯


BasicConcept


subsumes


GeneralConcept



subsumes ¯


BasicConcept


hasRole


Exi
stentialQualification


hasRole¯


Role


The algorithm used to make the conversion between a TBox and ths ABox
information is given
below
:


Given a normalised DL Lite knowledge base K(T, A)

Given a representation of DL Lite root structures and roles T
m


Cre
ate a new ABox A
T

against T
m


For each atomic concept A
1

in T


assert AtomicConcept(A
1
) into A
T

For each basic concept B
1

in T


assert BasicConcept(B
1
) into A
T

For each role R in T


assert Role(R) into A
T

For each assertion A
1


A
2

in T

assert includes(A
1
,

A
2
) into A
T

For each assertion A
1



R in T


create new concept _T

assert Existential(_T)

assert in(_T, R)

assert includes(_T, A
1
)

For each assertion A
1



R


in T


create new concept _T

assert NegExistential(_T)

assert in(_T, R)

assert includes(_T, A
1
)

F
or each assertion

R


C
1

in T


assert domain(R, C
1
) into A
T

For each assertion C
2




R


assert domain(R, C
1
) into A
T

For each assertion C
2




R



assert range(R, C
1
) into A
T

For each assertion B
1




B
2


assert Disjoint (_T) into A
T


assert disjointOf (B
3
,

B
2
) into A
T


assert includes (B
3
, B
1
) into A
T

For each functionality assertion (funct R) in T


assert FunctionalRole(R) in A
T

For each functionality assertion (funct Rˉ) in T


assert InverseFunctionalRole(R) in A
T


The new ABox A
T
can then be stored in th
e RDBMS alongside the original
knowledge base. Given the normalized TBox T
n

given in the example above, we
create the following ABox A
T

with respect to T
m
, shown
below
.


AtomicConcept(
Student
)



AtomicConcept(
Professor
)

Role(
TeachesTo
)



Role(
HasTutor
)

FunctionalRole(
HasTutor
)


domain(HasTutor,
Student
)

range(
HasTutor
,
Professor
)


domain(
TeachesTo
,
Professor
)

range(
TeachesTo
,
Student
)


Disjoint(
_T
1
)

inverseOf(
_T
1
,
Student
)


subsumes(
_T
1
,
Professor
)

Existential(_T
2
)



in(_T
2
, TeachesTo)

subsumes (_T
2
, Pro
fessor)


Existential(_T
3
)

hasRole(_T
3
, HasTutor)



subsumes(_T
3
, Student)


Note that a new concept _T is created in order to represent the disjoint inclusion in
the TBox. This is analogous to a BNode or anonymous class in OWL.


Example Query:
Find all name
d classes which are subclasses of foaf:Person which
have a property which refers to an
pop3
:
MBox

class:


q(?x)


includes(?x
, foaf:Person), domain(?y, ?x),
ran
ge(?y,pop3:MBox
)


Application of the PerfectRef algorithm gives:


q1(?x)


includes(?x, foaf:Per
s
on), domain(?y, ?x), range(?y,pop3:MBox
)

q2(?x)


includes(?x, foaf:Perso
n), BasicConcept(?x),

range(?y, pop3:MBox
)

q3(?x)


includes(?x, foaf:Person), Role(?y), range(?y,
pop3:MBox)

q4(?x)


BasicConcept(?x), Role(?y), range(?y,
pop3:MBox
)

q5(?x)


Genera
lConcept(?x), Role(?y), range(?y,
pop3:MBox
)


This can be
translated

to an SQL query such as:

select AtomicConcept.id

from AtomicConcept, includes, Role, domain, range

where includes.domain = AtomicConcept.id


and includes.range = ‘foaf.Person’


and Role.i
d = domain.domain


and domain.range = AtomicConcept.id


and range.domain = Role.id


and range.range = ‘xbox:LiveAccount’

Appendix
B
: Storing and Querying Datatype Properties

The standard DL
-
Lite model does not allow for datatype properties of concepts.
Th
ese are properties of an instance which exist as literal data within the ontology,
examples could include a person’s name, phone number, and date of birth. To allow
users of ONTOSEARCH2 to search this data, we had to expand the DL
-
Lite
algorithms which sto
red and queried ontologies.


The rules for storing concepts are:



For each basic concept B occurring in A, we define a table tabB of arity 1,
such that


a




tabB iff B(a)


A



For each role R occurring in A we define a relational table tabR of arity 2,
such

that


a,b




tabR iff R(a,b)


A


To this we add the following rule:



For each datatype property P on concept B, we create a column colP on tabB


When we populate the new table with the instances of concept B, we include the
value of the datatype property
P in the appropriate column.


When querying this model, we can specify a datatype property as a type of role,
with a literal value, for example:


q(?x, ?y)


name(?x, “John Smith”), worksFor (?x, ?y)


The query engine will see that two role atoms have been

included in the query and
will perform sub queries to see if each role is of type ObjectProperty or
DatatypeProperty in the TBox of the repository. In the case of a DatatypeProperty,
this is translated into a constraint in the query which restricts result
s to instances of
the concepts which have “John Smith” in the name column.

Appendix C: Performing compound queries across ABox
and TBox data


Consider a query such as: “Find all instances of the classes which are subclasses of
vin:Wine and have a property

vin:hasVintage, where the instance has a value of
vin:vintage1998 for this property”. This requires the query to examine the TBox to
find a list of suitable classes, and then extract the instance data from these classes and
examine them to get the relevan
t information.

This requires that the query format is changed from that proposed by Calvanese et
al. We allow a variable to be used as a concept or role name gives the flexibility
required, the query is then executed in two steps. The query is split into t
wo queries,
the atom which contains the query on the concept name and any elements which share
variables with this query (the dependant atoms) are built into a sub query which is
executed for each result returned by the first query. We can now formulate o
ur query
as follows:


q(?x)


includes (vin:Wine, ?z), domain (?y, ?z),

range (?y,

vin:hasVintage), ?z (?x),

vin:hasVintage(?x, vin:vintage1998)


This query is rewritten as two queries:


q(?z)


includes (vin:Wine, ?
z), domain (?y, ?z),

range (?y,
vin:hasVi
ntage)


foreach (q(?z) as _
CANDIDATE
_
) {


q

(?x)


_
CANDIDATE
_
(?x), vin:hasVintage(?x, vin:vintage1998)}

}


And can be translated as the following SQL queries:


select includes.range as z

from includes, domain, range

where includes.domain = ‘vin:Wine’

and
includes.range = domain.range

and domain.domain = range.domain

and range.range = ‘vin:hasVintage’


The results of this query are then bound to the variable ?CANDIDATE in the
following query, which is executed once for each result returned:


select ?CANDIDA
TE.id as x

from ?CANDIDATE, vin:hasVintage

where ?CANDIDATE.id = vin:hasVintage.domain

and vin:hasVintage.range = ‘vin:vintage1998’



Appendix D: Full results of LUBM benchmark exercise.

We have evaluated the ONTOSEARCH2 system using the Lehigh University

Benchmark (LUBM) [GPH05] to measure its performance on large data sets. We
have run benchmarks using generated data sets representing 1, 5, 10, 20 and 50
universities, these are generated using the same seed and index values as used in
[GPH04] so we can d
irectly compare the performance of the ONTOSEARCH2
system against the systems tested in [ibid]. The smallest dataset (0,1) contains
approximately 136,000 triples in 14 RDF files, and the largest dataset (0,50) contains
approximately 6,800,000 triples in 99
9 RDF files.

The test machine specifications are listed below in Table 1. The Java platform
used was JDK 1.5.0, we used PELLET 1.3 as the OWL DL reasoner and PostgreSQL
8.1 as the RDBMS. These were setup with default installations, no additional
configurat
ion was performed.


Model

Dell Precision 370

CPU

Intel Pentium 4 3.0GHz

RAM

1024Mb

Hard Disc

80Gb (ATA)

Table 1: Test machine specification


The data set was loaded separately for each set of queries, and the database was
cleared between each data set.

After loading the data, the command “VACUUM
FULL ANALYSE” was executed on the database to remove redundant data and
update the system catalogue with accurate statistics about each table to allow the
queries to be executed in as efficient a manner as possi
ble. The timings recorded were
the total time taken for the query to be parsed from SPARQL, expanded, converted to
SQL, for this query to be sent to the database, and the results retrieved. The time
taken for the results to be sent back to the web browser
of the test machine is not
included.

The results obtained are shown in table 2. The columns represent the time taken to
execute the query for each data set (T), the precision of the results (P, where 1.00 is
every result returned being a valid result for t
he query) and the recall of the results (R,
where 1.00 is every valid result for the query is returned). The precision and recall
figures are calculated for the results of the query on the first dataset only, as reference
result sets for the other datasets

are not currently available.


Query

T[0,1]
(ms)

T[0,5]
(ms)

T[0,10]
(ms)

T[0,20]
(ms)

T[0,50]
(ms)

P[0,1]

R[0,1]

Q1

156

185

435

597

921

1.00

1.00

Q2

220

301

503

2013

9250

1.00

1.00

Q3

172

224

375

609

1468

1.00

1.00

Q4

83

102

231

309

399

1.00

1.00

Q5

96

142

373

504

1313

1.00

1.00

Q6

204

263

1364

2120

9875

1.00

1.00

Q7

108

165

353

879

1781

1.00

1.00

Q8

166

264

738

1432

2201

1.00

1.00

Q9

820

1224

2193

8720

201351

1.00

1.00

Q10

741

1405

2932

6339

15745

1.00

1.00

Q11

232

294

823

1568

2567

0.00

0.00

Q12

189

216

302

499

781

1.00

1.00

Q13

67

65

72

76

78

1.00

1.00

Q14

236

302

398

460

822

1.00

1.00


Table 2: Results of the Lehigh University Benchmark queries against different
data sets


We see from these results that with one exception, the results obt
ained are
returned to the user within 20 seconds. The more promising results are given for
dataset (0,50) where, excluding Q9, the lowest time is 0.078 seconds, and the highest
is 15.745 seconds (average 3.630 seconds). Query 9 requires three of the larges
t
datasets to be joined, which caused a considerable amount of disk access on the
database. We will try to mitigate this in future revisions of the ONTOSEARCH2
software by tuning the database and making more RAM available for index caching.

The precision a
nd recall figures show perfect precision and recall for all queries
except for Q11. This is described in the benchmark as:

Property hasAlumnus is defined in the benchmark ontology as
the inverse of property degreeFrom, which has three
subproperties: underg
raduateDegreeFrom, mastersDegreeFrom,
and doctoralDegreeFrom. The benchmark data state a person as
an alumnus of a university using one of these three subproperties
instead of hasAlumnus. Therefore, this query assumes
subPropertyOf relationships between de
greeFrom and its
subproperties, and also requires inference about inverseOf.





http://swat.cse.lehigh.edu/projects/lubm/query.htm

ONTOSEARCH2 fails Q11 as it does not deal with subPropertyOf axioms in its
current approximati
on, which will be solved in our next version of ONTOSEARCH2.


10
100
1000
10000
100000
1000000
(0,1)
(0,5)
(0,10)
(0,20)
(0,50)
Dataset
Elapsed Time (ms)
Q1
Q7
Q13
Q14
Q1 DLDB
Q7 DLDB
Q13 DLDB
Q14 DLDB

Figur
e 3: Performance of ONTOSEARCH2 (OS2) against DLDB

as data size increases


Figure 3 shows how the performance of ONTOSEARCH2 changes with 4
different queries from the LUBM query set. As t
he size of the data set grows, the time
required to complete each query grows, but this is a linear growth when compared to
the increase in size of the data set. This is partially due to the query overhead to parse
the query and expand the query using the
PerfectRef algorithm which remains
constant for queries no matter the size of the DL
-
Lite ABox being queried, and
partially due to the way that PostgreSQL uses indexes which are increasingly efficient
for larger data sets. Also included in this chart are t
he results of the DLDB
-
OWL
system, taken from [GPH04]. This is the only knowledge management system from
the comparison performed by Guo et al which was able to complete the queries with
the largest data set. While the two benchmarks were performed at diff
erent times and
on different machines, we can draw certain conclusions from the results obtained.
Although in some cases, DLDB was able to return results for smaller datasets quicker
than the ONTOSEARCH2 system, the time taken to return results for the lar
ger
datasets grows far quicker with DLDB than with ONTOSEARCH2.