Balancing Expressivity and Implementability in OWL Ontologies for Semantic Data Frameworks: The Journey from 2004 to 2009 and Beyond

pogonotomygobbleAI and Robotics

Nov 15, 2013 (3 years and 7 months ago)

249 views

Balancing Expressivity and
Implementability

in OWL
Ontologies

for

Semantic Data Frameworks:

The Journey from 2004 to 2009 and Beyond

Peter Fox

Tetherless

World Constellation

RPI

Australia Ontology Workshop 2009

Outline


The origins of this effort


Why a framework and not a system?


Semantics in 2004


The design and development methods


Ontologies

and the software and production!


Semantics between 2004 and 2009


Discussion of the expressivity and
implementability

balance and one more …


Since it is almost 2010 … what we are up to

2

Tetherless World Constellation

3

Background

Scientists should be able to access a global, distributed
knowledge base of scientific data that:


appears to be integrated


appears to be locally available

But… data is obtained by multiple instruments, using
various protocols, in differing vocabularies, using
(sometimes unstated) assumptions, with inconsistent
(or non
-
existent) meta
-
data. It may be inconsistent,
incomplete, evolving, and distributed

And… there
exist(ed
) significant levels of semantic
heterogeneity, large
-
scale data, complex data types,
legacy systems, inflexible and unsustainable
implementation technology…


Origins


In 2000
-
2001 the need for capturing and preserving
knowledge in science data became very clear but the
barriers were high


In 2004 we started a virtual observatory project based
on semantic technologies


Use case driven


in solar and solar
-
terrestrial physics
with an emphasis on instrument
-
based measurements
and real data pipelines; we needed implementations


We knew we also needed integration and provenance
(but that came later)


We aimed to push semantics into our systems to build
new ‘prototypes’ but we ‘failed’ ;
-
)

Tetherless World Constellation

4

In 2004


2004


OWL was a W3 recommendation!!


Protégé 2.x and the Protégé
-
Java
-
OWL API


SWOOP was a viable editor


Jena and the Jena API were in good shape


Pellet worked


SPARQL was still a twinkle in the RDF working
group’s eye


Semantics were still the realm of computer
scientists


luckily we had one of the best

Tetherless World Constellation

5

Frameworks vs. Systems


Prior to 2005,
we
built systems


Rough definitions


Systems have very well
-
define entry and exit
points. A user tends to know when they are using
one. Options for extensions are limited and
usually require engineering


Frameworks have many entry and use points. A
user often does not know when they are using
one. Extension points are part of the design


You don’t have to agree, this was our view



Tetherless World Constellation

6

7


Ontology Spectrum

Catalog/

ID

Selected

Logical


Constraints

(disjointness,

inverse, …)

Terms/

glossary

Thesauri

“narrower

term”

relation

Formal

is
-
a

Frames

(properties)

Informal

is
-
a

Formal

instance

Value
Restrs.

General

Logical

constraints

Originally from AAAI 1999
-

Ontologies Panel by Gruninger, Lehmann, McGuinness, Uschold, Welty;




updated by McGuinness.

Description in: www.ksl.stanford.edu/people/dlm/papers/ontologies
-
come
-
of
-
age
-
abstract.html

Design and Development


We made a conscious decision only to develop
ontologies

that were required to answer
specific use cases


We made a conscious effort to use whatever
ontologies

were available**


We were pretty sure that rules would be
needed


We ignored query

Tetherless World Constellation

8

Content: Coupling
Energetics

and Dynamics of Atmospheric
Regions

Community data
archive for
observations and
models of Earth's
upper atmosphere
and geophysical
indices and
parameters
needed to
interpret them.

Includes

browsing
capabilities by
periods,
instruments,
models, …

Content: Mauna Loa
Solar Observatory

Near real
-
time
data from Hawaii
from a variety of
solar instruments.

Source for space
weather, solar
variability, and
basic solar
physics

Other content used
too


CISM


C
enter
for
I
ntegrated
S
pace
Weather
M
odeling

11

Virtual Observatories

Make data and tools quickly and easily accessible to a
wide audience.

Operationally, virtual observatories need to find the
right balance of data/model holdings, portals and
client software that researchers can use without
effort or interference
as if all the materials were
available on his/her local computer using the user’s
preferred language: i.e.
appear to be local and
integrated

Likely to provide controlled vocabularies that may be
used for interoperation in appropriate domains along
with database interfaces for access and storage and
“smart” tools for evolution and maintenance.

12

Early days of VxOs

… … … …

VO
1

VO
2

VO
3

DB
2

DB
3

DB
n

DB
1

?

13

The Astronomy approach;
data
-
types

as a service

… … … …

VO App
1

VO App
2

VO App
3

DB
2

DB
3

DB
n

DB
1


VOTable


Simple
Image

Access
Protocol


Simple
Spectrum

Access
Protocol


Simple
Time

Access
Protocol

VO

layer

Limited

interoperability

Lightweight

semantics

Limited

meaning,

hard

coded

Limited

extensibility


Under

review

OGC: {WFS, WCS, WMS} and


SWE
{SOS, SPS, SAS}

use the same approach

14

Science and technical
use cases

Find data which represents the state of the neutral
atmosphere anywhere above 100km and toward the
arctic circle (above 45N) at any time of
high
geomagnetic activity
.



Extract information from the use
-
case
-

encode knowledge


Translate this into a complete query for data
-

inference
and integration of
data from instruments, indices and
models


Provide semantically
-
enabled, smart data query
services via a SOAP web for the Virtual Ionosphere
-
Thermosphere
-
Mesosphere Observatory that
retrieve data, filtered by constraints on Instrument,
Date
-
Time, and Parameter in any order and with
constraints included in any combination.

15

Use Case example


Plot the neutral temperature from the Millstone
-
Hill
Fabry Perot, operating in the non
-
vertical mode during
January 2000 as a time series.


Plot

the
neutral temperature

from

the
Millstone
-
Hill

Fabry Perot
,
operating

in the

non
-
vertical mode

during

January 2000

as a

time series
.


Objects:


Neutral temperature is a (temperature is a) parameter


Millstone Hill is a (ground
-
based observatory is a) observatory


Fabry
-
Perot is a interferometer is a optical instrument is a instrument


Non
-
vertical mode is a instrument operating mode


January 2000 is a date
-
time range


Time is a independent variable/ coordinate


Time series is a data plot is a data product

16

Knowledge representation


Statements
as triples: {
subject
-
predicate
-
object
}

interferometer

is
-
a
optical instrument

Fabry
-
Perot

is
-
a

interferometer

Optical instrument

has
focal length

Optical instrument
is
-
a

instrument

Instrument

has

instrument operating mode

Instrument
has

measured parameter

Instrument operating mode

has

measured parameter

NeutralTemperature

is
-
a

temperature

Temperature
is
-
a

parameter


A
query*:
select all optical instruments which have
operating mode
vertical


An inference: infer operating modes for a
Fabry
-
Perot
Interferometer which measures neutral temperature


Fox
-

APAC 2007, Driving
e
-
research:
Grids and Semantics

17

… … … …

VO
Portal

Web
Serv.

VO
API

DB
2

DB
3

DB
n

DB
1

Semantic

mediation

layer

-

VSTO

-

low

level

Semantic

mediation

layer

-

mid
-
upper
-
level

Education,

clearinghouses,

other

services,

disciplines,

etc
.

Metadata,

schema,

data

Query,

access

and

use

of

data

Semantic

query,

hypothesis

and

inference

Semantic

interoperability

Added

value

Added

value

Added

value

Added

value

Mediation Layer


Ontology
-

capturing concepts of Parameters,
Instruments, Date/Time, Data Product (and
associated classes, properties) and Service Classes


Maps queries to underlying data


Generates access requests for metadata, data


Allows queries, reasoning, analysis, new hypothesis
generation, testing, explanation, etc.



Fox
-

APAC 2007, Driving
e
-
research:
Grids and Semantics

18

Partial

exposure

of

Instrument

class

hierarchy

-

users

seem

to

LIKE

THIS

Semantic

filtering

by

domain

or

instrument

hierarchy

19

20

Inferred

plot

type

and

return

required

axes

data

21

Semantic Web Services


Fox
-

APAC 2007, Driving
e
-
research:
Grids and Semantics

22

Semantic Web Services

OWL document returned
using VSTO ontology
-

can be
used both syntactically or
semantically

23

Semantic Web Benefits


Unified/ abstracted query workflow: Parameters, Instruments, Date
-
Time


Decreased input requirements for query: in one case reducing the number of
selections from
eight

to
three


Generates only syntactically correct queries: which was not always insurable in
previous implementations without semantics


Semantic query support: by using background
ontologies

and a
reasoner
, our
application has the opportunity to only expose coherent query (portal and
services)


Semantic integration: in the past users had to remember (and maintain codes)
to account for numerous different ways to combine and plot the data whereas
now semantic mediation provides the level of sensible data integration
required,

and exposed
as smart web services


understanding of coordinate systems, relationships, data synthesis,
transformations.


returns independent variables and related parameters


A broader range of potential users (PhD scientists, students, professional
research associates and those from outside the fields)

http
:
/
/
escience
.
rpi
.
edu/schemas/vsto_all
.
owl

25

Semantic Web Methodology and
Technology Development Process

Use Case

Small Team,
mixed skills

Analysis

Adopt
Technology
Approach

Leverage
Technology
Infrastructure

Rapid
Prototype

Open World:
Evolve, Iterate,
Redesign, Redeploy

Use Tools

Science/Expert
Review & Iteration

Develop
model/
ontology

Evaluation


26

Developing
ontologies


Use cases and small team (7
-
8; 2
-
3
domain/ data
experts,
2 knowledge experts, 1 software engineer, 1 facilitator, 1
scribe)


Identify classes and

minimal properties
(leverage
controlled vocab.)


Start with narrower terms, generalize when needed or possible


Adopt a suitable conceptual decomposition (e.g. SWEET)


Import modules when concepts are
orthogonal


Add service classes and properties where needed


Review, vet, publish


Only code them (in RDF or OWL) when needed (CMAP, …)


Ontologies
: small and modular

Species validation

Tetherless World Constellation

27

Expressivity VSTO 1.0

Tetherless World Constellation

28

Expressivity VSTO dev. version

Tetherless World Constellation

29

Yikes

Tetherless World Constellation

30

Ontologies

and the software


Protégé 2.x and then 3.x built from our
ontology on the web


Java class generation


Eclipse as a development environment


Leveraged a portal code base (from the Earth
System Grid project)

Tetherless World Constellation

31

32

33

2

Implementation choices


Our big challenge was
time



in use cases and in
the representation


Depending on the level of granularity there were >
200,000 day
-
time records, and > 70,000,000 sub
-
day
time intervals


no triple store could handle this**


We
descoped

our effort to delay use cases such
as: find all neutral temperature data around the
summer solstice for the last decade


We chose a minimal time encoding in the
ontology and delegated that to a relational DB


Reasoning in finite time does not mean 3
-
4 secs!

Tetherless World Constellation

34


Fox
-

APAC 2007, Driving
e
-
research:
Grids and Semantics

35

VSTO
-

semantics and
ontologies

in an
operational environment:

www
.vsto.org


Web Service

Implications and OWL 1.0


Lack of numeric support meant that the the
rules and procedural logic were implemented
in java, i.e. in the code


On several occasions the tools (not to be
named) pushed us into OWL
-
Full, introduced
inconsistencies, etc.


Finally, they stabilized, and in 2005 (and again
in 2006 and twice in 2007) we had stable
releases

Tetherless World Constellation

36

Evaluation


Highlights:


Less clicks to data


Auto identification and retrieval of independent variables & plotting support


Faster


Support for finding instruments (without specifying the id includes finding
data from instruments that the user did not know to ask for)



Questions (potentially with 35 responses)


What do you like about the new searching interface? (9)


Are you finding the data you need? (35:
Yes=34
, No=1)


What is the single biggest difference? (8)


How do you like to search for data? Browse, type a query,
visual? (10,
Browse=7, Type=0, Visual=3)


What other concepts are you interested in using for search,
e.g. time of high
solar activity, campaign, feature, phenomenon, others? (5, all of these)


Does the interface and services deliver the functionality,
speed, flexibility you
require? (30, Yes=30, No=0)


How often do you use the interface in your normal work? (19,
Daily=13,

Monthly=4, Longer=2)


Are there places where the interface/ services fail to perform
as desired? (5,
Yes=1, No=4)

Tetherless World Constellation

37

Iteration


We need the ability to evolve the ontology and not
break the framework


As we broaden re
-
use of these
ontologies

and creation
of new ones


We needed visual tools like CMAP Ontology Editor


We needed the visual tools to work with the editing/
plugin

tools


they do not


We needed to use natural language forms but this ended
up being sparse but that need will increase


Need tools aimed at software engineers and domain
scientists: three
-
pronged approach and interoperable:


OWL in editors (e.g. Protégé, SWOOP, etc.)


Visual (e.g. CMAP/COE)


Natural Language (e.g. Rabbit, CL,
Peng
)


Tetherless World Constellation

38

Maintenance


Support for collaborative feedback, evolution


Change management


Support for ‘comments’ and ‘annotations’, i.e.
self
-
documentation


Package management: creation, dependency,
consistency checking


Tetherless World Constellation

39

Semantics between

2004 and 2009


Ontologies

were needed for data integration



and provenance



and mediation for data mining


Protégé 3.x and then 4.0 came out


SWOOP development was interrupted


Cmap

added OWL predicate support*


SPARQL became a recommendation


Triple stores exploded in use and capability


Linked Open Data started to take off


Pellet 2.0 came out


We invaded OWLED 2006, 2007, and 2009

Tetherless World Constellation

40

41

Semantic Web Layers



Other projects


ontologies

for
faceted search

Tetherless World Constellation

42

For data integration

Tetherless World Constellation

43

Ontology packaging

Tetherless World Constellation

44

Provenance

Tetherless World Constellation

45

Discussion of E versus I


We had to expand the balance to now include
maintainability (/
evolvability
)


E
-
M
-
I briefly


E.g. modularization has become essential to facilitate
ontology packaging
-
> need to take advantage of OWL 2


Separation of class and instances


Makes visual development possible


Also facilitates SPARQL end
-
point approaches


As tools and applications improve we reconsider
our past choices


Adding time** back into VSTO and moving to OWL 2

Tetherless World Constellation

46

2010


Recently funded to take our developments into a
configurable SDF, thus we will push ontology
languages and tools on new ways:


OWL 2


RL in particular


Annotations


Property chaining


SPARQL (yawn)


RIF


probably not for a while


However, the tools still lag behind


especially for
visual and natural language development

Tetherless World Constellation

47

Modularization


One of the primary goals of VSTO 2.0 is to modularize the VSTO ontology, e.g., an
instrument module does not require any other classes besides the instrument and
maybe an instrument operating mode to substantiate what an instrument is.


The problem with modularization, however, is that although a subset may
substantiate a concept, that concept, especially in VSTO, has a number of relations
linking it with other concepts within the ontology, for instance the instrument
module may measure a number of parameters in the parameter module, or have a
time coverage that would be defined in the time module.


Each observatory that the VSTO integrates data for will import only the modules
that are appropriate for the observatory's domain.


There are also some modules that will always be required, regardless of the
domain, like the instrument, parameter, and time modules. Each observatory
ontology has its own way of linking these modular concepts, which will be called
link properties.


This presents a problem, as the VSTO portal may not know which link property to
use to associate an instrument with a set of parameters or a time coverage, as it
becomes the responsibility of the ontology for the respective observatory to
define the link properties.

Tetherless World Constellation

48

‘Interfaces’ or ‘Extensions’


This is where the VSTO interface ontology comes in. It doesn't have to be
called the VSTO interface, it could be VSTO link properties, or anything for
that matter.


The purpose of this ontology is to define a few link properties that will be
required for navigation to data in the VSTO portal. For instance, the
guided workflows as they work now, would require a number of link
properties. E.g. the Start by Instrument Workflow, the VSTO interface
would require an instrument and time coverage link property to get from
step 1 to step 2 in the workflow.


In the case that an instrument of the CEDAR observatory is selected in
step 1, this link property could be created in a rule
-
based logic as…


( Instrument_1
hasInstrumentOperatingMode

IOM_1 ^ IOM_1
hasDataset

Dataset_1 ^
Dataset_1
hasTimeCoverage

TimeInterval_1 ) => Instrument_1
hasTimeCoverage

TimeInterval_1


Of course, this would have to be done for all instrument operating modes
and all datasets associated with those operating modes to determine the
full time coverage of an instrument.

Tetherless World Constellation

49

OWL 2 considerations


What's good?:


new syntactic sugar to simplify ontology


ability to compare
numerics



OWL 2 QL Synopsis:


focused on ontology interoperability with database systems where scalable reasoning
and query answering over large numbers of instances is most important task


Why is it a good match?:


synopsis above, query answering over a large number of time instances will have to be
performed


Why isn't it a good match?:


does not support enumerations, a feature required by some concepts in VSTO


does not support functional properties, a feature required by some properties in VSTO


does not support property inclusions involving property chains, a feature we hope to
utilize to define rules for VSTO


does not support keys, a feature we hope to add when
Protege

4.1 released (along with
support for creation of keys)

Tetherless World Constellation

50

OWL 2 considerations


OWL 2 RL Synopsis:


focused on ontology interoperability with rule extended
DBMSs

where
scalable reasoning over large datasets is the most important task


Likely final choice:


supports all OWL features currently required by VSTO, including
enumerations and functional properties


supports property inclusions involving property chains, so potential for
rules can be addressed, namely for reasoning over time intervals


supports keys

Tetherless World Constellation

51

Back to Semantic Data
Frameworks


With the substantial adoption of semantics in
science data applications


There is a need for a higher level of application/
tool infrastructure


Others are experiencing the same lessons with
ontology and application development


We are aggregating our efforts into a:
Semantic
eScience

Framework (SESF)*


Configurable, i.e. ontology loadable and driven

Tetherless World Constellation

52

Inference vs. Query


The real power of semantic web in science is
likely to lay in the ability to balance
implementation choices between inference
(RDFS and OWL) and query (even SPARQL)


It is clear to us that the effect upon
expressivity and maintainability will be an
essential consideration


Recall the OWL
-
QL


OWL RL findings


Also depends on how dynamic the KB is…

Tetherless World Constellation

53

I.e. SDF
vs

LOD


Linked open data


RDFS and SPARQL


http://linkeddata.org



Emergent ontology versus, well, an engineered
one


Current chaos due to
owls:sameas


Dynamic content


One of the present challenges for us is to
accommodate the web of data into emerging
needs for federated search and access as
SDFs

are
curated
..


And yes, there is RDFS 2.0 to consider

Tetherless World Constellation

54

Summary


We set out to build a prototype and ended up with a
production semantic data framework


Language and tools served us well


Even with modest expressivity we challenged the tools
of the time and made many compromises


All along the way, we evaluated our ontology
developments and implementations to gauge the
benefits of semantics


Maintainability, esp. modularization is driving new
expressivity needs


We continue to need to bridge the computer science
and application communities

Tetherless World Constellation

55

Further Information


http://tw.rpi.edu/portal/SESF


Contacts:


pfox@cs.rpi.edu

Tetherless World Constellation

56