A platform for experimenting with measures of semantic similarity and supporting individual perspectives onto shared ontologies

toadspottedincurableInternet και Εφαρμογές Web

4 Δεκ 2013 (πριν από 3 χρόνια και 10 μήνες)

84 εμφανίσεις


1

A

platform

for experimenting with measures of
semantic

similarity

and
supporting individual perspectives onto
shared ontologies


Mark Gahegan
*
, Ritesh Agrawal, Anuj Jaiswal,
Junyan Luo and
Kean
-
Huat Soon



GeoVISTA Center, Department of Geography
,
The Penn
sylvania State University, USA
.

*Now at: School of Geography, Geology and Environmental Science, University of Auckland, New Zealand. Email:
m.gahegan@auckland.ac.nz




Abstract


This paper describes two de
velopments in the ongoing search for better
semantic similarity tools: s
uch
methods
are important when attempting to reconcile or to integrate knowledge
,
or knowledge
-
related
resources such as
ontologies and
database schemas. The first is an open
,

exten
si
ble

platform

for
experimenting with different measures of similarity for ontologies and concept maps.
The platform is
based around three different types of similarity, that we ground in cognitive principles and provide a
taxonomy and structure by which ne
w methods can be integrated.
The
platform

supports a variety of
specific
similarity
methods, to which researchers can add

others
. It also provides flexible ways to
combine the results from multiple methods, and some graphic tools for visualizing and commu
nicating
multi
-
part similarity scores. Details of the system, which forms part of the
ConceptVista

open codebase,
are described, along with associated details of the interfaces by which users can
add new methods, choose
which methods are used
and
select
h
ow multiple similarity scores are aggregated
. We
offer this as a
community resource, since many
similarity
methods have been proposed but there is still much confusion
about which one(s) might work well for different geographical problems; hence a test en
vironment that all
can access and extend would seem to be of practical use. We also
provide some examples of the

platform

in use.



The second part of the paper describes
in detail
the idea of ‘perspectives’

a means of defining specific
views onto semanti
c knowledge that can overcome some of the smaller differences in ontology that
sometimes are a stumbling block for compatibility or acceptance.
Perspectives
are designed to help
reconcile a user’s specific (but individual) understanding
,

or their current
needs,
to the contents of an
established domain ontology, but without forcing the user to adopt all the constructs of the ontology
directly. Minor differences can be overcome without the need for the user’s conceptual model (or related
applica
tion program
s) to be changed. Perspectives

offer a convenient way to customize how ontologies
appear to their users, rather like ‘views’ in a relational database (but significantly more powerful).

We
argue that perspectives are a kind of mediating transformation by
which knowledge resources can be
integrated.

In fact, they operationalize the notion of foreground and background in cognition, allowing
currently irrelevant details to be moved to the periphery.
So far in our implementation, perspective
s


2

allow: (i) prop
erties to be recast as concepts in their own right (and conversely, concepts
and sub
-
graphs
reduced to properties), (ii) differences in specialization/generalization to be byp
assed and more generally
(iii) implied relations used to connect concepts directl
y. We describe perspectives
from a cognitive
standpoint
, and give example
s

of how they can be used
.

Keywords:

semantic similarity, ontology mediation, open platform, concept mapping, GIScience

3

1 I
ntroduction

Knowledge computing (ontologies, provenance,
workflows, etc.) has opened the possibility of
representing and reasoning with knowledge about the geographical world, which

just like with data
before it

has led in turn to many further questions regarding interoperability, integration and update of
such
knowledge resources. Not surprising, then, that

we
often
find ourselves faced with the problem of
reconciling knowledge captured from different experts, or
scraped

from published documents and
databases.
But faced with the plethora of choices as to how k
nowledge should be recognized as similar
and integrated, the way forward is sometimes uncertain. Which methods should we use?
Such demands
have led us to construct
an evaluation platform for
various
semantic

similarity

methods that we describe
below
, in
the first part of the paper
.

The

platform is

implemented in
ConceptVista
, a concept mapping,
semantic search and knowledge integration environment.
(
http://www.geovista.psu.edu/ConceptVIST
A/index.jsp
).
ConceptVista

differs from ontology tools such
as
Protégé

in: its support for less formal kinds of knowledge such as concept maps, its links to many other
Web 2.0 technologies, and support of highly visual interaction and display.

A fuller d
escription is
available from Gahegan et al, (2007).


On the positive side, t
he
semantic similarity
research literature has grown very rapidly in the past five or
so years, so there are by now many
useful
similarity methods and
associated
metrics to draw fr
om.
Harvey at al (1999),
Winter (2001),
Kuhn (2005) and Agrawal (2007) provide good introductions to the
setting of geospatial semantics and some of the specific problems that must be overcome in a geographical
setting. More general accounts
of data sema
ntics and ontologies
are given by
Gruber, (1993),
Sheth
(1999)
,

Guarino (1998)

and Davis et al (2006);

while

Wach et al (2001) provide a useful summary of
ontology integration methods
. Measuring semantic difference in geo
-
ontologies is specifically addres
sed
in the work of Rodriguez and Egenhofer (2003), Kavouras et al (2005) and Fonseca et al (2006), in which
practical

metrics are proposed. Haase et al (2005)
,

Klein (2004)

and
Bloehdorn

et al (2006)

specifically
discuss the problem of checking for
,

and m
aintaining
,

consistency in distributed (co
-
evolving) ontologies.

Sowa provides a very thought
-
provoking account of dynamic ontology (2006) and with colleague
Majumdar (Sowa & Majumdar, 2003) describe
s

a system that can find matches between knowledge
fragm
ents (conceptual graphs) using a
sophisticated

analogy reasoning engine
.

Nevertheless, live systems

which

can be used by
an

entire research community to experiment with ontology
-
based

knowledge
integration are rare
. One
very basic but
useful example from

the Geosciences Network (GEON) is
described by
Lin and Lud
ä
scher (2003); it can be accessed from the GEON cyberinfrastructure portal at
https://portal.geongrid.org/gridsphere/gridsphere
.



However, a number of
deeper

questions
about

geographical knowledge
construction and
integration that
relate to the philosophy of geo
-
ontology (Frodeman, 2003), via the fields of hermeneutics, pragmatics and

4

situated cognition
,

require very careful attenti
on

since they impinge on the practicality and validity of
some matching method
s
.

An excellent account of the background to these questions and the related
research literature
over the past 40 years
is provided by Schwering (2008). Many of t
hese questions

arise,
we believe, because of the situated and often contested sense of meaning that is common within
geography and the natural sciences

(Clancy, 1994; Brodaric and Gahegan, 2007; Brodaric, 2007
; Pike and
Gahegan, 2007
)
.
By and large, t
he concepts and re
lations
we use to describe the world
do not exist in
nature, they are constructed by
humans
. Hence it is not surprising that meaning differs between
individuals, and through time.
The

assumption

that conveying
and reconciling
understanding is a
merely
lin
guistic

problem that would not occur in a purified language like description logic (ontology)

is, in fact,
almost entirely
wrong (Braspenning, 2002; Sowa, 2005)
.

Thus there are
often
no perfect
theoretical
solutions for
geographical
knowledge integration,

but rather subjective measures and practices that
,

on
balance
,

provide useful results.


Take the case of two experts who study different, but overlapping domains, such as (i) vulnerability of
local places to climate change and (ii) crisis management

and
disaster relief. It is likely that they will
share some knowledge that is obviously the same, but there may also be knowledge that is not identical
but is commensurate to some degree
,

and finally of course there is knowledge that is not common to both
of
them. The same can be said of computer programs, databases and ontologies; when compared with
each other there are grades of similarity

and overlap
. Putting it another way, there is
intersecting
knowledge
, where the problem is to insure that
ontological
clauses are not repeated and that small
inconsistencies are recognized and resolved. Then there is
augmenting knowledge
, that extends what is
known by each party separately, but remain
s

compatible with
the intersecting core
, then finally there is the
poss
ibility of knowledge that (so far) is disjoint and does not (yet) fit in.

Readers are directed to a
thought
-
provoking view of this problem, in a presentation by Von Schweber (2006) where Venn diagrams
are used to differentiate what is currently known, wha
t could be known (building on current knowledge)
and what cannot be known, as a means of understanding the limits of knowledge exchange between two
parties.


For the first case

of intersecting knowledge
, we need to recognize equivalence

though there may
be
naming conventions and differences in level of detail that must be resolved in order to
do so
. For the
second case

of augmenting knowledge
, we might expect fewer overlap
ping detail
s

and the problem
becomes one of knowledge integration
. More importantly
, there are likely to be concepts that play quite
different roles, i.e. have different properties and engage in different relations
. F
or example
the
vulnerability expert might be able to identify and understand at
-
risk populations, but have no idea how to

evacuate them from the path of a hurricane.

Yet much of the underlying semantic structure may well be
compatible, when examined. In the third case

where knowledge does not yet overlap
, the
best strategy

might be to avoid placing too much weight on simil
arities

because they may be coincidental
.

Of course,

5

the complicating factor is that in most cases, we
typically
do not know beforehand which of these three
situations applies for a given knowledge fragment: we need to infer it.


These apparent conflicts
in purpose point to the need for very flexible methods for recognizing and
resolving semantic similarity. Many of the tools developed to date for similarity resolution assume that
the problem is actually of the first type. And it is true that many integr
ation problems so far studied use
ontologies to resolve apparent differences between knowledge communities that
are

very
‘close’ to each
other

as in the
many
examples of ontologies used to build schema mappings between databases with
similar content. But
these are not the only kinds of
knowledge integration
problems we
are faced with
.


From our point of view then, we see an urgent need, not for another kind of metric to calculate semantic
similarity
, but for an environment that allows
:


1.

Methods and their m
etrics

to be readily evaluated and compared,

2.

Easy

exten
sion

with new
methods

for specific kinds of
similarity and matching

problems

3.

Better support for augmenting knowledge (second case described above)

4.

F
lexibility in the way that
methods and results

are
combined and communicated
, and

5.

Simple, but effective ways to investigate and
visualize

the result
s
.


In short, we do not know enough about the problem to be able to move straight to a solution, we need first
to conduct experiments and evaluate strategies o
n their merits for resolving particular
classes of problem
.
Without such
experimentation,

how will we ever know which strategies are
in fact
the most successful?


A description of the environment we have constructed for evaluating similarity forms the firs
t part of this
paper. The second part introduces a new idea to help knowledge communities look past
unimportant
differences in ontology, so that they do not become a dogmatic stumbling block for potential users.
Specifically, we describe the idea of ‘per
spectives’ onto know
ledge, which provide different views

onto

an underlying knowledge schema,
adapted specifically for some purpose. To give
just a simple

example

for now
, two experts (or two applications) might differ in the status afforded a notion such

as ‘Mesozoic’.
To one scientist, this might be simply a property tha
t describes the age of a fossil;

but to another,
Mesozoic might be a complex concept in its own right, with an intricate web of relations. The first
scientist, not interested in these d
etails
,

may wish to continue to treat Mesozoic as a property, and in our
opinion should be allowed to
do so
because it is not inconsistent with roles
it

can play.

But the second
scientist should be able to treat it as a
highly connected
concept. The tric
k here is to support both views
from the same ontology.


6

1

A semantic s
imilarity
platform for testing and evaluation

As noted above, much has been written already on the construction of different metrics for computing
similarity across ontologies, and the rel
ated task of matching, or building an equivalence mapping
between two schemas. However, it seems all too often that new methods are not compared to existing
ones, nor is the software developed available to allow all tests to be replicated and methods comp
ared. To
remedy this shortcoming, and to allow us to evaluate similarity methods in different contexts within the
geographical and geological realms, we have constructed an open, extensible similarity platform (test
bed), with graphical interfaces and pic
torial display of the similarity measures used.


However, b
efore describing our similarity platform, some clarification of purpose and intent is needed.
Firstly, we see our contribution here as providing a solid, conceptual
structure

for organizing the
various
similarity methods available and thereby creating a clean and extensible programming interface for
ourselves and others to use. G
ood structure is needed if we are to manage a growing collection of
more
complex methods. Secondly, by no means do we

claim to have a complete set of methods

to hand
, nor
that the methods we describe are

necessarily

the best available. We do provide a structure, and have
popul
ated it with methods that seem

interesting and useful to us
. B
ut it is far from complete. One

o
f

our
purposes in writing this paper is to generate interest in sharing methods between research groups.

If your
favorite method is missing, please consider supplying it to us! Finally,
i
t is not our purpose here to
provide an in
-
dept
h review of simila
rity measures,

we restrict ourselves to
some brief notes on different
types of similarity methods

related to supporting them computationally
.


We also note that c
alculating similarity is just one part of a complex process of knowledge integration and
manag
ement, especially when the knowledge resources are community
-
based, that is, shared by a number
of users.
We envisage the
research reported here
to fit
within a five stage model of ontology management
as follows:


1.

Ch
oose a strategy for assessing similarit
y
:
based on
an understanding

of the task and the parties
involved.

2.

Recognize

similarities:
use the chosen
methods

to compute similarity scores
, and to summarize
and report the findings
.

3.

Update local knowledge resources according to
the
similarities encount
ered. A
s part of this same
project (but not reported here) we have developed different strategies for merging and integrating

7

knowledge, from importing absent concepts to anchoring new concepts into formal domain
ontologies
1
.

4.

Manage the revision and maint
enance of ontologies within a community: this under
-
researched
area must address questions of how knowledge communities are formed, rules for participation
and a policy for approving revisions.

Some ideas as to how this might work are presented in
Klein

(
2004
)

and

Pike

et al
(
2005
). The tools for software version control can address the
technological aspects of the problem.

5.

Broadcast changes to community knowledge resources using a regular update cycle and
distribution mechanism.
2

1.1

Specific similarity metr
ics

Following
an idea suggested

by Sowa and
Majumdar

(2003)
,

we
have developed
a
categorization scheme
for

similarity metric
s according to the Peirce’s notions of f
irstness, secondness,
and
thirdness

(Peirce,
1931).
In Peirce’s writings, he coins these ter
ms to
describe

the different kinds of knowledge that
are
needed to form and relate categories
.
To whit:


“First is the conception of being or existing independent of anything else. Second is the

conception of
being relative to, the conception of reaction w
ith, something else. Third is

the conception of
mediation,
whereby a first and a second are brought into relation.”

(
P
eirce
,
1
891
)


So, f
irstness describes the internal or intrinsic
character

of some entity
, secondness relates
an

entity to
other entities,
and thirdness describes mediating transformations by which entities can become connected.
If we extend these ideas into
the
realm of semantic similarity, we obtain a concise and clear taxonomy for
similarity methods that makes it easier to understand what

the different methods do, and what their
operational parameters will be.

This scheme provides some much
-
needed structure for the software
development, as summarized in Table 1.


Table
1
. Modes of similarity, descriptions of what they do and the programmi
ng interfaces they require.


Mode

Description

Parameters: Type

Firstness
methods

Match

by
internal property
value
s, by property
type, and
also
property name
or identity
.

Properties(
A
)
:
Vector
,

Properties(
B
)
:
V
ector




1

This leads naturally to the question of exactly how su
ch merging should take place. Should properties from
matching concepts be merged into one new concept? Should they be subsumed into one or more existing concepts?
When combining knowledge across two communities, concepts

though their names and other pro
perties may be
similar

may actually mean different things to their knowledge community, or play significantly different roles.
Thus assuming they are equivalent (and can therefore be merged) may be unhelpful.

2

Imagine, if you will,
C
ritical
K
nowledge
U
pd
ates

to your PC!


8

Secondness
methods

Match

concepts acc
ording to their relationships
with other concepts, i.e. their graph
neighborhood

a
sub
-
graph centered on each
concept.

Neighborhood(
A
): Graph

Neighborhood(
B
): Graph

Thirdnesss methods

Introducing a mediating graph that brings A and B
into relation
.

Proper
ties(
A
): Vector,

Properties(
B
):

Vector,

Mediation
(A, B): Graph


1.1.1

Firstness Methods

Value and
Structur
e
Similarity

Value similarity methods calculate similarity scores based on the commonality of property values. The
metrics are designed to be directly p
roportional to the number of values that two concepts share
, and are
sometimes called set theoretic measures, employing notions of contrast (
Tversky, 1977)
.

Structural
similarity matches not on values of properties, but on their types. Thus it is basical
ly a count of the
number of properties whose types are the same, again, normalized by the number that are different. The
form of the method is typically a measure of the information they share

a
similarity score
calculated as
the sum over all properties o
r values (
p
i
) that are common between two concepts (
A

and
B
)

divided by
the

total information

contained in A and B separately.
Commonality

calculations vary with the type of
properties, a different method is needed for each new property type. See the eq
uations below for
examples of how these notions can be computed. Note that commonality can be computed in a number of
ways: the formulae below show integers compared according to their closeness to each other (giving a
numeric score in the range 0
-
1), and

strings compared for exact matches (giving a score of either 0 or 1).
Partial string matching and more complex lexicographical analysis is often more reliable when comparing
concept maps and ontologies not created using a controlled vocabulary. The gene
ral form for all firstness
methods is (from Table 1)
Properties(
A
)
:
Vector,
Properties(
B
)
:
Vector
; in the case of
the
simply
comparing the names of concepts,
the vectors reduce to simple strings or URIs.








B
A
ormation
inf
total
B
A
p
y
commonalit
B
A
score
Similarity
n
i
i
,
,
,
,
1




































string
,
1
,
0
int
1
,
,
type
of
is
p
if
Bp
Ap
Bp
Ap
Bp
Ap
type
of
is
p
if
p
range
Bp
Ap
B
A
p
y
Commonalit




9









B
A
n
B
n
A
n
B
A
ormation
nf
i
Total




,
.


It is in the nature of these kinds of measures that they
are narrowly focused on the internal aspects of
concepts.
As a consequence, it is usually not safe to rely on them alone.
T
ypes
and values
can be chosen
quite arbitrarily
,

and it is difficult to be sure that the comparisons are meaningful. For example, unless
constrained to only compare properties with the same names, these methods would compare the age of a
dog and the length of a train
.

But they do add evidence,
and can

be useful
when used to support more
advanced methods.

1.1.2

Secondness Methods

Secondness methods concern the relations between concepts, or more broadly the similarity of their graph
neighborhoods
. They are sometimes referred to as network methods
.

The simpl
est kinds

are very similar
to those described above for properties and values, except they are used on relations (and possibly their
properties also). So, following from the equations given above, secondness measures can compute
similarity based on common
ality among the relations, again normalized by the total information. One
complicating question is the depth of the subgraphs to be compared, and whether all types of relations
contribute equally to the overall score. For example, one could define a sema
ntic distance metric, so that
more distant relations (not directly connected, but connected by intermediary concepts) count for less.


More useful relational measures can be calculated using the explicit semantics of the relations
represented
in the ontolo
gy
. For example, one can count the network distance that must be travelled across a
generalization hierarchy t
o connect two concepts
A

and
B

(the number of

generalization steps from
A

to
their closest, shared generalization (
G
), then
the number of

further

specialization
steps from
G

to
B
.

Various other families of proximal relations can be used as well, including the spatial (
Kuipers, 2000
;
Schwering and Raubal
, 2005
)
. This kind of method works well with ontologies that share type hierarchies
explicitly
,

but more sophisticated measures are needed otherwise
, such as described by
Rodriguez and
Egenhofer

(
2003).

The general form remains as
Neighborhood(
A
): Graph
,
Neighborhood(
B
): Graph

(Table 1)
, though some of the more advanced methods

may

require addition
al parameters.

1.1.3

Thirdness Methods

Implied s
emantic relationships connecting concepts


(‘
nym matching service)

I
t is often the case when combining knowledge gathered from different human experts
or different
communities
that we need to
recognize

connect
ions

by certain
implicit (or missing)

semantic
relationships, such as
:
t
oponyms,
h
ypernyms,
h
yponyms,
m
eronyms,
h
olonyms,

s
ynonyms, and
a
ntonyms.
Although conceptually simple

use
WordNet
,
Cyc

or some other web
-
based
thesaurus

and gazetteer

to
check if two conc
epts have some kind of

nym

relationship

t
he problems here are ones of computational
performance and the reliability and completeness of these external resources. Performance is a problem
because searches through an external corpus must be made between al
l concepts
that might be similar

10

(thus order (
n
-
1)!
C
omplexity

assuming symmetry
). Completeness is a problem because specialized
scientific vocabulary is often missing

from these general
-
purpose thesauri
.


W
e have developed a separate

matching
service

t
hat
uses

the semantic relationships from
WordNet
.


The
service

searches for concepts that are similar

based on these semantic relationships.
Effectively this
offloads the computational burden to a

remote
computer as an RMI
service

us
ing

the powerful
Lucene

indexing tool

for efficiency
. Users of the service do not need to be aware of these complexities, the
interface is as described in Table
1
, i.e.
A
: Concept,
B
: Concept,
Mediation

(A, B):
Graph
, but the
G
raph

in
this case reduces to a single

nym

relation
.


Mediating subgraphs (
analogies and
perspectives)

The real power of mediation (thirdness) occurs when graphs more complex than a single relation can be
used to form connections between concepts. Sowa and
Majumdar

(2003) describe an analogy engine to
sea
rch for these mediating subgraphs in existing knowledge resources (such as ontologies). In our own
work, we have been frustrated by the lack of detailed domain knowledge by which these subgraphs might
be constructed, so have developed a human
-
led method
b
y which users can construct their own mediating
subgraph
s
. We term these subgraphs ‘perspectives’ and
dedicate the next section of the paper to
explaining how they work and how they can be used.

As far as the API is concerned they require no
special trea
tment.



Of course, the most successful strategies are likely to be those that can combine multiple methods and
resolve how to combine their results (see Schwering, 2005 as an example). We briefly touch on this
problem below.

1.2

Details of the Application Pr
ogramming Interface

The Application Programming Interface is constructed around the following methods:


Analyzers

Methods for performing comparisons

in values and types
(described above).
Three different families are implemented so far: StringAnalyzer,
N
umericAnalyzer and DateAnalyzer

for the firstness methods
.
3

T h i r d n e s s
a n a l y s z e r s i n c l u d e a

湹 m m 慴 捨 c 湧 獥 r v i 捥 慮 搠 t 桥
p e r s p e c t i v e s

m e c h a n i s m
d e s c r i b e d b e l o w i n s e c t i o n 3.

F i l t e r s

E x c l u d e c e r t a i n p r o p e r t i e s o f c o n c e p t s a n d r e l a t i o n s f r o m a n y c o m p a r i s o n
.

G U I I n t e r f a c e s

A l l o w u s e r s t o c o n f i g u r e t h e v a r i o u s o p t i o n s a n d p a r a m e t e r s t h a t d i f f e r e n t



3

W e h a v e n o t i n c l u d e d a n a l y z e r s f o r s e c o n d n e s s i n o u r w o r k s o f a r, a s t h e r e a r e s o m a n y t o b o r r o w f r o m
c o l l e a g u e s
.


11

Analyzers can use.

Extractors

Extract needed information from more complex fields, such as the local concept
name from a concept URI string (for when concepts are

identified with URIs).

Summarizers

These define the expressions used to combine multiple scores together.

Visaualizers

Ways for reporting the similarity scores in the
ConceptVista

application.


Supporting the
Analyzers

is a SimilarityRegistry
c
lass tha
t mai
ntains a HashMap index between similarity
methods and their

graphical user interface (G
UI
)

component (instance of SimilarityEditorInterface)
.

This
class allows users to add a new HashMap entry at runtime, if there is a ne
ed to modify or add an analyz
er.
It also

d
ecouples similarity measures from their GUI component
,
hence different
graphical
interfaces
to
methods
can also be substituted as needed. Specific RegistryAnalyzers maintain a registry of which GUI
components to use to set the different para
meters of a specific implementation of an Analyzer interface.


Filters

are
provided

to allow the user to exclude different properties of a concept from being used by the
various similarity metrics. This is necessary because some properties may be known
to be misleading,
and also because various details unrelated to semantics
sometimes

find their

way into in some concept
maps.


GUI Interfaces

are the graphical components by which the user interacts with the various similarity
methods. Because of the rese
mblance in parameters shown in table 1, new similarity methods can
sometimes make

use of existing GUI Interfaces, so no additional code is needed.


Extractors

are used to mine out information for matching from more complex properties, such as creati
ng
conc
ept names from their URI.

They provide a way to restrict what is compared
for

the properties
selected for use.


Summarizers

are the means to combine similarity scores from multiple methods. We envisage here a
process by which the scores from different me
thods can be weighted and combined. Over time, we can
perhaps learn which weighted combinations of methods works best for different
knowledge integration
problems and domains. So, far we have restricted our Summarizers to work only with the firstness
mea
sures described above. It remains an open question how to build a more general summarizer that will
work across all kinds of similarity methods.


Visualizers

are a class of methods for visually depicting the similarity scores. On the example
shown
below

in Figure 1
, simple bars are used to show these scores. We have experimented with other visual
devices and selected this one for now, though we intend to add more methods soon.


12


A complete UML diagram for the less experimental aspects of the whole platfor
m is given

for reference
purposes

in Appendix 1.

The entire application, which includes
ConceptVista

and the ‘nym matching
service
,

is available for download as a package from:
h
ttp://www.personal.psu.edu/arj135/Projects/CV4/CV4
-
Setup.exe
.

The authors will also be
happy to make
source code available to any interested parties. The codebase uses an LGPL license and is written in Java
using the JENA ontology tools.

1.3

Example of use

When the user
begin
s

an evaluation of semantic similarity
, two (or more) ontologies
(or more informal
concept maps)
are
first
loaded and displayed concurrently.

The user then clicks on the similarity tools
panel to choose
which methods to use,
and via the
ir GUI Interfaces,
set
s

any necessary operational
parameters
.
Having configured the methods, and selected a Summarizer, the user then clicks on a concept
in one ontology from which to begin the search for similarity. The system responds by calculating
si
milarity scores between this concept and all the other concepts in the
second

ontology.
The results are
projected into

the display
. The user can then choose to act on the computed scores, perhaps by creating
new relations to represent the uncovered conne
ctions, or by proceeding to compare further concepts.


The example below
in Figure 1
shows

t
he concept of
SurfaceWater


(highlighted in red)
from the SWEET
EarthRealm

ontology
(source:
www.nasa.gov/earthrealm
)

compared to several concepts from the
AktiveSA
ontology for crisis management
/ disaster relief (source:

https://www.edefence.org/~ps/aktivesa/OntoWeb/index.htm
).
The concept
Surf
aceWaterObject

has the
best overall similarity scores. As configured in this example, the pink bar represents
a

structural similarity
score, which is calculated based on how many common links are associated with both concepts. The red
bar denotes
a

simil
arity score based on lexical
similarities

of the concept names. Finally, the brown bar
shows a score that combines both the structural and string value measures

together
.

For this possible
match,
structural similarity has the highest score, as the two co
ncepts share
an almost identical
set of
properties. The string value similarity receives a moderate score because
of
the lexical difference between
“SurefaceWaterObject” and “SurfaceWater”

This also compromises the combined score that considers
both struct
ural and value similarity measures
.


Note
of course that d
ifferent
methods

may well change the outcome
s. For example, a

more refined lexical
string matching method might improve the results, as might the use of a Filter to remove the sub
-
string
“Object”

from all concepts in the
AktiveSA

ontology.

The graph neighborhoods are very similar, so
secondness methods may not be so effective in this simple example.



13



Figure
1
: The concept of
SurfaceWater

from the SWEET
E
arth
R
ealm ontology
(highlighted in red)
i
s compared
to

several concepts from the AktiveSA ontology

for crisis management.
The resulting similarity scores are
shown by the multi
-
bar glyph symbols in the display. See text for further details.

2

Perspective
s

Perspectives are sequences of ontological

transformations
,
specifically designed to overcome some of the
possible conflicts that can occur in the
process of
human
-
oriented ontology creation
and use
. For
example:


14

(i)


seemingly arbitrary decisions
where concepts could be linked by many relations
,

co
ncerning
whi
ch ones should be made explicit

and which ones are simply implied
;

(ii)


a different degree of interest or sophistication

that

may arise between ontolo
gy
producers

and
ontology
consumers
;

or

(iii)


a dissimilar propensity for levels of generalization am
ong practitioners, where the ontology is
strongly hierarchical but the user conceives of a much flatter structure

(or visa versa)
.

The aim of perspectives in all these case
s

is to finesse over such differences.


An early implementation of perspective
s as

ontological

filters was
reported by Gahegan et al, (2008
), as a
predominantly visual filter used to draw attention to certain themes in the display, built
automatically
around pre
-
defined semantic types. Filters were used to show the visual intersection
of different views
onto a concept map. Here perspectives are
extended to

a deeper cognitive level, designed to
mediate

conceptual knowledge so it better matches an expert’s personal understanding

or current need
.
Importantly, i
n this sense
p
erspectives b
roaden the notion of what might be considered commensurate
knowledge

beyond the kinds of
similarity
measures
described

above
.
Specifically they

allow us to
operationalize an idea from the writings of Whitehead (19
29; 1933
), where he describes the ability
of
humans to move ideas (concepts and relationships) between the light of enquiry and the

penumbral
background

,

where details are not known precisely, or are not
currently
needed. Using this idea, we can
define a filter that highlights a specific sub
-
gr
aph, but which ‘rolls
-
up’ the concepts on the periphery of
the filter,
temporarily
removing or
recasting them as simple properties of the concepts of interest.


Figures
2

describes

schematically
how perspective work

in the case of providing such conceptu
al focus
.
The top diagram shows a concept map or ontology upon which two different perspectives (A and B) will
be defined. The
left

diagram represents the effect of applying perspective A: concepts inside the filter are
unchanged (numbers 6, 7, 8 and 9),
concepts connected directly to those inside are reduced to being
properties of the included concepts and are shown now as circles (2, 5, 10, 11, 12, 15). Concepts not
directly
connected
to those within the perspective are
temporarily
removed

(shown grayed

out)
. The right
diagram shows perspective A retracted and perspective B asserted (concepts 6, 7, 12 and 14 now in focus),
with corresponding changes to the surrounding nodes. The movement from A to B represents conceptual
refocusing, so some previously
relevant details are no longer required, some previously truncated concepts
are
re
-
inflated and some additional concepts become visible.



15






Figure
2
. An overview of ho
w perspective filters work. The upper diagram shows a simplified ontology as a
series of connected nodes. Two perspectives (A and B) onto this ontology are shown on the lower diagrams.
Concepts on the periphery of a perspective are recast as properties
and shown as circles. Concepts falling outside
a perspective are
temporarily removed (
grayed out
)
. See text for details.


As an example of how perspectives are created,
Figure
3

shows a snapshot from a user session where a
perspective is being constructe
d
. Its purpose is

to map a local expert’s understanding of vulnerability onto
the
AktiveSA

ontology
.

The

bottom left panel in the display visually depicts the perspective
(here
comprised of three expressions

shown as nodes that have been added separately

by the user
)
, and provides
a visual editor

by which to
create and
interact with these expressions. The form of these expressions is
described in Gahegan et al, (2008).
4




4
Note to reviewers.
I have a sequence of images showing how a perspective is created to bring the AktiveSA and
local ontol
ogy into alignment…but th
ey would take up a lot of space

.

Hence I have included only one snapshot
for now.

Perhaps these additional images could be added to an accompanying website?

11

1

6

4

5

10

3

9

15

12

14

16

13

8

2

7

A

B

11

1

6

4

5

10

3

9

15

12

14

16

13

8

2

7

11

1

6

4

5

10

3

9

15

12

14

16

13

8

2

7


16



Figure
3
. A snapshot of the process of creating a perspective, to map local user
knowledge onto an established
ontology. The perspective editor, shown at bottom left, contains a visual portrayal of a perspective, at this stage
comprising three expressions (the green circles).


As mentioned above,

many semantic differences arise becaus
e

of quite arbitrary choices during ontology
construction, or because

of a predilection for ‘lumping’ or ‘
s
plitting’, i.e. the degree to which specificity is
added to a generalization hierarchy. In Figure
4
, a simple hierarchy is shown that
contains

the c
oncept
s

of
Tree
, and
Eucalypt Tree
, and a single instance:
e.Maculata
. Eucalypts are
E
vergreens
, so this category
could be added into the hierarchy, but in doing so,
Tree

would no longer be directly related to
Eucalypt
;
the dashed relation would be remove
d, and the two dotted relations would be added. Many local measures
of similarity can be misled by such
simple
differences, even though both ontologies are in most senses
entirely commensurate with each other. More to the point, if the concept
Evergreen

is superfluous,
confusing, contentious or
absent

from the conceptual model of the
ontology
user, then it need not be
shown.





17










Figure
4
. A simple generalization hierarchy, showing the optional inclusion of an additional concept (
Evergreen
Tree
)
.


Following Sowa and Majumdar (2003), we recognize that the
r
elationship between
Tree

and
Eucalypt

remains

equally true whether implied via the
Evergreen

concept, or explicitly linked. For some
applications, the concepts of
Evergreen

and
Deciduous

may be

useful, but for others they are unnecessary.
Should a researcher who does not need this distinction be forced to work with it? We believe not, and that
it is perfectly acceptable for them to ‘see’ and use this ontology as if
Eucalypt

and
Tree

were direc
tly
connected.


The same logic can be applied to non
-
generalization relations too, and
is

very useful in
emphasizing
specific (but implied

and therefore non
-
obvious
) patterns in knowledge.


The following example
should
make this point clear.


2.1

Examples of

Perspectives in use

As an example, consider an ontology of authors, articles and thematic areas. Typically, we know which
articles the various authors have written, and (via keywords) which themes the articles relate to. But we
may wish to know, or see
directly, which authors share interests in the same themes, or how themes
cluster together because they are studied by the same authors. Trying to glean this information from the
concept map can be difficult; the indirect connections via articles to resea
rchers adds a great deal of (what
for this question is) noise.


Figure
5

shows the
GIST Body of Knowledge
, recently developed to provide a consolidated account of the
var
ious themes that comprise GISystems and Science
from an educational point of view
(
http://www.ucgis.org/priorities/education/modelcurriculaproject.asp
).

An
ontology

was constructed from
the
GIST
major themes

along with their hierarchical relationships. Ea
ch
major theme

(such as analytical
methods, cartography and visualization)
is
colored
different
ly, with the window on the left of the screen
e.Maculata

Eucalypt

Tree

Evergreen

Tree


18

acting as a legend for the various themes
.

Note

that

the figure is designed to show the breadth of the
ontology, t
he details are not important here.



Figure
5
. The GIST ontology, created from the GIST Body of Knowledge document that describes the major
teaching themes in the field of GIS. The ontology, like the document, is a hierarchy, comprised of major themes
t
hat are further subdivided into specific topics
; c
olor is used to differentiate the various themes. The left panel in
the display is a navigable legend
, displayed as a tree.


The next image
, Figure
6
,

shows a close
-
up of part of the GIST ontology after au
thors have been added in

(scraped from Google Scholar using the GIST themes as keywords)
, and connected to the various themes
that they have published on. There is a proliferation of new relationships, but they are useful to see which
authors publish broa
dly, and which narrowly (the broader authors have multiple connections, and include
different colored themes). However, if the user wishes to see which topics seem to
be
closely related (
i.e.
often studied by the same researchers) then this display does n
ot help. The GIST Body of Knowledge is
structured as a hierarchy
, but i
n general,
many
GIScience researchers work on
several

subtopics

across the
field
.

The final image in this sequence, Figure
7
, shows relationships between GIST topics based on
authors
who work across
different

areas.
The idea
is

to find topic
s

that are closely related to each other
(in
terms of what authors study)
but classified
in GIST
in
to

different
themes
.
A perspective was created
to
derive relationsh
i
ps between topic
s

based on comm
on authors
, but

then
remov
ing the authors (the
concepts that actually link themes together)
.

The figure is shown in close
-
up, so the detail is readable.
To
achieve this

transformation
, the perspective effectively internalizes (rolls up) the relations fro
m topics to

19

authors inside of the topics concepts, so that authors are now effectively attributes of topics
.

Topics are
then linked together if they share the same value for any of their author properties.


















Figure
6
. A close
-
up of part o
f the GIST ontology with researchers added in and connected to themes, based on
the articles they have published. GIST themes are shown via the different colored ellipses, authors are displayed
as light blue rectangles (the currently selected author, Alan

MacEachren, is highlighted in red). The inset
schematic at bottom left shows how the topics and authors are connected.

















20



Figure
7
. A perspective filter is applied to the ontology shown in Figure
6

above, to make explicit the implied
rela
tionships between topics (based on co
-
occurrence of links to researchers). As the schematic at bottom left
shows, topics are now directly connected together when they share a researcher, and researchers are now absent.
See text for more details.


A

persp
ective
, then,
is a special kind of
mediating expression

that extends the Jena query capability,
so
that concepts,
relations and properties can be hidden, internalized,
and
externalized

but
cannot be created
or destroyed. Perspectives
do not change the und
erlying ontology
whatsoever
,
they rather support

specific
views onto it, provided they do not conflict with the underlying semantics (interpreted as above).
Using
perspectives, a

user

or a knowledge engineer can shape the ontology to reflect their own
und
erstanding
,
compressing it where it is too specific, and truncating it where it is too broad.
So t
hey
can still
use a
n
agreed, community

ontology without the need to proliferate new versions to suit their immediate needs,
with all the savings this entails

in terms of additional maintenance and reconciliation. Small differences
can be overlooked where the overall logic is not compromised.

3

Conclusions and Future Work

There seems to be no end to the set of possible semantic similarity methods. Of the many k
inds reported
we have experienced mixed, patchy results with all of them. Moving forward from this point demands
more rigorous evaluation and comparison. The work we describe in the first part of this paper addresses
the problem via an open, extensible p
latform for computing semantic similarity.
We have drawn on the
Peircian notions of firstness, secondness and thirdness to provide a strong conceptual underpinning that
we believe neatly describes the parameters and goals of different methods

and so provi
des a firm
foundation for application development
.

Our API supports several methods for computing similarity
, and
eases the problem of adding more. Results are provided visually, and the tools provide a high degree of
user interaction.


In parallel with
this, with the aim of also producing more cognitively plausible tools with which to filter
and
mediate

ontologies, we have developed the notion of perspectives, described in the second half of the
paper. Perspectives
are a special case of mediating transf
ormations, or thirdness. They
provide some
further sophistication to similarity methods,

because they offer
ways to see past what we argue are small
differences in the way knowledge can be encoded. We hope to eventually use these ideas as part of a
m
ore
advanced similarity reasoner
, that can itself take different perspectives during the process of
computing similarity.



21

Firstness, secondness and thirdness do seem to be robust notions on which to design a semantic similarity
platform
, b
ut it
appears

that a

fourthness

category may be needed
in addition:

to deal with the growing
possibilities of computing similarity using measures of association from massive collections of use
-
cases.
Similarity by emergence

perhaps describes it best.
Mitra (200
4
)

describe
s

how this might work, by
scraping information from

web pages returned from
Google

searches on the concepts to be compared
. We
have experimented with these methods at length, and hope to include them in a later release of our
software.


Having constructed a

platform for evaluating semantic similarity, it is
perhaps
time to move to a stage of
careful experimentation and comparison of our collective set of methods.
Such a program will take time
and effort, but is needed.
Answers to the following deeper
quest
ions must now be sought:


1.

Which
semantic
methods

are the most
reliable and
useful in
uncovering

similarity, or in merging
ontologies?

This remains a
n open question that
requires some sophisticated experimentation with
users
, using

tools such as the one sh
own above

and more besides!

We are keen to hear of
experiences from other researchers
on
this
topic
.


2.

How should different measures of similarity be combined when they are used?

Are there contexts
tasks for

which some measures work better than others, an
d if so, what are they?

Given that we might
discover these contexts, is it possible to achieve useful matching results without supervision by human
experts?


3.

Is there a useful role for hermeneutics to play in constructing knowledge horizons
or some other

form
of perspective
around concept maps, and showing where two horizons might intersect?
So far in our
experiment
s

with local
perspective
s
we have not tried to explicitly highlight where perspectives
diverge, but this might hold some promise for communic
ating differences in understanding.


Acknowledgements:

This research was funded by the US National Science Foundation (NSF) via grants BCS

9978052
(HERO), ITR (BCS)

0219025, and ITR (EAR)

0225673 (GEON). The authors would like to thank
Stephen Weaver fo
r his work in turning GIST into

an RDF document (no mean feat)

and the organizers of
the COSIT 2007 Semantic Similarity Workshop, where some of these ideas were first aired.


References

Agarwal, P. 2005. Ontological Considerations in GIScience. Internation
al Journal of Geographical
Information Science 19 (5):501
-
536.


22

Bloehdorn, S., Haase, P., Sure, Y., and Voelker, J. 2006. Ontology Evolution. In Semantic Web
Technologies: Trends and Research in Ontology
-
based Systems, eds. Davies, Studer and Warren, 51
-
70.

London: John Wiley & Sons Ltd.

Braspenning, P.J. 2000. Symposium on
Intelligent Agents in Software Engineering for Planning
, KaHo
St.
-
Lieven, Gent, 23rd February, 2000.

Brodaric, B. 2007. Geo
-
Pragmatics for the Geospatial Semantic Web. Transactions In G
IS 11 (3):453
-
477.

Brodaric, B. and Gahegan, M. 2007. Experiments to examine situated geoscientific concepts. Spatial
Cognition and Computation Journal (Special Issue on Cognitive Semantics and Ontologies) 7(1): 61
-
95.

Clancey, W. 1994. Situated cognitio
n: How representations are created and given meaning. Lessons from
Learning. R Lewis and P Mendelsohn (Eds.) Amsterdam, North
-
Holland: pp. 231
-
242.

Davies, J., Studer, R., and Warren, P. 2006. Semantic Web Technologies: Trends and Research in
Ontology
-
base
d Systems. London: John Wiley & Sons Ltd.

Fonseca, F., Camara, G., and Monteiro, A.M. 2006. A Framework for Measuring the Interoperability of
Geo
-
Ontologies. Spatial Cognition and Computation 6 (4):

309
-
331.

Frodeman, R. 2003. Geo
-
Logic: Breaking ground be
tween philosophy and the earth sciences. Albany,
SUNY Press.

Gahegan, M., Agrawal, R. and DiBiase, D. 2007. Building rich, semantic descriptions of learning
activities to facilitate reuse in digital libraries.
International

Journal on Digital Libraries
,
7,
(
1
-
2
):
81
-
97
.

URL:


http://www.springerlink.com/content/q102m641460h77v6/?p=cae7b09531014c3d9605c2f74b10dbfa
&pi=4

Gahegan, M, and Pike, W. 200
6. A Situated Knowledge Representation of Geographical
Information. Transactions In GIS 10 (5):727
-
749.


Gahegan, M. Luo, J., Weaver, S., Pike, W. and Banchuen, T (in review). Connecting GEON: making
sense of the myriad resources, researchers and concep
ts that comprise a geoscience
cyberinfrastructure.
Computers & geosciences
, special issue of cyberinfrastructure for the
geosciences.

Gruber, T. R. 1993. A Translations Approach to Portable Ontology Specifications. Knowledge
Acquisition 5:

199
-
220.

Guarino
, N.

1998. Formal ontology in information systems. In: Guarino, N. (ed.) Formal Ontology in
Information Systems. Proc. FOIS'98, Trento, Italy, June 6
-
8 1998. IOS Press, Amsterdam, pp. 3
-
15.

Haase, P., van Harmelen, F., Huang, Z., Stuckenschmidt, H., and Su
re, Y. 2005. A Framework of
Handling Inconsistency in Changing Ontologies. International Conference on Semantic Web (ISWC
2005) Lecture

Notes in Computer Science 3729, pp.
353
-
367.

Harvey, F., Kuhn, W., Pundt, H., Bishr, Y., and Riedemann, C. 1999. Semanti
c Interoperability: A Central
Issue for Sharing Geographic Information. The Annals of Regional Science 33:

213
-
232.

Kavouras, M., Kokla, M., and Tomai, E. 2005. Comparing categories among geographic ontologies.
Computer and Geosciences 31 (2):

145
-

154.

K
lein, M. 2004. Change Management for Distributed Ontologies. PhD Dissertation, Dutch Graduate
School for Information and Knowledge Systems, Vrije Universiteit, Amsterdam.

Kuhn, W. 2005. Geospatial Semantics: What, of What, and How? Journal on Data Semantic
s III LNCS
3534:

1
-
24.


23

Kuipers, B. 2000. The
Spatial Semantic H
ierarchy. Artificial Intelligence 111:

191
-
233.

Lin, K. and Lud
ä
scher, B. 2003. A system for semantic integration of geological maps via ontologies.
Proc. of the Workshop on Semantic Web Techn
ologies for Searching and Retrieving Scientific Data
(SCISW)
.

Mitra, P. 2004. An Algebraic Framework for the Interoperation of Ontologies. PhD Dissertation,
Department of Electrical Engineering, Stanford University.

Peirce, C. S. 1891. Review of:
Principle
s of Psychology
by William James,
Nation
, 53: 32.


Peirce, C. S. 1931. The Collected Papers of Charles Sanders Peirce. Harvard University Press:
Cambridge, MA.


Pike W
, Yarnal B, MacEachren A,
Gahegan M,

Yu C,
2005. Infrastructure for collaboration:
Buil
ding the future for local environmental change,
Environment
,
47(2):

8
-
21
.

Pike, W. and Gahegan, M. 2007. Beyond ontologies: towards situated representations of scientific
knowledge. International Journal of
Human
-
Computer Studies,65 (7):

674
-
688.

Rodrigu
ez, M. A., and Egenhofer, M.J. 2003. Determining Semantic Similarity among Entity Classes
from Different Ontologies. IEEE Transactions on Knowledge and Data Engineering 15 (2):442
-
456.

Schwering, A. 2008. Approaches to Semantic Similarity Measurement for G
eo
-
Spatial Data: A Survey.
Transactions in GIS

12(1): 5
-
29.

Schwering, A. 2005.
Hybrid Model for Semantic Similarity Measurement
.
Ordnance Survey Research
Report Series, Southampton, UK.

Schwering, A. and M. Raubal
2
005. Spatial Relations for Semantic S
imilarity Measurement. 2nd
International Workshop on Conceptual Modeling for Geographic Information Systems
(CoMoGIS2005), Klagenfurt
, Austria
,
S
pringer: Berlin.

Sheth, A. 1999. Changing Focus on Interoperability in Information Systems: From System, Syntax
,
Structure to Semantics. In Interoperating Geographic Information Systems, eds. Goodchild,
Egenhofer, Fegeas and Kottman, Kluwer
: New York, pp.

5

29:

Smart, P.D., Abdelmoty, A.I., El
-
Geresy, B.A., and Jones, C.B. 2007. A Framework for Combining Rules
and
Geo
-
ontologies. RR 2007 Lecture

Notes in Computer Science 4524. pp.
133
-
147.

Sowa, J.F. 2005. The Challenge of Knowledge Soup. Research Trends in Science, Technology and
Mathematics Education
, pp.
55
-
90.

Sowa, J.F. 2006. A Dynamic Theory of Ontology. In F
ormal Ontology in Information Systems, eds.
Bennett and Fellbaum
, Amsterdam
: IOS Presss, pp.

204
-
213.

Sowa, J., and Majumdar, A. 2003. Analogical reasoning, in de Moor A., Lex, W. and Ganter B. (eds.),
Conceptual Structures for Knowledge Creation and Comm
unication
, LNAI 2746, Springer
-
Verlag,
Berlin, pp. 16
-
36.
http://www.jfsowa.com/pubs/analog.htm

Tversky, B
. 1977
. Features of Similarity. Psychological Review 84(4): 327
-
352
.

Von Schweber, E. (2006) UR
L:


http://colab.cim3.net/file/work/Expedition_Workshop/2006_01_24_BootstrappingSOAthroughCOIs/
VonSchweber_
LivingSystems_2006_01_24.ppt

(accessed April 8, 2008).

Wache, H., Vögele, T., Visser, U., Stuckenschmidt, H., Schuster, G., Neumann, H., and Hübner, S., 2001.
Ontology
-
based integration of information: a survey of existing approaches.
In: Stuckenschmidt,

H.
(ed.), IJCAI
-
01 Workshop: Ontologies and Information Sharing, pp. 108
-
117.


24

Whitehead, A. N. 1929. Process and Reality: An Essay in Cosmology.
Social Science Book Store:

New
York.

Whitehead, A. N. 1933.
Adventures of Ideas
, Macmillan: New York.

Winter
, S. 2001. Ontology: buzzword or paradigm shift in GIScience? International Journal of
Geographical Information Science 15 (7):

587
-
590.



25

Appendix A: UML specification of the similarity Application Programming
Interface

(1)


26


Appendix A: UML specification of the similarity Application Programming Interface

(2)