Content-based Ontology Matching for GIS Dataset

splashburgerInternet and Web Development

Oct 22, 2013 (3 years and 5 months ago)

62 views



Content
-
based Ontology Matching

for GIS Dataset

Jeffrey Partyka, Neda Alipanah, Latifur Khan, B
.

Thuraisingham

Department of Computer Science

University of Texas at Dallas

{jlp072000, na06100, lkhan,
Bhavani.thuraisingham
}
@utdallas.edu



ABSTRACT


The
alignment of

separate ontologies by matching
related concepts continues to attract great attention within
the database and artificial intelligence communit
ies
,
especially

since
semantic heterogeneity across data sources
remains a
widespread and relevant problem. The GIS
domain adds additional complexity to ontology alignment
due t
o a multitude
of
characteristic

attributes including
geometry and location.

Consequently, the GIS domain
presents unique forms of semantic heterogeneity t
hat
require
a variety of matching approache
s.


Our approach considers content
-
based

techniques
for
aligning GIS ontologies.
W
e examine the associated
instance data of the compared concepts and
apply a
c
ontent
-
matching
strategy to measure similarity
based on
value types based on N
-
grams present in the data. We focus
special attention on

a method applying the concepts of
mutual information

and

N
-
grams by developing 2 separate
variations

and testing them over
GIS dataset including
multi
-
jurisdictions.

In order to align concepts, first we find
the appropriate columns. For this, we will exploit mutual
information between two columns based on the type
distribution of their content. Intuitively, if

two columns are
the same, type distribution should be very
similar.
We
justify the conceptual validity of our ontology alignment
technique with a series of experimental results that
demonstrate the eff
icacy
and utility of
our algorithms on a
wide
-
variety of authentic GIS data.


INTRODUCTION

The problem of
information integration has experienced a
number of manifestations since its inception, which
resulted from the meteoric popularity of databases after the
1960's. However, the core of this problem has always been
the need to consolidate heterogeneous data
sources under a
single, unified schema. Over the last few decades, a
tremendous amount of effort has been expended to

discover novel information integration strategies; some of
the more popular approaches have used the column name
similarity, content simil
arity, and graph similarity to
semantically align columns from one data source to the
columns of another data source.


Ontology alignment is the most recent incarnation of the
i
nformation integration problem, but in order to define this
properly, the
definition of ontology should first be

understood.
The most popular definition of an ontology is
that of a "formal, explicit specification of a shared
conceptualization" proposed by Gruber[
1
].

In practice,
ontologies

for a given domain

consist of a series
of classes
(or concepts) along with their properties, restrictions and
instances, many of which are related by various types of
relationships. The alignment of ontologies, therefore,
entails deriving correspondences between concepts and
their associated pr
operties and instances
. Because of the
expanded scope provided by ontologies as opposed to
relational databases, determining similarity via matching is
a more challenging problem.


Within the domain of geographic information systems
(GIS), further com
plexities are included within ontology
alignment due to
geometry and
geospatial characteristics
(i.e., location)
within geo
databases. Often in GIS,
determining similar data requires matching location as well
as semantics.

It poses several
challenges’
. First, a concept
may have a set of attributes (i.e., attributes) rather than
single description (i.e., in text documents

a bag of
keywords).
Second, attributes are heterogeneous in nature
across multiple concepts. Some attributes are matched and
some do
not. Third, rather than one to one matching
between attributes, one to many is possible. Finally, for
mapping concepts, we need to find appropriate attributes
between two concepts from two different ontologies.
Therefore,
at the first
step

attribute

level
matching between
concepts will be estimated and next, concept level
matching will be done.


Various approaches towards a solution have been
proposed in GIS literature involving strategies that measure
syntactic, semantic and relational differences betw
een the
compared ontologies. This often takes the form of distance
measures between compared concept names that analyze
the syntactic
heterogeneities

between the words
themselves, as well as
heterogeneities

regarding their
possible meanings in different co
ntexts.
Nudelman

et al.

[2]
outline a classification of semantic heterogeneity types that
are typically encountered during ontology alignment in the
GIS domain. In regards to ontology alignment algorithms,

Sunna and Cruz [3] discuss
structural heterogene
ity
between
GIS ontologies

and construct

a tool [4] that applies
their ideas t
o generate a similarity score that

quantif
ies

the
synonymy between compared concepts.
The approach
undertaken

here

has applied some of the measures
discussed above and
considered many others
, but it goes
one step further by exploring similarities between the
various attributes associated with the instances of the
compared concepts

using a mathematically derived
similarity measure based on the concept of mutual
informatio
n. Furthermore, we apply our similarity measure
over a multi
-
jurisdictional GIS dataset.


In developing a strategy for aligning GIS ontologies, we
consider a novel approach based on the information
theoretic concept of mutual information that utilizes

content
-
matching techniques Specifically, we identify type
distributions over distinct N
-
grams among the columns
within the instance data of compared concepts and use
these to obtain a similarity value.
In particular, using
distinct type we strive to capt
ure pattern from raw text of
GIS dataset.
We extend our similarity measure towards
1:M matching of attributes between concepts.
We present a
complete mathematical derivation of our similarity measure
based on mutual information.
Therefore, w
hile previous
w
ork has focused on the use of N
-
grams withi
n schema
integration, our work
leverages some of these ideas along
with some new variations towards the construction of a
wholly innovative ontology matching algorithm.


The rest of this paper is organized as follows. In section
2, we discuss an overview of related work in the areas of
ontology alignment and schema integration. Section 3
states the problem to be solved and our proposed solution.
Section 4 presents in de
tail the ontology alignment
algorithm and its mathematical underpinnings.
A series of
experiments and their associated results involving different
variations of our method over different kinds of text are
discussed in Section 5. Finally, in section 6,

we o
utline our
future work.




2. RELATED
WORK


Ontology matching continues to attract extensive
interest, particularly with regards to the domain of GIS.
Related work includes [
2
], which formally describes the
various ways in which semantic heterogeneit
y may be
encountered during the ontology alignment process in the
GIS domain. In addition to name and content heterogeneity,
the authors also discuss coordinate heterogeneity, which
compares the spatial position of
two

instances, and
relationship heterogen
eity, which compares the ob
ject
references of
two

instances.
Sunna and Cruz [
3
] describe
matching ontologies using structural properties such as
sibling similarity and descendant’s similarity. Using these
ideas, they

introduce an ontology alignment tool for use in
the GIS domain called AgreementMaker

[
4
].

A number of
challenges in the GIS domain continue to inspire
considerable attention, particularly regarding composite
matching.


Schema matching also plays an i
mportant role in the
content matching process. The survey of approaches to
automated schema matching by Rahm and Bernstein [
5
]
presents a comprehensive and useful overview of the
methodologies derived over a number of years. Included is
a taxonomy which us
es several criteria to categorize the
matching approaches, such as schema and content based
methods, element
-
level and structure
-
level methods, and
linguistic and constraint
-
based methods. A more recent
survey by Doan and Halevy [
6
] covers newer developmen
ts
in this area with a focus on related AI and machine learning
research.


A number of schema matching publications describ
ing

methods tailored more to the database community
influenced our work. Dai, Koudas et al. [
7
]
discussed

content
-
based schema m
atching based on distributions of
N
-
grams among compared columns. Despite the influence
of this publication, some crucial differences exist between
their approach and the methods explained here. First, their
approach used data sources
containing raw text f
rom any
given domain
, whereas our methods specifically targeted
the GIS domain. Second, their approach is designed for the
area of schema matching, while our methods are made for
the area of ontology matching, which means that in our
work, additional compl
exities needed to be considered,
such as concept matching over names as well as content.
Third, they defined statistical types only over distributions
of N
-
grams and used these

to determine column similarity
.
In addition to this idea, our approach consider
s a number of
variations. One of these treats N
-
grams themselves as
distinct types extracted from the tuple values of compared
columns. Also, these two approaches
are

applied over
regular text and over encoded text, which allowed us to
observe our algorith
m’s performance over vastly different
kinds of data.
Fourth, we present a complete mathematical
derivation of our own similarity measure based on the
concept of mutual information.
Finall
y, while they briefly
mentioned the potential application of their al
gorithm to
1:M matching, we apply
our similarity

measure
towards the
creation of a 1:M matching algorithm between concepts
.

Kang and Naughton [
8
] propose a
two
-
step process for
schema matching when column names in schemas
are
ambiguous

that
involv
es

the generation of pairwise
attribute correlations and their associated frequency within
the compared tables.

A few publications seek to optimize
the discovery of schema matches. Some notable methods
include Cilo[
9
], a schema
-
matching tool with extensive
c
apabilities, SPIDER[1
0
], built on top of Cilo that uses an
idea known as routes, which describe the relationship
between source and target data, in order to debug schema
mappings.


Within the AI community, a number of works in the
schema matching area ap
ply machine learning and
statistical methods to learn attribute properties from data
and
examples. Li and Clifton[
11
] describe a tool known as
SEMINT, which uses neural networks to determine match
candidates by matching attributes from similar clusters
bet
ween columns in a 1:1 match. Berlin and Motro[
1
2
]
describe a tool known as Autoplex which uses supervised
machine learning techniques for automating the discovery
of content for virtual databa
se systems. He et al. [1
3
] create
a co
-
occurrence matrix between

compared tables and
attempts to find the minimum distance between the
matrices to determine an optim
al similarity value. Finally,
M
aumann, Ho et al. [1
4
] focus on determining optimal
match candidates by classifying source attributes using a
modified Naïv
e Bayes classifier.



3. P
ROBLEM STATEMENT AND
PROPOSAL

3.1 Problem Statement


Given 2 data sources, S
1

and S
2
, each of which is
represented by ontologies O
1

and O
2
, the goal is to find
similar concepts between O
1

and O
2
by examining their
names and their respective instances.
L
et us assume that O
1

and O
2

are derived from the GIS domain. Additionally,
these ontologies may vary in breadth, depth, and
relationship types between their constituent concepts.
Figure
s

1

and 2

dis
play O
1

and O
2
, the ontologies
to be

aligned
. Also displayed for each ontology are their
constituent concepts and two
sample

identifying

attributes
for each concept
.

Both ontologies
are

derived from the
Roads and Ferries package of the Geographic Data Files
(GDF) data model and the Ontology for Traffic Networks.



With this in mind, an effective ontology alignment
procedure would be expected to match up concepts w
hich
are semanticall
y equivalent
.
In this case, O
1

and O
2

both
feature Road and Ferry concepts, so a strong similarity
value between each
one

would be expected. Furthermore, a
close semantic equivalence would also seem to exist
between the Residential Area concept of O
1

and t
he
Address Area concept of O
2,
and

Traffic Area of O
1

and
Enclosed Traffic Area of O
2
. There may also be a fairly
strong semantic similarity between Junction of O
1

and
Intersection of O
2
.

Our goal is to determine this semantic
similarity given instances fo
r concepts.


3.2 Proposed Solution

The challenge involved in the alignment of these
ontologies, assuming that they have already been
constructed, is based on the derivation of procedures that
will maximize the
semantic

similarity between any two
concepts
between the ontologies.



A diagram outlining the generalized ontology matching
process

is
displayed

in Figure
3
.

The process consists of the
matching of names and content between compared
concepts. The name match attempts to determine the degree
of sy
nonymy between the concept names. The content
match determines similarity between the instances of each
concept by measuring their mutual information, and it
accomplishes this by the extraction of features known as N
-
grams from
the
compared columns.
The de
tails of this
process will be discussed in section 4. Both name and
content matching contribute equal weight to the overall
similarity between the compared concepts.


Pairwise semantic similarity between two concepts will
be computed using

Algorithm 1
.

This algorithm executes in
the following way:

It
takes as input one concept named a є
O
1

and one concept named b є O
2
. Lines 1 and 2 compute
the name and content similarity, respectively, between
concepts a and b. Lines 4 and 5 set the weights of name an
d
content similarity equal to each other. Finally, line 6
computes the overall similarity between a and b using the
weights assigned in the previous two lines
.


The overall similarity between

two

concepts is an
equally weighted normalized sum of the name similarity
and the content similarity.
Algorithm 2
displays
pseudocode outlining the execution of the

ontology
alignment program from a higher
-
level per
spective. The
end result of this code
, once a
ll pairs of concepts between
O
1

and O
2

are examined for similarity,

will be the
alignment of the ontologies, represented by a 2
-
dimensional

matrix of pairwise

similarity values between the concepts
normalized to take on values between 0 and 1.


Note t
hat we are not attempting to solve the problem of
matching specific instances between concepts that model
the same geographic object (ie: verifying that an instance of
the concept Road from O
1

refers to the same road as an
instance of the concept Road from

O
2
). This is because the
content similarity between two concepts depends only on
the degree to which the instances are members of both
concepts. For instance, when comparing a concept Road
from O
1

and a concept Road from O
2
, it may very well be
the case t
hat the instances of Road from O
1

represent roads
from Alaska while instances of Road from O
2

represent
roads from Florida, yet the content similarity between the
concepts is measured to be very high. If the Road concepts
from O
1

and O
2

are semantically si
milar,

then the
geographic distance between the individual roads becomes
irrelevant.

In a case such as this, other criteria, such as the
use of domain
-
specific patterns

in the content

(eg: Avenue,
Blvd for Road concept) would be more responsible for
decidi
ng similarity.








Figure
1
: Concepts and attributes of Ontology

O
1








Figure
2
: Concepts and attributes of Ontology O
2
.







Figure
3
. General Overview of Ontology Matching






Process.



Input:

Concept a є O
1
, Concept b є O
2

Output:

O
S
, a normalized, real
-
valued similarity measure


of the input concepts

1:

S
N

= computeNameSimilarity(a, b)

2: S
C

= computeContentSimilarity(a, b)

// see section 4.1.

3: // assign weight variables to be equivalent

4
: O
S

= (W
N

* S
N
) + (W
C

* S
C
)




Input:

Ontology O
1
, Ontology O
2

Output:

AC[a]: array indexed by all concepts a є O
1



whose corresponding value for each index is the highest


matching concept b є O2

1:

f
or

each concept of O
1

{

2:
for

each concept of O
2

{

3:


// compute overall concept similarity

4:


M[a,b] = Ontology
-
Concept
-
Match(a,b)

5:


end
for

6
:

// find appropriate concept from O
2

for a

7
:

AC[a] = argmax b є O
2

(M[a,b])


8:

end for


4. MATCHING ALGORITHM
:

SEMANTIC SIMILA
RITY BETWEEN
TWO CONCEPTS

Our ontology matching algorithm consists of two separate
ideas explored in different research areas.

The first part
involves concept name similarity, a strategy that is
traditionally incorporated into ontology alignment
algorithms. The second part of our approach, involv
es

content similarity
.

W
e derive a measure of similarity based
on the concepts of ent
ropy and information gain, and we
also apply various derivatives of the algorithm in order to
maximize our matching potential.

4.
1

Content

Similarity



At the first level of content similarity, attribute level
matching between concepts will be estimated and next,
concept level matching will be done using attribute level
similarity (see Section 4.1.5).



Content
matching between two concepts involves
m
easuring the similarity between the instance values for a
pair of attributes. This is accomplished by extracting
instance values from the compared attributes, subsequently
extracting a characteristic set of N
-
grams from these
instances, and finally compari
ng the respective N
-
grams for
each attribute. An N
-
gram is simply a substring of length N
consisting of contiguous characters. N may be any number,
so during all of our experiments involving N
-
grams in this
paper, the value of N was set equal to 2.


We have experimented with a number of varying
approaches using 2
-
grams that ultimately determines the
instance similarity between the compared attributes. The
approaches differ based on how a value type is specifically
defined, and also based on the ty
pe of text making up the
values in the compared attributes. We will generally
consider a value type associated with an instance within a
particular attribute to be defined as either a unique 2
-
character pattern that occurs with a particular frequency, or
a

unique set of these 2
-
grams.


4.
1.1

Feature Extraction of Distinct N
-
Grams




In our first approach, we extract distinct N
-
gram
features from the instances themselves and consider each
unique 2
-
gram extracted as a value type. The similarity
between th
e attributes is measured by determining the
disparity between the 2
-
grams extracted and between the
frequency of 2
-
gra
ms they have in common.


Figure 5 shows a comparison of the types extracted from
the tuple values based on the relations of compared
c
oncepts C
A

and C
B

using DNF and TPF
.


Among the types extracted from the columns A.StrName
and B.Street using DNF are LO, OC, CU, ST, TR, RA, R4,
and 5/.


Because each distinct 2
-
gram is considered to be a
different type, and due to the fact that the

2
-
grams extracted
from the compared attributes are over regular text, we name
this approach the Distinct N
-
Gram Feature Extraction
Approach (DNF
R
).


4.
1
.2 Feature Extraction of Distinct N
-
gram Sets
From Tuples



An alternative approach

to the aforementioned method
of content similarity via 2
-
gram feature extraction is to
collect all 2
-
grams and their corresponding frequencies for
each tuple value within one of the compared attributes and
use this information to construct a 2
-
gram set.
In this case,
the set itself would be considered a value type, rather than
any of its constituent 2
-
grams. As a result, each tuple value
within a compared attribute would be associated with an N
-
gram set of a particular type. In turn, the similarity of 2
c
ompared attributes between different concepts would be
determined by measuring the disparity between the N
-
gram
sets created for each tuple value along with the frequencies
between common N
-
grams for each set. When executed
over regular text, this approach

is known as Tuple Feature
Extraction (TPF
R
). While this approach lacks the
granularity of DNF, it excels due to its faster execution
time and enhanced ability to capture 2
-
gram information at
the tuple level. The latter property of DNF allows it to
create

types based on the character makeup of a related
subset of tuples.

Algorithm 1

Ontology
-
Concept
-
Match

(a, b)

Algor
ithm 2

Ontology
-
Alignment



In figure 5, u
sing TPF
,

two types extracted are {LO,
OC CU, ST} and {TR, RA, R4, 5/ }.
Despite the fact that
the same 2
-
grams for the first tuple string will be extracted

using both
methods
,
in TPF,
only the set counts as a value
type
.
Notice that the set of 2
-
grams extracted for each set
effectively define a value type which will later be matched
by any other tuples with a similar character pattern.



Figure
5
:
Comparison of columns

A.StrName and
B.Street



4.
1
.3 Measuring Type Similarity


Although different versions of the attribute similarity
algorithm involving N
-
grams have been discussed, we have
yet to discuss the specific measure used to quantify
similarity between compared

attributes. This measure is
known as Entropy Based Distribution (EBD), and it takes
the following form:



EBD =

H(C | T)


(2)




H(C)









Refer to section 4.2.4 for a derivation of EBD. In this
equation, C and T are random variables where C indicates
the union of
the column types C1 and C2 involved in

the
comparison and T indicates the value type.
Recall that T is
a distinct N
-
gram for the D
NF algorithm.
EBD is a
normalized value with a range from 0 to 1, where 0
indicates the lowest EBD, or no similarity whatsoever
between compared attributes, and 1 indicates the highest
EBD. Most of our experiments involve 1:1 comparisons
between attributes

of compared concepts, so the value of C
would simply be C1 U C2.
T
he form of a single value type
will either be a 2
-
gram or a set of 2
-
grams associated with
an instance value, depending upon the specific approach
implemented. H(C) represents the entropy o
f a set of
instance values for a particular attribute (or column) while
H(C | T) indicates the conditional entropy of a set of
instance values associated with a particular value type.



Algorithm 3
: Calculate_EBD


Input:

Attribute

a
є

O
1
,
Attribute

b
є

O
2


Output:

EBD,
a normalized, real
-
valued similarity



measure of the input
attributes



1
:
(C
1
,T) = Extract N
-
grams & Types for a


2
:
(C
2
,T) = Extract N
-
grams & Types for b


3
:
EBD

=
(
C1

U

C
2
,T)



4
: Return EBD

Algorithm 3 above describes the method by which the EBD
for all 1:1 column comparisons between two concepts is
calculated. All N
-
grams and value types are extracted from
the compared columns

C
1

and C
2

in lines 1
-
2, and line 3
computes the final EBD value o
f C
1

U C
2
.


Intuitively,

EBD

is a comparison of

the
ratio of
column
types for each distinct value type (conditional entropy) with
column types in C (entropy). A column C contains high
entropy if it is impure; that is, the ratios of column types
making up C are similar to one another. On the other hand,
low entropy in C exists when one column type exists at a
much higher ratio than an
y other type. Conditional entropy
is similar to entropy in the sense that ratios of column types
are being compared. However, the difference is that before
computing this ratio, we are given a subset of tuple values
that all are associated with a given val
ue type. Figures
6
a
and
6
b provides examples to help visualize the concept. In
both examples, crosses indicate column types originating
from C
1
, while squares indicate column types originating
from C
2
. The value types are represented as clusters (larger
ci
rcles), each of which is associated with a number of tuple
values from C
1

and C
2
. In figure
6
a, the total number of
crosses is 10 and the total number of squares is 11, which
implies that entropy is very high. The conditional entropy is
also quite high, si
nce the ratios of crosses to squares within
2 of the clusters are equal and nearly equal within the other.
Thus, the ratio of conditional entropy to entropy will be
very close to 1, since the ratio of crosses to squares is
nearly the same from an overall p
erspective and from an
individual cluster perspective. Figure
6
b portrays a
different situation: while the entropy is

1.0 (since the
number of crosses is equal to the number of squares
overall), the ratio of crosses to squares within each
individual cluste
r varies considerably. One cluster features
all crosses and no squares, while another cluster features a
3:1 ratio of squares to crosses. When computing the EBD
value for this example, we will derive a value that is lower
than the EBD for the first example

because H(C | T) will be
a much lower value. Intuitively, this makes sense because
the ratios of value types between the compared attributes
are dissimilar.


Figure
6
a
. Distribution of column types and value types
when
EBD

is high. H(C) is similar to H(C|T)
.


Figure
6
b
. Distribution of column types and value types
when
EBD

is low. H(C) and H(C|T) have dissimilar
values.


4.1
.
4

Mathematical Basis of EBD


The discussion of EBD in the previous subsection is
based on an
intuitive understanding of entropy and
conditional entropy. This subsection will lend some
mathematical credence to the aforementioned ideas.

4.
1
.
4
.1


Definitions

Definition 1:

Let X

be a discrete random variable.

T
he
entropy H(X) of the distribution

p
i

= p(x
i
) = p (X

= x
i
)

is
computed
as:





Definition
2
:

Conditional
e
ntropy of
a
random

variable X
given Y is denoted by H(X|Y) and is computed as
:


H(X|Y) = H(X,Y)


H(Y)=




Definition 3:

The mutual information between two discrete
random variable
s

X

and
Y is defined as
:






Definition 4
: Relative Entropy
(Kullback
-
Leibler distance)
is a measure for comparing two distribution
s

p and q
and is
denoted by KL(p,q). I
t is computed

as

follows:



(



)






(




)







Definition 5
:
Jensen
-
Shannon distance is a standard
measurement for comparing distribution
s

p and q. It is
computed as
:

JS(p,q, α, β) =

αKL(p,m) + βKL(q,m)



Where
:

m =
α
p

+
β
q, 0 <
α
,

β

< 1,
α
+
β

= 1

4.
1
.
4
.2


Derivation of EBD

Let
C
1

and C
2

be two co
lumns

from two different
ontologies O
1

and O
2
, respectively, that we would like to
compare based on their value types. Using the soft

clustering of two

concepts, it is required to associate
conditional probabilities
)
for each
value in

C
1

U
C
2

and

t є T
, the set of types. Type distribution of each
tuple
C
i

in
C
1

U C
2

will be computed as
:



For this, we
will

find the distance between two typ
e
distributions using Equation

7
. Our goal is to determine
similarity between two distributions of each type that came
from
C
1

U
C
2
. We have:



(
8
)

Using Definition 4
f
or cluster
C
i

and type

t
, we have
:



(9)




Substituting

(
9
) in
equation (8
)
:









Using

(
1
1
) in equation (
10
),

will result in the following
:






After some simplification, we have
:



The above equation proves that for more similar columns
C
1

and C
2

(as indicated by value types), I(C;T) will be
lower. For

example,

for two identical

columns

C
1

and
C
2
,
I(C;T)
=

0.
On the other hand, in the case when C
1

and C
2

are dissimilar, I(C;T)
will be higher.


Based on
definition

3
:


(1
4
)

Therefore
:


(3)

(4
)

(5
)

(6
)

(7
)

(11
)

(10
)

(12
)

(9
)

(13
)





(1
5
)

We define Entropy Based Distribution (EBD) as follows:





(1
6
)

and


EBD




(1
7
)

The a
bove equation proves that EBD and mutual
information are
inversely proportional
.
The higher the EBD
between C
1

and C
2
, the

lower

the mutual information, and
the opposite is also true.


4.
1
.5. Algorithm for 1:1
content matching

Considering the algorithm 3, called Calculate
-
EBD,

we

can
now present the complete algorithm for content matching
.
Let C
1

be a concept from O
1

that is
compared
with

C
2

from
O
2
, and let A and B be the set of attributes associated with
concepts C
1

and C
2
, respectively. For each attribute in A,
we find the corresponding attribute in B that represents an
optimal match candidate, and we subsequently calculate

the

EBD between these attributes. The end result of this
process is an array of EBD values (let us call this
Array
EBD
), with each attribute in A being associated with
exactly one of these values. The final content similarity
value between C
1

and C
2

is com
puted by taking the average
of all values in Array
EBD
; this value is a normalized real
-
number between 0 and 1.

4.
1
.
6
.

Algorithm for 1:M
content matching


A
lgorithm

4

below describes our approach to

1:M
matching. Let
a

be a co
lumn
from

concept

A in
O
1

that is
compared with M co
lumn
s
b
from
concept B in
O
2
. The
algorithm for this matching is as follows:

Algorithm

4


Multiple Match

Input:

Attribute a for concept A and
Ontology
O
1

and a set

attributes b
1
, b
2
,…b
n

for concept B and
Ontology
O
2

Output:

Determining concatenation of M
attribute b

from
O
2

which are most similar to a from
O
1


1:
for
each
attribute

b
i




B

2: find EBD


Calculate_EBD

(a,

b
i
)

3: add EBD to Similarity list

SL

4:
end for

5: Sort SL in descending order

based on EBD

6: Pick

C
k


of highest EBD

from

SL

without replacement

7
:
EBD


Calculate_EBD

(a,

C
k
)


8
:


Repeat


9:
I
f

SL is not empty then

10:

C
highest



Pick an attribute

from SL with highe
st


E
B
D without replacement

11:

C
k


Concat
(C
highest
,

C
k
)


12:

EBD’
=
Calculate_EBD

(a,

C
K

)

13:
else

14: break;

15:
end if

16
:

Until
(
EBD’
-
EBD
) > δ

17
: Output


C
K

The algorithm takes as input one

attribute a from

concept

A

є O
1

and n
attributes

named b
1
, b
2
,…b
n

from B
є O
2
. Lines
2 and 3 compute the EBD and add them to a similarity list.
Line 5 sorts the list based on EBD values. In line 6, the
algorithm

picks the attribute

with the largest EBD.
Line 7

find
s

the new value
o
f EBD

for

concatenated
attributes of b
and
attribute
a. In l
ine 8
, the algorithm
use a loop and
checks if
SL is not empty so that we would be able to find
another similar column
with regard to EBD in greedy
fashion

(if exists)
.

This loop will be finished when the

difference between new EBD and
previous EBD is less
than a threshold
.

In other words, we could not find any new
attributes that will help us to improve the EBD score.

5.
EXPERIMENTS

We now present the experiments that we conducted
re
garding concept matching via name similarity and
conte
nt similarity between 2 separate ontologies in the GIS
domain.

5.1

Experimental Setup

5.1.1

Platform

The machine that conducted the experiments contains an
Intel(R) Pentium(R) M 1.70 GHz CPU and 504 MB of
RAM. The operating system is Windows XP with Service
Pack 2, and our implementation was created using Java
version 1.6.0_02 with Java SE Runtime Enviro
nment
build
1.6.0_02

and Java Hotspot Client VM.

5.1.
2

Datasets


Because data from several different areas of the United
States were employed in our experiments, we effectively
created a multi
-
jurisdictional GIS environment. Also,

GIS
data assigned to concepts for O
1

is

disjoint with the data
assigned to the concepts for O
2
. In cases where GIS data for
a particular concept was fairly plentiful (ie: Road), it was a
simple matter to find GIS data modeling roads for different
cities.
However, for

Junction and Intersection
,
only

one
data source
suitable for our experiments was discovered.
T
his single data source was
then
split into disjoint sets over
the relevant concepts.


Table 1 displays a summary of the relevant information
rega
rding the datasets involved in our experiments. As can
be seen, the instance data
is extremely varied

in several
ways
. The number of instances is as low as 24 (Ferry) and
as high as 91059 (Junction and Intersection)
. Meanwhile,
the number of attributes is
as low as 3 (Ferry) and as high
as
26 (Enclosed Traffic Area), the geographic scope ranges
from a particular city (ie. Dallas) to an entire state
(Virginia), and the number of different locations modeled
by our data is nearly equal to the number of unique
concepts between the ontologies. This last feature
highlights the multi
-
jurisdictional nature of our dataset, and
in turn, this implies that our data is disjoint. As a result, our
algorithm was tested in an extremely challenging GIS
environment.


T
able

1
. Description of Data Sources



5.
2


Results


T
he size of the datasets associated with

some of the
concepts is
not uniform
,
so
we decided to calculate EBD
between
two

columns using a fixed sample size 500
-
1000

instances.


5.
2.1

DNF and TPF Over

Encoded Text


Besides

using DNF and TPF

to compare the instance
similarity of
two

columns between two concepts over
regular text, we also decided to test DNF and TPF over
encoded text, thus creating DNF
E

and TPF
E
. In this
experiment, instead of
representing each 2
-
gram using their
literal characters, each character is assigned one of three
character types. Every character that is a letter (a
-
z, A
-
Z) is
represented by α, every character that is a digit (0
-
9) is
represented by β, and every non
-
alph
anumeric character is

left untouched.
Our goal is to find a characteristic pattern in
the tuple string

which might be consistently associated with
a concept attribute
.

These characteristic
patterns for a
concept’s attributes

can act as a membership identif
ier
for
its contained tuple strings
which can justify its matches
or
mismatches with compared attributes of another concept
.
The
following is an example of c
onverting a tuple string to
its encoded
equivalent:


“U.S. highway 75”


α
.

α
.

α

α

α

α

α

α

α

β

β

U
nique 2
-
grams are extracted from the encoded text, and
from this point we had the choice of applying either DNF
E

or TPF
E

in order to determine attribute similarity. In the
following section, we have provided results for encoded
text usi
ng both methods.

5.
2
.
2

Concept Similarity Between O
1

and O
2


The
results of the alignment of O
1

and O
2

using only
content similarity of the compared concepts are shown in


Table 2. Each cell in the table represents a similarity

calculation between
one concept in O
1

and another


concept in O
2
, and is composed of four

separate values.
The

first two values represent the content similarity over

encoded text using TPF
E

and DNF
E
, respectively. The last

two values represent content similarity over regular

text

using TPF
R

and DNF
R
, respectively. From the results, a
number of conclusions can be drawn. First, for
most of the

concept comparisons, the calculated similarity values

generated by using
DNF
, independent of the text type, are
significantly higher th
an the values generated by TPF
.
These results can be explained
due to the more stringent
matching requirements of a value type in TPF as opposed
to DNF. Keep in mind that for 2 tuples to have a matching
value type in TPF, the sets of 2
-
grams contained with
in
each must match exactly. If there is even one 2
-
gram
contained in one tuple that the other tuple lacks, then the
tuples will represent different value types in TPF. The end
result of this situation will be that
the tuples will not have
any value type in
formation in common.

However, in DNF,
these same tuples would be able to match on
nearly
all of
their 2
-
grams
, which in turn would raise the conditional
entropy H(C|T) and result in a higher overall EDS value
between the compared columns.




Table
2
.
EDS values between

Concepts of O
1








and O
2




The second observation to be made from Table 2 is that
the EBD values obtained over raw text were far lower than
those obtained over encoded tex
t. The reason for this is
because for DNF, the large increase in the number of
possible 2
-
grams generated trivially leads to a large number
of value types between the compared columns. For TPF, all
that is required to distinguish one 2
-
gram set from anothe
r
is a single 2
-
gram. Consequently the number of unique sets
of 2
-
grams generated via TPF will also rise sharply.
Because of the expanded possibilities in 2
-
grams and 2
-
gram sets, there will also be far more value types present
within the compared columns.

This means that there is a
greater possibility of unmatched types, and as a result, the
conditional entropy values are more likely to be dissimilar.
Also, a
s a result of the above reasoning, along with the
reasoning for TPF values being lower than DNF val
ues, the
value of TPF
R

is notably low.


A final observation from Table 2 is that despite the
discrepancies noted above, some sensible correlations
emerge. For instance, the concepts
Traffic

Area and
Enclosed Traffic
Area share a high concept similarity based
on TPF and DNF over both encoded and raw text. This is
particularly evident when measuring the relative similarity
values for either concept

as compared to other matching
concepts
. The content similarity between
T
raffic Area and
Enclosed Traffic Area

using TPF over encoded text was at
a minimum .
38

higher than other concepts for
Traffic

Area
and .
45

higher than other concepts for
Enclosed Traffic
Area
. Notable correlations also existed between
Residential

Area
-
Addr
ess
Area and
Ju
nction
-
Intersection.


However, note that a few strong correlations which do
not seem sensible are present in the data. For example, the
content similarity algorithm measured a high score between
Junction
-
Address Area; in fact, this score
closely related

to
the score generated between Junction
-
Intersection. On the
other hand, there are a few concepts that seemingly should
have a higher similarity score, but instead display a low
score in Table 2. For instance, the Road concepts of O
1

and
O
2

should possess higher content similarity.


The kind of anomaly exhibited by Junction
-
Address
Area exists based on the following reasoning. First, for
situations exhibited by the falsely inflated score of the
Junction
-
Address Area comparison, the simil
arity is high
when many columns ma
tch well on their
value types.
For
example
, since the data type matches between these
two

columns, it is quite possible that the ratio of extracted 2
-
gram features will be closely related. This can occur
between columns of

the integer data type that represent id
values of different kinds. Even though there is little
semantic correlation between the columns, the computed
EBD value will tend to be high because of the limited
number of possible 2
-
grams that can be extracted fr
om
these columns.


The second type of anomaly, exhibited by the Road
concepts of O
1

and O
2
, occurs when semantically similar
columns ar
e modeled using different

data type conventions.
For example, the attribute Road.FTYPE of O
1

is a string
-
valued data

type, containing the types of road names (Rd.,
Ave., etc.) while Road.FEAT_TYPE from O
2

models this
same information as a series of integers (0
-
7), and thus, is
an integer data type. Another example is Road.FDSUF
from O
1,
, a string type and Road.SURFACE ,

an integer
type. While the Road concepts do have a number of
columns that match strongly, there are too many column
matchings that fit the profile described above to allow for a
higher overall content match.



Table 3 displays concept similarity resu
lts in a format
exactly like Table 2, with the only difference being that
name similarity is now factored into the process with a
weight equal to content similarity.


Notice that many of the aforementioned scoring
anomalies have disappeared. The Road c
oncepts from O
1

and O
2
now exhibit the kind of strong match that we
originally expected. Furthermore, the Junction
-
Address
Area match has weakened substantially, and now the
Junction
-
Intersection similarity score is the strongest
correlation for the Junction concept by
about

.20
.
The
scores for other anomalies mimicking the behavior of
Junction
-
Address Area

are generally much lower than the
scores from Table 2. This makes sense given the fact that
concept name similarity exerts an influence equal to that of
conte
nt similarity; in

this case, anomalous concept pairings
with vastly different names will exhibit a degraded score,
while other

concept pairings suc
h as Road
-
Road and Ferry
-
Ferry will benefit.


5.
2
.
3

EBD Computations


The second set of results from our experiments
il
lustrates EBD computations for two

different concept
comparisons using DNF. The first concept pairing,
Junction
-
Intersection, illustrates a situation where the
columns between the concept yield a high EBD value,
while the second concept pairing illustrates

a scenario
where the matching columns yield a low EBD value.


Table 4 illustrates the Junction
-
Intersection pairing
along with all of its column pairings. Note that other than
the comparison between the Intersection.INTSEC_DIR and
Junction.descriptio
n, all EBD values calculated for the
remaining column pairings are above .90. This is consistent
with our data, since each of these column pairings are
indeed semantically similar to each other. The matching of
Intersection.INTSEC_DIR and Junction.descript
ion
r
esulted from the lack of a semantically equivalent column
for both, and a result of
.1
0 clearly shows that their content
was
largely
unrelated.


Table 5 illustrates the opposite situation of Table 4.
Here, an example of a concept match featuring l
ow overall
EBD is displayed, along with all constituent column
pairings. Other than the comparison of Junction.OLD_KEY
and Enclosed Traffic Area.%_Secondary_VMT, all column
comparisons display an EBD below .50.






Table 3. Name +
EDS values between c
oncepts of





O
1

and O
2





In reality, none of the column pairings are semantically
similar to each other; the high EBD score between
Junction.OLD_KEY and Enclosed Traffic Area.
%_Secondary_VMT results because of their matchin
g data
types and the lack of any corresponding column for either
in the datasets for Junction and Enclosed Traffic Area.

This
is also the reason why Junction.roadID is paired up with
Enclosed Traffic Area.State_Total_Road_Length. In this
case however, ther
e was enough variation in the extracted
2
-
grams to lower the EBD score.



Table 4.

EBD Computation for Intersection
-



Junction
using DNF








Table

5.
EBD

Computation for Junction
-





Enclosed_
Traffic Area using DNF



5.
2
.
3

EBD Computation for Composite Match



Table 6 below illustrates

the EBD computation between
two concepts containing instance data relating to residential
areas when a composite match applies. Concept A contains

a single column called ‘Address’ which contains
a

string

value

co
mposed

of a
city, state and zip code, whereas
Concept B contains each of these fields in separate columns
along with some other unrelated columns. The Multiple
-
Match algorithm defined earlie
r first computes the EBD
value between A.Address and all other columns in Concept
B and sorts the results in descending order of EBD value. It
starts with a 1:1 match with B.City, since this column
produced the highest EBD value with A.Address. Next,
B.Sta
te, which had the second highest EBD value with
A.Address, is concatenated with B.City, and the EBD value
of the new combined column with A.Address is computed.
Since it is significantly higher, we then concatenate B.Zip.
The EBD value jumps to .83, since
now we are effectively
comparing 2 columns that both feature City, State and Zip
code information. When we try to add the next column,
B.PhoneNumber, the overall EBD value drops, signifying
the end of the algorithm. This makes sense since no phone
number i
nformation is contained within A.Address.



Table

6
. EBD Computation for Composite Match



using

DNF over raw text



6
.
CONCLUSION &
FUTURE WORK


In this paper, we have outlined an algorithm that aligns
two

separate ontologies from the GIS domain
using c
ontent
similarity. We focused special attention on the content
similarity algorithm, which calculates EBD between the
associated concept columns. Next, we illustrated and
discussed the results of a series of
experiments which
displayed the capabilities of our content matching
algorithm over both encoded text and raw text. Multiple
variations of the algorithm were explored

for two

different
N
-
gram extraction techniques were outlined
.

They were

tested on authent
ic
, multi
-
jurisdictional

GIS data, and
the
content matching results

were compared against results that
used both concept name matching and content matching.
Finally, we provided a more in
-
depth look at some sample
computations of the EBD value between two
different
concept pairings.


Future efforts regarding ontology matching
within and
beyond the GIS domain will focus on improvement using
a
variety of

strategies.
We
will apply structure
-
level
matching techniques in an attempt to more accurately and
thoroughly examine concept similarity. We will seek to
analyze some of the more traditional techniques, such as
parent
-
child
and sibling relationship similarity

and apply i
t
according to our specific EBD.
We will also explore the
application of
domain
-
specific knowledge in an attempt to
increase the
content matching
accuracy.
Another possibility
is to
seek out variable
-
length N
-
grams in the data which
designate some common s
ubstring of a GIS object, such as
“AVE.”, “RD.”, “TERMINAL”.


REFERENCES

[1]

T. R. Gruber (1993)
,


A Translation Approach to
Portable Ontology Specifications
,”
Knowledge Acquisition
,
5(2), 199
-
220.

[2
]

Guillermo Nudelman Hess, Cirano Iochpe, Alfio
Ferrara, Silvana Castano, "Towards Effective Geographic
Ontology Matching,"
GeoS 2007
, pp. 51
-
65
.

[
3
] William Sunna, "Multilayered Approach to Aligning
Heterogeneous Ontologies",Ph.D. dissertation, University
of Illinois at Chicago, 2007.

[
4
]

William Sunna and Isabel Cruz, "Structure
-
based
Methods to Enhance Geospatial Ontology Alignment",
Second International Conference on Geospatial Semantics
,
Mexico City, Mexico, November 2007.


[
5
] E.Ralun and P. A. Bernstein, “A servery of approaches
to automatic schema matching”,
VLDB Journal
, vol. V10,
pp. 334
-
350, 2001

[6
] A. Doan and A. Halevy, “Sematic
-
integration research
in the database community”,
Al Mag.

Vol 26, no 1,pp.83
-
94, 2005.

[7
] B
ing Tian Dai, Nick Koudas, Divesh Srivastava,
Anthony K. H. Tung, and Suresh Venkatasubramanian,
"Validating Multi
-
column Schema Matchings by Type,"
24th International Conference on Data Engineering
(ICDE),

pp. 120
-
129, 2008.

[8
] J. Kang and J.F. Naughton,

“Schema matching with
opaque column names and data values” in Proc.
SIGMOD
,
2003, pp, 205
-
216.

[9
] L. L. Yan, R.J. Miller, L. M. Haas, and R. Fagin, “Data
-
driven understanding and refinement of schema mappings,”
in
Proc. SIGMOD
, 2001, pp. 485
-
496.

[1
0
] L.

Chiticariu and W.C. Tan, “Debugging schema
mappings with routes,”in
Proc, VLDB
, 2006, pp. 79
-
90.

[11
] W.S. Li and C. Clifon, “Semint: a tool for identifying
attribute correspondence in heterogeneous databases using
neural networks,”
Data Knowl. Eng.

, vol. 33, no. 1,pp.49
-
84, 2000.

[12
] J. Berlin and A. Motro, “Autoplex: Automated
discovery of content for virtual databases,” in
Proc.
CoopIS
, 2001, pp. 108
-
122.

[1
3
] B. He, K. C.
-
C. Chang, and J. Han, “Discovering
complex matching across web query inte
rfaces: a
correlation mining approach,” in
Proc. KDD
, 2004, pp.
248
-
157.

[1
4
] F. Maumann, C.
-
T. Ho,X.Tian,L.M. Haas, and N.
Megiddo,”Attribute classification using feature analysis,”
in
Proc. ICDE
, 2002, p.271.