A Framework for Matching Schemas of Relational Databases

taupesalmonInternet και Εφαρμογές Web

21 Οκτ 2013 (πριν από 3 χρόνια και 1 μήνα)

105 εμφανίσεις

A Framework for Matching Schemas

of Relational Databases






by

Ahmed Saimon Adam




Supervisor

Dr. Jixue Liu








A
minor
thesis submitted for the degree of

Master of Science (Computer and Information Science)


School of Co
mputer and Information
Science

University of South Australia


7
th

February 2011



2


Declaration


I declare that this thesis does not incorporate without acknowledgment any
material previously submitted for a degree or diploma in any university; and
that to the best of my knowledge

it does not contain any materials previously
published or written by another person except where due reference is made in
the text.


………………………………………………..

Ahmed Saimon Adam

February 2011



3


Acknowledgments



I
would like to thank
my supervisor

Dr.
Jixue Li
u

for all the
help

and
support

and
advice
he has provided to
me
during the
course

of writing this thesis.

Without his expert advice and
invaluable guidance
, I would not be able to
complete this thesis.


I would
also like to thank our program director Dr. Ja
n Stanek for providing
outstanding support and encouragement in times of difficulty and at all times.




4


CONTENTS

1.

Introduction

................................
................................
................................
........................

7

2.

Schema matching architecture

................................
................................
.........................

11

2.1.

Input component

................................
................................
................................
..........

11

2.2.

Schema matching process

................................
................................
............................

13

2.2.1.

Matching at element level

................................
................................
........................

14

2.2.1.1.

Pre
-
processing element names

................................
................................
.............

14

2.2.1.2.

String matching

................................
................................
................................
....

15

2.2.1.2.1.

Prefix matching

................................
................................
................................

15

2.2.1.2.2.

Suffix matching

................................
................................
................................

16

2.2.1.3.

String similarity matching
................................
................................
....................

17

2.2.1.3.1.

Edit distance (levenshtein distance)

................................
................................
.

17

2.2.1.3.2.

N
-
gram

................................
................................
................................
.............

18

2.2.1.4.

Identifying structural similarities

................................
................................
.........

19

2.2.1.4.1.

Data type constraints

................................
................................
........................

20

2.2.1.4.2.

Calculating final structural similarity score

................................
.....................

21

2.2.2.

Matching at instance level

................................
................................
.......................

21

2.2.2.1.

Computing the instance similarity

................................
................................
.......

23

2.3.

Schema matching algorithms

................................
................................
.......................

24

2.3.1.

Algorithm 1: find name similarity of schema element

................................
............

25

2.4.

Similarity computation
................................
................................
................................
.

35

2.4.1.

Computation at matcher level

................................
................................
..................

35

2.4.2.

Computation at attribute level

................................
................................
..................

35

2.4.3.

Computing similarity at table level

................................
................................
..........

36

2.4.4.

Tie
-
breaking

................................
................................
................................
.............

37

2.4.4.1.

Tie breaking in table matching

................................
................................
............

37

2.4.4.2.

Tie breaking in attribute matching

................................
................................
.......

38

3.

E
mpirical evaluation

................................
................................
................................
........

39

3.1.

Experimental setup
................................
................................
................................
.......

39

3.1.1.

Database

................................
................................
................................
...................

39

3.1.2.

Schema matcher prototype

................................
................................
.......................

40

3.2.

Evaluating the accuracy of matching algorithms

................................
.........................

41

3.2.1.

Prefix matching algorithm

................................
................................
.......................

41

3.2.2.

Suffix matching algorithm

................................
................................
.......................

43

3.2.3.

N
-
grams matching algorithm

................................
................................
...................

43

3.2.4.

Structural matching algorithm

................................
................................
.................

44

3.2.5.

Instance matching

................................
................................
................................
....

45

3.2.6.

Overall accuracy of similarity algorithms
................................
................................

46

3.2.7.

Effect of schema size on the overall precision

................................
.........................

46

3.2.8.

Efficiency of schema matching process

................................
................................
...

48

4.

Conclusion

................................
................................
................................
.......................

49

5.

References

................................
................................
................................
........................

50






5


List of tables

Table 1: Variations in data types across database vendors

................................
......................

12

Table 2: Schema matching algorithms

................................
................................
.....................

13

Table 3: Examples of string tokenization

................................
................................
................

14

Table 4: Prefix matching example 1

................................
................................
........................

16

Table 5: Prefix matching example 2

................................
................................
........................

16

Table 6: Suffix matching example

................................
................................
...........................

17

Table 7: Structural metadata for structural comparison

................................
...........................

19

Table 8: Structural matching example

................................
................................
.....................

19

Table 9: Data type matching example

................................
................................
.....................

20

Table 10: Properties utilized for instance matching

................................
................................

21

Table 11: sample dataset for Schema1

................................
................................
.....................

22

Table 12:
Examples showing how the statistical calculations are pe
rformed

.........................

22

Table 13: Matcher level similarity calculation example

................................
..........................

35

Table 14: Combined similarity calculation at attribute level

................................
...................

35

Table 15: Table similarity calculation example 1

................................
................................
....

36

Table 16: Table similarity calculation example 2

................................
................................
....

37

Table 17: Tie breaking example in attribute matching

................................
............................

38

Table 18: Summary of experimental schemas

................................
................................
.........

40

Table 19: Software tools used for prototype

................................
................................
............

41

Table 20: Prefix matching sample results

................................
................................
................

42

Table 21: Suffix matching sample results

................................
................................
................

43

Table 22: N
-
gram matching sample results

................................
................................
.............

44

Table 23: Structure matching sample results

................................
................................
...........

45

Table 24: Instance matching sample results

................................
................................
............

45

Table 25: Best matching attributes for smaller schem
as with 4 tables

................................
....

47

Table 26: Best matching attributes for single table schemas

................................
...................

47

Table 27: Decrease in efficiency with increase in schema size

................................
...............

49




6


List of
figures

Figure 1: Schema matcher architecture

................................
................................
....................

11

Figure 2: Sample snippet

of a schema in DDL

................................
................................
........

12

Figure 3: Combined final similarity matrix

................................
................................
.............

36

Figure 4: Tie breaking example 1

................................
................................
............................

38

Figure 5: Figure 6: Tie breaking example 2

................................
................................
.............

38

Figure 7: Schema matcher prototype

................................
................................
.......................

40

Figure 8: Precision of matching algorithms (%)

................................
................................
......

46

Figure 9: Precision vs. schema size

................................
................................
.........................

48

Figure 10: Schema size vs. processing time

................................
................................
............

49

7


Abstract


Schema matching, the process of identifying
semantic

correlations
between
database
schemas, is one of the vital tasks in database integration.

This study is based on relational databases. In this thesis,
we investigate past
researches in schema matching to study techniques

and methodologies in the field
. We
define an arc
hitecture for a schema matching system and
propose a framework
based
on it considering the factors of scalability, efficiency and accuracy. With the
framework, we also develop several algorithms that can be used in detecting
similarities in schemas. We bui
ld a prototype and conduct experimental evaluation in
order to evaluate the framework and effectiveness of its algorithms.


1.

I
NTRODUCTION


1.1.

BACKGROUND

The schema of a database describes how its concepts, their relationships and
constraints are structured. Schema matching, the process of identifying semantic
correlations between database schemas, is one of the vital and most challenging tasks
in database
integration
(Halevy, Rajaraman & Ordille 2006)
.

Many organizations, due to the
adaptation to the rapid and
continuous
changes taking
place over the years have been
develop
ing
and us
ing various databases for their
varied
applications

(Parent & Spaccapietra 1998)

and this continue to be the case after
so
many years
(Ziegler & Dittrich 2004)
.

This tendency
for creating diverse applications
and databases is
d
ue
to the
vibrant business environment
that
changes such as
structural changes to organizations, openin
g of new branches at geographically
dispersed locations, exploitation of new business markets are some of the reasons.

As the databases grow, organizations need to use their databases in a consolidated
manner for research and analytical purposes.

Data warehousing and data mining are
results
of
some of the large scale integration of databases

(Doan & Halevy 2005)
.


Currently, many data sharing applications such as data warehousing, e
-
commerce and
semantic web processing require matching schemas for this purpose.
In such
applications
, a unified virtual database

that comprises of multiple databases need to be
created in order to flawlessly access the information available from those databases
(Doan & Halevy 2005)
.

8


Database
integration research has been on going since
late 7
0s

(Batini, Lenzerini &
Navathe 1986; Doan & Halevy 2005)

and a first
of
such mechanisms was attained in
1980

(Ziegler & Dittrich 2004)

.
S
ince early 90s,
database integration methods have
been p
u
t
to use
commercially

as well as in the
research field

(Doan & Halevy 2005; Halevy, Rajaraman & Ordille
2006)
.


1.2.


Motivation

S
chema matching
has been
done in a
highly manual

fashion
(Do, H 2006; Do, H &
Rahm 2007)
; but, it
is laborious,
time consuming
an
d highly susceptible to errors
(Huimin &

Sudha 2004; Doan & Halevy 2005; Po & Sorrentino 2010)

and
sometimes,
not
e
ven
possible
(Domshlak, Gal & Roitman 2007)
.

One of the major issues in schema
matching

process is due to the disparity in the
structure and semantics of databases involved.

Often, how database designers perceive
a certain concept
is

different from one another. Therefore, same
real world
concept
might
have different representation or it is possible that two different concepts are
represented as the same in the database
(Batini, Lenzerini & Navathe 1986)
.

Moreover, when building a database integration project, the necessary information for
correctly mapping semantics might not be available. The major source
s

of
information, the original designers might not be available or the documentation might
not be suf
ficient.

Consequently, the matching has to be performed form the limited
evidence available from schema element names and instance values

(Doan & Halevy
2005)
.

Nonetheless, even the information available from the schema elements might not be
enough for
adequate identification of matches

(Doan & Halevy 2005)

.Usually, in
order to find the right correspondences in schema elements,
an exhaustive matching
process has to be conducted where every element needs to be evaluated to ensure the
best match among all the available elements
; t
his is
very
costly and cumbersome
(Doan & Halevy 2005)
.

In addition to these obstacles, schema matching is largely domain specific. For
example, an attribute pair dec
lared as highly correlated in one domain might be
completely unrelated in another domain
(Doan & Halevy 2005)
. Hence, because of its
highly cognitive nature,
the likelihood of making
process
fully automatic
is
yet
uncertain
(Doan, Noy & Halevy 2004; Aumueller et al. 2005; Doan & Halevy 2005;
Do, H & Rahm 2007; Po & Sorrentino 2010)
.


9


However, the
demand for more efficient, scalable and autonomous schema matching
systems are conti
nuously rising

(Doan, Noy & Halevy 2004; Po & Sorrentino 2010)

due to
t
he emergence of semantic web, advancements in ecommerce

and the
increasing
level of collaboration in
organizations
.

Consequently,
a
ccording to

(Rahm & Bernstein 2001; Do, H, Melnik & Rahm 2002;
Doan & Halevy 2005; Shvaiko & Euzenat 2005; Do, H & Rahm 2007)
,
researchers
have
proposed

several semi
-
automatic methods
.
Some adopt heuristic criter
ia

(Doan
& Halevy 2005)

while others have used m
achine learning

and information retrieval
techniques
in some
approaches

(Rahm & Bernstein 2001; Shvaiko & Euzenat 2005)
.

However,
no one me
thod
can be chosen as the best schema matching
solution
due to
the
diversi
ty
of
data sources
(Bernstein et al. 2004; Domshlak, Gal & Roitman 2007)
.

Therefore,
a generic solution that can be easily customised
for dealing with different
types of
applications
is
more

useful

(Bernstein et al. 2004)
.

1.3.


Previous work

According to
(Rahm & Bernstein 2001; Do, H, Melnik & Rahm 2002; Doan, Noy &
Halevy 2004; Doan & Halev
y 2005; Shvaiko & Euzenat 2005; Do, H & Rahm 2007)
,
many approaches and techniques have been proposed in the past for resolving schema
matching problems. Some methods adopt techniques from

machine learning and
information retrieval, while others use heuristic criteria on composite and hybrid
techniques.

For
better interpretation of the underlying semantics in the schemas, a combination of
different types of information are used in many
systems

and m
ainly two levels of
database information are exploited in schema matching
techniques
(Rahm & Bernstein
2001; Do,

H, Melnik & Rahm 2002; Doan & Halevy 2005; Shvaiko & Euzenat 2005;
Do, H & Rahm 2007)
. They are the metadata available at schema level and the actual
instance data.

Schema information is used for measuring similarities at element level

(e.g. attribute
and table names) and at s
tructural level (e.g. constraints such as data type, field length,
value range); at instance level, patterns in data instances are analysed. Moreover,
many algorithms use auxiliary information such as user feedback, dictionaries,
thesauri and domain knowle
dge for dealing with various problems in schema
matching
(Rahm & Bernstein 2001; Do, H, Melnik & Rahm 2002; Doan & Halevy
2005; Shvaiko & Euzenat 2005; Do, H & Rahm 2007)
.



10







1.4.


Research objectives

and
methodology

Investigations on schema matching

In this research, o
ur focus is on relational databases only.
This research will
investigate existing schema matching techniques

and their architectures and i
dentify
what
and how
information can be exploited

and under what circumstances (e.g. what
technique can be used when attribute names are ambiguous?).

Propose a schema matching framework

Based on the findings,
we propose a schema matching framework for relational
databases. This framework is based on the

findings on
how to best represent input
schemas, output results and other interfaces in the architecture considering scalability,
accuracy and customizability.
Taking into account the findings from this investigation,
w
e
also
implement multiple matching a
lgorithms that exploit different clues available
from the database, and specifications of how to implement different types of
algorithms on the system. We also establish how match
ing
accuracy can be measured
in a quantifiable manner.

Prototype and empirica
l evaluation

Based on this framework, w
e build

a prototype
and conduct empirical evaluation
to
demonstrate the validity of the research
.
In our experiments,

we
demonstrate
the
accuracy of the matching algorithms and scalability and customisability of the
a
rchitecture
.

We also demonstrate the effects on matching accuracy when we use different
combinations of algorithms.
We use precision and recall measurements, techniques in
information retrieval (IR), for
measuring accuracy and providing results
.




11




2.

SCHEMA MATCHING ARCH
ITECTURE

The
schema matching
architecture of
this system is
defined in
three
major
components
, similar to many other systems
(Rahm & Bernstein 2001; Doan & H
alevy
2005; Shvaiko & Euzenat 2005)
;

Input

component:

establishes acceptable formats and standards for the schemas
.
Also,
it converts schemas into a format that can be manipulated by the schema matching
softwar
e.

Matching component
:

Matching

tasks are performed

in
this component. It c
onsists of
a
set
of elementary algorithmic functions, each of which is called a
matcher
.

Each
matcher utilizes some type of information (e.g. element names in name matchers) to
perform a sub
-
function in the matching process. Matchers are implemented by
executing them in
a series
.

Output
:

component:
delivers the final mapping results.

This pr
ocess is depicted in the diagram below

(
Figure
1
: Schema matcher architecture

)
.




Figure
1
: Schema matcher architecture

2.1.

Input

component

Schemas of relational databases are often created in SQL DDL which is a series of
SQL “CERATE” statements.
Therefore, information about schema elements such as
table names, attribute names, data types, keys
and other available information
can be
obtained
from

the schema
,

by the input module
,

in a standard format.

12


In this framework, t
he
input
component accepts database schemas in
SQL DDL (Data
Definition Language)

format

and
parses

them
, similar to

(Li & Clifton 2000)
,
to
extract
schema

information
.
The software accepts two schemas at a time and the
SQL
DDL has to be in a text file with

.sql


extension. Once the required information is
extracted, the input component utilizes these data to build
the appropriate Schema
objects

and handovers the Schema objects to the Matching Component
. These Schema
objects are manipulated by the software to perform all the necessary operations.

Schemas
from d
ifferent database engines will have
variations

in schema
s in terms of
data

types, constraints and other
vendor
specific

properties

(Li & Clifton 2000)
.
For
example, in

Error! Refere
nce source not found.
,
the
attribute
income

from
Schema
A, has a data type
SMALLMONEY

and needs to be matched with an attribute in Schema B.
As Schema B is from a
n

Oracle

database, it
does
not
have a data type of
SMALLMONEY

as in SQL Server
.


Schema A

Schema B

Database engine

MS SQL Server

Oracle

Data type

SMALLMONEY

NUMBER

Attribute name

Income

income

Table
1
: Variations in data types across database vendors

Therefore, similar to
data type conversion process in
(Li & Clifton 2000)
, we
use
conversion table
s

to map
data types
if
the schemas
are from different database types

so that schemas from different databases
also
can be compared.

Currently, the system
supports
conversion between
Oracle
11i

and SQL Server 2008 schemas. The
conversion tables can be modified for supporting additional relational database
engines

such as MySQL
.

This conversion table is given in APPENDIX 1.

The example
in Figure 3
shows
a snippet of a

schema that
has
been converted into
standard SQL DDL format and ready to be
used by
the system for conducting schema
matching operations.

…..

CREATE TABLE ADAAS001.CUSTOMERS (

CUSTOMER_ID NUMBER NOT NULL,

FIRST_NAME VARCHAR2(10) NOT NULL,

LAST_NAME
VARCHAR2(10) NOT NULL,

DOB DATE,

PHONE VARCHAR2(12),

PRIMARY KEY (CUSTOMER_ID)

);

CREATE TABLE ADAAS001.EMPLOYEES (

EMPLOYEE_ID NUMBER NOT NULL,

MANAGER_ID NUMBER,

FIRST_NAME VARCHAR2(10) NOT NULL,

LAST_NAME VARCHAR2(10) NOT NULL,

TITLE VARCHAR2(20),

SALAR
Y NUMBER(6),

PRIMARY KEY (EMPLOYEE_ID)

13


);

…..


Figure
2
: Sample snippet of
a
schema

in DDL


2.2.

Schema matching process

It is very much likely that elements that correspond to the same
concept

in the real
world will prone to have similarities in database
s

in terms of structure
and patterns in
data instances
(Li & Clifton 2000; Huimin & Sudha 2004; Halevy, Rajaraman &
Ordille 2006)
. Therefore, according to
(Rahm & Bernstein 2001;
Do, H, Melnik &
Rahm 2002; Doan, Noy & Halevy 2004; Doan & Halevy 2005; Shvaiko & Euzenat
2005; Do, H & Rahm 2007)
, by using a combination of different types of information,
a better inter
pretation of the
underlying semantics
in the
schema
s can be achieved.

At schema level, we exploit metadata available from the schemas for measuring
similarities at
element level

(e.g.
attribute
and table
names
) and at structural level (e.g.
c
onstraints suc
h as data type
, field length,
value range
);


As in other composite matching systems
(Doan, Domingos & Halevy 2001; Embley,
Jackman & Xu 2001; Do, Hai & Rahm 2002)
,
we perform
the schema matching in
stages where source schema S
S

is compared with target schema, S
T
.

Once the Schema Objects are constructed from the input component and received at
the Matching Component, the matching operations begin.
At the beginning of the
schema matching process, a similarity matrix,
M
, is initialized and at each stage, a
function,
m
atch
, is executed to compute the similarity score,
S

for that function.
M
is
refined with the values of
S

at each stage.

The sequence of
match

execution, as in
(Giunchiglia & Yatskevich 2004)
, and their
scoring

weight are predefined
.

The algorithms we use are in
Table
2
.

Sequence

Matcher name

1

matchPrefix

2

matchSuffix

3

matchNgrams

4

matchEditDistance

5

matchStructure

6

matchInstance


Table
2
: Schema matching algorithms

14




2.2.1.

Matching
at
element
level


In the
first stage
of schema matching process,
emphasis is given on determining the
syntactic similarity in attribute and table names

by employing several name matching
functions.

In every
match
, each of the elements in source schema
S
S

is compared with
that of target schema,
S
T

to obtain

the similarity score

S

for the
match
.

2.2.1.1.

Pre
-
processing element names

I
n designing schemas, mostly
abbreviations or multiple words are used for naming
elements instead of using single words

(Huimin & Sudha 2004)
; as a result, element
names that have the same mean
ing might differ syntactically
(Madhavan, Bernstein &
Rahm 2001)
.

Prior to performing the string matching process, element names need to be pre
-
processed for
better understanding

of the semantics in element names

and
achieving
higher accuracy in the subsequent pr
ocesses
.


Expansion of abbreviations and acronyms:

First,
we tokenize the element names,
similar to
(Monge & Elk
an 1996; Do, Hai & Rahm 2002; Wu et al. 2004)
,
by isolating
them based on delimiters such as punctuation marks, spaces and substrings in camel
-
case names.
Tokenization is conducted in order to uncover parts in the name that may
go undetected if this pos
sibility is not considered. For example, consider comparing
two strings
hum
-
res

and

human
-
resource
. Performing a prefix matching operation on
them will discover that these two strings are same if done after tokenizing. However,
executing the same operation without tokenizing will not detect much
similarity
to the
level
it deserves.

Tokenization
process

is depicted in

Table
3
.

Element name

Tokenized substrings

Isolation based on

finHRdistr

fin, HR, distr

Camel
-
case naming

Daily
-
income

Daily, income

Hyphen
delimitation

In_patient

In, patient

Underscore delimitation

Table
3
: Examples of string tokenization





15


2.2.1.2.

String matching


Once pre
-
processing has been done, string matching operations are performed

in a
consecutive
. A string
matching algorithm
,
matcher
,
is used to calculate similarity
score at each matching process.


Matching suffixes and prefixes are two forms of string matching done in many
systems

as in
(Madhavan, Bernstein & Rahm 2001; Do, Hai & Rahm 2002; Melnik,
Garcia
-
Molina & Rahm 2002; Giunchiglia & Yatskevich 2004; Shvaiko & Euzenat
2005)
.

2.2.1.2.1.

Prefix matching


First, prefix matching is done between the
element
names
of the schemas
to identify
whether the string with the longer length starts with the shorter string.
This type of
matching is especially useful in detecting abbreviations
(Shvaiko & Euzenat 2005)
. It
gives a score in the range [0, 1]
, where 0 if there is no similarity and 1
if an exact

match
.

When comparing two name strings, every string token in one string will be compared
with every token in the other string.
Match score for both prefix matching and suffix
matching
are calculated
using this formula,

S = k/{(x+y)/2}

Where
k

is number of matching sub
-
strings
,

x

is number of sub
strings
in element name of
A

from
S
S

y

is number of sub
strings
in element name of
B

from
S
T


In bot
h

prefix and suffix matching, we use a first
-
come, first
-
s
erved tie breaking
strategy similar to
(
Langville & Meyer 2005)
. The rationale for using this strategy is
further explained in section

2.4.4

Tie
-
breaking
.

Prefix matching process is depicted in the following example.

Consider two strings being compared,
inst_tech_demonstrator

and
institute
-
technology
-
demo
.

After t
okenization, the two strings become

inst_tech_demonstrator


(inst, tech, demonstrator)
and

institute
-
technology
-
demo



(institute, technology, demo)


16


By performing a prefix matching operation on every token with each other, the
number of matching substrings, k is obtained
.
Table
4

shows an example of prefix
matching.


Element B

Element A

I
nstitute

T
echnology

demo

inst

1

0

0

tech

0

1

0

demonstrator

0

0

1

Table
4
: Prefix
matching example

1

In this case,
k

= 3.

Similarity
,
S

between
A

and
B



=3/{(|3|+|3|)/2}



= 1

A score of 1 indicates that these two fields match exactly when prefix matching is
done.

Sometimes, P
refix matching
is not very accurate in determining the simil
arity. It
might give
a high score if
the
strings
do
match, even though they are note related.


Element B

E
lement
A

maths

teacher

tea

0

1

mat

1

0

Table
5
: Prefix matching example 2


This shows a perfect match although they are
of two different meanings.

Therefore, to
reduce the effect of such false positives, we use multiple string matching functions.

2.2.1.2.2.

Suffix matching

Suffix matching identifies
whether
one
string
ends
with
another.

This type of
matching is especially useful in de
tecting words that are related in meaning
(Shvaiko
& Eu
zenat 2005)

though they are exactly not the same. For example, ‘
saw
’ can be
matched with
handsaw
,
hacksaw
,
jigsaw

and ‘
ball
’ can be matched with
volleyball,
baseball, football

and
nut

with
peanut, chestnut, walnut
.

String matching calculations are done
the same as in prefix matching. Consider the
example below:


Element B

17


Element A

human

resource

fund

human

1

0

0

resource

0

1

0

type

0

0

0

Table
6
: Suffix matching example

Similarity

S
,

between
Human
-
Resource
-
Type
and
Human
-
Resource
-
fund


=2/{(|
3
|+|3|)/2}


= 0.67


Nevertheless, suffix matching does not
always
guarantee
accurate results for
every
matching operation
. I
t is not that effective in matching
some
words
(e.g:
car

Madagascar
, rent

current)
.

Therefore, we use additional string similarity matching
functions as described in the next se
c
tions

to reduce such effects.

2.2.1.3.

String similarity matching


2.2.1.3.1.

Edit distance

(Levenshtein distance)


Similar to
(Do, Hai & Rahm 2002; Chua, CEH, Chiang, RHL & Lim, E
-
P 2003;
Cohen, Ravikumar & Fienberg 2003; Giunchiglia & Yatskevich 2004)
, we use a form
of edit distance, Levenshtein distance, to calculate the similarity between name
strings.

This is for determining how similar the characters in the two strings are.

In Levenshtein
distance, the number of operations



character substitutions, insertions
and deletions



required

to transform one string into another is calculated, assigning a
value of 1 to each operation performed. This value indicates the
Levenshtein
distance,
k
, a me
asure of
error or
dissimilarity
(Navarro 2001)

between the two strings; shorter
the distance, higher is the similarity
(Navarro 2001; Giunchiglia & Yatskevich 2004)
.

As we need to compute the similarity, not dissimilarity, similarity
S
, is calculated by
representing
S

in a value between 0 and 1, as in
(Chua, CEH, Chiang, RHL & Lim, E
-
P 2003; Giunchiglia & Yatskevich 2004)

and
excluding the error ratio from it. This is
done
by the following equation

(Navarro 2001; Chua, CEH, Ch
iang, RHL & Lim, E
-
P
2003; Giunchiglia & Yatskevich 2004)
,

Let
A

and
B

be the two strings being compared, and

S = 1


{
k

/ [max [length (A), length (B)]}


18


Where S is the similarity value between the two strings,

k is the Levenshtein distance

Hence, for

an identical match, edit distance,
k

= 0 and similarity score,
S
, will be 1.

For example, edit distance between infoSc and informationScience is,

S
= 1


{12/18}

= 0.333


2.2.1.3.2.

N
-
gram


With Levenshtein distance, the reflected similarity value might not be very
accurate in
some types of strings. Therefore, in
the next step of string similarity matching,
we use
n
-
gram
matching
as in,
(Do, Hai & Rahm 2002; Giunchiglia, Shvaiko & Yatskevich
2004)
.

In this technique, the number of common n
-
grams, n, between the
two strings is
counted and a similarity score, S, is given by the following equation:

Let
A

and
B

be the two strings being compared, and

S = n / [max [ngrams (A), ngrams (B)]


For comparing some forms of strings, n
-
gram performs better than edit distance.

For
example, consider two strings ‘Contest’ and ‘Context’.

ngrams (Contest)


Con, ont, nte, tes, est = 5

ngrams (Context)


Con, ont, nte, tex, ext = 5

S

= 3/5 = 0.6


With n
-
gram, similarity score is 0.6; but, if edit distance is done on these two string
s,
it gives a similarity score of 0.86.


19


2.2.1.4.

Identifying structural similarities

S
chema elements that resemble to entities in the real world are likely to have similar
structural properties
(Li & Clif
ton 2000)
;
therefore
,
structural properties are
believed
to
have some
evidence

for discovering conceptual meanings embedded within
schemas. Similar to
(Li & Clifton 2000; Huimin & Sudha 2004)
, structural
data
(meta
data
) that are
utilized in this framework are given in
Table
7
.

#

Metadata

Details

1

Data type

Method in

section

2.2.
2.1

2

Field length

1 if same length

3

Range

1 if same range

4

Primary key

1 if both primary key

5

Foreign key

1 if both foreign key

6

Unique

1 if both unique key

7

Null
able

1 if both null allowed/ not allowed

8

Default

1 if both same default

9

Precision

1 if both same precision

10

Scale

1 if both same scale


Table
7
: Structural metadata for structural comparison


In this stage, elements in
schema
S
S
are compared with those in
schema
S
T

for their
structural
similarities
depicted in Table
2
.
The match function
,
matc
hStructure
,
checks
for
structural
similarit
ies
in the sequence specified in
Table

2

and
give
s

the
score,
S

in
a
m

x
n

matrix as in the example below

(
Table
8
)
,

Metadata

Attributes

n1

n2

….

Data type

m1

0.65

0.8


Field length

0

1


Range

1

0


Primary
key

0

0


Foreign key

0

0


Unique

1

0


Null
able

1

0


Default

0

1


Precision

1

1


Scale

0

1


Data type

m2

0.
8
5

0.
7


Field length

1

0


Range

1

0


…..






Table
8
: Structural matching example

20


For each type of
metadata except data type, a similarity value of 1 is given if that
property is common to both the fields and a 0 given if they do not match.

2.2.1.4.1.

Data type constraints

For comparing similarities between data types, w
e
construct
data type
synonyms

table
and use

that table for doing the dat
a

type comparisons
as done in
other
similar
researches
(Li & Clifton 2000; Do, Hai & Rahm 2002; Thang & Nam 2008; Karasneh
et al.

2009; Tekli, Chbeir & Yetongnon 2009)
.


Since there are variations in data types across different database systems
(Li & Clifton
20
00)
, realization of similarities in data types is not very straight forward. Therefore,
b
ased on
(Oracle 2008)

and
(Microsoft 2008)
, f
irst we construct a Vendor Specific
Data Types Table (VSDT) that consists of data types mappings for
Oracle and SQL
Server. This table

is in APPENDIX 1.

Based on
the VSDT
table, data types that
have a h
igh level of
similar
ity
can be
detected. For example, Oracle has got a data type called

Number

, but SQL Server
does not; but from the VSDT table, it can be derived that
Number in
Oracle is
equivalent to Float in SQ Server. Therefore, the
two
data types c
an be mapped as
a
match with maximum similarity.

Similarly, SQL Server has got a DATETIME data
type but Oracle does not
; i
t has a

Date


type

instead
. In such as case the
se two data
types are be mapped and give a maximum data type similarity score of 1.

Of
ten cases will come where two data types are not the same but possess some level
of similarity. For example, integer and
float

are not the same but they do have some
similarities as both the types are numbers
(Li & Clifton 2000)
;

likewise, char and
varchar are similar cases.

In view of this situation, in order to give consideration to
data types that have some level of similarity, we further categorise all the data types
into a more generic data

type classification similar to
(Li & Clifton 2000)
.

We make
this
classification based on
(Li & Clifton 2000; Oracle 2003; Microsoft 2007)

and
give a fixed similarity value of 0.5 for such cases

and a score of 0 if there is no match.
We call this table Generic Data Type Table (GDTT).

GDTT is in APPENDIX 2.

As an example, consider
the figure
below

(
Table
9
).

In this example, attribute

quantity
’ in SchemaA has a data type
int
eger

and attribute amount in SchemaB has a
data type float. In this case, as integer and maps to float in the Generic Data Type
Table (GDTT), it gets a similarity value of 0.5.


Schema
A

Schema
B

Attribute

quantity

amount

Data type

integer

float

Data type

similarity

0.5

Table
9
: Data type matching example

21


2.2.1.4.2.

Calculating final structural similarity score

As described in 2.2.2 and 2.2.2.1,
we obtain the property
-
comparison
values for all the
attributes

and calculate
their
average

to get a final structural similarity score for
every
attribute pair.

That is, we add the property comparison values and divide by 10 as we are
utilizing 10 different metadata properties.

This formula is given as follows,

S = k / N

Where
S

is the structura
l similarity score,

k

is total property comparison score,

N

is total number of properties considered


2.2.2.

Matching
at instance level

Instance values of the databases also can be used as an additional source for
discovering
relationships of databases

(Huimin & Sudha

2004)
.

It is possible that the
same data might be represented differently
in
databases

(Huimin & Sudha 2004)
.

For
example, ‘morning’ might be represented as ‘am’ while another database could
represent it as ‘m’ or ‘1’.

Although
this

issue exists in datab
ases,
information from
analysis of
actual data values
is often complementary to
schema
level matching
(Do,
Hai & Rahm 2002; Chua, C, Chiang, R & Lim, E 2003)

and can be
valuable
especially in circumstances where available schema information is limited
(Huimin &
Sudha 2004)
.

Si
milar to
(Li & Clifton 2000; Huimin & Sudha 2004)
, we use several statistical
features of the instance values
for assessing similarities in database

fields
.

The
features we utilize in this
framework

are given in the table below

(
Table
10
)
.



Table
10
: Properties utilized for instance matching


For showing how the statistical analysis

is performed on a table of instances
, t
he table
below
(
Table
11
)

shows some sample data
of a book store

database

and
Table
12

shows details of how
the
calculations are performed.

#

Properties

1

Mean length of the values

2

Standard deviation of length of the values

3

Mean ratio of number of numeric characters

4

Mean ratio of number of non
-
alphanumeric characters

5

Mean ratio of number of
distinct values to total tuples in the table

6

Mean ratio
of blank values to total tuple
s

in
the
table

22


ISBN

Author

Title

Ref_no

0062039741

Justin Bieber

First Step 2 Forever

XA345

0061997811

Brian Sibley

Harry Potter Film Wizardry

NAHK
W
1

1423113381

Rick Riordan

The Red Pyramid


1423101472

Mary
-
Jane Knight

Percy Jackson and the Olympians: The
Ultimate Guide


1617804061

Justin
Bieber

My World: Easy Piano

F9876001

Table
11
: sample d
ataset

for Schema1




ISBN

Author

Title

Ref_no

1

Mean(Length)

(10+10+10
+10
+
10
)/
5

=10

Normalised:0


(13+12+12
+16
+13
)/
5

=
13.2

Normalised:

(13.2
-
12)/(16
-
12)

=0.3

(20+26+15
+51
+
20
)/
5

=
26.4


Normalised:

(20.333
-
15)/(26
-
15)

=0.

317

(5+6+
0+0+
9)
/5

=
4

Normalised:

(
6.667
-
5
)
/
(

9
-
5
)

= 0.
444


2

StdDev(Length)

0

(0.25
-
0.3)
2

+ (0
-
0.3)
2

+ (0
-
0.3)
2

+ (1
-
0.3)
2

+
(0.25
-
0.3)
2

=

0.675/(5
-
1)

Var = 0.16875

StdDev







= 0.411

0.
397

0.
437

3

Mean(Numeric)

(10
/10
+10
/10
+1
0
/10
+10
/10
+10
/
10
)/
5

=1

(0+0+0
+0+0)/
66

= 0


(
1
/20)
+0+0
+0+0
/
5

=0.
0
1

(3
/5
+1
/6
+
0+0
+
8
/9
)/
5

= 1.6556/5

=0.3311

4

Mean(non
alphanumeric)

0

(0 + 0 + 0 + (
1/
16) +
0) /5

= 0
.0625/5

=
0.01
3

(0+0+0+(1/51)+(
1/20))/5


=
0.0696
/5

=0.01
4

0

5

Mean(Distinct)

=5/5

=1

=4/5

= 0.8

=5/5

=1

=4/5

=0.8

6

Mean(Blanks)

0

=(
1/
13+
1/
12+
1/
12
+1/
16+1/13
)/5

=
0.383/5

=
0.07
7

(3/20+3/26+2/15
+5/51+3/20)/5

=
0.647
/5

= 0.129




0

Table
12
:
Examples showing how the statistical calculations
are performed


23


Although each of these properties would contribute to the semantics at different
levels,
establishing the degree of relevance
is
not
a straight forward task

(Huimin &
Sudha 2004)

as the original dimensions are
varied
different units

of measurements
.
Therefore,
we normalize the values if they do not fall within
the range [0, 1]
(Li &
Clifton 2000; Huimin & Sudha 2004)
. Out of the 6 measurements we are considering
for this framework, only the first two will need

to be normalized in this manner but
the other measurements
need not be normalized as being ratios, they
always fall in this
range.

2.2.2.1.

Computing the instance similarity

Similar to the works in
(Chen 1995; Jain et al. 2002; Yeung & Tsang 2002; Kaur
2010)
, we calculate the similarity
between
two fields

based on average Manhattan
Distance
using
this
formula,

S = 1


















Where
x

represents the
distance measure of a field
property
in Schema1,

y

represents the distance measure of a field
property
in Schema
2 and

n

represents
the
number of
statistical
properties
or dimensions
considered
. In this
framework, as we are utili
zing
6

properties, n

=

6

by default.


After
obtaining the matrices of statistical values for each of the fields in both the
schemas,
we compute the
instance similarities between them.

For example,
T
able

A

in
Schema1 has
statistical values

(from
Table
12
:
Examples
showing how the statistical calculations
are performed
)
as
,

ISBN

{0, 0, 1,
0,
1, 0}

Author

{0.3,

0.411,

0, 0.013,

0.8,

0.077}

Title

{0. 317, 0.397, 0.01, 0.014,

1, 0.129}

Ref_no

{0.444, 0.437, 0.331, 0, 0.8, 0}


Suppose that Table B in Schema2 has similar attributes with statistical
values as,

ISBN

{0, 0, 1,
0,
1, 0}

Author

{0.
4
, 0.
45
, 0
.1
, 0.0
2
3, 0.
7
,

0.0
6
}

Title

{0.
3
, 0.
29
, 0.
04
, 0.
02
, 1, 0.
14
}

Code
_no

{0.4
39
, 0.43
3
, 0.33
7
, 0, 0.
6
, 0}

We calculate the similarity from the formula,


24


S(ISBN)

=
1
-




(



)


(



)

(



)

(



)

(



)

(



)




=1

S(
Author
) = 1




(







)


(







)

(





)

(







)

(







)

(







)





=
1


| 0.366|/6


= 1


0.061


=
0.939


From the above two calculations, it can be seen that TableA.ISBN and TableB.ISBN
has a similarity value of 1, indicating the highest
possible similarity.

TableA.Author and TableB.Author shows a similarity of 0.939.















2.3.


Schema matching algorithms


25


2.3.1.

Algorithm 1: Find name similarity of schema element

Input:

Set of attribute names in schema S
S
:
X = {U
a
. A
1

, U
a
.A
2
, …U
a
.A
i
...U
a
.A
n
}

U
a

is set of tables in schema S
S

A
i

is an attribute in the table U
a

Set of attribute names in schema S
T
:
Y = {V
b
.B
1
, V
b
.B
2
, ….V
b
.B
n
}

V
b

is set of tables in schema S
T

B
i

is an attribute in table V
b


Output:

Table Similarity Matrix. A Set of table
similarity pairs in the form
S
S
.U
a
: S
T
.V
b



S
TB

S
S

is source schema; U
a
is source table; S
T

is target schema; V
b
is
target table, S
TB
is similarity between the two tables


begin

call matchPrefix(X, Y)
// X and Y are set of attributes from schema S
S

and S
T

respectively

call matchSuffix(X, Y)

call matchNgram(X, Y, ngram)
// ngram is the value passed as the ngram

call getTableMatchResult()

end



function matchPrefix(X, Y)

{

foreach A
i
in X do
//comparing every attribute in schema S
S

with
every attribute in
S
T

{


foreach B
j
in Y do

{



if A
i
= B
j

then S = 1



else
//if two attributes are not the same string, tokenize the
strings

{




P

= call tokenize (A
i
)
//P and Q are set of tokens in strings
A
i
and B
j
respectively

Q

= call tokenize (B
j
)

k = 0

//initialize match counter

initialize new matchFoundList


foreach P
m

in P do

{
//compare every token in Ai with that in Bj





foreach Q
n

in Q

do





{






//if Qn does not already have a match with any token in
Pm

If(Q
n

not in matchFoundList)

{








//if longer token starts with shorter string, a
token match is found

max [length(P
m
), length(Q
n
)] starts with min
[length(P
m
), length(Q
n
)]

26



k = k + 1
//match score is incremented by one for
each token match

Add Qn to matchFoundList






}













}




}

//S is prefix similarity between attributes Ai and Bj

S (A
i
, B
j
) = k / { ( | P | + | Q | ) / 2 }




//S is similarity score for prefix match between two
attributes Ai and Bi

// UpdateAMSM

updates Matcher Similarity Matrix in Table 1


matchPrefix is matcher algorithm name, S
S
.U
a
.A
i

and S
T
.V
b
.B
j
,
are attributes of schema SS and ST respectively, S is
similarity between the two attribute

call UpdateAMSM

(matchPrefix, S
S
.U
a
.A
i
, S
T
.V
b
.B
j
, S)

//return Attribute Matcher Similarity Matrix

re
turn AMSM

}





// function to tokenize a string X

function tokenize(X)

{


begin


Initialize C
//C is set of tokens in X



d = {

, _}
// d is set of accepted delimiters

foreach d
i

in d do




add tokens to C (split(X, d
i
) )



return C


end

}








function matchSuffix(X, Y)

{

foreach A
i
in X do
//comparing every attribute in schema S
S

with
every attribute in S
T

{


foreach B
j
in Y do

{



if A
i
= B
j

then S = 1



else
//if two attributes are not the same string, tokenize the
strings

{




P

=
call tokenize (A
i
)
//P and Q are set of tokens in strings
A
i
and B
j
respectively

Q

= call tokenize (B
j
)

27


k = 0

//initialize match counter

initialize new matchFoundList


foreach P
m

in P do

{
//compare every token in Ai with that in Bj





foreach Q
n

in Q

do





{






//if Qn does not already have a match with any token in
Pm

If(Q
n

not in matchFoundList)

{








//if longer token ends with shorter string, a token
match is found

max [length(P
m
), length(Q
n
)] ends with min
[length(P
m
), length(Q
n
)]


k = k + 1
//match score is incremented by one for
each token match

Add Qn to matchFoundList






}













}




}

//S is suffix similarity between attributes Ai and Bj

S (A
i
, B
j
) = k / { ( | P | + | Q | ) / 2 }




//S is similarity score for
suffix match between two
attributes Ai and Bi

// UpdateAMSM

updates Matcher Similarity Matrix in Table 1


matchSuffix is matcher algorithm name, S
S
.U
a
.A
i

and S
T
.V
b
.B
j
,
are attributes of schema SS and ST respectively, S is
similarity between the two
attribute

call UpdateAMSM

(matchSuffix, S
S
.U
a
.A
i
, S
T
.V
b
.B
j
, S)

//return Attribute Matcher Similarity Matrix

return AMSM

}





function matchNgram(X, Y)

{


foreach A
i
in X do
//comparing every attribute in schema S
S

with every
attribute in S
T

{



foreach B
j
in Y do


{



if A
i
= B
j

then S = 1



else
//if two attributes are not the same string, tokenize the
strings

{




P

= call getNgrams (A
i
, ngram)

//P and Q are set of n
-
grams in
strings A
i
and B
j
respectively


Q

= call getNgrams (B
j
, ngram)

k = 0

//initialize match counter

28


initialize new matchFoundList
//n
-
grams that have found a
matching n
-
gram


foreach P
m

in P do

{
//compare every n
-
gram in Ai with that in Bj





foreach Q
n

in Q

do


//if Qn does not already have a match with any other n
-
gram

in Pm






if Q
n

(not in matchFoundList) then

{






If P
m

=
Q
n

then







//add to match found list if a matching

Add to matchFoundList(Q
n
)

k = k + 1
//match score is incremented by
one for each n
-
gram match





}



//S is n
-
gram similarity
between attributes Ai and Bj


S (A
i
, B
j
) =

k / { Max ( | P | ,| Q | ) }



// UpdateAMSM

updates Attribute Matcher Similarity Matrix in Table
1


matchNgram is matcher algorithm name, S
S
.U
a
.A
i

and S
T
.V
b
.B
j
, are
attributes of schema SS and ST
respectively, S is similarity
between the two attribute

call UpdateAMSM

(matchNgram, S
S
.U
a
.A
i
, S
T
.V
b
.B
j
, S)

//return Attribute Matcher Similarity Matrix

return AMSM

}








//returns a set of n
-
grams for a string A

Function getNgrams(A, g)
// A is
element name string; g is ngram value
used

{



begin



n = length(A)



r = n
-

g
//r is last n
-
gram position



while i <= r do

{
// get n
-
grams until all possible n
-
grams obtained




ngram
i

= getSubstring(A, i, g)




Add(ngram
i

) to C
// C is set of
n
-
grams obtained




}

i = i+1



return C


end


}






29


//Updates Attribute Matcher Similarity Matrix in Table 1

//parameter matcher is matcher name, S
S
.U
a
.A
i

and S
T
.V
b
.B
j
, are attributes of
schema SS and ST respectively, S is similarity between the two

attribute

function UpdateAMSM(matcher, S
S
.U
a
.A
i
, S
T
.V
b
.B
j
, S)

{

// AttributeMatcherSimilarity is an object to hold similarity data

initialize new AttributeMatcherSimilarity(matcher, S
S
.U
a
.A
i
, S
T
.V
b
.B
j
,
S)


//AMSM is Attribute Matcher Similarity Matrix
that holds similarity
information in an Array List

Add to AMSM(matcher, S
S
.U
a
.A
i
, S
T
.V
b
.B
j
, S)

}



function getAttributeMatchResult()

{


foreach x in AMSM do


{




//source attribute in format schema.table.attribute



A
i

= x.getSourceAttr()

//target attrib
ute in format schema.table.attribute



B
j

= x.targetAttribute()

S
x =
x.getScore()


// AttributeMatchResult object holds aggregate similarity score
for attribute pairs



initialize new AttributeMatchResult(Ai,Bj,Sx)

//if Attribute pair similarity data not i
n ASM, add to ASM



if(AttributeMatchResult not in ASM) then




Add AttributeMatchResult to ASM

//if attribute pair score exists, add to existing score

else add AttributeMatchResult.score() to ASM.score(Ai,Bj)



}

//calculate average score for each pair in

ASM (Attribute
Similarity Matrix in Table 2)

foreach y in ASM do



{

//calculate total score for each pair




S
aggregate

= ASMscore(Ai,Bj)





//calculate average for each pair




S
average

= Saggregate / ASM.countMatchers


//Update Attribute Similarity Ma
trix with average score

30


updateASMscore(S
S
.U
a
.A
i
, S
T
.V
b
.B
j
,S
average
)



}



//return Attribute Similarity Matrix in Table 2


return ASM

}


function getTableMatchResult()

{


foreach w in ASM do

{



//source table in format
schema.table



P
m

= w.getSourceTable()

//target table in format schema.table

Q
n

= w.targetTable()

S
r =
w.getScore()


// TableMatchResult object holds aggregate similarity score for
table pairs



initialize new tableMatchResult(Pm,Qn,Sr)

//if Table pair sim
ilarity data not in TSM, add to TSM

TSM is Table Similarity Matrix in Table 3



if(TableMatchResult not in TSM) then




Add TableMatchResult to TSM

//if table pair score exists, add to existing score

else add TableMatchResult.score() to TSM.score(Pm,Qn)



}

//calculate average score for each pair in TSM (Table
Similarity Matrix in Table 3)

foreach z in TSM do



{

//calculate total score for each table pair




S
aggregate

= TSMscore(Pm,Qn)





//calculate average for each pair




S
average

= S
aggregate

/TSM.possibleMaximumScore(Pm,Qn)


//Update Attribute Similarity Matrix with average score

updateTSMscore(S
S
.U
a
, S
T
.V
b
,S
average
)



}

}


//return Table Similarity Matrix

return TSM


}


31


function getSchemaMatchResult()

{


foreach x
in TSM do

{



//source schema



P
m

= w.getSourceSchema()


//target schema

Q
n

= w.targetSchema()

S
r =
w.getScore()


// SchemaMatchResult object holds aggregate similarity score
for schema pairs



initialize new SchemaMatchResult(Pm,Qn,Sr)

//if Schema pair s
imilarity data not in SSV, add to SSV

SSV is Schema Similarity Value Object



if(SchemaMatchResult not in SSV) then




Add SchemaMatchResult to TSM

//if schema pair score exists, add to existing score

else add SchemaMatchResult.score() to SSV.score(Pm,Qn)



}

//calculate average score for schema pair




S
average

= SSV.getScore() /SSV.possibleMaximumScore(Pm,Qn)


//Update Attribute Similarity Matrix with average score

updateSSVscore(S
S
, S
T
,S
average
)




}


//return Schema Similarity
Value SSV

return SSV


}






32


2.3.2.

Algorithm 2: Find structural similarities of schemas


Input:

//Structural properties of attributes in source schema S
S


//Each attribute is passed with attribute name in the form
Schema.Table.Attribute followed by structural
properties


DT
-

data type

FL
-

field length


R
-

range

PK
-

primary Key

FK
-

foreign Key

UK
-

unique Key

NU
-

nullable

DE
-

default value

PR
-

precision

SC
-

scale



Set of attributes in schema S
S
:

X = {U
a
. A
1

, DT1, FL1,R1PK1,FK1,UK1,NU1,DE1,PR1,SC1 U
a
.A
2
,DT2 FL2,
R2,PK2,FK2,UK2,NU2,DE2,PR2,SC2 …U
a
.A
i,
DT
i
,FLiRiPKi,FKi,UKi,NUi,DEi,PRi,SCi,
..U
a
.A
n
,DT
n
, FLn, Rn,PKn, FKn,UKn,Nun,Den,PRn,SCn }


Set of attributes in schema S
T
:

Y= {V
b
. B
1

, DT1, FL1,R1PK1,FK1,U
K1,NU1,DE1,PR1,SC1 V
b
.B
2
,DT2 FL2,
R2,PK2,FK2,UK2,NU2,DE2,PR2,SC2 …V
b
.B
j,
DT
j
,FLjRjPKj,FKj,UKj,NUj,DEj,PRj,SCj,
..V
b
.B
n
,DT
n
, FLn, Rn,PKn, FKn,UKn,NUn,DEn,PRn,SCn }


Output:

Set of table similarity pairs in the form

S
S
.U
a
: S
T
.V
b



S
TB

//S
S

is source schema; U
a
is source table; S
T

is target schemas; V
b
is
target table


begin

call matchStructure (X, Y)

call getMatchResult()

end



33


//calculates structural similarity between two schemas

Function matchStructure(X, Y)

{


begin



if S
S
and S
T
from

same type of database server then




data type ref table = same DB table



else data type ref table = conversion DB table



k=0



//compare each attribute in source schema to each in target
schema

foreach Ai in X do

{

foreach Bj in Y do

{


if DT of Ai = D
T of Bj then k = k+1
//data type


if FL of Ai = FL of Bj then k = k+1
//field length

if R type of Ai = R of Bj then k = k+1
// range


if Ai is PK and Bj is a PK then k = k+1
// primary Key


if Ai is FK and Bj is a FK then k = k+1
// foreign Key


if Ai is
UK and Bj is a UK then k = k+1
// unique Key


if Ai is NU and Bj is a NU then k = k+1
// nullable


if DE of Ai = DE of Bj then k = k+1//
default value


if PR of Ai = PR of Bj then k = k+1
// precision


if SC of Ai = SC of Bj then k = k+1
// scale

// maxPos
sibleSimilarity is count of structural properties


S (A
i
, B
j
) = k/maxPossibleSimilarity()

// UpdateAMSM

updates Attribute Matcher Similarity Matrix in Table 1

structure is matcher algorithm name, S
S
.U
a
.A
i

and S
T
.V
b
.B
j
, are
attributes of schema SS and ST
respectively, S is similarity between
the two attribute


call UpdateAMSM(structure, S
S
.U
a
.A
i
, S
T
.V
b
.B
j
, S)

//return Attribute Matcher Similarity Matrix

34


return AMSM

}

}

}




function matchInstance(X, Y)

{

foreach A
i
in X do
//comparing values of every
field in schema S
S

with every field in S
T

{


foreach B
j
in Y do

{



Call getMeanLength(B
j
)





}











}




}


//T is a set of instances for an attribute for which its mean length has to
be calculated

Function getMeanLength(T)

{


foreach t
i

in T do

{


Add length(t
i
) to totalLength


}

meanLength = totalLength / |T|

return meanLength

}



35



2.4.

Similarity computation


2.4.1.

Computation at matcher level

Match operations are performed on every element of schema S
S

= { S
S1
, S
S2
, ..
S
S
m

}
with every element of S
T

= { S
T1
, S
T2

……S
Tn
} and a similarity value is computed for
each operation on a table by table basis as in the example below.


Match operation

Element names

mathsTeacher

diplomat



Prefix

teaMat

1

0


Suffix

0

0.667


Edit
Distance

0.25

0.375


n
-
gram

0.2

0.167


Structural

0.5

0.3


Instance

0.04

0.2


Prefix



...

...


….








Table
13
:
Matcher level s
imilarity
calculation example




2.4.2.

Computation at attribute level

To obtain a combined matching result, the average score is computed for the pair of
elements, similar to
(Do, Hai & Rahm 2002; Bozovic & Vassalos 2008)
. For example,
after the above match operations, combined similarity values are

computed as in the
table
below

(
Table
14
)
.




Table
14
: Combined similarity
calculation at attribute level

Consequently, the combined final result of the matching operations is given in a
similarity matrix,
M
, with all the elements,
m,

i
n
S
S

and elements,
n
, in
S
T
in a
m
x

n

matrix as in the example below.

Schema S
S
, Table x

Schema S
T
, Table y


mathsTeacher

diplomat



teaMat

0.3625

0.3022











Elements in

Table y,
S
T


36




n
1

n
2

n
3

m
1




m
2




m
3




m
4




Figure
3
: Combined final similarity matrix


The highest score in each row of the matrix indicates the
element

in S
S

that has
the
highest similarity to the corresponding element i
n

S
T
.


2.4.3.

Computing
similarity

at table level


After computing the similarities for attributes in S
S

and S
T
, we compute the c
ombined
similarity between tables. This is done by taking the ratio of all the similarity values
between table
x

in schema
S
S

and table
y

in schema
S
T

with maximum possible
similarity, similar to computations in

(Do, Hai & Rahm 2002)
.


Table simi
larity computed from the formula given below.


Table Similarity =

Sum of similarities between x and y

Combined maximum similarity of

x and y


Example 1





Table
15
: Table similarity
calculation
example 1


Table Similarity =

0.25 + 0.3 + 1 +0.5 + 0.7 +1

6


Schema S
S
,
Table x1

Schema S
T
,

Table y1


n1

n2

m1

0.25

0.3

m2

1

0.5

m3

0.7

1

Elements in

Table x,
S
S

37



=

0.625


Example 2




Table
16
: Table similarity
calculation
example 2


Table Similarity =

0.4 + 0.5 + 0.8 + 1 + 0.7 + 0.8 + 0.9 + 1+ 1

9



=

0.789


From the above two examples, Table x1 has a higher similarity to Table y2 than
Table
y1.

The end result of matching table similarity between two schemas is a matrix that gives
similarity values between all the tables in both the schemas.



Table y1

Table y2

Table y3

Table x1

0.12

0.9

0.61

Table x2

0.6

0.74

0.58

Table x3

1

0.27

0.2

Table x4

0.5

0.91

0.1


2.4.4.

Tie
-
breaking


As matching attributes
or tables
are determined based on the highest similarity score,
there is a possibility
of a tie
if
an attribute
or table
get
s

more than one matching
element
with the same
similarity
score.

In such cases, it is necessary to implement a
tie breaking strategy to resolve such
issues
.

2.4.4.1.

Tie breaking in t
able

m
atching

When selecting the best match for tables, we use maximum cardinality

matching
strategy
(Wu et al. 2004)
.
For example, consider the set of tables in sche
ma1 as
{m1,m2,m3} and set of tables in schema2 as {n1,n2,n3}.

In maximum cardinality
matching, we choose the matching tables in a way such that every table gets the
matching table with the highest similarity score. In this case, the best matching will be
(
m1,n3), (m2, n1), (m3, n2).

Schema S
S
,
Table x1

Schema S
T
,
Table y2


n1

n2

n3

m1

0.4

0.5

0.8

m2

1

0.7

0.8

m3

0.9

1

1

Tables in schema
S
T


Tables in schema
S