Matching and Reuse of XML Schemas

farmpaintlickInternet και Εφαρμογές Web

21 Οκτ 2013 (πριν από 3 χρόνια και 10 μήνες)

71 εμφανίσεις

1

Matching and Reuse of XML Schemas

2

Sample XML Schema

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">


<xs:element name="car">


<xs:complexType>


<xs:sequence>


<xs:element name="make" type="xs:string"/>


<xs:element name="model" type="xs:string"/>


<xs:element name="year" type="xs:string"/>


<xs:element name="color" type="xs:string"/>


<xs:element name="driver">


<xs:complexType>


<xs:sequence>



<xs:element name="first" type="xs:string"/>



<xs:element name="last" type="xs:string"/>



<xs:element name="license" type="xs:string"/>


</xs:sequence>


</xs:complexType>


</xs:element>


</xs:sequence>


</xs:complexType>


</xs:element>

</xs:schema>

3

What is XML schema matching


Matching


identifying the relations among the
corresponding elements of two schemas


e.g. customer/firstName <==> client/name/first




customer/name <==>


concatenate (client/name/first, client/name/last)




Calculate the distance between two Schemas


E.g., distance between

customer.xsd

and
client.xsd

is 0.67.

4

Why XML Schema matching


From data integration point of view:


Purpose: Automatically identifying corresponding elements between two
schemas


Relevant works:


Database schema matching/mapping
, e.g.,
A. Doan, et al.,
Reconciling schemas of
disparate data sources: A machine
-
learning approach.
SIGMOD
, 2001


Generic schema mapping
, e.g.,
J. Madhavan, P. A. Bernstein, E. Rahm.
Generic schema
matching with Cupid.
VLDB
, 2001.



XML Schema matching
. E.g.
H. Do, E. Rahm. COMA A system for flexible combination of
schema matching approaches.
VLDB

2002.


From web service composition point of view


e.g., matching the output type of one service with the input of another in
sequential composition


From software reuse point of view:


Purpose: Build XML Schema categories and search engines;


Relevant works:


Software component search
:
A Mili, R Mili, RT Mittermeir, A survey of software reuse
libraries, Annals of Software Engineering, 1998.


Agent and service matching
: Katia Sycara, Jianguo Lu, Matthias Klusch, Interoperability
among Heterogeneous Software Agents on the Internet, Technical Report CMU
-
RI
-
TR
-
98
-
22, CMU.

5

What are the problems


Modelling


As graph


As tree matching


Node similarity


Name, type, cardinality.


Structure similarity


Tree edit distance


K. Zhang, D. Shasha. Simple fast algorithms for the editing distance
between trees and related problems.
SIAM Journal of Computing
, 1989.


6

Overview of our system


XML

Schema

Name

Similarity


XML

Schema

Modelling

Structural Relations

Name Relations

Results


retrieval

Node Relations

Node

Similarity

Structural
similarity

7

Three
similarities

WordNet,

string matching

Hungarian method


Name

Similarity

Node

Similarity

Structural
Similarity

Node name

Hierarchical

structure

Compatibility

tables

User
-
defined

data type

Built
-
in

data type

Cardinality

Tree matching

algorithm

8

Modelling


<xs:element name="driver" type="driverType"/>

<xs:attribute name="license" type="xs:string"/>

Model schemas as trees

9

Modelling


customerOrder

shipping

billing

address

date

ship2Add

date

bill2Add

street

province

postcode

schema

reference

paper

author

title

contents

refNo

paper

customerOrder

shipping

billing

date

ship2Add

date

bill2Add

schema

street

address

province

postcode

street

address

state

zip

Address_ca.xsd

Address_us.xsd

Model schemas as trees

Reference

Importing and Inclusion

Recursion

10

Information excluded in Modelling



Related to elements or attributes


Default value, value range, unique, nullable…



Related to structure


Sequence


All


Choice

name

first

last

name

last

first

Model schemas as trees

11

Computing node similarity


Computing name similarity with the help of:


WordNet and its API


String matching


Hungarian method



Add the similarity of other information


Data type


Minimum cardinality


Maximum cardinality


Node similarity

12

Name similarity from token lists


Tokenize names


E.g. clientName
-
> client name




submittedReports
-
> submit report


Similarity between two token lists


Using Hungarian method for Weighted Bipartite Graph Matching
(WBGM)

sim
i,j

sim
0,0

customer

delivery

address

client

require

shipping

address

customerDeliveryAddress
vs.

clientRequiredShippingAddress

Node similarity

13

Determine the structural relation

Tree 1

Tree 2

Structure similarity

14

Common substructure

car

make

model

year

color

driver

firstName

lastName

license

make

car

model

year

color

driver

first

last

license

Structure similarity

15

Approximate Common Structure

car

make

model

year

color

driver

firstName

lastName

license

make

car

model

year

color

driver

first

last

license

Structure similarity

16

Mappings in an ACS

car

make

model

year

color

driver

first (firstName)

last (lastName)

license

m
ACS1

= {(s1.car, s2.car),


(s1.make, s2.make),


(s1.year, s2.year),


(s1.color, s2.color)}

m
ACS2

= {(s1.dirver, s2.driver),


(s1.fist, s2.firstName),


(s1.last, s2.lastName),


(s1.license, s2.license)}

ACS1

ACS2

Structure similarity

17

Evaluation


Criteria


Matching outcomes


Mappings


Schema similarity


Execution time



Collected four groups of Schemas


Purchase orders used in COMA (5)


Large schemas from XML.org (86)


Schemas on hospitality domain (95)


Extract from WSDL (419)

Evaluation

18

Comparison with edit distance algorithm element
mapping on data group 1

Evaluation

Method 1: our algorithm

Method 2: edit distance

19

Comparison with edit distance: schema similarity data
group 3 and 4

Evaluation

Method 1: our algorithm

Method 2: edit distance

20

Comparison with edit distance: performance
on data group 2

Evaluation

Method 1: our algorithm

Method 2: edit distance

21

Comparison with COMA (Mapping)



COMA


'All'

COMA


'All+SchemaM'

Our algorithm

Precision

about 0.95

about 0.93

0.88

Recall

about 0.78

about 0.89

0.87

Overall

0.73

0.82

0.75

Overall

is a measure that combines
precision

and
recall
. It
reflects the efforts of
removing
incorrect mappings and adding
missing ones.


Evaluation

22

Conclusion



Scalable schema matching


Wang Lian, David W. Cheung, Nikos Mamoulis, and Siu
-
Ming Yiu,
An Efficient and Scalable Algorithm for Clustering XML Documents
by Structure, TKDE, 2005.


Subtyping



Apply to web service matching

23

Web service synthesis

24

Web Service Composition


Composite web service: “service implemented by
combining the functionality provided by other web
services”

G. Alonso et al.


Web service composition: the process of developing a
composite web service


Approaches to web service composition:


Conventional programming languages, such as Java, C#;


Web service composition languages, such as BPEL;


Workflow, pi
-
calculus, petri net, automata…


Web service synthesis.


composition

25

Web Service Synthesis


BPEL and the like are still programming languages


They describe exactly
how

to compose the web services.


Web service synthesis


We describe
what

is the service. But don’t describe how to
implement it;


We don’t even know what are the component services involved;


The relevant services are discovered and invoked
dynamically;



The implementation is synthesized from the web service
specification,
automatically
.



Program synthesis has a long history.


composition

26

Web Service Synthesis

WS

Syntactic Specification (WSDL)

Semantic Specification (Datalog)

Service Implementation

Service Specification (WSDL/Datalog)

WS2

WS1

WS

Service Implementation (BPEL)

composition

27

Syntactic specification: …

Semantic Specification:

chapters(ISBN, PRICE, TITLE, AUTHOR) <
-

Chapters(ISBN, PRICE), Book1(TITLE, ISBN,
AUTHOR).

Synthesis Example

Service specification

Syntactic:


Interface definition defined by WSDL

Semantic:

Q(ISBN, PRICE, TITLE, RATE) <
-



Chapters(ISBN, PRICE),


Book1(TITLE, ISBN, AUTHOR),


Book2(ISBN, COMMENT, RATE).

Service Implementation

Java code, database

Service Specification

Syntactic specification:

WSDL file

Semantic Specification:

amazon(ISBN, PRICE, RATE, TITLE, AUTHOR) <
-



Amazon(ISBN, PRICE),


Book1(TITLE, ISBN, AUTHOR),


Book2(ISBN, COMMENT, RATE).

Chapters

amazon

MetaSearchService

??

MetaSearchService
Implementation

composition

28

Generate the abstract implementation by query rewriting

Syntactic specification: …

Semantic Specification:

chapters(ISBN, PRICE, TITLE, AUTHOR) <
-

Chapters(ISBN, PRICE), Book1(TITLE, ISBN,
AUTHOR).

Service specification

Syntactic:


Interface definition defined by WSDL

Semantic:

Q(ISBN, PRICE, TITLE, RATE) <
-



Chapters(ISBN, PRICE),


Book1(TITLE, ISBN, AUTHOR),


Book2(ISBN, COMMENT, RATE).

Service Implementation

Java code, database

Service Specification

Syntactic specification:

WSDL file

Semantic Specification:

amazon(ISBN, PRICE, RATE, TITLE, AUTHOR) <
-



Amazon(ISBN, PRICE),


Book1(TITLE, ISBN, AUTHOR),


Book2(ISBN, COMMENT, RATE).

Chapters

amazon

MetaSearchService

Q(ISBN, PRICE, TITLE, RATE) <
-

amazon(ISBN, PRICE, RATE, TITLE', AUTHOR'),

chapters(ISBN, PRICE0, TITLE, AUTHOR).

MetaSearchService Abstract
Implementation

composition

29

Generate the Concrete Implementation

Syntactic specification: …

Semantic Specification:

chapters(ISBN, PRICE, TITLE, AUTHOR) <
-

Chapters(ISBN, PRICE), Book1(TITLE, ISBN,
AUTHOR).

Service specification

Syntactic:


Interface definition defined by WSDL

Semantic:

Q(ISBN, PRICE, PRICE0, TITLE, RATE) <
-



Service Implementation

Java code, database

Service Specification

Syntactic specification:

WSDL file

Semantic Specification:

amazon(ISBN, PRICE, RATE, TITLE, AUTHOR) <
-



Amazon(ISBN, PRICE),


Book1(TITLE, ISBN, AUTHOR),


Book2(ISBN, COMMENT, RATE).

Chapters

amazon

MetaSearchService

Q(ISBN, PRICE, PRICE0, TITLE, RATE) <
-

amazon(ISBN, PRICE, RATE, TITLE', AUTHOR'),

chapters(ISBN, PRICE0, TITLE, AUTHOR).

MetaSearchService Abstract
Implementation

Invoke amazon;

Invoke chapters;

Combine the output;

MetaSearchService Concrete
Implementation

composition

30

It is a lightweight approach…


Web services

are restricted to be database queries or
functions that can be described by database queries or
Datalog;


Semantic specification

is Datalog instead of more powerful
specification mechanism employing ontology;


Compositions

are restricted to data composition instead of
full
-
blown process specification such as BPEL.



All those choices are meant for the construction of a
practical web service synthesis system…

composition

31

Mapping between Datalog and Web Services


Database vendors also provide wrappers for web services


Behind a web service there is a SQL query that corresponds to the
web service;


SQL defines the semantics of the web service.


Major database vendors support the mapping between SQL and
Web service;


We experimented with DB2WS.

Malaika, S. et al. DB2 and Web Services.
IBM System Journal
, 41(4), pp. 666
-
685. 2002.

composition

32

Generate the Abstract Implementation by Query
rewriting

Definition:

Given a query Q and a set of views V. A
rewriting

of Q using V is a query Q’ such that Q=Q’,
and Q’ refers to one or more views in V.


Q


T1, T2, T3.

Query:

Views:

Rewriting 2:

Q


V1, V2.

Rewriting 1:

Q


V1, T3.

V1

T1,T2.

V2

T2,
T3.

composition

33

Our query rewriting system

composition

34

Limitations of our approach


Focus on database web services;


Datalog is not expressive enough.


Query rewriting in Description Logic, or OWL.



Assume the existence of global database schemas:


Service providers need to provide the semantic definition of web
services in terms a global database schema;


New service specification is also defined using the common schema



Schema matching


composition

35

Other threads


Web service collection and clustering


From UDDI, Crawler, Search engines such as Google


Master thesis to be finished this summer


Web service metrics


Schema subtyping


Based on regular tree grammar


Master thesis to be finished this summer


Bottom up web service composition


Semantic web service


36

Service Oriented Architecture

Discovery
agency

Provider

Requester

interact

find

publish

37

Web service discovery


Keywords search


Based on IR techniques, such as vector space model


Fast, but not accurate


Signature matching


Decide subtype relations between input and output of web services


Used in service composition, to find composable web services


Relaxed matching


Approximate matching, allowing small deviations in both structure
and words/tags


Semantic matching


Matching functional requirements of web services


Used in adaptive, autonomous systems