The Rise and Fall and Rise of Dependency Theory Part II: The Rise from the Ashes

radiographerfictionΔιαχείριση Δεδομένων

31 Οκτ 2013 (πριν από 3 χρόνια και 8 μήνες)

74 εμφανίσεις


The Rise and Fall and Rise of Dependency
Theory


Part II: The Rise from the Ashes









Ronald Fagin


IBM Almaden Research Center





2

Dependencies were Considered Harmful


Dependencies were undesirable


Except for keys and referential integrity constraints


Database normalization eliminated dependencies


BCNF: each FD is a logical consequence of keys


4NF: each MVD is a logical consequence of keys


5NF: each JD is a logical consequence of keys


3

But then:


Dependencies took on a new, very positive role!

4

Data Integration and Data Exchange

Data integration
:

Describe data in a global schema in terms
of data in local schemas


Data exchange
:

Describe data in a target schema in terms
of data in a source schema, and actually
produce the target database






5

Data Integration and Data Exchange

These are old, but recurrent, database problems



Phil Bernstein


2003



Data exchange is the oldest database problem




EXPRESS
: IBM San Jose Research Lab


1977


for transforming data between hierarchical databases



The universal relation model is

an early case of data
integration


We will focus mainly on data exchange

6

Schema

Mappings & Data Exchange

Source
S



Target
T


Schema Mapping
M

= (
S
,
T
,
Σ
)


Source

schema
S
,
Target

schema
T


High
-
level, declarative assertions
Σ

that specify the
relationship between
S

and
T



Data Exchange

via the schema mapping
M

= (
S
,
T
,
Σ
):


Transform a given
source
instance
I
to a
target

instance
J,
so that <
I, J
> satisfy the specifications
Σ

of
M


I

J

Σ

7

Schema Mapping Specification Language

The relationship between source and target is typically
given by
source
-
to
-
target tgds



(
x
)



y

(
x
,
y
)

where



(
x
)

is a conjunction

of atoms over the source



(
x
,
y
)

is a conjunction of atoms over the target


(Student(s)


Enrolls(s,c))




t

g (Teaches(t,c)


Grade(s,c,g))

There may also be target tgds and egds:

Grade(s,c,g))


Grade(s,c,g’))


(g = g’)

8

New Role of Dependencies


In data exchange, dependencies play a crucial role in
describing how to transform data from one format to another

9

Solutions in Schema Mappings

Definition
: Schema Mapping

M

= (
S
,
T
,
Σ
)


If
I

is a source instance
, then a
solution

for
I

is a


target instance
J

such that
<
I, J
>

satisfy
Σ


Fact:

In general, for a given source instance
I,



there may be
no solutions

at all




or


there may be
multiple solutions
; in fact there may be
infinitely many solutions

10

Universal Solutions in Data Exchange


[Fagin, Kolaitis, Miller, Popa


ICDT 2003] introduced
universal solutions
as the “best” solutions in data exchange


By definition, a solution is
universal

if it has
homomorphisms

to all other solutions


Thus, it is a “most general” solution


Constants
: entries in source instances


Variables

(
labeled nulls
): entries besides constants in
target instances


Homomorphism

h: J
1



J
2

between target instances:


h(c) = c
, if
c

is a constant


If
P(a
1
,…,a
m
)

is in
J
1
,
, then
P(h(a
1
),…,h(a
m
))

is in
J
2



11

How to Obtain a Universal Solution?


Answer: Use our old friend the
chase
!


Theorem
[Fagin, Kolaitis, Miller, Popa


ICDT 2003]:

If there is a solution, then the chase produces a
universal solution



12

Standard schema mappings


[Fagin, Kolaitis, Miller, Popa


ICDT 2003] define a
weakly
acyclic set of tgds


[Deutsch, Tannen
-

ICDT 2003] have a slightly more
restrictive notion


Let a
standard schema mapping

be one specified by s
-
t tgds,
target egds, and a weakly acyclic set of target tgds.


Theorem
[Fagin, Kolaitis, Miller, Popa


ICDT 2003]:

For standard schema mappings, the chase runs in
polynomial time (data complexity)


13

Query Answering in Data Exchange

Schema
S


Schema
T

I

J

Σ

q

Question:

What is the semantics of target query
answering?


Definition:

The
certain answers

of a query
q

over
T

on
I




certain
(
q,I
) = ∩ {
q
(
J
):
J

is a solution for
I
}


Note:

It is the standard semantics in data integration



14

Computing the Certain Answers

Theorem
[Fagin, Kolaitis, Miller, Popa


ICDT 2003]:


Assume a standard schema mapping. Let
q

be a union of
conjunctive queries over the target.



If
I
is a source instance and
J
is a universal solution for
I
:


certain
(
q
,
I
) = the set of all “
null
-
free
” tuples in
q
(
J)
.



Hence,
certain
(
q
,
I
) is computable in polynomial time

1.
Compute a universal

solution
J
, using the chase, in
polynomial time

2.
Evaluate
q
(
J
) and remove tuples with nulls




15


Composing Schema Mappings


Given
M
12

= (
S
1
,
S
2
,

12
)

and
M
23

= (
S
2
,
S
3
,

23
)
, derive a
schema mapping
M
13

= (
S
1
,
S
3
,

13
)

that is “
equivalent
” to
the sequence
M
12

and
M
23

Schema

S
1


Schema

S
2


Schema

S
3

M
12

M
23

M
13

What does it mean for
M
13

to be “
equivalent
” to the
composition of
M
12

and
M
23
?

16

Semantics of Composition


13

has to have the property that:


<
I
1
,I
3
>




13

if and only if there exists
I
2

such that

<
I
1
,I
2
>



12

and <
I
2
,I
3
>



23





17

Result of the composition


Question:
If
M
12

and
M
23

are each specified by s
-
t tgds,
what language is needed for specifying the composition of
M
12

and
M
23
?



Answer:

[Fagin, Kolaitis, Popa, Tan


PODS 2004]:


second
-
order tgds




18

Second
-
Order Tgds

Definition:

Let
S

be a source schema and
T

a target schema.


A
second
-
order tuple
-
generating dependency

(SO
-
tgd) is a
formula of the form:



f
1



f
m
( (

x
1
(

1




1
))





(

x
n
(

n




n
)) ),

where


f
i

is a function symbol



i

is a conjunction of atoms over
S

and equalities of terms



i

is a conjunction of atoms from

T


Example:


f

(

e( Emp(e)


Mgr(e,
f(e)
)









e( Emp(e)


(
e=f(e)
)


SelfMgr(e) ) )


19

Composition and SO
-
Tgds

Theorem
[Fagin, Kolaitis, Popa, Tan


PODS 2004]:


The composition of any finite sequence of schema mappings
specified by s
-
t tgds can be specified by an SO
-
tgd


Conversely, every SO
-
tgd specifies the composition of a finite
sequence of mappings that are each specified by s
-
t tgds.


Recently [Arenas, Fagin, Nash


ICDT 2010] showed that
the sequence need only be of size 2


20

Composition with Target Constraints


[Arenas, Fagin, Nash


ICDT 2010] defined
s
-
t SO
dependencies
, which generalize SO tgds by allowing not only
target atoms but also equalities in the conclusion


Theorem
[Arenas, Fagin, Nash


ICDT 2010] :


The composition of any finite sequence of standard
schema mappings can be specified by an s
-
t SO
dependency (along with target egds and target tgds)


Conversely, every s
-
t SO dependency specifies the
composition of a finite sequence of standard schema
mappings


In fact, again, the sequence need only be of size 2


The chase procedure can be extended to schema mappings
specified by s
-
t SO dependencies, so that it produces
universal solutions in polynomial time (data complexity)

21

Conclusions


Dependencies now play a crucial role in data integration and
data exchange


We even have second
-
order dependencies, which have in
fact been implemented in IBM Infosphere Data Architect.


Dependency theory is alive and well!








22

Extra slides


23

The Smallest Universal Solution


Fact:

Universal solutions need not be unique


Question
:
Is there a “best” universal solution?


Answer:

[Fagin, Kolaitis, Popa


PODS 2003] took a



small is beautiful
” approach:


There is a
smallest

universal solution (if solutions exist); hence,


the most
compact

one to materialize


Definition:

The
core

of an instance
J

is the smallest
subinstance
J’

that is homomorphically equivalent to
J


Fact:



Every finite relational structure has a core


The core is unique up to isomorphism


24

Core: The smallest universal solution

Theorem
[Fagin, Kolaitis, Popa


PODS 2003]
:


All universal solutions have the same core


The core of the universal solutions is the smallest
universal solution


If the target constraints are egds, then the core is
polynomial
-
time computable (data complexity)


Theorem
[Gottlob and Nash


PODS 2006]:


If the target constraints are egds and a weakly acyclic set of
tgds, then the core is polynomial
-
time computable







25

Old Conclusions


Dependencies now play a crucial role in data integration and
data exchange


We even have second
-
order dependencies, which have in
fact been implemented in practice!


Lately, even probabilistic dependencies have been studied


[Dong, Halevy, Yu


VLDB 2007]


[Das Sarma, Dong, Halevy


SIGMOD 2008]


[Fagin, Kimelfeld, Kolaitis


ICDT 2010]


Probabilistic dependencies on probabilistic databases



Dependency theory is alive and well!