# The Rise and Fall and Rise of Dependency Theory Part II: The Rise from the Ashes

Διαχείριση Δεδομένων

31 Οκτ 2013 (πριν από 4 χρόνια και 6 μήνες)

95 εμφανίσεις

The Rise and Fall and Rise of Dependency
Theory

Part II: The Rise from the Ashes

Ronald Fagin

2

Dependencies were Considered Harmful

Dependencies were undesirable

Except for keys and referential integrity constraints

Database normalization eliminated dependencies

BCNF: each FD is a logical consequence of keys

4NF: each MVD is a logical consequence of keys

5NF: each JD is a logical consequence of keys

3

But then:

Dependencies took on a new, very positive role!

4

Data Integration and Data Exchange

Data integration
:

Describe data in a global schema in terms
of data in local schemas

Data exchange
:

Describe data in a target schema in terms
of data in a source schema, and actually
produce the target database

5

Data Integration and Data Exchange

These are old, but recurrent, database problems

Phil Bernstein

2003

Data exchange is the oldest database problem

EXPRESS
: IBM San Jose Research Lab

1977

for transforming data between hierarchical databases

The universal relation model is

an early case of data
integration

We will focus mainly on data exchange

6

Schema

Mappings & Data Exchange

Source
S

Target
T

Schema Mapping
M

= (
S
,
T
,
Σ
)

Source

schema
S
,
Target

schema
T

High
-
level, declarative assertions
Σ

that specify the
relationship between
S

and
T

Data Exchange

via the schema mapping
M

= (
S
,
T
,
Σ
):

Transform a given
source
instance
I
to a
target

instance
J,
so that <
I, J
> satisfy the specifications
Σ

of
M

I

J

Σ

7

Schema Mapping Specification Language

The relationship between source and target is typically
given by
source
-
to
-
target tgds

(
x
)

y

(
x
,
y
)

where

(
x
)

is a conjunction

of atoms over the source

(
x
,
y
)

is a conjunction of atoms over the target

(Student(s)

Enrolls(s,c))

t

g (Teaches(t,c)

There may also be target tgds and egds:

(g = g’)

8

New Role of Dependencies

In data exchange, dependencies play a crucial role in
describing how to transform data from one format to another

9

Solutions in Schema Mappings

Definition
: Schema Mapping

M

= (
S
,
T
,
Σ
)

If
I

is a source instance
, then a
solution

for
I

is a

target instance
J

such that
<
I, J
>

satisfy
Σ

Fact:

In general, for a given source instance
I,

there may be
no solutions

at all

or

there may be
multiple solutions
; in fact there may be
infinitely many solutions

10

Universal Solutions in Data Exchange

[Fagin, Kolaitis, Miller, Popa

ICDT 2003] introduced
universal solutions
as the “best” solutions in data exchange

By definition, a solution is
universal

if it has
homomorphisms

to all other solutions

Thus, it is a “most general” solution

Constants
: entries in source instances

Variables

(
labeled nulls
): entries besides constants in
target instances

Homomorphism

h: J
1

J
2

between target instances:

h(c) = c
, if
c

is a constant

If
P(a
1
,…,a
m
)

is in
J
1
,
, then
P(h(a
1
),…,h(a
m
))

is in
J
2

11

How to Obtain a Universal Solution?

Answer: Use our old friend the
chase
!

Theorem
[Fagin, Kolaitis, Miller, Popa

ICDT 2003]:

If there is a solution, then the chase produces a
universal solution

12

Standard schema mappings

[Fagin, Kolaitis, Miller, Popa

ICDT 2003] define a
weakly
acyclic set of tgds

[Deutsch, Tannen
-

ICDT 2003] have a slightly more
restrictive notion

Let a
standard schema mapping

be one specified by s
-
t tgds,
target egds, and a weakly acyclic set of target tgds.

Theorem
[Fagin, Kolaitis, Miller, Popa

ICDT 2003]:

For standard schema mappings, the chase runs in
polynomial time (data complexity)

13

Schema
S

Schema
T

I

J

Σ

q

Question:

What is the semantics of target query

Definition:

The

of a query
q

over
T

on
I

certain
(
q,I
) = ∩ {
q
(
J
):
J

is a solution for
I
}

Note:

It is the standard semantics in data integration

14

Theorem
[Fagin, Kolaitis, Miller, Popa

ICDT 2003]:

Assume a standard schema mapping. Let
q

be a union of
conjunctive queries over the target.

If
I
is a source instance and
J
is a universal solution for
I
:

certain
(
q
,
I
) = the set of all “
null
-
free
” tuples in
q
(
J)
.

Hence,
certain
(
q
,
I
) is computable in polynomial time

1.
Compute a universal

solution
J
, using the chase, in
polynomial time

2.
Evaluate
q
(
J
) and remove tuples with nulls

15

Composing Schema Mappings

Given
M
12

= (
S
1
,
S
2
,

12
)

and
M
23

= (
S
2
,
S
3
,

23
)
, derive a
schema mapping
M
13

= (
S
1
,
S
3
,

13
)

that is “
equivalent
” to
the sequence
M
12

and
M
23

Schema

S
1

Schema

S
2

Schema

S
3

M
12

M
23

M
13

What does it mean for
M
13

to be “
equivalent
” to the
composition of
M
12

and
M
23
?

16

Semantics of Composition

13

has to have the property that:

<
I
1
,I
3
>

13

if and only if there exists
I
2

such that

<
I
1
,I
2
>

12

and <
I
2
,I
3
>

23

17

Result of the composition

Question:
If
M
12

and
M
23

are each specified by s
-
t tgds,
what language is needed for specifying the composition of
M
12

and
M
23
?

[Fagin, Kolaitis, Popa, Tan

PODS 2004]:

second
-
order tgds

18

Second
-
Order Tgds

Definition:

Let
S

be a source schema and
T

a target schema.

A
second
-
order tuple
-
generating dependency

(SO
-
tgd) is a
formula of the form:

f
1

f
m
( (

x
1
(

1

1
))

(

x
n
(

n

n
)) ),

where

f
i

is a function symbol

i

is a conjunction of atoms over
S

and equalities of terms

i

is a conjunction of atoms from

T

Example:

f

(

e( Emp(e)

Mgr(e,
f(e)
)

e( Emp(e)

(
e=f(e)
)

SelfMgr(e) ) )

19

Composition and SO
-
Tgds

Theorem
[Fagin, Kolaitis, Popa, Tan

PODS 2004]:

The composition of any finite sequence of schema mappings
specified by s
-
t tgds can be specified by an SO
-
tgd

Conversely, every SO
-
tgd specifies the composition of a finite
sequence of mappings that are each specified by s
-
t tgds.

Recently [Arenas, Fagin, Nash

ICDT 2010] showed that
the sequence need only be of size 2

20

Composition with Target Constraints

[Arenas, Fagin, Nash

ICDT 2010] defined
s
-
t SO
dependencies
, which generalize SO tgds by allowing not only
target atoms but also equalities in the conclusion

Theorem
[Arenas, Fagin, Nash

ICDT 2010] :

The composition of any finite sequence of standard
schema mappings can be specified by an s
-
t SO
dependency (along with target egds and target tgds)

Conversely, every s
-
t SO dependency specifies the
composition of a finite sequence of standard schema
mappings

In fact, again, the sequence need only be of size 2

The chase procedure can be extended to schema mappings
specified by s
-
t SO dependencies, so that it produces
universal solutions in polynomial time (data complexity)

21

Conclusions

Dependencies now play a crucial role in data integration and
data exchange

We even have second
-
order dependencies, which have in
fact been implemented in IBM Infosphere Data Architect.

Dependency theory is alive and well!

22

Extra slides

23

The Smallest Universal Solution

Fact:

Universal solutions need not be unique

Question
:
Is there a “best” universal solution?

[Fagin, Kolaitis, Popa

PODS 2003] took a

small is beautiful
” approach:

There is a
smallest

universal solution (if solutions exist); hence,

the most
compact

one to materialize

Definition:

The
core

of an instance
J

is the smallest
subinstance
J’

that is homomorphically equivalent to
J

Fact:

Every finite relational structure has a core

The core is unique up to isomorphism

24

Core: The smallest universal solution

Theorem
[Fagin, Kolaitis, Popa

PODS 2003]
:

All universal solutions have the same core

The core of the universal solutions is the smallest
universal solution

If the target constraints are egds, then the core is
polynomial
-
time computable (data complexity)

Theorem
[Gottlob and Nash

PODS 2006]:

If the target constraints are egds and a weakly acyclic set of
tgds, then the core is polynomial
-
time computable

25

Old Conclusions

Dependencies now play a crucial role in data integration and
data exchange

We even have second
-
order dependencies, which have in
fact been implemented in practice!

Lately, even probabilistic dependencies have been studied

[Dong, Halevy, Yu

VLDB 2007]

[Das Sarma, Dong, Halevy

SIGMOD 2008]

[Fagin, Kimelfeld, Kolaitis

ICDT 2010]

Probabilistic dependencies on probabilistic databases

Dependency theory is alive and well!