Data Integration: The Teenage Years

radiographerfictionΔιαχείριση Δεδομένων

31 Οκτ 2013 (πριν από 3 χρόνια και 8 μήνες)

77 εμφανίσεις

Data Integration:


The Teenage Years

Alon Halevy (Google)

Anand Rajaraman (Kosmix)

Joann Ordille (Avaya)


VLDB 2006

Agenda


A few perspectives on the last 10 years


Technical, commercial


Perspectives from our personal paths


Wild speculations about the future


This is
not
a survey on data integration

(See the paper in the proceedings for another
non
-
survey)

Acknowledgements

Other members of the Information Manifold
Project:


Jaewoo Kang (NCSU, Korea Univ.)


Divesh Srivastava (AT&T Labs)


Shuky Sagiv (Hebrew U.)


Tom Kirk


Acknowledgements

To the SIGMOD 1996 Program committee


For
rejecting
the earlier version of the paper.

Timeline

95

96

97

98

99

00

01

02

03

04

05

06

Data Integration

Legacy Databases

Services and Applications

Enterprise Databases



Sequenceable
Entity

Gene

Phenotype

Structured
Vocabulary

Experiment

Protein

Nucleotide
Sequence

Microarray
Experiment

The Information Manifold


Goal: integrate data from multiple sources
on the web:


Find the Woody Allen movies playing in
my area, and their reviews



Need to
describe
the data sources:


Contents, constraints, access patterns

wrapper

wrapper

wrapper

wrapper

wrapper

Mediated Schema

Semantic mappings

optimization &


execution

query


reformulation

Design time

Run time

Semantic Mappings

[a.k.a. Source Descriptions]


Books


Title

ISBN

Price

DiscountPrice

Edition


CDs

Album

ASIN

Price

DiscountPrice

Studio

BookCategories

ISBN

Category

CDCategories

ASIN

Category

Artists

ASIN

ArtistName

GroupName

Authors

ISBN

FirstName

LastName

CD:

ASIN, Title, Genre,


Artist:
ASIN, name,



Mediated Schema

logic

Global
-
as
-
View (GAV)

Source


Source


Source


Source


Source


R1

R2

R3

R4

R5

CD
(A,T,G) :
-

R1
(A,T,G)

CD
(A,T,G) :
-

R2
(A,T),
R3
(T,G)

CD:

ASIN, Title, Genre,


Artist:
ASIN, name,



Mediated Schema

Mapping:

Local
-
as
-
View (LAV)

Source


Source


Source


Source


Source


R1

R2

R3

R4

R5

R1
(A,T,G) :
-

CD
(A,T,G,Y), Artist(A,N), Y< 1970

R2
(A,T) :
-

CD
(A,T,”French”,Y)

CD:

ASIN, Title, Genre, Year

Artist:
ASIN, Name,



Mediated Schema

Mapping:

Query Answering in LAV =

Answering queries using views

Given a set of views V
1
,…,V
n
,


And a query Q,


Can we answer Q using only the answers to
V
1
,…,V
n
?

AQUV (I)


[Larson et al., 85 & 87], [Tsatalos et al., 94],
[Chaudhuri et al., 95],



Focus on AQUV for:


Query optimization


Supporting physical data independence



Every commercial DBMS supports AQUV.


AQUV (II)


AQUV for data integration:


Find
maximally contained rewriting


Not necessarily equivalent rewriting


Algorithms:


Bucket algorithm [LRO, 96]


Inverse rules [Duschka, 97]


Minicon [Pottinger and Halevy, 2000]


Views and security: [Miklau and Suciu, 04]

Survey: Halevy, VLDB Journal, 2001

Some Subsequent Results


Semantics of data integration:


Abiteboul & Duschka, 1998: certain answers


Open vs. closed world assumption


CWA is bad complexity news!

Survey: Lenzerini, PODS 2002

Certain Answers

Origin

Destination

SF

Seattle

NY

Seoul

Origin

Destination

SF

Seoul

NY

Seattle

Mediated schema: Route (Origin, Destination)

Source 1:

Origins


SF


NY


Source 2
: Destinations


Seattle


Seoul


Query: Route (SF, Seattle)?

Possible databases:

Some Subsequent Results


Limitations due to binding patterns


Input title, get book info [Rajaraman et al., 95]


Additional query processing capabilities


Form applies multiple predicates


Disjunction, negation in sources.


Ordering sources, probabilistic mappings


[Florescu et al., 97, Doan et al., Dong et al.]


GLAV [Millstein et al., 99]

Survey: Lenzerini, PODS 2002

A word on Description Logics


Selecting relevant sources =
reasoning
.


Description logics to the rescue:


[Catarci and Lenzerini, 93]


Information Manifold


Combined the Classic DL with Datalog
(CARIN)


See AAAI
-
96 (not sigmod)


Brought DL and DB closer together.


A very active area of research today.

95

96

97

98

99

00

01

02

03

04

05

06



XML and Semi
-
structured Data


Tsimmis: semi
-
structured data for
integration.


XML: whetted the integration appetites


We have the syntax


Now just solve the silly semantics problems


Don’t bother: we’ll all standardize on DTDs.


XML will have a significant role on the data
integration industry and research.

95

96

97

98

99

00

01

02

03

04

05

06



Back in the Lab…


Two observations:


Who’s going to write all these LAV/GAV
formulas?


This
was the bottleneck.


Once we have mappings, how can we
execute queries?


Traditional plan
-
then
-
execute doesn’t work.

Semantic Mappings

BooksAndMusic

Title

Author

Publisher

ItemID

ItemType

SuggestedPrice

Categories

Keywords


Books


Title

ISBN

Price

DiscountPrice

Edition


CDs

Album

ASIN

Price

DiscountPrice

Studio

BookCategories

ISBN

Category

CDCategories

ASIN

Category

Artists

ASIN

ArtistName

GroupName

Authors

ISBN

FirstName

LastName

Inventory

Database A

Inventory Database B


Standards are great, but there are too many of them.


Techniques for Schema Mapping

[Survey by Rahm and Bernstein, VLDBJ 2001]



Compare schema elements based on:


Names (or n
-
grams)


Data types and instances


Text descriptions, integrity constraints



Combine multiple techniques:


[Momis, Cupid, LSD, Coma]


Create mappings from matches


[Clio @ IBM + Miller]


A Machine Learning Approach

[Doan et al., 2001, ACM Distinguished Dissertation 2003]


Many mapping tasks are repetitive


Learn from previous experience:


Build a classifier for every element of the
mediated schema.


Many kinds of cues


meta
-
strategy learning

Mediated schema

listed
-
price


$250,000


$110,000


...


address price agent
-
phone description

Matching Real
-
Estate Sources


location


Miami, FL


Boston, MA


...



phone

(305) 729 0831

(617) 253 1429


...


comments

Fantastic house

Great location


...

realestate.com


location

listed
-
price

phone

comments

Schema of realestate.com

If “
fantastic
” &


great


occur frequently

in data values =>

description

Learned hypotheses


price


$550,000


$320,000


...


contact
-
phone

(278) 345 7215

(617) 335 2315


...


extra
-
info

Beautiful yard

Great beach


...

homes.com

If “
phone
” occurs

in the name =>

agent
-
phone

Mediated schema

Reference Reconciliation

To Join or not to Join?


Many ways to refer to the same object in
the world:


“IBM”, “International Business Machines”


Alon Levy, Alon Halevy


Automated methods are necessity


Can’t go through all the data manually


Very active area in ML, KDD, DB, UAI, …

Query Processing

To Plan or to Execute?


In addition to distributed query processing issues:


Few statistics, if any.


Network behavior issues: latency, burstiness,…


Garlic @IBM



“Adaptive query processing”:


Stonebraker saw it coming in Ingres.


Revivals by Graefe (1993) and DeWitt (1998).


Query scrambling [Urhan & Franklin]


Eddies [Avnur & Hellerstein]


Convergent query processing [Ives et al.]

95

96

97

98

99

00

01

02

03

04

05

06



Commercialization


Late 90’s


anything goes.


Want money from VC’s?


Say “XML” 3 times loud and clear.


Academia at the forefront:


Nimble (UW), Cohera (Berkeley), Enosys
(UCSD),…


Big companies took notice


Some faster than others

Commercialization Retrospective

[See Panel
-
of
-
Experts, SIGMOD 05]


Uphill battle vs. the warehousing folks


Virtual integration was more “pay
-
as
-
you
-
go”


Another battle with the EAI folks


Should really be a symbiosis there.


Go vertical or horizontal?


Obvious: go vertical if you can find the
right

one.


The technology worked


But it’s all in the timing…


XML Query

User

Applications

Lens


File

InfoBrowser


Software

Developers Kit

NI MBLE


APIs

Front
-
End

XML

Lens Builder


Management


Tools

Integration

Builder

Security Tools

Data

Administrator

After $30M…

Concordance

Developer

Integration

Layer

Nimble Integration Engine


Compiler

Executor

Metadata

Server

Cache

Relational

Data Warehouse/


Mart

Legacy

Flat File

Web Pages

Common
XML View

95

96

97

98

99

00

01

02

03

04

05

06



NASDAQ

So… Back in the Lab


Model management


Peer data management systems


Data exchange

Model Management

[Bernstein et al.]


Generic infrastructure for managing
schemas and mappings:


Manipulate models and mappings as bulk
objects


Operators to create & compose mappings,

merge & diff models


Short operator scripts can solve schema
integration, schema evolution, reverse
engineering, etc.


First challenge: semantics of operators.

Peer Data Management Systems

Berkeley

Stanford

DBLP

UW (Washington)


UW (Wisconsin)

CiteSeer

UW (Waterloo)

Q

Q1

Q2

Q6

Q5

Q4

Q3

LAV, GLAV

PDMS
-
Related Projects


Piazza (Washington)


Hyperion (Toronto)


PeerDB (Singapore)


Local relational models (Trento, Toronto)


Active XML (INRIA)


Edutella (Hannover, Germany)


Semantic Gossiping (EPFL Lausanne)


Raccoon (UC Irvine)


Orchestra (U. Penn)

PDMS Challenges

Berkeley

Stanford

DBLP

UW (Washington)

UW (Wisconsin)

CiteSeer

UW (Waterloo)



Semantics:



careful about cycles



Optimization:



Compose mappings



Prune paths



Manage networks:



Consistency



Quality



Caching


Data Exchange


Key question: given an instance of
S
and a
mapping, create an instance for
T.



[Fagin, Kolaitis, Popa & Tan]

S

T

M

95

96

97

98

99

00

01

02

03

04

05

06



95

96

97

98

99

00

01

02

03

04

05

06



?

2006 Status Report

[The People Angle]


Joann @ Avaya


Integrating communications into business
processes


Anand @ Kosmix



Creating a new kind of search company


Alon @ Google


Working for Joann’s old boss


Deep web evangelist


Pondering data management for the masses

2006 Status Report

[Enterprise Angle]


Enterprise Information Integration is
established:



IBM, BEA, Oracle, MetaMatrix, Composite,
Actuate, …


Impact on design tools:


IBM Rational Data Architect


ADO .NET v. 3

Forrester Says…

"Enterprises are facing the
growing challenges of
using disparate sources of data

managed
by different applications, including problems with data
integration, security, performance, availability and
quality.... New technology is emerging that Forrester has
coined
"information fabric,"

a term defined as a
virtualized data layer

that integrates
heterogeneous data and content repositories in real
time.... The potential benefits of this technology are so
great that enterprises should develop a strategy to
leverage

information fabric technology as it becomes
more widely available."

2006 Status Report

[Web Angle]


Vertical search engines:
one

domain


At scale: need even better source
descriptions


deep web can be surfaced


Terminology
: Data integration = mashups!



Wikipedia:


A
mashup

is a website or
Web 2.0

application that uses content from more
than one source to create a completely
new service. This is akin to
transclusion
.

Looking Ahead


Data management: from the enterprise to the
masses


Challenges:


Databases of
everything


Need support for collaboration


Help people structure their data


Pay
-
as
-
you go data management

Pay
-
as
-
you
-
go Data Management

Benefit

Investment (time, cost)

Dataspaces

Data integration solutions

Artist: Mike Franklin

Dataspaces: Franklin, Halevy, Maier [see PODS 2006]

Big Carrots

Reusing Human Attention


Principle:


User action = statement of semantic relationship


Leverage actions to infer other semantic relationships


Examples


Providing a semantic mapping


Infer other mappings


Writing a query


Infer content of sources, relationships between sources


Creating a “digital workspace”


Infer “relatedness” of documents/sources


Infer co
-
reference between objects in the dataspace


Annotating, cutting & pasting, browsing among docs


Conclusion


We’ve done extremely well as a community!


Next challenge: data management and
integration tools for the masses