Semistructured data: from practice to theory

cowphysicistInternet and Web Development

Dec 4, 2013 (3 years and 8 months ago)

62 views

Semistructured data
--

June 2001

1

Semistructured data:

from practice to theory


Serge Abiteboul

INRIA & Xyleme SA


Serge.Abiteboul@inria.fr

Serge.Abiteboul@xyleme.com

http://www
-
rocq.inria.fr/verso

http://www.xyleme.com


2

Organization


Motivations


XML


Typing XML


Querying XML


XML and the Web


Illustrations: 2 problems


Incomplete information


Xyleme


Conclusion

Semistructured data
--

June 2001

3

Motivations

4

Motivation: Complex data


Structure is irregular (missing/extra data…)


Schema does not exist or is unknown


Schema is rapidly evolving




Relational and ODB models are too rigid


Example: BibTex, HTML, SGML, XML,
ASN.1, STEP/Express…

5

Complex data: mediation

Source

wrapper

Source

wrapper

Source

wrapper

Source

wrapper

Source

wrapper

Source

wrapper

Mediator

Ontology

meta
-
data

User

Many data sources coming and going

6

Motivations: The Web today


Terabytes of data


Private web
: not publicly available pages


Deep web
: data hidden behind forms


A lot of public pages



Standard is a document/hypertext language
HTML

7

The Web today


Browsing


Search engines


in: list of words


out: sorted list of URLs



Applis: hand
-
made wrappers


Expensive


Incomplete


Short
-
lived, not adapted to the
Web constant changes


[Raghavan ’00]

8

A new standard XML


HTML is not appropriate for data exchange on the
Web


Standard database models are too constraining for
the Web



The solution: a semistructured data model XML


Reminder: a data model consists of a type definition
language, a query/update language + more

Semistructured data
--

June 2001

9

The most successful
semistructured data model: XML

10

The origin of XML


Parents


SGML


Relational and OO databases


SGML: markup language for documents


HTML and the Web: billions of pages


Not appropriate for data exchange


XML eXtensible Mark
-
up Language


W3C and most industrial companies [B2B]


Main idea: separate content and presentation


Use tags to represent structure and semantics


11

XML: documents + databases


HTML




XML


comes from SGML




also


hypertext language




semistructured data



fixed number of tags




not fixed


content and presentation



not mixed


are mixed


very difficult to extract data



much easier


from a page


old standard for the Web



new standard


12

HTML = Hypertext Language

Ref


Name Price

X23 Camera 359.99

R2D2 Robot 19350.00

Z25 PC 1299.99

Information System


HTML



The <b> X23 </b> new camera
replaces the <b> X22 </b>. It
comes equipped with a flash
(worth by itself <i>53.99 $</i>)
and provides great quality for
only <i>359.99 $</i>.

Text + presentation

Where is the data ?

hard

13

XML = Semistructured Data

Ref


Name Price

X23 Camera 359.99

R2D2 Robot 19350.00

Z25 PC 1299.99

...

Information System

<product
-
table>

<
product reference=”X23">


<designation> camera </designation>


<price unit=Dollars> 359.99 </price>


<description> … </description>

</product>

<
product reference=”R2D2">


<designation> Robot </designation>


<price unit=Dollars> 19350 </price>


<description> … </description>

...

</product
-
table>


XML

Data + Structure

Semistructured
:

more flexible

easy

14

XML: example

dealer

UsedCars

NewCars

ad

ad

model

year

model

<dealer>


<UsedCars>


<ad>


<model>
Honda
</model>


<year>
96
</year></ad>


</UsedCars>


<NewCars>


<ad>


<model>
Acura
</model>


</ad>


</NewCars>


<NewCars>


<ad>


<model>
R406
</model>


</ad>


</NewCars>

</dealer>

Honda


96

Acura

It is just an unranked

tagged ordered tree

NewCars

ad

model

R406

15

XML


Tree or graph


Data and structure/semantics are mixed


Tags contain typing information


Core constructor is
list

of tag/value pairs


Details


Each node may have an arbitrary number of children
with distinct or not tags


Nodes also have attributes that are unordered and
unique per node


Standard means to represent cyclic data: Id Idrefs

16

XML

Very active/noisy field
-

standards


types (DTD/XML schema), style
-
sheet (XSL), resource
description (RDF...)


DOM, SAX…


WML (wap), MathML, SMIL (multimedia), RSS (news),
RDF (metadata)...


How fast will XML conquer the web?


so far rather slow (about 1% now of the visible web; much
more in intranets); accelerates (e.g., with Explorer 5.5)


Semistructured data
--

June 2001

17

Typing XML

18

Typing XML


This is heresy for the freedom of the Web


Essential for data management: query
optimization, user interfaces, applications


Differences with standard database typing


Collections are sequences instead of sets


Types may be very large (e.g., from integration)


Data is more irregular so types should be more
permissive


New issues sometimes: you have the data, extract its
type, an approximate type

19

Intuition : the type is a tree


Semantics and structure are in paths


dealer/UsedCars/ad


dealer/UsedCars/ad/model

dealer

UsedCars

NewCars

ad

model

year

model

text

text

ad

text

20

DTD: a grammar

Catalog


Product*

Product


Name Price? Cat (Part Quantity)*

Part


BasicPart + ComposedPart

BasicPart


Pame

ComposedPart


Name (Part Quantity)*



Nice and simple


Shortcoming: type of an element is independent of
its context

21

More complex: specialization


Type of
ad

depends on its context


dealer

UsedCars

NewCars

ad
used

ad
new


model

year

model

dealer

UsedCars

NewCars

ad

ad

model

year

model


One way to view it: homomorphism

22

Regular tree automata


Set of accepted trees: regular tree languages


Definable in monadic second
-
order logic

p q


r r s s

q
f
q
f
q
f
q
f
q
f
q
f


q
0

Acceptance
: there is a

computation such that

all leaves are labeled
q
f

Used New

ad ad ad ad

m y m y m m


dealer



variants: top/down bottom/up,
nondeterminism, unranked trees

23

DTDs+specialization

Result: DTDs+specialization = regular tree
languages



Closure (intersection, union, complement)


Tests for validation, inclusion


Static analysis

24

Situation today


Many people are using DTDs


Nice and simple in spite of ugly syntax


New proposal: xml
-
schema


More powerful but too complicated?


Other proposals: Relax, Trex


Usually based on some kind of regular tree automata


From experience: one will win and not necessarily
the best

Semistructured data
--

June 2001

25

Query languages for XML

26

Query Languages for XML


Extensions of SQL


first
-
order
-
logic


Information retrieval keyword search


Navigation via regular expression + pattern matching

Lorel, XML
-
QL, XMAS…


Structural recursion

UnQL, XSLT…


No official winner


leader is Xquery

27

Pattern matching


Tree with variables
and constraints


Pattern matching
between the query and
the data


Each match provides a
valuation for X,Y,Z


catalog

product

name price cat=elec

subcategory

<200

X

Y

Z

28

Example in Lorel

select <offer> Z/name, P/name, P’/price </offer>

from

P in catalog/product,



Z in discount_stores/store,



Z/storecatalog/product P’

where P/category=“camera” and P/make=“canon” and



P’/id = P/id



Joins like in relational databases


Construction of complex results


Regular expressions for paths
(e.g., W/*/name = “Gates”)

29

What is new in XML queries


A bit new: limited recursion (like in deductive
databases)


A bit new but no big deal: constructed answers
(like in OODB)


Very new: ordered data


Bothering


Theoretical base is a bit messy: FO, tree automata,
bisimulation


No yardstick like relational calculus/algebra

30

Proposal : k
-
pebble transducers

stack

[milo,suciu,vianu]

31

k
-
pebble transducers: result

root

a


c

b

a

a

b

a

b

Semistructured data
--

June 2001

32

XML and the Web

33

Why it is the same old story


Massive amounts of data


Providers export data, users access data

• Query languages, indexing, optimization

• Database paradigm: still effective on the
Web



34

Why it is not the same old story


Databases


rigid structure


transactions,
concurrency control


data independence


controlled (e.g.,
known cost model)


coherent system, very


polished artifact


The Web


flexible, no schema


flexible protocols



fuzzy separation


perfect mess (and that’s
why people like it?)


closer to a natural
ecosystem!


35

The principles of the Web


The uncertainty principle
: you can never be sure of
anything or that the data is consistent


The incompleteness principle
: they do not give you
all the data you want (but some you don’t want :
-
)


The chaos principle
: you can rarely assume the
existence of some global schema


The instability principle
: everything keeps changing

Every piece of data you got is probably wrong,
incomplete, does not conform to its expected type
and is probably already stale

36

What can be reused?


Some technology? indexes, B
-
trees, distributed
query processing (concurrency control and
transactions not yet)


Database theory? little


Algebra and rewrite rules for optimization


Dependency theory


First order and other logics


Seems that because of the ordering, it opens the gates
for many more tools such as regular/tree languages


37

Metaphor [AV]:
the Web is infinite


What are the pages pointing to my
homepage?


Google solution: milliseconds


stale data


Freeze the Web: weeks to get exact answer


Exact answer: no means to get it



Leads to reconsider the notion of
computation

38

Computability


Finitely computable: give the answer in finite time


All pages reached from my HP in less than 3 links


Eventually computable: each solution is given in
finite time; computation may be infinite


All pages reached from my HP


Not computable


Can my HP be reached starting from my HP?


Also: approximate, partial, stale, pipelined answers

39

Tough life:
the Web is huge


Relational calculus/algebra: logspace data
complexity (also AC0)


What is the data complexity of an Xquery
of the Web?


Complexity of computing on the Web


Logspace in the Web?


Need to trade quality for performance

40

The Web keeps changing


Classical: versions, temporal queries


Less classical: monitoring of the Web
[Xyleme]


Smart crawling of the Web: flow of docs


Query subscription: query on this flow


Continuous queries


What is the underlying theory?

Semistructured data
--

June 2001

41

Illustration: incomplete
information

Work with Victor Vianu

42

Example

Access to an electronic catalog


Q1: name, subcat, price of electronic products with price
less than $200

Q2: name, pictures of cameras at least pictured once


43

Q1: name, subcat, price of electronic products with price less than 200

catalog

cdplayer

product

canon 120 elec

camera

product

nikon 199 elec

camera

product

sony 175 elec

product1

product2

*

*

missing

44

Missing data after Q1

product1

name price cat picture

subcategory

*

product2

name price cat picture

subcategory

*

!=elec

=elec

>200

45

Q2: name, pictures of cameras at least pictured once

catalog

product

canon 120 elec

camera

product

nikon 199 elec

camera

product

sony 175 elec

cdplayer

product2a

missing

product2c

product2

*

product2b

*

c.jpg

akai a.jpg elec

camera


3


3

*

product1

46

product1

name price
cat

picture

subcategory

*

!=elec

product2
a

name
price

cat

picture

subcategory

=elec

>200

name price
cat


product3


elec

product2
b

name

price

cat

picture

*

=elec

>200

product2
c

name

price

cat


subcategory

=elec

>200

subcategory

!=camera

subcategory

!=camera

no picture

no picture

product +



Known

data



Missing data

47

After two queries


Known information:


Prefix of the real data tree


Missing information


Complex type


Q3: name, price, pictures of cameras costing less
than $100 and at least pictured once


can be
completely

answered using A1, A2


Q4: list all cameras


can be
partially

answered using A1, A2

Semistructured data
--

June 2001

48

Illustration: Xyleme

49

A dynamic warehouse of Web data


Warehouse


Xyleme stores huge quantities of data (teraB)


Xyleme is not a search engine (only index) or a
mediator

(only virtual data)


XML


Xyleme is focused on XML


Dynamic


Xyleme is interested in data evolution/changes

50

Technical Challenges

1. Data Acquisition and Maintenance

discover data of interest and maintain it up to date

2.
Repository

store and index this data

3. Efficient query Processing

4.
Semantic Integration

provide a simple view of each semantic domain

5. Change Control

Monitor the web and offer services such as Query
Subscription




51

Technical challenges


Scale to the web


Size of data: billions of pages


Size of index: terabytes


Number of customers


thousands of simultaneous queries


millions of subscriptions

52

Web Heterogeneity


Semantic domains, e.g., cinema


Many possible types for data in this domain,
many DTDs


Semantic Integration


one abstract DTD for the domain


gives the illusion that the system maintains an
homogeneous database for this domain

1 domain = 1

abstract DTD

53

Discover the Domains


Cluster DTDs sharing
similar «

tags

» using
data mining techniques
(frequent item sets) and
linguistic tools (e.g.,
thesaurus, heuristics to
extract words from
composite words or
abbreviations, etc.)


to obtain
domains

cdtd1 .

cdtd2 .

cdtd3 .

adtd1

adtd2

adtd4

Many concrete

DTDs


Fewer abstract

DTDs

cdtd7 .

cdtd8 .

cdtd9 .

cdtd10 .

cdtd4 .



cdtd5 .

cdtd6 .

54

Answering queries


Choose an ADTD


Automatically, manually, hybrid


For each concrete DTD in a domain


Find how it relates to the abstract DTD


Mappings between paths in both


Distributed query processing (cluster of PCs)


Many concrete DTDs; often not possible to compute a
static execution plan


Dynamic generation of execution plans [Cluet et al]


Semistructured data
--

June 2001

55

Conclusion

56

One Question Only


The web is turning from a large collection
of documents into a huge knowledge base


When will I be able to get

the precise knowledge I need?



Database + Knowledge Base + Linguistic + ...