Predicate-based Indexing of Annotated Data

cowphysicistInternet and Web Development

Dec 4, 2013 (3 years and 10 months ago)

90 views


Predicate
-
based Indexing of


Annotated Data


Donald Kossmann

ETH Zurich

http://www.dbis.ethz.ch

Observations


Data is annotated by apps and humans


Word: versions, comments, references, layout, ...


humans: tags on del.icio.us


Applications

provide
views

on the data


Word: Version 1.3, text without comments, …


del.icio.us: Provides search on tags + data


Search Engines

see and index the
raw data


E.g., treat all versions as one document


Search Engine’s view != User’s view


Search Engine returns the wrong results


or hard
-
code the right search logic (del.icio.us)

Application

(e.g., Word, Wiki, Outlook, …)

File System

(e.g., x.doc, y.xls, …)

Views

Desktop Search Engine

(e.g., Spotlight, Google Desktop, …)

User

crawl & index

read & update

query

Desktop Search

Example 1: Bulk Letters

<address/>

Dear
<recipient/>
,

The meeting is at 12.

CU, Donald

Peter



Paul



Mary



Raw data

x.doc, y.xls



Dear Peter,

The meeting is at 12.

CU, Donald



Dear Paul,

The meeting is at 12.

CU, Donald



View

Example1: Traditional Search Engines

DocId

Keyword



x.doc

Dear



x.doc

Meeting



y.xls

Peter



y.xls

Paul



y.xls

Mary









Inverted File

Query: Paul, meeting

Answer:
-

Correct Answer: x.doc

Query: Paul, Mary

Answer: y.xls

Correct Answer: x.doc

Example 2: Versioning (OpenOffice)

<deleted id=„1“><info date =„5/15/2006“></deleted>

<inserted id=„2“><info date =„5/15/2006“></inserted>

<delete id=„1“>Mickey likes Minnie</delete>

<insert id=„2“>Donald likes Daisy</insert>




Mickey likes Minnie


Donald likes Daisy


Raw Data

Instance 1

(Version 1)

Instance 2

(Version 2)

Example 2: Versioning (OpenOffice)

DocId

Keyword



z.swx

Mickey



z.swx

likes



z.swx

Minnie



z.swx

Donald



z.swx

Daisy



Inverted File

Query: Mickey likes Daisy

Answer: z.swx

Correct Answer:
-

Query: Mickey likes Minnie

Answer: z.swx

Correct Answer: z.swx (V1)

Example 3: Personalization,
Localization, Authorization

<header>


<data row="duck" id=“man">Donald</data>


<data row="duck" id=“woman">Daisy</data>


<data row="mouse" id=“man">Mickey</data>


<data row="mouse" id=“woman">Minnie</data>

</header>

<body>


<field id=“man"/> likes <field id=“woman"/>.

</body>

Donald likes Daisy.

Mickey likes
Minnie.

Donald Daisy Mickey
Minnie likes .

Example 4: del.icio.us


Query: „Joe, software, Yahoo“


both A and B are relevant, but in
different worlds


if context info available, choice is possible

user

tag

URL

Joe

business

A

Mary

software

B

Tag Table

Yahoo builds software.

Joe is a programmer

at Yahoo.

http://A.com

http://B.com

Example 5: Enterprise Search


Web Applications


Application defined using „templates“ (e.g., JSP)


Data both in JSP pages and database


Content = JSP + Java + Database


Content depends on Context (roles, workflow)


Links = URL + function + parameters + context


Enterprise Search



Search: Map Content to Link



Enterprise Search: Content and Link are complex


Example: Search Engine for J2EE PetStore


(see demo at CIDR 2007)

Possible Solutions


Extend Applications with Search Capabilities


Re
-
invents the wheel for each application


Not worth the effort for small apps


No support for cross
-
app search


Extend Search Engines


Application
-
specific rules for „encoded“ data


„Possible Worlds“ Semantics of Data


Materialize view, normalize view


Index normalized view


Extended query processing


Challenge: Views become huge!

Application

(e.g., Word, Wiki, Outlook, …)

File System

(e.g., x.doc, y.xls, …)

Views

Desktop Search Engine

(e.g., Spotlight, Google Desktop, …)

User

crawl & index

read & update

query

Views

rules

Size of Views


One rule: size of view grows linearly with size
of document


E.g., for each version, one instance in view


Constant can be high! (e.g., many versions)


Several rules: size of view grows
exponentially with number of rules


E.g, #versions x #alternatives


Prototype, experiments: Wiki, Office, E
-
Mail…


About 30 rules; 5
-
6 applicable per document


View ~ 1000 Raw data

Solution Architecture

Rules and Patterns


Analogy: Operators of relational algebra


Patterns sufficient for Latex, MS Office,
OpenOffice, TWiki, E
-
Mail (Outlook)

Normalized View

<field
match=„//field“

ref=„//data[@id=$m/@id]/text()“

key=„$r/../@row“

/>

<body>


<select pred=“R1=duck">
Donald
</select>


<select pred=“R1=mouse">
Mickey
</select>


likes


<select pred=“R1=duck">
Daisy
</select>


<select pred=“R1=mouse">
Minnie
</select>.

</body>

<header>


<data row="duck" id=“man">Donald</data>


<data row="duck" id=“woman">Daisy</data>


<data row="mouse" id=“man">Mickey</data>


<data row="mouse" id=“woman">Minnie</data>

</header>

<body>


<field id=“man"/> likes <field id=“woman"/>.

</body>

Raw Data:










Rule:





Normalized

View:

Normalized View

<version
match=„//insert“


key=„//inserted[@id eq $m/@id]/info/@date

/>

Mikey


<select pred=“R2>=5/1/2006">Mouse</select>


likes

Minnie


<select pred=“R2>=5/16/2006">Mouse</select>
.

<inserted id=1><info date=„5/1/2006“/></inserted>

<inserted id=2><info date=„5/16/2006“/></inserted>

Mikey
<insert id=1>Mouse</insert>

likes Minnie
<insert id=2>Mouse</insert>
.


Raw Data:






Rule:





Normalized

View:

General Idea:



Factor out common parts: „Mickey likes Minnie.“



Markup variable parts:
<select …/>,
<select …/>

Normalization Algorithm


Step 1: Construct Tagging Table


Evaluate „match“ expression


Evaluate „key“ expression


Compute Operator from Pattern (e.g., > for version)


Step 2: Tagging Table
-
> Normalized View


Embrace each match with <select> tags

Rule

NodeId

Key Value

Op

R1

19

duck

=

R1

19

mouse

=

R1

22

duck

=

R1

22

mouse

=

Predicate
-
based Indexing

DocId

Keyword

Condition

z.swx

Donald

R1=duck

z.swx

Mickey

R1=mouse

z.swx

likes

true

z.swx

Daisy

R1=duck

z.swx

Minnie

R1=mouse

Normalized
View:









Inverted

File:

<body>


<select pred=“R1=duck">
Donald
</select>


<select pred=“R1=mouse">
Mickey
</select>


likes


<select pred=“R1=duck">
Daisy
</select>


<select pred=“R1=mouse">
Minnie
</select>.

</body>

Query Processing

Donald likes
Minnie

R1=duck

^ true ^
R1=mouse

Donald
likes Daisy

R1=duck
^true^
R1=duck

R1=duck

false

DocId

Keyword

Condition

z.swx

Donald

R1=duck

z.swx

Mickey

R1=mouse

z.swx

likes

true

z.swx

Daisy

R1=duck

z.swx

Minnie

R1=mouse

Qualitative Assessment


Expressiveness of rules / patterns


Good enough for „desktop data“


Extensible for other data


Unclear how good for general applications (e.g., SAP)


Normalized View



Size: O(n); with n size of raw data


Generation Time: depends on complexity of XQuery
expressions in rules; typically O(n)


Predicate
-
based Inverted File


Size: O(n)
-

same as traditional inverted files


Generation Time: O(n)


But, constants can be large


Query Processing


Polynomial in #keywords in query (~ traditional)


High constants!

Experiments


Data sets from my personal desktop


E
-
Mail, TWiki, Latex, OpenOffice, MS Office, …


Data
-
set dependent rules


E
-
Mail: different rule sets (here conversations)


Latex: include, footnote, exclude, …


TWiki: versioning, exclude, …


Hand
-
cooked queries


Vary selectivity, degree that involves instances


Measure size of data sets, indexes,
precision & recall, query running times

Data Size (Twiki)

Traditional

Enhanced

Raw Data (MB)

4.77

4.77

Normalized
View (MB)

-

4.53

Index (MB)

0.56

1.07

Creation Time
(secs)

9.00

9.62

Data Size (E
-
Mail)

Traditional

Enhanced

Raw Data (MB)

51.46

51.46

Normalized
View (MB)

-

51.77

Index (MB)

2.86

12.61

Creation Time
(secs)

106.61

132.62

Precision (Twiki)

Traditional

Enhanced

Query 1

0.985

1

Query 2

0.071

1

Query 3

0.339

1

Query 4

0.875

1

Recall is 1 in all cases. Twiki: example for „false positives“.

Recall (E
-
Mail)

Traditional

Enhanced

Query 1

0.322

1

Query 2

0.821

1

Query 3

0.499

1

Query 4

0.5

1

Precision is 1 in all cases. E
-
Mail: example for „false negatives“.

Response Time in ms (Twiki)

Traditional

Enhanced

Query 1

0.201

0.907

Query 2

0.218

1.224

Query 3

0.033

0.122

Query 4

0.054

0.212

Enhanced one order of magnitude slower, but still within milliseconds.

Response Time in ms (E
-
Mail)

Traditional

Enhanced

Query 1

0.003

0.864

Query 2

0.004

6.091

Query 3

0.020

1.845

Query 4

0.027

0.055

Enhanced orders of magnitude slower, but still within milliseconds.

Conclusion & Future Work


See data with the eyes of users!


Give search engines the right glasses


Flexibility in search: reveal hidden data


Compressed indexes using predicates



Future Work


Other apps: e.g., JSP, Tagging, Semantic Web


Consolidate different view definitions (security)


Search on streaming data