Efficient Information Retrieval

boorishadamantΤεχνίτη Νοημοσύνη και Ρομποτική

29 Οκτ 2013 (πριν από 3 χρόνια και 10 μήνες)

84 εμφανίσεις

Efficient Information Retrieval


Ralf


Schenkel

March 19, 2010

Antrittsvorlesung

Why this topic?

March 19, 2010

Antrittsvorlesung

Outline


A real
-
life example of IR

(for everybody )


Algorithms for text retrieval
(for engineers )


Approximation algorithms
(for experts )

(including some recent results)

March 19, 2010

Antrittsvorlesung

First: An example from Real Life™

Here‘s a very simple task:

Find the best six cakes for a party.

Difficult for a number of reasons:

(1) unclear what a good cake is

(2) way too many options to evaluate

March 19, 2010

Antrittsvorlesung

Let‘s do it like a scientist

Determine
dimensions

of cake quality:


tastiness


healthiness


fruitiness


nutritiousness


expensiveness




Evaluate

each cake in the dimensions (ratings)

5

4

3

2

1

5

4

3

2

1

March 19, 2010

Antrittsvorlesung

Simplify: Two dimensions, find best cake

Possible way out:
Skylines


A cake
dominates

another cake if it is better in one
dimension and not worse in all other dimensions


Skyline: all
non
-
dominated

cakes

tastiness

healthiness

Skylines often don‘t reduce

the set of candidates enough



March 19, 2010

Antrittsvorlesung

Better: Consider value of each cake


Compute numerical value for each cake:


score( ) = 5*tastiness( ) + healthiness( )


Order cakes by descending value:





5*5+1 = 26




5*4+2 = 22




5*3+4 = 19




5*2+3 = 13




5*1+5 = 10

This is the best cake

(your choice and weights may differ)

March 19, 2010

Antrittsvorlesung

Life is more complex than this


Need
diverse

cakes among the 6 selected cakes

(people with
different preferences

will attend the party!)


External constraints

may reduce options


maximal amount of
money

to spend


outside
temperature


Availability

of cakes or bakery


Exact
ratings

of cakes often
unknown


Requires estimation by
popularity

etc.


We abstract these problems away in this talk.


March 19, 2010

Antrittsvorlesung

Outline


A real
-
life example of IR
(for everybody)


Algorithms for text retrieval

(for engineers)


Approximation algorithms
(for experts)

(including some recent results)

March 19, 2010

Antrittsvorlesung

Text Retrieval

Importance of t in the collection

(the
less

frequent, the better)

Importance of t for document d

(the
more

frequent, the better)

Modeling and ranking:

Define score for documents

tf(d,t): frequency of tag t for doc d

df(t): #docs tagged with t

Linear combination for query scores

Problem:

Find the best documents
d

from a large collection

that match a query
{t
1
,…,t
n
}

Back to the cake analogy:

Documents

= cakes

Terms

= tastiness, …

Score

= rating

March 19, 2010

Antrittsvorlesung

What about efficiency?


Cannot compute this from scratch for each query

(>>10
10

documents)


Solution:


Precompute per
-
term scores for each document


For each term, store list of
(d,score(d,t))

on disk


When query arrives:


combine entries from lists


sort results


return top
-
k


(
merge
-
then
-
sort

algorithm)

March 19, 2010

Antrittsvorlesung

Family of Threshold Algorithms

But
:

Lists can be very long (millions of entries)



Simple
merge
-
then
-
sort

algorithm too expensive


Observation
:

„Good“ results have high scores


Order lists by
decreasing

scores


Have „intelligent“ algorithm

with different
list access modes

and
early stopping


T: 0.99

G: 0.77

B: 0.51

A: 0.15

D: 0.01

decreasing score

March 19, 2010

Antrittsvorlesung

List Access Modes

Sequential

(sorted, SA):


Access tuples in list order


Disk access time amortized

over many accesses

D8: 0.99

D4: 0.77

D2: 0.51

D1: 0.15

D3: 0.01

decreasing score

D8: 0.99

D4: 0.77

D2: 0.51

D1: 0.15

D3: 0.01

decreasing score

D2?

Random
(RA):



Look up list entry for specific item



Pay full cost (plus lookup cost)


for each access



C times more expensive (10
-
1000)


than sequential access


First example for an algorithm:

Sorted access only, no random access (NRA)

March 19, 2010

Antrittsvorlesung

Example: Top
-
1 for 2
-
term query (NRA)

L
1

L
2

top
-
1 item

min
-
k:

candidates

A: 0.9

G: 0.3

H: 0.3

I: 0.25

J: 0.2

K: 0.2

D: 0.15

D: 1.0

E: 0.7

F: 0.7

B: 0.65

C: 0.6

A: 0.3

G: 0.2

March 19, 2010

Antrittsvorlesung

Example: Top
-
1 for 2
-
term query (NRA)

top
-
1 item

min
-
k:

candidates

0.9

?

A:


score: [0.9;1.9]

0.9

A: 0.9

G: 0.3

H: 0.3

I: 0.25

J: 0.2

K: 0.2

D: 0.15

D: 1.0

E: 0.7

F: 0.7

B: 0.65

C: 0.6

A: 0.3

G: 0.2

?

?

?:


score: [0.0;1.9]

L
1

L
2

March 19, 2010

Antrittsvorlesung

Example: Top
-
1 for 2
-
term query (NRA)

top
-
1 item

min
-
k:

candidates

0.9

?

A:

score: [0.9;1.9]

0.9

?

1.0

D:

score: [1.0;1.9]

1.0

A: 0.9

G: 0.3

H: 0.3

I: 0.25

J: 0.2

K: 0.2

D: 0.15

D: 1.0

E: 0.7

F: 0.7

B: 0.65

C: 0.6

A: 0.3

G: 0.2

?

?

?:


score: [0.0;1.9]

L
1

L
2

March 19, 2010

Antrittsvorlesung

1.0

Example: Top
-
1 for 2
-
term query (NRA)

top
-
1 item

min
-
k:

candidates

0.9

?

A:

score: [0.9;1.9]

0.3

?

G:

score: [0.3;1.3]

?

1.0

D:

score: [1.0;1.3]

A: 0.9

G: 0.3

H: 0.3

I: 0.25

J: 0.2

K: 0.2

D: 0.15

D: 1.0

E: 0.7

F: 0.7

B: 0.65

C: 0.6

A: 0.3

G: 0.2

?

?

?:


score: [0.0;1.3]

L
1

L
2

March 19, 2010

Antrittsvorlesung

1.0

Example: Top
-
1 for 2
-
term query (NRA)

top
-
1 item

min
-
k:

candidates

0.9

?

A:

score: [0.9;1.6]

?

1.0

D:

score: [1.0;1.3]

0.3

?

G:

score: [0.3;1.0]

No more new candidates considered

A: 0.9

G: 0.3

H: 0.3

I: 0.25

J: 0.2

K: 0.2

D: 0.15

D: 1.0

E: 0.7

F: 0.7

B: 0.65

C: 0.6

A: 0.3

G: 0.2

?

?

?:


score: [0.0;1.0]

L
1

L
2

March 19, 2010

Antrittsvorlesung

1.0

Example: Top
-
1 for 2
-
term query (NRA)

top
-
1 item

min
-
k:

candidates

0.9

?

A:

score: [0.9;1.6]

?

1.0

D:

score: [1.0;1.3]

Algorithm safely terminates after 12 SA

A: 0.9

G: 0.3

H: 0.3

I: 0.25

J: 0.2

K: 0.2

D: 0.15

D: 1.0

E: 0.7

F: 0.7

B: 0.65

C: 0.6

A: 0.3

G: 0.2

?

1.0

D:

score: [1.0;1.25]

0.9

?

A:

score: [0.9;1.55]

?

1.0

D:

score: [1.0;1.2]

0.9

?

A:

score: [0.9;1.5]

?

1.0

D:

score: [1.0;1.2]

0.9

0.4

A:

score: [1.3;1.3]

1.3

L
1

L
2

March 19, 2010

Antrittsvorlesung

Random Accesses

Two main purposes for random accesses:


Can speed up execution


Some predicates cannot be read from sorted lists
(„X and not Y“) => expensive predicates


Scheduling problem:


When

perform RA for
which

item to
which

list?

This talk covers only the „when“ aspect.

March 19, 2010

Antrittsvorlesung

Random Access Scheduling


When


Immediately

when an item is seen (TA)

+
Scores always correct, no candidates needed


Most RA are wasted (items seen again later)


Really slow if RA are expensive


Balanced
: after C sorted accesses, do 1 RA

(Combined Algorithm, CA)


Faster than TA
,
but most RA are still wasted

March 19, 2010

Antrittsvorlesung

Random Access Scheduling


When


LAST heuristics:

switch from SA to RA when


All possible candidates have been seen


expected future cost for RA is below the cost already
spent for SA

Rationale behind this:

Do expensive RA as late as possible

to avoid wasting them

March 19, 2010

Antrittsvorlesung

Experiments with TREC Benchmark


TREC Terabyte collection:

~24 million docs from .gov domain,

~420GB (unpacked) size

(we now have one with 10
9

docs, 5TB compressed size)


50 keyword queries from TREC Terabyte 2005


Performance measures:


Number of sequential and random accesses


Weighted cost: #SA + C ∙ #RA


Wall
-
clock runtime

March 19, 2010

Antrittsvorlesung

Lower bound
: for each query [VLDB06]



compute
top
-
k results R

and
final

mink



find minimum over all combinations of scan depths that see R



SA cost + RA cost for candidates with bestscore>mink



considers blocks of entries for tractability

Experiments: (TA and) CA on TREC

10

50

100

200

500

0

4,000,000

k


average cost (#SA + 1000 x #RA)

merge
-
then
-
sort

NRA

CA

10

50

100

200

500

0

250

k

average running time (milliseconds)

merge
-
then
-
sort

NRA

100

lower bound

average abstract cost

average wallclock runtime

OURS

OURS

March 19, 2010

Antrittsvorlesung

Outline


A real
-
life example of IR
(for everybody)


Algorithms for text retrieval
(for engineers)


Approximation algorithms

(for experts)

(including some recent results)

March 19, 2010

Antrittsvorlesung

Beyond Exact Top
-
K Results


Improve performance by considering
approximate
results with
probabilistic guarantees


drop candidate when probability for being top
-
k result is <
ε


estimate probabilities from per
-
list score distributions


reasonable
improvement in performance

(stop earlier)


probabilistic guarantee
:
E[relative recall @ k] = 1
-



Maximize result quality

within
fixed budget

for
execution cost (number of accesses, time)


adaptive scheduling: initially prefer high scores,

later high score drops


Experimental results
close to optimal

(offline) results

[VLDB04]

[ICDE09]

Can I see the next
cake, please?

March 19, 2010

Antrittsvorlesung

Even More Heuristics: Proximity

A:
9.3

T:
7.2

E:
5.0

B:
4.5

TL(french)

TL(pianist)

B:
(3.0,8.6,4.5)

F:
(0.7,9.1,1.5)

T:
(0.5,3.0,7.2)

G:
(0.2,2.0,1.7)

CL(french, pianist)

F:
9.1

B:
8.6

A:
5.9

D:
4.6

descending score


Observation
:
[SPIRE07]


„Good“ results have term matches close together




add second type of list:


for each term pair, include documents with close


occurrences of the terms, ordered by


distance
-
based score

March 19, 2010

Antrittsvorlesung

Query Processing

Observation
:

very small prefixes of the lists yield good results

A:
9.3

T:
7.2

E:
5.0

B:
4.5

TL(french)

TL(pianist)

F:
9.1

B:
8.6

A:
5.9

D:
4.6

B:
(3.0,8.6,4.5)

F:
(0.7,9.1,1.5)

T:
(0.5,3.0,7.2)

G:
(0.2,2.0,1.7)

CL(french, pianist)

merge join

top
-
k results

sort

ascending did

Prune and reorganize index lists

B:
(3.0,8.6,4.5)

F:
(0.7,9.1,1.5)

T:
(0.5,3.0,7.2)

G:
(0.2,2.0,1.7)

A:
5.9

B:
8.6

D:
4.6

F:
9.1

A:
9.3

T:
7.2

E:
5.0

B:
4.5

descending score

descending score


Parameters tuned through exhaustive search

in the parameter space

(4h on 80
-
core Hadoop
-
cluster)


Resulting index approx. as large as the collection

March 19, 2010

Antrittsvorlesung

Evaluation at INEX 2009


Standard benchmark for XML retrieval


2.6 million XML documents with semantic
annotation from YAGO


113 human
-
defined queries, 75 come with list of
relevant results


Explicit efficiency task

March 19, 2010

Antrittsvorlesung

Runtime vs. Quality at INEX 2009

result quality

runtime (ms)

March 19, 2010

Antrittsvorlesung

What did we learn?


From an abstract point of view,

cake search is equivalent to text retrieval.


Standard retrieval algorithms are quite efficient,
but still can be improved a lot.


Often fast approximations are good enough.


Application
-
specific optimizations can further
improve efficiency.

March 19, 2010

Antrittsvorlesung

Thank you.

March 19, 2010

Antrittsvorlesung

Questions?


Did you actually run the algorithm (with the cakes)?


Yes, last week.


How long did it take?


Five minutes.


How many cakes did you have to access?


Won‘t tell.


Did you do a careful study of the result quality?


We‘ll do that now as a crowdsourcing initiative,

you are all invited.