# Efficient Information Retrieval

Τεχνίτη Νοημοσύνη και Ρομποτική

29 Οκτ 2013 (πριν από 4 χρόνια και 6 μήνες)

103 εμφανίσεις

Efficient Information Retrieval

Ralf

Schenkel

March 19, 2010

Antrittsvorlesung

Why this topic?

March 19, 2010

Antrittsvorlesung

Outline

A real
-
life example of IR

(for everybody )

Algorithms for text retrieval
(for engineers )

Approximation algorithms
(for experts )

(including some recent results)

March 19, 2010

Antrittsvorlesung

First: An example from Real Life™

Here‘s a very simple task:

Find the best six cakes for a party.

Difficult for a number of reasons:

(1) unclear what a good cake is

(2) way too many options to evaluate

March 19, 2010

Antrittsvorlesung

Let‘s do it like a scientist

Determine
dimensions

of cake quality:

tastiness

healthiness

fruitiness

nutritiousness

expensiveness

Evaluate

each cake in the dimensions (ratings)

5

4

3

2

1

5

4

3

2

1

March 19, 2010

Antrittsvorlesung

Simplify: Two dimensions, find best cake

Possible way out:
Skylines

A cake
dominates

another cake if it is better in one
dimension and not worse in all other dimensions

Skyline: all
non
-
dominated

cakes

tastiness

healthiness

Skylines often don‘t reduce

the set of candidates enough

March 19, 2010

Antrittsvorlesung

Better: Consider value of each cake

Compute numerical value for each cake:

score( ) = 5*tastiness( ) + healthiness( )

Order cakes by descending value:

5*5+1 = 26

5*4+2 = 22

5*3+4 = 19

5*2+3 = 13

5*1+5 = 10

This is the best cake

(your choice and weights may differ)

March 19, 2010

Antrittsvorlesung

Life is more complex than this

Need
diverse

cakes among the 6 selected cakes

(people with
different preferences

will attend the party!)

External constraints

may reduce options

maximal amount of
money

to spend

outside
temperature

Availability

of cakes or bakery

Exact
ratings

of cakes often
unknown

Requires estimation by
popularity

etc.

We abstract these problems away in this talk.

March 19, 2010

Antrittsvorlesung

Outline

A real
-
life example of IR
(for everybody)

Algorithms for text retrieval

(for engineers)

Approximation algorithms
(for experts)

(including some recent results)

March 19, 2010

Antrittsvorlesung

Text Retrieval

Importance of t in the collection

(the
less

frequent, the better)

Importance of t for document d

(the
more

frequent, the better)

Modeling and ranking:

Define score for documents

tf(d,t): frequency of tag t for doc d

df(t): #docs tagged with t

Linear combination for query scores

Problem:

Find the best documents
d

from a large collection

that match a query
{t
1
,…,t
n
}

Back to the cake analogy:

Documents

= cakes

Terms

= tastiness, …

Score

= rating

March 19, 2010

Antrittsvorlesung

Cannot compute this from scratch for each query

(>>10
10

documents)

Solution:

Precompute per
-
term scores for each document

For each term, store list of
(d,score(d,t))

on disk

When query arrives:

combine entries from lists

sort results

-
k

(
merge
-
then
-
sort

algorithm)

March 19, 2010

Antrittsvorlesung

Family of Threshold Algorithms

But
:

Lists can be very long (millions of entries)

Simple
merge
-
then
-
sort

algorithm too expensive

Observation
:

„Good“ results have high scores

Order lists by
decreasing

scores

Have „intelligent“ algorithm

with different
list access modes

and
early stopping

T: 0.99

G: 0.77

B: 0.51

A: 0.15

D: 0.01

decreasing score

March 19, 2010

Antrittsvorlesung

List Access Modes

Sequential

(sorted, SA):

Access tuples in list order

Disk access time amortized

over many accesses

D8: 0.99

D4: 0.77

D2: 0.51

D1: 0.15

D3: 0.01

decreasing score

D8: 0.99

D4: 0.77

D2: 0.51

D1: 0.15

D3: 0.01

decreasing score

D2?

Random
(RA):

Look up list entry for specific item

Pay full cost (plus lookup cost)

for each access

C times more expensive (10
-
1000)

than sequential access

First example for an algorithm:

Sorted access only, no random access (NRA)

March 19, 2010

Antrittsvorlesung

Example: Top
-
1 for 2
-
term query (NRA)

L
1

L
2

top
-
1 item

min
-
k:

candidates

A: 0.9

G: 0.3

H: 0.3

I: 0.25

J: 0.2

K: 0.2

D: 0.15

D: 1.0

E: 0.7

F: 0.7

B: 0.65

C: 0.6

A: 0.3

G: 0.2

March 19, 2010

Antrittsvorlesung

Example: Top
-
1 for 2
-
term query (NRA)

top
-
1 item

min
-
k:

candidates

0.9

?

A:

score: [0.9;1.9]

0.9

A: 0.9

G: 0.3

H: 0.3

I: 0.25

J: 0.2

K: 0.2

D: 0.15

D: 1.0

E: 0.7

F: 0.7

B: 0.65

C: 0.6

A: 0.3

G: 0.2

?

?

?:

score: [0.0;1.9]

L
1

L
2

March 19, 2010

Antrittsvorlesung

Example: Top
-
1 for 2
-
term query (NRA)

top
-
1 item

min
-
k:

candidates

0.9

?

A:

score: [0.9;1.9]

0.9

?

1.0

D:

score: [1.0;1.9]

1.0

A: 0.9

G: 0.3

H: 0.3

I: 0.25

J: 0.2

K: 0.2

D: 0.15

D: 1.0

E: 0.7

F: 0.7

B: 0.65

C: 0.6

A: 0.3

G: 0.2

?

?

?:

score: [0.0;1.9]

L
1

L
2

March 19, 2010

Antrittsvorlesung

1.0

Example: Top
-
1 for 2
-
term query (NRA)

top
-
1 item

min
-
k:

candidates

0.9

?

A:

score: [0.9;1.9]

0.3

?

G:

score: [0.3;1.3]

?

1.0

D:

score: [1.0;1.3]

A: 0.9

G: 0.3

H: 0.3

I: 0.25

J: 0.2

K: 0.2

D: 0.15

D: 1.0

E: 0.7

F: 0.7

B: 0.65

C: 0.6

A: 0.3

G: 0.2

?

?

?:

score: [0.0;1.3]

L
1

L
2

March 19, 2010

Antrittsvorlesung

1.0

Example: Top
-
1 for 2
-
term query (NRA)

top
-
1 item

min
-
k:

candidates

0.9

?

A:

score: [0.9;1.6]

?

1.0

D:

score: [1.0;1.3]

0.3

?

G:

score: [0.3;1.0]

No more new candidates considered

A: 0.9

G: 0.3

H: 0.3

I: 0.25

J: 0.2

K: 0.2

D: 0.15

D: 1.0

E: 0.7

F: 0.7

B: 0.65

C: 0.6

A: 0.3

G: 0.2

?

?

?:

score: [0.0;1.0]

L
1

L
2

March 19, 2010

Antrittsvorlesung

1.0

Example: Top
-
1 for 2
-
term query (NRA)

top
-
1 item

min
-
k:

candidates

0.9

?

A:

score: [0.9;1.6]

?

1.0

D:

score: [1.0;1.3]

Algorithm safely terminates after 12 SA

A: 0.9

G: 0.3

H: 0.3

I: 0.25

J: 0.2

K: 0.2

D: 0.15

D: 1.0

E: 0.7

F: 0.7

B: 0.65

C: 0.6

A: 0.3

G: 0.2

?

1.0

D:

score: [1.0;1.25]

0.9

?

A:

score: [0.9;1.55]

?

1.0

D:

score: [1.0;1.2]

0.9

?

A:

score: [0.9;1.5]

?

1.0

D:

score: [1.0;1.2]

0.9

0.4

A:

score: [1.3;1.3]

1.3

L
1

L
2

March 19, 2010

Antrittsvorlesung

Random Accesses

Two main purposes for random accesses:

Can speed up execution

Some predicates cannot be read from sorted lists
(„X and not Y“) => expensive predicates

Scheduling problem:

When

perform RA for
which

item to
which

list?

This talk covers only the „when“ aspect.

March 19, 2010

Antrittsvorlesung

Random Access Scheduling

When

Immediately

when an item is seen (TA)

+
Scores always correct, no candidates needed

Most RA are wasted (items seen again later)

Really slow if RA are expensive

Balanced
: after C sorted accesses, do 1 RA

(Combined Algorithm, CA)

Faster than TA
,
but most RA are still wasted

March 19, 2010

Antrittsvorlesung

Random Access Scheduling

When

LAST heuristics:

switch from SA to RA when

All possible candidates have been seen

expected future cost for RA is below the cost already
spent for SA

Rationale behind this:

Do expensive RA as late as possible

to avoid wasting them

March 19, 2010

Antrittsvorlesung

Experiments with TREC Benchmark

TREC Terabyte collection:

~24 million docs from .gov domain,

~420GB (unpacked) size

(we now have one with 10
9

docs, 5TB compressed size)

50 keyword queries from TREC Terabyte 2005

Performance measures:

Number of sequential and random accesses

Weighted cost: #SA + C ∙ #RA

Wall
-
clock runtime

March 19, 2010

Antrittsvorlesung

Lower bound
: for each query [VLDB06]

compute
top
-
k results R

and
final

mink

find minimum over all combinations of scan depths that see R

SA cost + RA cost for candidates with bestscore>mink

considers blocks of entries for tractability

Experiments: (TA and) CA on TREC

10

50

100

200

500

0

4,000,000

k

average cost (#SA + 1000 x #RA)

merge
-
then
-
sort

NRA

CA

10

50

100

200

500

0

250

k

average running time (milliseconds)

merge
-
then
-
sort

NRA

100

lower bound

average abstract cost

average wallclock runtime

OURS

OURS

March 19, 2010

Antrittsvorlesung

Outline

A real
-
life example of IR
(for everybody)

Algorithms for text retrieval
(for engineers)

Approximation algorithms

(for experts)

(including some recent results)

March 19, 2010

Antrittsvorlesung

Beyond Exact Top
-
K Results

Improve performance by considering
approximate
results with
probabilistic guarantees

drop candidate when probability for being top
-
k result is <
ε

estimate probabilities from per
-
list score distributions

reasonable
improvement in performance

(stop earlier)

probabilistic guarantee
:
E[relative recall @ k] = 1
-

Maximize result quality

within
fixed budget

for
execution cost (number of accesses, time)

adaptive scheduling: initially prefer high scores,

later high score drops

Experimental results
close to optimal

(offline) results

[VLDB04]

[ICDE09]

Can I see the next

March 19, 2010

Antrittsvorlesung

Even More Heuristics: Proximity

A:
9.3

T:
7.2

E:
5.0

B:
4.5

TL(french)

TL(pianist)

B:
(3.0,8.6,4.5)

F:
(0.7,9.1,1.5)

T:
(0.5,3.0,7.2)

G:
(0.2,2.0,1.7)

CL(french, pianist)

F:
9.1

B:
8.6

A:
5.9

D:
4.6

descending score

Observation
:
[SPIRE07]

„Good“ results have term matches close together

add second type of list:

for each term pair, include documents with close

occurrences of the terms, ordered by

distance
-
based score

March 19, 2010

Antrittsvorlesung

Query Processing

Observation
:

very small prefixes of the lists yield good results

A:
9.3

T:
7.2

E:
5.0

B:
4.5

TL(french)

TL(pianist)

F:
9.1

B:
8.6

A:
5.9

D:
4.6

B:
(3.0,8.6,4.5)

F:
(0.7,9.1,1.5)

T:
(0.5,3.0,7.2)

G:
(0.2,2.0,1.7)

CL(french, pianist)

merge join

top
-
k results

sort

ascending did

Prune and reorganize index lists

B:
(3.0,8.6,4.5)

F:
(0.7,9.1,1.5)

T:
(0.5,3.0,7.2)

G:
(0.2,2.0,1.7)

A:
5.9

B:
8.6

D:
4.6

F:
9.1

A:
9.3

T:
7.2

E:
5.0

B:
4.5

descending score

descending score

Parameters tuned through exhaustive search

in the parameter space

(4h on 80
-
-
cluster)

Resulting index approx. as large as the collection

March 19, 2010

Antrittsvorlesung

Evaluation at INEX 2009

Standard benchmark for XML retrieval

2.6 million XML documents with semantic
annotation from YAGO

113 human
-
defined queries, 75 come with list of
relevant results

March 19, 2010

Antrittsvorlesung

Runtime vs. Quality at INEX 2009

result quality

runtime (ms)

March 19, 2010

Antrittsvorlesung

What did we learn?

From an abstract point of view,

cake search is equivalent to text retrieval.

Standard retrieval algorithms are quite efficient,
but still can be improved a lot.

Often fast approximations are good enough.

Application
-
specific optimizations can further
improve efficiency.

March 19, 2010

Antrittsvorlesung

Thank you.

March 19, 2010

Antrittsvorlesung

Questions?

Did you actually run the algorithm (with the cakes)?

Yes, last week.

How long did it take?

Five minutes.

How many cakes did you have to access?

Won‘t tell.

Did you do a careful study of the result quality?

We‘ll do that now as a crowdsourcing initiative,

you are all invited.