Probabilistic Information Retrieval

cabbageswerveAI and Robotics

Nov 7, 2013 (3 years and 11 months ago)

86 views

Prob
abilistic
I
nformation
R
etrieval

[ ProbIR ]

Suman K Mitra


DAIICT, Gandhinagar

suman_mitra@daiict.ac.in

Acknowledgment

Alexander Dekhtyar
, University of Maryland

Mandar Mitra
, ISI, Kolkata

Prasenjit Majumder
, DAIICT, Gandhinagar

Why use Probabilities?


Information Retrieval deals with uncertain
information


Probability is a measure of uncertainty


Probabilistic Ranking Principle


provable


minimization of risk


Probabilistic Inference


To justify your decision

Document
Collection

Document
Representation

Query

Query
Representation

1

2

3

Basic IR System

4

1.
How good the representation is?

2.
How exact the representation is?

3.
How well is the query matched ?

4.
How relevant is the result to the query?

Approaches and main Contributors


Probability Ranking Principle



Robertson

1970
onwards



Information Retrieval as Probabilistic Inference



Van Rijsbergen

et al. 1970 onwards



Probabilistic Indexing



Fuhr
et al. 1980 onwards



Bayesian Nets in Information Retrieval



Turtle
,
Croft

1990 onwards



Probabilistic Logic Programming in Information
Retrieval



Fuhr

et al. 1990 onwards

Probability Ranking Principle

1.
Collection of
documents

2.
Representation of
documents

3.
User uses a query

4.
Representation of
query

5.
A set of documents
to return

Question:

In what order
documents to present to user?

Logically:

Best document first
and then next best and so on

Requirement:

A formal way to
judge the goodness of
documents with respect to
query

Possibility:

Probability of
relevance of the document with
respect to query

Probability Ranking Principle

If a retrieval system’s
response to each request

is
a
ranking of the documents in the collections

in
order of
decreasing probability of goodness

to the user who
submitted the request ...

… where the probabilities are estimated as accurately as
possible on the basis of whatever data made available to
the system for this purpose ...

… then the
overall effectiveness

of the system to its users

will be the best

that is obtainable on the basis of that data.

W. S. Cooper

)
(
)
(
)
|
(
)
|
(
)
(
)
(
)
|
(
)
|
(
)
(
)
|
(
)
(
)
(
)
|
(
b
p
a
p
a
b
p
b
a
p
b
p
a
p
a
b
p
b
a
p
a
p
a
b
p
b
a
p
b
p
b
a
p





Probability Basics

Bayes’ Rule

Let a and b are two events


Odds of an event a is defined as

)
(
1
)
(
)
(
)
(
)
(
a
p
a
p
a
p
a
p
a
O



Conditional probability satisfies all axioms of probability







1
1
)
|
(
)
|
(
i
i
i
i
b
p
b
p
a
a

1
)
|
(
0


b
a
p
(i)

1
)
|
(

b
S
p
(ii)

(iii)

If

are mutually exclusive events then

a
i
0
)
(
,
0
)
(
,
)
(
)
(
)
|
(



b
p
ab
p
b
p
ab
p
b
a
p
0
)
|
(

b
a
p
1
)
(
)
(
)
(
)
(





b
p
ab
p
b
p
ab
p
b
ab
1
)
|
(

b
a
p
Hence (i)

1
)
(
)
(
)
(
)
(
)
|
(



b
p
b
p
b
p
Sb
p
b
S
p
Hence (ii)

a
1
a
2
a
3
a
4
b







1
1
)
(
)
)
((
)
|
(
i
i
i
i
b
p
b
p
b
p
a
a
)
(
)
(
1
b
p
b
p
i
i
a




)
(
)
(
1
b
p
b
p
i
i
a




[ ‘s are all mutually exclusive]

b
a
i
Hence (iii)









1
1
)
|
(
)
(
)
(
)
|
(
i
i
i
i
b
p
b
p
b
p
b
p
a
a
Probability Ranking Principle

Let
x

be a document in the collection.

Let
R

represent
relevance
of a document w.r.t. given (fixed) query
and let
NR

represent
non
-
relevance.


p(
R|x)

-

probability that a retrieved document
x
is
relevant.

p(
NR|x)

-

probability that a retrieved document
x
is
non
-
relevant.

)
(
)
(
)
|
(
)
|
(
)
(
)
(
)
|
(
)
|
(
x
p
NR
p
NR
x
p
x
NR
p
x
p
R
p
R
x
p
x
R
p


p(
R)
,p(
NR
)
-

prior probability of
retrieving a relevant and non
-
relevant document respectively

p(
x|R), p(x|NR)

-

probability that if a relevant (non
-
relevant)
document is retrieved, it is
x.

)
(
)
(
)
|
(
)
|
(
)
(
)
(
)
|
(
)
|
(
x
p
NR
p
NR
x
p
x
NR
p
x
p
R
p
R
x
p
x
R
p


Ranking Principle (Bayes’ Decision Rule):

If p(
R|x) >
p(
NR|x
) then

x

is relevant
,

otherwise
x

is not relevant

Probability Ranking Principle

Is PRP minimizes the average probability of error?





)
|
(
)
|
(
)
|
(
x
NR
p
x
R
p
x
error
p
If we decide
NR

If we decide
R



x
x
p
x
error
p
error
p
)
(
)
|
(
)
(
p(error)
is minimal when all
p(error|x)
are minimal.

Bayes’ decision rule

minimizes each
p(error|x).

Probability Ranking Principle

Actual

X is R

X is NR

Decision

X is R

2

X is NR

1



x
x
p
x
error
p
error
p
)
(
)
|
(
)
(






NR
x
R
x
x
p
x
NR
p
x
p
x
R
p
)
(
)
|
(
)
(
)
|
(
X is either in R or NR







NR
x
R
x
x
p
x
NR
p
x
p
x
R
p
)
(
)
|
(
)
(
)
|
(






NR
x
NR
x
x
p
x
NR
p
x
p
x
NR
p
)
(
)
|
(
)
(
)
|
(






NR
x
x
x
p
x
NR
p
x
p
x
NR
p
x
R
p
)
(
)
|
(
)
(
)}
|
(
)
|
(
{
Constant






R
x
t
Cons
x
p
x
R
p
x
NR
p
error
p
tan
)
(
)}
|
(
)
|
(
{
)
(
Minimization of
p(error)

Minimization of
2p(error)








R
x
NR
x
x
p
x
NR
p
x
R
p
x
p
x
R
p
x
NR
p
)
(
}
|
(
)
|
(
{
)
(
)}
|
(
)
|
(
{
Min

p(x) is same for any x

Define

}
0
)
|
(
)
|
(
:
{
1



x
NR
p
x
R
p
x
S
}
0
)
|
(
)
|
(
:
{
2



x
NR
p
x
R
p
x
S
are nothing but the decision for x to be in R
and NR respectively

1
S
2
S
and

:
The decision minimizes p(error)

1
S
2
S
and

Hence


How do we compute all those probabilities?


Cannot compute exact probabilities, have to use
estimates from the (ground truth) data.

(Binary Independence Retrieval)

(Bayesian Networks??)


Assumptions


“Relevance” of each document is independent of
relevance of other documents.


Most applications are for Boolean model.

Probability Ranking Principle

Issues


Simple case: no selection costs.




x

is
relevant

iff

p(R|x) > p(NR|x)


(Bayes’ Decision Rule)



PRP
: Rank all documents by
p(R|x).

Probability Ranking Principle

Actual

X is R

X is NR

Decision

X is R

X is NR


More complex case: retrieval costs.



C
-

cost of retrieval of
relevant

document


C’

-

cost of retrieval of
non
-
relevant

document


let
d,

be a document


Probability Ranking Principle: if



for all
d’ not yet retrieved
, then
d

is the next document to
be retrieved

))
|
(
1
(
)
|
(
))
|
(
1
(
)
|
(
d
R
p
C
d
R
p
C
d
R
p
C
d
R
p
C













Probability Ranking Principle

Actual

X is R

X is NR

Decision

X is R

X is NR

Binary Independence Model

Binary Independence Model



Traditionally used in conjunction with PRP


“Binary” = Boolean
: documents are represented
as binary vectors of terms:






iff

term
i

is present in document
x
.



“Independence”:

terms occur in documents
independently



Different documents can be modeled as same
vector.

)
,
,
(
1
n
x
x
x



1

i
x


Queries: binary vectors of terms


Given query
q
,


for each document
d,

need to compute
p(R|q,d).


replace with computing
p(R|q,x)

where

x

is vector
representing
d


Interested only in ranking


Use odds:

)
,
|
(
)
,
|
(
)
|
(
)
|
(
)
,
|
(
)
,
|
(
)
,
|
(
q
NR
x
p
q
R
x
p
q
NR
p
q
R
p
x
q
NR
p
x
q
R
p
x
q
R
O








Binary Independence Model



Using
Independence

Assumption:




n
i
i
i
q
NR
x
p
q
R
x
p
q
NR
x
p
q
R
x
p
1
)
,
|
(
)
,
|
(
)
,
|
(
)
,
|
(


)
,
|
(
)
,
|
(
)
|
(
)
|
(
)
,
|
(
)
,
|
(
)
,
|
(
q
NR
x
p
q
R
x
p
q
NR
p
q
R
p
x
q
NR
p
x
q
R
p
x
q
R
O








Constant for
each query

Needs estimation





n
i
i
i
q
NR
x
p
q
R
x
p
q
R
O
d
q
R
O
1
)
,
|
(
)
,
|
(
)
|
(
)
,
|
(

So :

Binary Independence Model





n
i
i
i
q
NR
x
p
q
R
x
p
q
R
O
d
q
R
O
1
)
,
|
(
)
,
|
(
)
|
(
)
,
|
(


Since
x
i


is either 0 or 1:












0
1
)
,
|
0
(
)
,
|
0
(
)
,
|
1
(
)
,
|
1
(
)
|
(
)
,
|
(
i
i
x
i
i
x
i
i
q
NR
x
p
q
R
x
p
q
NR
x
p
q
R
x
p
q
R
O
d
q
R
O


Let

);
,
|
1
(
q
R
x
p
p
i
i


);
,
|
1
(
q
NR
x
p
r
i
i




Assume, for all terms not occurring in the query (
q
i
=0
)

i
i
r
p

Binary Independence Model

All matching terms

Non
-
matching
query terms

All matching terms

All query terms
























1
1
1
0
1
1
1
)
1
(
)
1
(
)
|
(
1
1
)
|
(
)
,
|
(
i
i
i
i
i
i
i
q
i
i
q
x
i
i
i
i
q
x
i
i
q
x
i
i
r
p
p
r
r
p
q
R
O
r
p
r
p
q
R
O
x
q
R
O

Binary Independence Model

Constant for

each query

Only quantity to be estimated

for rankings













1
1
1
1
)
1
(
)
1
(
)
|
(
)
,
|
(
i
i
i
q
i
i
q
x
i
i
i
i
r
p
p
r
r
p
q
R
O
x
q
R
O



Retrieval Status Value:













1
1
)
1
(
)
1
(
log
)
1
(
)
1
(
log
i
i
i
i
q
x
i
i
i
i
q
x
i
i
i
i
p
r
r
p
p
r
r
p
RSV
Binary Independence Model



All boils down to computing RSV.













1
1
)
1
(
)
1
(
log
)
1
(
)
1
(
log
i
i
i
i
q
x
i
i
i
i
q
x
i
i
i
i
p
r
r
p
p
r
r
p
RSV




1
;
i
i
q
x
i
c
RSV
)
1
(
)
1
(
log
i
i
i
i
i
p
r
r
p
c



So, how do we compute
c
i
’s
from our data ?

Binary Independence Model



Estimating RSV coefficients.



For each term
i
look at the following table:

Documens
Relevant
Non-Relevant
Total
X
i
=1
s
n-s
n
X
i
=0
S-s
N-n-
S+s
N-n
Total
S
N-S
N
S
s
p
i

)
(
)
(
S
N
s
n
r
i



)
(
)
(
)
(
log
)
,
,
,
(
s
S
n
N
s
n
s
S
s
s
S
n
N
K
c
i









Estimates:

Binary Independence Model

PRP and BIR: The lessons



Getting reasonable approximations of
probabilities is possible.



Simple methods work only with restrictive
assumptions:


term independence


terms not in query do not affect the outcome


boolean representation of documents/queries


document relevance values are independent


Some of these assumptions can be removed

Probabilistic weighting scheme

)
)(
(
)
(
log
s
S
s
n
s
S
n
N
s





Add 0.5 with each term

)
5
.
0
)(
5
.
0
(
)
5
.
0
)(
5
.
0
(
log









s
S
s
n
s
S
n
N
s
log

function of ratio of probabilities may lead to positive or negative of infinity

)
5
.
0
)(
5
.
0
(
)
5
.
0
)(
5
.
0
(
log
.
)
(
)
1
(
.
)
(
)
1
(
1
1
3
3













s
S
s
n
s
S
n
N
s
f
L
k
f
k
q
k
q
k

: are constants.

q
: within query frequency (wqf),

f

: within document frequency (wdf),

n

: number of documents in the collection indexed by this term,

N

: total number of documents in the collection,


s

: number of relevant documents indexed by this term,

S

: total number of relevant documents,

L

: normalised document length (i.e. the length of this document
divided by the average length of documents in the collection).

Probabilistic weighting scheme
[S.Robertson]

In general form, the weighting function is


3
1
,
k
k
Probabilistic weighting scheme
[S.Robertson]

BM11


Stephen Robertson's BM11 uses the general form for
weights, but adds an extra item to the sum of term
weights to give the overall document score

)
1
(
)
1
(
2
L
L
n
k
q


: number of terms in the query (the query length),

: another constant.

2
k
q
n
This term is 0 when L=1

)
1
(
)
1
(
2
L
L
n
k
q


)
5
.
0
)(
5
.
0
(
)
5
.
0
)(
5
.
0
(
log
.
)
(
)
1
(
.
)
(
)
1
(
1
1
3
3













s
S
s
n
s
S
n
N
s
f
L
k
f
k
q
k
q
k
Probabilistic weighting scheme

)
1
(
)
1
(
2
L
L
n
k
q


)
5
.
0
)(
5
.
0
(
)
5
.
0
)(
5
.
0
(
log
.
)
(
)
1
(
.
)
(
)
1
(
1
1
3
3













s
S
s
n
s
S
n
N
s
f
L
k
f
k
q
k
q
k
BM15


)
1
(
)
1
(
2
L
L
n
k
q


)
5
.
0
)(
5
.
0
(
)
5
.
0
)(
5
.
0
(
log
.
)
(
)
1
(
.
)
(
)
1
(
1
1
3
3













s
S
s
n
s
S
n
N
s
f
k
f
k
q
k
q
k
BM15 is same as BM11 with term

replaced by

f
L
k

1
f
k

1
Probabilistic weighting scheme

BM25


)
5
.
0
)(
5
.
0
(
)
5
.
0
)(
5
.
0
(
log
.
)
(
)
1
(
.
)
(
)
1
(
1
3
3













s
S
s
n
s
S
n
N
s
f
k
f
k
q
k
q
k
BM25

combines the B11 and B15 with a scaling factor,
b
,
which turns BM15 into BM11 as it moves from 0 to 1


))
1
(
(
1
b
bL
k
k



b=1 BM11

b=0 BM15

Default values used:

5
.
0
,
1
,
0
,
1
3
2
1




b
k
k
k
1
,
0
2


b
k
General Form

Bayesian Networks

2. Binary Assumption

x
i
= 0 or 1

Can it be 0, 1, , …..n ?

1. Independent Assumption



x
x
x
n
x
,........
,
2
1

Independent

Can they be dependent?

A possible way could be Bayesian Networks

Modeling of JPD in a compact way by using graph
(s) to reflect conditional independence relationships

BN

BN :

Representation

Arcs

Conditional independence (or causality)

Nodes


Random variables

Undirected arcs


MRF


Directed arcs


BNs

BN :

Advantages


Arc from node A to B implies : A causes B


Compact representation of JPD of nodes


Easier to learn (Fit to data)

Bayesian Network (BN) Basics

Bayesian

Network

(BN)

is

a

hybrid

system

of

probability

theory

and

graph

theory
.

Graph

theory

provides

the

user

to

build

an

interface

to

model

highly

interacting

variables
.

On

the

other

hand

probability

theory

ensures

the

system

as

a

whole

is

consistent,

and

provides

ways

to

interface

models

to

data
.


G

= Global structure (DAG


Directed Acyclic Graph) that contains



a node for each variable
x
i

U
,
i

= 1,2, …..,
n


edges represent the probabilistic dependencies between
nodes of variables

x
i

U

M

=

set

of

local

structures

{
M
1

,

M
2

,


.
,

M
n

}
.

n

mappings

for

n

variables
.

M
i

maps

each

values

of

{
x
i

,

Par(
x
i

)}

to

parameter


.

Here

Par(
x
i

)

denotes

the

set

of

parent

nodes

of

x
i

in

G
.

The

joint

probability

distribution

over

U

can

be

decomposed

by

the

global

structure

G

as

P(
U

|
B
) =

i

P(
x
i

| Par(
x
i
),

,
M
i
,
G
)

x
4

x
2

x
3

x
1

Yes

No

True

False

Wet Grass

Yes

No

Sprinkler

On

Off

Rain

Cloud

BN :

An Example

* By Kevin Murphy

p(C=T)

p(C=F)

1/2

1/2

C

p(S=On)

p(S=Off)

T

0.1

0.9

F

0.5

0.5

C

p(R=Y)

p(R=N)

T

0.8

0.2

F

0.2

0.8

S R

p(W=Y)

p(W=N)

On Y

0.99

0.01

On N

0.9

0.1

Off Y

0.9

0.1

Off N

0.0

1.0

p(C,S,R,W) = p(C ) p(S|C) p(R|C,S) p(W|C,S,R)

p(C,S,R,W) = p(C ) p(S|C) p(R|C) p(W|S,R)

BN :

As Model


Hybrid system of probability theory and graph theory


Graph theory provides the user to build an interface to model the
highly interactive variables


Probability theory provides ways to interface model to data

BN :

Notations

)
,
(


s
B
B
s
B
: Structure of the Network


: Set of parameters that encode local prob. dist.

s
B
again has two parts
G

and
M

G

is the global structure (DAG)

M

is the mapping for n variables (arcs)

)}
(
,
{
i
i
x
par
x
BN :

Learning

Structure : DAG

Parameter : CPD

s
B
Structure

Known

Unknown


Parameter

Known

Unknown

Parameter can only be learnt if structure is either
known or learnt earlier

Structure is known and full data is observed (nothing missing)

BN :

Learning

Parameter learning

MLE, MAP

Structure is known and full data is NOT observed (data missing)

Parameter learning

EM

BN :

Learning

Learning Structure


To find the best DAG that fits the data

)
(
)
(
)
|
(
)
|
(
D
p
B
p
B
D
p
D
B
p
s
s
s

Objective Function:

))
(
log(
))
(
log(
))
|
(
log(
))
|
(
log(
D
p
B
p
B
D
p
D
B
p
s
s
s



Constant and indep. of

s
B
Search Algorithm : NP hard

Performance criteria used : MIC, BIC

References (basics)

1.
S. E. Roberctson, “The Probability Ranking Principle in IR”,
Journal of Documentation,

33
,
294
-
304, 1977.

2.
K. S. Jones, S. Walker and S. E. Roberctson, “A Probabilistic model for information
retrieval: development and comparative experiments
-
Part
-
1”,
Information Processing and
Management,

36
, 779
-
808, 2000.

3.
K. S. Jones, S. Walker and S. E. Roberctson, “A Probabilistic model for information
retrieval: development and comparative experiments
-
Part
-
2”,
Information Processing and
Management,

36
, 809
-
840, 2000.

4.
S. E. Roberctson and H. Zaragoza, “The Probabilistic relevance framework: BM25 and
beyond Ranking Principle in IR”,
Foundation and Trends in Information Retrieval,

3
, 333
-
389, 2009.

5.
S. E. Roberctson, C. J. Van Rijsbergen and M. F. Porter, “Probabilistic models of indexing
and searching”,
Information Retrieval Research,

Oddy et al. (Ed.s),
36
, 35
-
56, 1981.

6.
N. Fuhr and C. Buckley, “A probabilistic learning approach for document indexing”,
ACM
Tran. On Information systems
, 9, 223
-
248, 1991.

7.
H. R. Turtle and W. B. Croft, “Evaluation of an inference network based retrieval model,
ACM Tran. On Information systems
, 7, 187
-
222, 1991.

Thanks You

Discussions