Prob
abilistic
I
nformation
R
etrieval
[ ProbIR ]
Suman K Mitra
DAIICT, Gandhinagar
suman_mitra@daiict.ac.in
Acknowledgment
Alexander Dekhtyar
, University of Maryland
Mandar Mitra
, ISI, Kolkata
Prasenjit Majumder
, DAIICT, Gandhinagar
Why use Probabilities?
•
Information Retrieval deals with uncertain
information
•
Probability is a measure of uncertainty
•
Probabilistic Ranking Principle
•
provable
•
minimization of risk
•
Probabilistic Inference
•
To justify your decision
Document
Collection
Document
Representation
Query
Query
Representation
1
2
3
Basic IR System
4
1.
How good the representation is?
2.
How exact the representation is?
3.
How well is the query matched ?
4.
How relevant is the result to the query?
Approaches and main Contributors
Probability Ranking Principle
–
Robertson
1970
onwards
Information Retrieval as Probabilistic Inference
–
Van Rijsbergen
et al. 1970 onwards
Probabilistic Indexing
–
Fuhr
et al. 1980 onwards
Bayesian Nets in Information Retrieval
–
Turtle
,
Croft
1990 onwards
Probabilistic Logic Programming in Information
Retrieval
–
Fuhr
et al. 1990 onwards
Probability Ranking Principle
1.
Collection of
documents
2.
Representation of
documents
3.
User uses a query
4.
Representation of
query
5.
A set of documents
to return
Question:
In what order
documents to present to user?
Logically:
Best document first
and then next best and so on
Requirement:
A formal way to
judge the goodness of
documents with respect to
query
Possibility:
Probability of
relevance of the document with
respect to query
Probability Ranking Principle
If a retrieval system’s
response to each request
is
a
ranking of the documents in the collections
in
order of
decreasing probability of goodness
to the user who
submitted the request ...
… where the probabilities are estimated as accurately as
possible on the basis of whatever data made available to
the system for this purpose ...
… then the
overall effectiveness
of the system to its users
will be the best
that is obtainable on the basis of that data.
W. S. Cooper
)
(
)
(
)

(
)

(
)
(
)
(
)

(
)

(
)
(
)

(
)
(
)
(
)

(
b
p
a
p
a
b
p
b
a
p
b
p
a
p
a
b
p
b
a
p
a
p
a
b
p
b
a
p
b
p
b
a
p
Probability Basics
Bayes’ Rule
Let a and b are two events
Odds of an event a is defined as
)
(
1
)
(
)
(
)
(
)
(
a
p
a
p
a
p
a
p
a
O
Conditional probability satisfies all axioms of probability
1
1
)

(
)

(
i
i
i
i
b
p
b
p
a
a
1
)

(
0
b
a
p
(i)
1
)

(
b
S
p
(ii)
(iii)
If
are mutually exclusive events then
a
i
0
)
(
,
0
)
(
,
)
(
)
(
)

(
b
p
ab
p
b
p
ab
p
b
a
p
0
)

(
b
a
p
1
)
(
)
(
)
(
)
(
b
p
ab
p
b
p
ab
p
b
ab
1
)

(
b
a
p
Hence (i)
1
)
(
)
(
)
(
)
(
)

(
b
p
b
p
b
p
Sb
p
b
S
p
Hence (ii)
a
1
a
2
a
3
a
4
b
1
1
)
(
)
)
((
)

(
i
i
i
i
b
p
b
p
b
p
a
a
)
(
)
(
1
b
p
b
p
i
i
a
)
(
)
(
1
b
p
b
p
i
i
a
[ ‘s are all mutually exclusive]
b
a
i
Hence (iii)
1
1
)

(
)
(
)
(
)

(
i
i
i
i
b
p
b
p
b
p
b
p
a
a
Probability Ranking Principle
Let
x
be a document in the collection.
Let
R
represent
relevance
of a document w.r.t. given (fixed) query
and let
NR
represent
non

relevance.
p(
Rx)

probability that a retrieved document
x
is
relevant.
p(
NRx)

probability that a retrieved document
x
is
non

relevant.
)
(
)
(
)

(
)

(
)
(
)
(
)

(
)

(
x
p
NR
p
NR
x
p
x
NR
p
x
p
R
p
R
x
p
x
R
p
p(
R)
,p(
NR
)

prior probability of
retrieving a relevant and non

relevant document respectively
p(
xR), p(xNR)

probability that if a relevant (non

relevant)
document is retrieved, it is
x.
)
(
)
(
)

(
)

(
)
(
)
(
)

(
)

(
x
p
NR
p
NR
x
p
x
NR
p
x
p
R
p
R
x
p
x
R
p
Ranking Principle (Bayes’ Decision Rule):
If p(
Rx) >
p(
NRx
) then
x
is relevant
,
otherwise
x
is not relevant
Probability Ranking Principle
Is PRP minimizes the average probability of error?
)

(
)

(
)

(
x
NR
p
x
R
p
x
error
p
If we decide
NR
If we decide
R
x
x
p
x
error
p
error
p
)
(
)

(
)
(
p(error)
is minimal when all
p(errorx)
are minimal.
Bayes’ decision rule
minimizes each
p(errorx).
Probability Ranking Principle
Actual
X is R
X is NR
Decision
X is R
2
X is NR
1
x
x
p
x
error
p
error
p
)
(
)

(
)
(
NR
x
R
x
x
p
x
NR
p
x
p
x
R
p
)
(
)

(
)
(
)

(
X is either in R or NR
NR
x
R
x
x
p
x
NR
p
x
p
x
R
p
)
(
)

(
)
(
)

(
NR
x
NR
x
x
p
x
NR
p
x
p
x
NR
p
)
(
)

(
)
(
)

(
NR
x
x
x
p
x
NR
p
x
p
x
NR
p
x
R
p
)
(
)

(
)
(
)}

(
)

(
{
Constant
R
x
t
Cons
x
p
x
R
p
x
NR
p
error
p
tan
)
(
)}

(
)

(
{
)
(
Minimization of
p(error)
Minimization of
2p(error)
R
x
NR
x
x
p
x
NR
p
x
R
p
x
p
x
R
p
x
NR
p
)
(
}

(
)

(
{
)
(
)}

(
)

(
{
Min
p(x) is same for any x
Define
}
0
)

(
)

(
:
{
1
x
NR
p
x
R
p
x
S
}
0
)

(
)

(
:
{
2
x
NR
p
x
R
p
x
S
are nothing but the decision for x to be in R
and NR respectively
1
S
2
S
and
:
The decision minimizes p(error)
1
S
2
S
and
Hence
•
How do we compute all those probabilities?
–
Cannot compute exact probabilities, have to use
estimates from the (ground truth) data.
(Binary Independence Retrieval)
(Bayesian Networks??)
Assumptions
–
“Relevance” of each document is independent of
relevance of other documents.
–
Most applications are for Boolean model.
Probability Ranking Principle
Issues
•
Simple case: no selection costs.
•
x
is
relevant
iff
p(Rx) > p(NRx)
(Bayes’ Decision Rule)
•
PRP
: Rank all documents by
p(Rx).
Probability Ranking Principle
Actual
X is R
X is NR
Decision
X is R
X is NR
•
More complex case: retrieval costs.
–
C

cost of retrieval of
relevant
document
–
C’

cost of retrieval of
non

relevant
document
–
let
d,
be a document
•
Probability Ranking Principle: if
for all
d’ not yet retrieved
, then
d
is the next document to
be retrieved
))

(
1
(
)

(
))

(
1
(
)

(
d
R
p
C
d
R
p
C
d
R
p
C
d
R
p
C
Probability Ranking Principle
Actual
X is R
X is NR
Decision
X is R
X is NR
Binary Independence Model
Binary Independence Model
•
Traditionally used in conjunction with PRP
•
“Binary” = Boolean
: documents are represented
as binary vectors of terms:
–
–
iff
term
i
is present in document
x
.
•
“Independence”:
terms occur in documents
independently
•
Different documents can be modeled as same
vector.
)
,
,
(
1
n
x
x
x
1
i
x
•
Queries: binary vectors of terms
•
Given query
q
,
–
for each document
d,
need to compute
p(Rq,d).
–
replace with computing
p(Rq,x)
where
x
is vector
representing
d
•
Interested only in ranking
•
Use odds:
)
,

(
)
,

(
)

(
)

(
)
,

(
)
,

(
)
,

(
q
NR
x
p
q
R
x
p
q
NR
p
q
R
p
x
q
NR
p
x
q
R
p
x
q
R
O
Binary Independence Model
•
Using
Independence
Assumption:
n
i
i
i
q
NR
x
p
q
R
x
p
q
NR
x
p
q
R
x
p
1
)
,

(
)
,

(
)
,

(
)
,

(
)
,

(
)
,

(
)

(
)

(
)
,

(
)
,

(
)
,

(
q
NR
x
p
q
R
x
p
q
NR
p
q
R
p
x
q
NR
p
x
q
R
p
x
q
R
O
Constant for
each query
Needs estimation
n
i
i
i
q
NR
x
p
q
R
x
p
q
R
O
d
q
R
O
1
)
,

(
)
,

(
)

(
)
,

(
•
So :
Binary Independence Model
n
i
i
i
q
NR
x
p
q
R
x
p
q
R
O
d
q
R
O
1
)
,

(
)
,

(
)

(
)
,

(
•
Since
x
i
is either 0 or 1:
0
1
)
,

0
(
)
,

0
(
)
,

1
(
)
,

1
(
)

(
)
,

(
i
i
x
i
i
x
i
i
q
NR
x
p
q
R
x
p
q
NR
x
p
q
R
x
p
q
R
O
d
q
R
O
•
Let
);
,

1
(
q
R
x
p
p
i
i
);
,

1
(
q
NR
x
p
r
i
i
•
Assume, for all terms not occurring in the query (
q
i
=0
)
i
i
r
p
Binary Independence Model
All matching terms
Non

matching
query terms
All matching terms
All query terms
1
1
1
0
1
1
1
)
1
(
)
1
(
)

(
1
1
)

(
)
,

(
i
i
i
i
i
i
i
q
i
i
q
x
i
i
i
i
q
x
i
i
q
x
i
i
r
p
p
r
r
p
q
R
O
r
p
r
p
q
R
O
x
q
R
O
Binary Independence Model
Constant for
each query
Only quantity to be estimated
for rankings
1
1
1
1
)
1
(
)
1
(
)

(
)
,

(
i
i
i
q
i
i
q
x
i
i
i
i
r
p
p
r
r
p
q
R
O
x
q
R
O
•
Retrieval Status Value:
1
1
)
1
(
)
1
(
log
)
1
(
)
1
(
log
i
i
i
i
q
x
i
i
i
i
q
x
i
i
i
i
p
r
r
p
p
r
r
p
RSV
Binary Independence Model
•
All boils down to computing RSV.
1
1
)
1
(
)
1
(
log
)
1
(
)
1
(
log
i
i
i
i
q
x
i
i
i
i
q
x
i
i
i
i
p
r
r
p
p
r
r
p
RSV
1
;
i
i
q
x
i
c
RSV
)
1
(
)
1
(
log
i
i
i
i
i
p
r
r
p
c
So, how do we compute
c
i
’s
from our data ?
Binary Independence Model
•
Estimating RSV coefficients.
•
For each term
i
look at the following table:
Documens
Relevant
NonRelevant
Total
X
i
=1
s
ns
n
X
i
=0
Ss
Nn
S+s
Nn
Total
S
NS
N
S
s
p
i
)
(
)
(
S
N
s
n
r
i
)
(
)
(
)
(
log
)
,
,
,
(
s
S
n
N
s
n
s
S
s
s
S
n
N
K
c
i
•
Estimates:
Binary Independence Model
PRP and BIR: The lessons
•
Getting reasonable approximations of
probabilities is possible.
•
Simple methods work only with restrictive
assumptions:
–
term independence
–
terms not in query do not affect the outcome
–
boolean representation of documents/queries
–
document relevance values are independent
•
Some of these assumptions can be removed
Probabilistic weighting scheme
)
)(
(
)
(
log
s
S
s
n
s
S
n
N
s
Add 0.5 with each term
)
5
.
0
)(
5
.
0
(
)
5
.
0
)(
5
.
0
(
log
s
S
s
n
s
S
n
N
s
log
function of ratio of probabilities may lead to positive or negative of infinity
)
5
.
0
)(
5
.
0
(
)
5
.
0
)(
5
.
0
(
log
.
)
(
)
1
(
.
)
(
)
1
(
1
1
3
3
s
S
s
n
s
S
n
N
s
f
L
k
f
k
q
k
q
k
: are constants.
q
: within query frequency (wqf),
f
: within document frequency (wdf),
n
: number of documents in the collection indexed by this term,
N
: total number of documents in the collection,
s
: number of relevant documents indexed by this term,
S
: total number of relevant documents,
L
: normalised document length (i.e. the length of this document
divided by the average length of documents in the collection).
Probabilistic weighting scheme
[S.Robertson]
In general form, the weighting function is
3
1
,
k
k
Probabilistic weighting scheme
[S.Robertson]
BM11
Stephen Robertson's BM11 uses the general form for
weights, but adds an extra item to the sum of term
weights to give the overall document score
)
1
(
)
1
(
2
L
L
n
k
q
: number of terms in the query (the query length),
: another constant.
2
k
q
n
This term is 0 when L=1
)
1
(
)
1
(
2
L
L
n
k
q
)
5
.
0
)(
5
.
0
(
)
5
.
0
)(
5
.
0
(
log
.
)
(
)
1
(
.
)
(
)
1
(
1
1
3
3
s
S
s
n
s
S
n
N
s
f
L
k
f
k
q
k
q
k
Probabilistic weighting scheme
)
1
(
)
1
(
2
L
L
n
k
q
)
5
.
0
)(
5
.
0
(
)
5
.
0
)(
5
.
0
(
log
.
)
(
)
1
(
.
)
(
)
1
(
1
1
3
3
s
S
s
n
s
S
n
N
s
f
L
k
f
k
q
k
q
k
BM15
)
1
(
)
1
(
2
L
L
n
k
q
)
5
.
0
)(
5
.
0
(
)
5
.
0
)(
5
.
0
(
log
.
)
(
)
1
(
.
)
(
)
1
(
1
1
3
3
s
S
s
n
s
S
n
N
s
f
k
f
k
q
k
q
k
BM15 is same as BM11 with term
replaced by
f
L
k
1
f
k
1
Probabilistic weighting scheme
BM25
)
5
.
0
)(
5
.
0
(
)
5
.
0
)(
5
.
0
(
log
.
)
(
)
1
(
.
)
(
)
1
(
1
3
3
s
S
s
n
s
S
n
N
s
f
k
f
k
q
k
q
k
BM25
combines the B11 and B15 with a scaling factor,
b
,
which turns BM15 into BM11 as it moves from 0 to 1
))
1
(
(
1
b
bL
k
k
b=1 BM11
b=0 BM15
Default values used:
5
.
0
,
1
,
0
,
1
3
2
1
b
k
k
k
1
,
0
2
b
k
General Form
Bayesian Networks
2. Binary Assumption
x
i
= 0 or 1
Can it be 0, 1, , …..n ?
1. Independent Assumption
x
x
x
n
x
,........
,
2
1
Independent
Can they be dependent?
A possible way could be Bayesian Networks
Modeling of JPD in a compact way by using graph
(s) to reflect conditional independence relationships
BN
BN :
Representation
Arcs
Conditional independence (or causality)
Nodes
Random variables
Undirected arcs
MRF
Directed arcs
BNs
BN :
Advantages
•
Arc from node A to B implies : A causes B
•
Compact representation of JPD of nodes
•
Easier to learn (Fit to data)
Bayesian Network (BN) Basics
Bayesian
Network
(BN)
is
a
hybrid
system
of
probability
theory
and
graph
theory
.
Graph
theory
provides
the
user
to
build
an
interface
to
model
highly
interacting
variables
.
On
the
other
hand
probability
theory
ensures
the
system
as
a
whole
is
consistent,
and
provides
ways
to
interface
models
to
data
.
G
= Global structure (DAG
–
Directed Acyclic Graph) that contains
•
a node for each variable
x
i
U
,
i
= 1,2, …..,
n
•
edges represent the probabilistic dependencies between
nodes of variables
x
i
U
M
=
set
of
local
structures
{
M
1
,
M
2
,
…
.
,
M
n
}
.
n
mappings
for
n
variables
.
M
i
maps
each
values
of
{
x
i
,
Par(
x
i
)}
to
parameter
.
Here
Par(
x
i
)
denotes
the
set
of
parent
nodes
of
x
i
in
G
.
The
joint
probability
distribution
over
U
can
be
decomposed
by
the
global
structure
G
as
P(
U

B
) =
i
P(
x
i
 Par(
x
i
),
,
M
i
,
G
)
x
4
x
2
x
3
x
1
Yes
No
True
False
Wet Grass
Yes
No
Sprinkler
On
Off
Rain
Cloud
BN :
An Example
* By Kevin Murphy
p(C=T)
p(C=F)
1/2
1/2
C
p(S=On)
p(S=Off)
T
0.1
0.9
F
0.5
0.5
C
p(R=Y)
p(R=N)
T
0.8
0.2
F
0.2
0.8
S R
p(W=Y)
p(W=N)
On Y
0.99
0.01
On N
0.9
0.1
Off Y
0.9
0.1
Off N
0.0
1.0
p(C,S,R,W) = p(C ) p(SC) p(RC,S) p(WC,S,R)
p(C,S,R,W) = p(C ) p(SC) p(RC) p(WS,R)
BN :
As Model
•
Hybrid system of probability theory and graph theory
•
Graph theory provides the user to build an interface to model the
highly interactive variables
•
Probability theory provides ways to interface model to data
BN :
Notations
)
,
(
s
B
B
s
B
: Structure of the Network
: Set of parameters that encode local prob. dist.
s
B
again has two parts
G
and
M
G
is the global structure (DAG)
M
is the mapping for n variables (arcs)
)}
(
,
{
i
i
x
par
x
BN :
Learning
Structure : DAG
Parameter : CPD
s
B
Structure
Known
Unknown
Parameter
Known
Unknown
Parameter can only be learnt if structure is either
known or learnt earlier
Structure is known and full data is observed (nothing missing)
BN :
Learning
Parameter learning
MLE, MAP
Structure is known and full data is NOT observed (data missing)
Parameter learning
EM
BN :
Learning
Learning Structure
•
To find the best DAG that fits the data
)
(
)
(
)

(
)

(
D
p
B
p
B
D
p
D
B
p
s
s
s
Objective Function:
))
(
log(
))
(
log(
))

(
log(
))

(
log(
D
p
B
p
B
D
p
D
B
p
s
s
s
Constant and indep. of
s
B
Search Algorithm : NP hard
Performance criteria used : MIC, BIC
References (basics)
1.
S. E. Roberctson, “The Probability Ranking Principle in IR”,
Journal of Documentation,
33
,
294

304, 1977.
2.
K. S. Jones, S. Walker and S. E. Roberctson, “A Probabilistic model for information
retrieval: development and comparative experiments

Part

1”,
Information Processing and
Management,
36
, 779

808, 2000.
3.
K. S. Jones, S. Walker and S. E. Roberctson, “A Probabilistic model for information
retrieval: development and comparative experiments

Part

2”,
Information Processing and
Management,
36
, 809

840, 2000.
4.
S. E. Roberctson and H. Zaragoza, “The Probabilistic relevance framework: BM25 and
beyond Ranking Principle in IR”,
Foundation and Trends in Information Retrieval,
3
, 333

389, 2009.
5.
S. E. Roberctson, C. J. Van Rijsbergen and M. F. Porter, “Probabilistic models of indexing
and searching”,
Information Retrieval Research,
Oddy et al. (Ed.s),
36
, 35

56, 1981.
6.
N. Fuhr and C. Buckley, “A probabilistic learning approach for document indexing”,
ACM
Tran. On Information systems
, 9, 223

248, 1991.
7.
H. R. Turtle and W. B. Croft, “Evaluation of an inference network based retrieval model,
ACM Tran. On Information systems
, 7, 187

222, 1991.
Thanks You
Discussions
Comments 0
Log in to post a comment