Selectivity Estimation using
Probabilistic Models
Author:
Lise Getoor, Ben Taskar, Daphne Koller
Presenter:
Qidan Cheng
Outline
Introduction
Estimation for single tables
SRM: Statistical Relational Models
Selectivity Estimation using SRMs
Learning SRM
Introduction
Accurate estimates of the result size of
queries are crucial to several query
processing components of DBMS.
Cost

based query optimizers: choose
the optimal query execution plan.
Query profilers: predicting resource
consumption and distribution of query
results.
Answer counting queries
Introduction
How to estimate the size of a selection query
over multiple attributes for a single table?
The result size is determined by the joint
frequency distribution of the values of these
attributes.
Size(Q) = R
P
D
(a
1
,
…
,a
k
)
Query:
select * from R
where R.A1 = a1 and
…
and
R.Ak = ak
Introduction
But
…
exponential in # of attributes v
n
,
representing all combination of attribute
values is infeasible.
Attribute value independence assumption:
joint distribution is product of single attribute
distributions
Problem: overestimate or underestimate the
query size.
→ Bayesian Network
Estimation for single table
Example:
Given a simple relation R with three
attributes:
Education (high

school, college,degree)
3 values
Income (low, medium, high)
3 values
Home

Owner (false, true)
2 values
Joint distribution need 18 numbers.
Observation:
Some of the correlations between
attributes might be indirect ones,
mediated by other attributes.
→ Conditional Independence
Estimation for single table
P(H=hE=e,I=i)=P(H=hI=i)
Home

owner is conditionally
independent of Education given
Income.
→ Compact form of the joint distribution
P(H,E,I) = P(E)P(IE)P(HI,E)=P(E)P(IE)P(HI)
Home

owner
Income
Education
Estimation for single table
Figure (b) can encode precisely the same joint
distribution as in Figure (a)
Bayesian Networks
X

Ray
Lung Infiltrates
Sputum Smear
Tuberculosis
Pneumonia
Nodes = random variables
Edges = direct probabilistic
influence
Network structure encodes independence
assumptions:
X

Ray
conditionally
independent of
Pneumonia
given
Infiltrates
Bayesian Networks
X

Ray
Lung Infiltrates
Sputum Smear
Tuberculosis
Pneumonia
Associated with each node
X
i
there is a
conditional probability distribution
P(X
i
Pa
i
:
⤠
—
distribution over
X
i
for each assignment to
parents
0.8
t
0.2
p
t
p
0.6
0.4
0.01
0.99
0.2
0.8
t
p
t
p
T
P
P(I P, T )
BN Semantics
Compact & natural representation:
nodes have
k
parents
2
k
n
vs.
2
n
params
conditional
independencies
in BN structure
+
local
probability
models
full joint
distribution
over domain
=
X
I
S
T
P
BNs for Query Estimation
Query:
select * from R
where R.A
1
= a
1
and
…
and R.A
k
= a
k
P(a
1
,a
2
,
…
a
n
)=
倨P
i
parents(a
i
))
Use Bayesian inference algorithm to compute
P
D
(a
1
,
…
,a
k
)
Algorithm complexity depends on BN
connectivity; efficient in practice
Size(Q) = R
P
D
(a
1
,
…
,a
k
)
Join Selectivity Estimation
Person
Purchase
Uniform Join Assumption
Size(Purchase Person) =  Purchase 
Assuming referential integrity
Naïve Approach
Join Selectivity Estimation
Example query Q:
“
person.income=high and
purchase.type=luxury
”
p = P (person.income=high)
q = P (purchase.type=luxury)
Size
Q
= Purchase*p*q
Problems:
Joining Two Tables
Correlated Attributes
Person
Purchase
Income = high
Income = low
Type = luxury
Type = necessity
The attributes of the two different tables are often
correlated
Skewed Join
Person
Purchase
Income = high
Income = low
Type = luxury
Type = necessity
The probability that two tuples join with each other can
also be correlated with various attributes.
Join Indicator
S
R
Query: select * from R, S
where R.F = S.K
and R.A = a and S.B = b
P(J
F
) = prob. randomly chosen tuple from R
joins with a randomly chosen tuple from S
size(Q) =  R   S  P(J
F
, a, b)
Statistical Relational Models
Model distribution of attributes across
multiple tables
Allow attribute values to depend on
attributes in the same table (like a
BN)
Allow attribute values to depend on
attributes in other tables along a
foreign key join
Can model the join probability of two
tuples using
join indicator variable
Statistical Relational Model
A SRM for a relational database is a
pair(S,
θ
),which specifies a local
probabilistic model for each of the following
variables
A variable R.A for each table R and each
attribute A
刮R
A boolean join indicator variable R.J
F
For each variable R.X
S specifies a set of parents Pa(R.X)
Θ
specifies a CPD P(R.XPa(R.X))
Example SRM
Person
Income
Age
School
Prestige
J
school
Purchase
J
person
Type
Attended
Bought

by
0.9998, 0.0002
Type=necessity
false
true
false
true
0.999, 0.001
Income = high
0.99, 0.01
Universal Foreign Key Closure
Schema: R, S, T.R.F refers
to S, S.F refers to T
stratification: T < S < R
r
s
t
r.F
1
= s.K
s.F
2
= t.K
Schema: R, S
R.F
1
refers to S, R.F
2
refers to S
stratification: S < R
r
r.F
1
= s
1
.K
s
1
s
2
r.F
2
= s
2
.K
Universal Foreign Key Closure
Minimal extension Q
+
to a query Q:
Let Q be a keyjoin query over r
1
,r
2
,
…
r
k
For each r, if there is an attribute R.A
with parent R.F.B where R.F points to S,
then there is a unique tuple variable s
representing the join r.F=s.K
Proposition: Let Q be a query and let
Q
+
be its minimal extension. Then
size
Q
[D]=size
Q+
[D]
Answering Queries Using SRMs
Construct Query Evaluation BN for Query:
select * from Person, Purchase
where Person.id = Purchase.buyer

id
and Person.Income = high and
Purchase.Type=luxury
Person
Purchase
J
person
Income
Type
Age
Prestige
J
school
School
Compute upward closure of query attributes
by including all parents as well
SRM Learning
Learn parameters & qualitative
dependency structure
Extend known techniques for learning
Bayesian networks from data.
Database
Patient
Strain
Contact
Structure selection
Define scoring function:
log

likelihood function
l
(
θ
,S  D)=log P(D  S,
θ
)
Finding the model that has maximum
log

likelihood given data.
Do the greedy local structure search
Parameter Estimation
The model contains a parameter
Θ
ax
for each value a of A and each assignment
of values x to X.
Θ
ax
= P(R.A=a X=x)
Θ
ax
= F
D
(R.A=a,X=x)/F
D
(X=x)
System Architecture
Model
Constructor
Database
offline
Selectivity
Estimator
execution time
Query Q
Size(Q)
Conclusions
SRM is unique in its ability to handle
select and join operators
Estimates the high

dimensional joint
distribution using a set of lower

dimensional conditional distributions
To do:
Incremental maintenance of the SRM
as the database changes
Joins over non

key attributes
Selected Publications
o
“Learning Probabilistic Models of Link Structure”, L. Getoor, N. Friedman, D.
Koller and B. Taskar, JMLR 2002.
o
“Probabilistic Models of Text and Link Structure for Hypertext Classification”,
L. Getoor, E. Segal, B. Taskar and D. Koller, IJCAI WS ‘Text Learning: Beyond
Classification’, 2001.
o
“Selectivity Estimation using Probabilistic Models”, L. Getoor, B. Taskar and D.
Koller, SIGMOD

01.
o
“Learning Probabilistic Relational Models”, L. Getoor, N. Friedman, D. Koller,
and A. Pfeffer, chapter in Relation Data Mining, eds. S. Dzeroski and N.
Lavrac, 2001.
o
see also N. Friedman, L. Getoor, D. Koller, and A. Pfeffer, IJCAI

99.
o
“Learning Probabilistic Models of Relational Structure”, L. Getoor, N. Friedman,
D. Koller, and B. Taskar, ICML

01.
o
“From Instances to Classes in Probabilistic Relational Models”, L. Getoor, D.
Koller and N. Friedman, ICML Workshop on Attribute

Value and Relational
Learning: Crossing the Boundaries, 2000.
o
Notes from AAAI Workshop on Learning Statistical Models from Relational
Data, eds. L.Getoor and D. Jensen, 2000.
o
Notes from IJCAI Workshop on Learning Statistical Models from Relational
Data, eds. L.Getoor and D. Jensen, 2003.
See http://www.cs.umd.edu/~getoor
Comments 0
Log in to post a comment