Selectivity Estimation using
Probabilistic Models
Author:
Lise Getoor, Ben Taskar, Daphne Koller
Presenter:
Qidan Cheng
Outline
Introduction
Estimation for single tables
SRM: Statistical Relational Models
Selectivity Estimation using SRMs
Learning SRM
Introduction
Accurate estimates of the result size of
queries are crucial to several query
processing components of DBMS.
Cost
-
based query optimizers: choose
the optimal query execution plan.
Query profilers: predicting resource
consumption and distribution of query
results.
Answer counting queries
Introduction
How to estimate the size of a selection query
over multiple attributes for a single table?
The result size is determined by the joint
frequency distribution of the values of these
attributes.
Size(Q) = |R|
P
D
(a
1
,
…
,a
k
)
Query:
select * from R
where R.A1 = a1 and
…
and
R.Ak = ak
Introduction
But
…
exponential in # of attributes v
n
,
representing all combination of attribute
values is infeasible.
Attribute value independence assumption:
joint distribution is product of single attribute
distributions
Problem: overestimate or underestimate the
query size.
→ Bayesian Network
Estimation for single table
Example:
Given a simple relation R with three
attributes:
Education (high
-
school, college,degree)
3 values
Income (low, medium, high)
3 values
Home
-
Owner (false, true)
2 values
Joint distribution need 18 numbers.
Observation:
Some of the correlations between
attributes might be indirect ones,
mediated by other attributes.
→ Conditional Independence
Estimation for single table
P(H=h|E=e,I=i)=P(H=h|I=i)
Home
-
owner is conditionally
independent of Education given
Income.
→ Compact form of the joint distribution
P(H,E,I) = P(E)P(I|E)P(H|I,E)=P(E)P(I|E)P(H|I)
Home
-
owner
Income
Education
Estimation for single table
Figure (b) can encode precisely the same joint
distribution as in Figure (a)
Bayesian Networks
X
-
Ray
Lung Infiltrates
Sputum Smear
Tuberculosis
Pneumonia
Nodes = random variables
Edges = direct probabilistic
influence
Network structure encodes independence
assumptions:
X
-
Ray
conditionally
independent of
Pneumonia
given
Infiltrates
Bayesian Networks
X
-
Ray
Lung Infiltrates
Sputum Smear
Tuberculosis
Pneumonia
Associated with each node
X
i
there is a
conditional probability distribution
P(X
i
|Pa
i
:
⤠
—
distribution over
X
i
for each assignment to
parents
0.8
t
0.2
p
t
p
0.6
0.4
0.01
0.99
0.2
0.8
t
p
t
p
T
P
P(I |P, T )
BN Semantics
Compact & natural representation:
nodes have
k
parents
2
k
n
vs.
2
n
params
conditional
independencies
in BN structure
+
local
probability
models
full joint
distribution
over domain
=
X
I
S
T
P
BNs for Query Estimation
Query:
select * from R
where R.A
1
= a
1
and
…
and R.A
k
= a
k
P(a
1
,a
2
,
…
a
n
)=
倨P
i
|parents(a
i
))
Use Bayesian inference algorithm to compute
P
D
(a
1
,
…
,a
k
)
Algorithm complexity depends on BN
connectivity; efficient in practice
Size(Q) = |R|
P
D
(a
1
,
…
,a
k
)
Join Selectivity Estimation
Person
Purchase
Uniform Join Assumption
Size(Purchase Person) = | Purchase |
Assuming referential integrity
Naïve Approach
Join Selectivity Estimation
Example query Q:
“
person.income=high and
purchase.type=luxury
”
p = P (person.income=high)
q = P (purchase.type=luxury)
Size
Q
= |Purchase|*p*q
Problems:
Joining Two Tables
Correlated Attributes
Person
Purchase
Income = high
Income = low
Type = luxury
Type = necessity
The attributes of the two different tables are often
correlated
Skewed Join
Person
Purchase
Income = high
Income = low
Type = luxury
Type = necessity
The probability that two tuples join with each other can
also be correlated with various attributes.
Join Indicator
S
R
Query: select * from R, S
where R.F = S.K
and R.A = a and S.B = b
P(J
F
) = prob. randomly chosen tuple from R
joins with a randomly chosen tuple from S
size(Q) = | R | | S | P(J
F
, a, b)
Statistical Relational Models
Model distribution of attributes across
multiple tables
Allow attribute values to depend on
attributes in the same table (like a
BN)
Allow attribute values to depend on
attributes in other tables along a
foreign key join
Can model the join probability of two
tuples using
join indicator variable
Statistical Relational Model
A SRM for a relational database is a
pair(S,
θ
),which specifies a local
probabilistic model for each of the following
variables
A variable R.A for each table R and each
attribute A
刮R
A boolean join indicator variable R.J
F
For each variable R.X
S specifies a set of parents Pa(R.X)
Θ
specifies a CPD P(R.X|Pa(R.X))
Example SRM
Person
Income
Age
School
Prestige
J
school
Purchase
J
person
Type
Attended
Bought
-
by
0.9998, 0.0002
Type=necessity
false
true
false
true
0.999, 0.001
Income = high
0.99, 0.01
Universal Foreign Key Closure
Schema: R, S, T.R.F refers
to S, S.F refers to T
stratification: T < S < R
r
s
t
r.F
1
= s.K
s.F
2
= t.K
Schema: R, S
R.F
1
refers to S, R.F
2
refers to S
stratification: S < R
r
r.F
1
= s
1
.K
s
1
s
2
r.F
2
= s
2
.K
Universal Foreign Key Closure
Minimal extension Q
+
to a query Q:
Let Q be a keyjoin query over r
1
,r
2
,
…
r
k
For each r, if there is an attribute R.A
with parent R.F.B where R.F points to S,
then there is a unique tuple variable s
representing the join r.F=s.K
Proposition: Let Q be a query and let
Q
+
be its minimal extension. Then
size
Q
[D]=size
Q+
[D]
Answering Queries Using SRMs
Construct Query Evaluation BN for Query:
select * from Person, Purchase
where Person.id = Purchase.buyer
-
id
and Person.Income = high and
Purchase.Type=luxury
Person
Purchase
J
person
Income
Type
Age
Prestige
J
school
School
Compute upward closure of query attributes
by including all parents as well
SRM Learning
Learn parameters & qualitative
dependency structure
Extend known techniques for learning
Bayesian networks from data.
Database
Patient
Strain
Contact
Structure selection
Define scoring function:
log
-
likelihood function
l
(
θ
,S | D)=log P(D | S,
θ
)
Finding the model that has maximum
log
-
likelihood given data.
Do the greedy local structure search
Parameter Estimation
The model contains a parameter
Θ
a|x
for each value a of A and each assignment
of values x to X.
Θ
a|x
= P(R.A=a |X=x)
Θ
a|x
= F
D
(R.A=a,X=x)/F
D
(X=x)
System Architecture
Model
Constructor
Database
offline
Selectivity
Estimator
execution time
Query Q
Size(Q)
Conclusions
SRM is unique in its ability to handle
select and join operators
Estimates the high
-
dimensional joint
distribution using a set of lower
-
dimensional conditional distributions
To do:
Incremental maintenance of the SRM
as the database changes
Joins over non
-
key attributes
Selected Publications
o
“Learning Probabilistic Models of Link Structure”, L. Getoor, N. Friedman, D.
Koller and B. Taskar, JMLR 2002.
o
“Probabilistic Models of Text and Link Structure for Hypertext Classification”,
L. Getoor, E. Segal, B. Taskar and D. Koller, IJCAI WS ‘Text Learning: Beyond
Classification’, 2001.
o
“Selectivity Estimation using Probabilistic Models”, L. Getoor, B. Taskar and D.
Koller, SIGMOD
-
01.
o
“Learning Probabilistic Relational Models”, L. Getoor, N. Friedman, D. Koller,
and A. Pfeffer, chapter in Relation Data Mining, eds. S. Dzeroski and N.
Lavrac, 2001.
o
see also N. Friedman, L. Getoor, D. Koller, and A. Pfeffer, IJCAI
-
99.
o
“Learning Probabilistic Models of Relational Structure”, L. Getoor, N. Friedman,
D. Koller, and B. Taskar, ICML
-
01.
o
“From Instances to Classes in Probabilistic Relational Models”, L. Getoor, D.
Koller and N. Friedman, ICML Workshop on Attribute
-
Value and Relational
Learning: Crossing the Boundaries, 2000.
o
Notes from AAAI Workshop on Learning Statistical Models from Relational
Data, eds. L.Getoor and D. Jensen, 2000.
o
Notes from IJCAI Workshop on Learning Statistical Models from Relational
Data, eds. L.Getoor and D. Jensen, 2003.
See http://www.cs.umd.edu/~getoor
Enter the password to open this PDF file:
File name:
-
File size:
-
Title:
-
Author:
-
Subject:
-
Keywords:
-
Creation Date:
-
Modification Date:
-
Creator:
-
PDF Producer:
-
PDF Version:
-
Page Count:
-
Preparing document for printing…
0%
Σχόλια 0
Συνδεθείτε για να κοινοποιήσετε σχόλιο