Selectivity Estimation using Probabilistic Models - Database and ...

ocelotgiantΤεχνίτη Νοημοσύνη και Ρομποτική

7 Νοε 2013 (πριν από 3 χρόνια και 5 μήνες)

57 εμφανίσεις

Selectivity Estimation using
Probabilistic Models

Author:

Lise Getoor, Ben Taskar, Daphne Koller

Presenter:

Qidan Cheng

Outline


Introduction


Estimation for single tables


SRM: Statistical Relational Models


Selectivity Estimation using SRMs


Learning SRM

Introduction


Accurate estimates of the result size of
queries are crucial to several query
processing components of DBMS.


Cost
-
based query optimizers: choose
the optimal query execution plan.


Query profilers: predicting resource
consumption and distribution of query
results.


Answer counting queries


Introduction


How to estimate the size of a selection query
over multiple attributes for a single table?

















The result size is determined by the joint
frequency distribution of the values of these
attributes.


Size(Q) = |R|


P
D
(a
1
,

,a
k
)

Query:

select * from R



where R.A1 = a1 and


and



R.Ak = ak

Introduction


But

exponential in # of attributes v
n
,
representing all combination of attribute
values is infeasible.


Attribute value independence assumption:
joint distribution is product of single attribute
distributions


Problem: overestimate or underestimate the
query size.








→ Bayesian Network

Estimation for single table


Example:



Given a simple relation R with three

attributes:


Education (high
-
school, college,degree)

3 values


Income (low, medium, high)


3 values


Home
-
Owner (false, true)



2 values



Joint distribution need 18 numbers.


Observation:



Some of the correlations between

attributes might be indirect ones,

mediated by other attributes.






→ Conditional Independence

Estimation for single table


P(H=h|E=e,I=i)=P(H=h|I=i)



Home
-
owner is conditionally

independent of Education given

Income.

→ Compact form of the joint distribution

P(H,E,I) = P(E)P(I|E)P(H|I,E)=P(E)P(I|E)P(H|I)

Home
-
owner

Income

Education

Estimation for single table


Figure (b) can encode precisely the same joint
distribution as in Figure (a)

Bayesian Networks

X
-
Ray

Lung Infiltrates

Sputum Smear

Tuberculosis

Pneumonia

Nodes = random variables


Edges = direct probabilistic




influence

Network structure encodes independence
assumptions:
X
-
Ray

conditionally
independent of
Pneumonia

given
Infiltrates

Bayesian Networks

X
-
Ray

Lung Infiltrates

Sputum Smear

Tuberculosis

Pneumonia


Associated with each node
X
i

there is a
conditional probability distribution
P(X
i
|Pa
i
:




distribution over
X
i

for each assignment to
parents

0.8

t

0.2

p

t

p

0.6

0.4

0.01

0.99

0.2

0.8

t

p

t

p

T

P

P(I |P, T )

BN Semantics

Compact & natural representation:

nodes have


k

parents


2
k
n

vs.
2
n


params


conditional

independencies

in BN structure

+

local

probability

models

full joint

distribution

over domain

=

X

I

S

T

P

BNs for Query Estimation


Query:

select * from R



where R.A
1

= a
1

and


and R.A
k

= a
k






P(a
1
,a
2
,

a
n
)=


倨P
i
|parents(a
i
))



Use Bayesian inference algorithm to compute
P
D
(a
1
,

,a
k
)



Algorithm complexity depends on BN
connectivity; efficient in practice


Size(Q) = |R|


P
D
(a
1
,

,a
k
)

Join Selectivity Estimation

Person

Purchase

Uniform Join Assumption

Size(Purchase Person) = | Purchase |

Assuming referential integrity

Naïve Approach

Join Selectivity Estimation

Example query Q:



person.income=high and
purchase.type=luxury




p = P (person.income=high)



q = P (purchase.type=luxury)

Size
Q

= |Purchase|*p*q

Problems:



Joining Two Tables

Correlated Attributes

Person

Purchase

Income = high

Income = low

Type = luxury

Type = necessity

The attributes of the two different tables are often
correlated


Skewed Join

Person

Purchase

Income = high

Income = low

Type = luxury

Type = necessity

The probability that two tuples join with each other can
also be correlated with various attributes.

Join Indicator

S

R

Query: select * from R, S


where R.F = S.K


and R.A = a and S.B = b


P(J
F
) = prob. randomly chosen tuple from R
joins with a randomly chosen tuple from S



size(Q) = | R | | S | P(J
F
, a, b)

Statistical Relational Models


Model distribution of attributes across
multiple tables


Allow attribute values to depend on
attributes in the same table (like a
BN)


Allow attribute values to depend on
attributes in other tables along a
foreign key join


Can model the join probability of two
tuples using
join indicator variable




Statistical Relational Model



A SRM for a relational database is a
pair(S,
θ
),which specifies a local
probabilistic model for each of the following
variables


A variable R.A for each table R and each
attribute A

刮R


A boolean join indicator variable R.J
F


For each variable R.X


S specifies a set of parents Pa(R.X)


Θ

specifies a CPD P(R.X|Pa(R.X))


Example SRM

Person

Income

Age

School

Prestige

J
school

Purchase

J
person

Type

Attended

Bought
-
by

0.9998, 0.0002

Type=necessity

false

true

false

true

0.999, 0.001

Income = high

0.99, 0.01

Universal Foreign Key Closure


Schema: R, S, T.R.F refers
to S, S.F refers to T

stratification: T < S < R


r

s

t

r.F
1

= s.K

s.F
2

= t.K


Schema: R, S

R.F
1

refers to S, R.F
2

refers to S

stratification: S < R


r

r.F
1

= s
1
.K

s
1

s
2

r.F
2

= s
2
.K

Universal Foreign Key Closure


Minimal extension Q
+
to a query Q:


Let Q be a keyjoin query over r
1
,r
2
,

r
k



For each r, if there is an attribute R.A
with parent R.F.B where R.F points to S,
then there is a unique tuple variable s
representing the join r.F=s.K



Proposition: Let Q be a query and let
Q
+

be its minimal extension. Then
size
Q
[D]=size
Q+

[D]



Answering Queries Using SRMs


Construct Query Evaluation BN for Query:


select * from Person, Purchase


where Person.id = Purchase.buyer
-
id


and Person.Income = high and
Purchase.Type=luxury

Person

Purchase

J
person

Income

Type

Age

Prestige

J
school

School

Compute upward closure of query attributes

by including all parents as well

SRM Learning


Learn parameters & qualitative
dependency structure


Extend known techniques for learning
Bayesian networks from data.

Database

Patient

Strain

Contact


Structure selection


Define scoring function:


log
-
likelihood function



l
(

θ
,S | D)=log P(D | S,
θ
)




Finding the model that has maximum
log
-
likelihood given data.



Do the greedy local structure search


Parameter Estimation


The model contains a parameter
Θ
a|x


for each value a of A and each assignment
of values x to X.

Θ
a|x

= P(R.A=a |X=x)

Θ
a|x

= F
D
(R.A=a,X=x)/F
D
(X=x)

System Architecture

Model

Constructor

Database

offline

Selectivity

Estimator

execution time


Query Q

Size(Q)

Conclusions


SRM is unique in its ability to handle
select and join operators


Estimates the high
-
dimensional joint
distribution using a set of lower
-
dimensional conditional distributions

To do:


Incremental maintenance of the SRM
as the database changes


Joins over non
-
key attributes


Selected Publications

o
“Learning Probabilistic Models of Link Structure”, L. Getoor, N. Friedman, D.
Koller and B. Taskar, JMLR 2002.

o
“Probabilistic Models of Text and Link Structure for Hypertext Classification”,
L. Getoor, E. Segal, B. Taskar and D. Koller, IJCAI WS ‘Text Learning: Beyond
Classification’, 2001.

o
“Selectivity Estimation using Probabilistic Models”, L. Getoor, B. Taskar and D.
Koller, SIGMOD
-
01.

o
“Learning Probabilistic Relational Models”, L. Getoor, N. Friedman, D. Koller,
and A. Pfeffer, chapter in Relation Data Mining, eds. S. Dzeroski and N.
Lavrac, 2001.

o
see also N. Friedman, L. Getoor, D. Koller, and A. Pfeffer, IJCAI
-
99.

o
“Learning Probabilistic Models of Relational Structure”, L. Getoor, N. Friedman,
D. Koller, and B. Taskar, ICML
-
01.

o
“From Instances to Classes in Probabilistic Relational Models”, L. Getoor, D.
Koller and N. Friedman, ICML Workshop on Attribute
-
Value and Relational
Learning: Crossing the Boundaries, 2000.

o
Notes from AAAI Workshop on Learning Statistical Models from Relational
Data, eds. L.Getoor and D. Jensen, 2000.

o
Notes from IJCAI Workshop on Learning Statistical Models from Relational
Data, eds. L.Getoor and D. Jensen, 2003.

See http://www.cs.umd.edu/~getoor