# Selectivity Estimation using Probabilistic Models - Database and ...

AI and Robotics

Nov 7, 2013 (4 years and 8 months ago)

94 views

Selectivity Estimation using
Probabilistic Models

Author:

Lise Getoor, Ben Taskar, Daphne Koller

Presenter:

Qidan Cheng

Outline

Introduction

Estimation for single tables

SRM: Statistical Relational Models

Selectivity Estimation using SRMs

Learning SRM

Introduction

Accurate estimates of the result size of
queries are crucial to several query
processing components of DBMS.

Cost
-
based query optimizers: choose
the optimal query execution plan.

Query profilers: predicting resource
consumption and distribution of query
results.

Introduction

How to estimate the size of a selection query
over multiple attributes for a single table?

The result size is determined by the joint
frequency distribution of the values of these
attributes.

Size(Q) = |R|

P
D
(a
1
,

,a
k
)

Query:

select * from R

where R.A1 = a1 and

and

R.Ak = ak

Introduction

But

exponential in # of attributes v
n
,
representing all combination of attribute
values is infeasible.

Attribute value independence assumption:
joint distribution is product of single attribute
distributions

Problem: overestimate or underestimate the
query size.

→ Bayesian Network

Estimation for single table

Example:

Given a simple relation R with three

attributes:

Education (high
-
school, college,degree)

3 values

Income (low, medium, high)

3 values

Home
-
Owner (false, true)

2 values

Joint distribution need 18 numbers.

Observation:

Some of the correlations between

attributes might be indirect ones,

mediated by other attributes.

→ Conditional Independence

Estimation for single table

P(H=h|E=e,I=i)=P(H=h|I=i)

Home
-
owner is conditionally

independent of Education given

Income.

→ Compact form of the joint distribution

P(H,E,I) = P(E)P(I|E)P(H|I,E)=P(E)P(I|E)P(H|I)

Home
-
owner

Income

Education

Estimation for single table

Figure (b) can encode precisely the same joint
distribution as in Figure (a)

Bayesian Networks

X
-
Ray

Lung Infiltrates

Sputum Smear

Tuberculosis

Pneumonia

Nodes = random variables

Edges = direct probabilistic

influence

Network structure encodes independence
assumptions:
X
-
Ray

conditionally
independent of
Pneumonia

given
Infiltrates

Bayesian Networks

X
-
Ray

Lung Infiltrates

Sputum Smear

Tuberculosis

Pneumonia

Associated with each node
X
i

there is a
conditional probability distribution
P(X
i
|Pa
i
:

distribution over
X
i

for each assignment to
parents

0.8

t

0.2

p

t

p

0.6

0.4

0.01

0.99

0.2

0.8

t

p

t

p

T

P

P(I |P, T )

BN Semantics

Compact & natural representation:

nodes have

k

parents


2
k
n

vs.
2
n

params

conditional

independencies

in BN structure

+

local

probability

models

full joint

distribution

over domain

=

X

I

S

T

P

BNs for Query Estimation

Query:

select * from R

where R.A
1

= a
1

and

and R.A
k

= a
k

P(a
1
,a
2
,

a
n
)=

i
|parents(a
i
))

Use Bayesian inference algorithm to compute
P
D
(a
1
,

,a
k
)

Algorithm complexity depends on BN
connectivity; efficient in practice

Size(Q) = |R|

P
D
(a
1
,

,a
k
)

Join Selectivity Estimation

Person

Purchase

Uniform Join Assumption

Size(Purchase Person) = | Purchase |

Assuming referential integrity

Naïve Approach

Join Selectivity Estimation

Example query Q:

person.income=high and
purchase.type=luxury

p = P (person.income=high)

q = P (purchase.type=luxury)

Size
Q

= |Purchase|*p*q

Problems:

Joining Two Tables

Correlated Attributes

Person

Purchase

Income = high

Income = low

Type = luxury

Type = necessity

The attributes of the two different tables are often
correlated

Skewed Join

Person

Purchase

Income = high

Income = low

Type = luxury

Type = necessity

The probability that two tuples join with each other can
also be correlated with various attributes.

Join Indicator

S

R

Query: select * from R, S

where R.F = S.K

and R.A = a and S.B = b

P(J
F
) = prob. randomly chosen tuple from R
joins with a randomly chosen tuple from S

size(Q) = | R | | S | P(J
F
, a, b)

Statistical Relational Models

Model distribution of attributes across
multiple tables

Allow attribute values to depend on
attributes in the same table (like a
BN)

Allow attribute values to depend on
attributes in other tables along a
foreign key join

Can model the join probability of two
tuples using
join indicator variable

Statistical Relational Model

A SRM for a relational database is a
pair(S,
θ
),which specifies a local
probabilistic model for each of the following
variables

A variable R.A for each table R and each
attribute A

A boolean join indicator variable R.J
F

For each variable R.X

S specifies a set of parents Pa(R.X)

Θ

specifies a CPD P(R.X|Pa(R.X))

Example SRM

Person

Income

Age

School

Prestige

J
school

Purchase

J
person

Type

Attended

Bought
-
by

0.9998, 0.0002

Type=necessity

false

true

false

true

0.999, 0.001

Income = high

0.99, 0.01

Universal Foreign Key Closure

Schema: R, S, T.R.F refers
to S, S.F refers to T

stratification: T < S < R

r

s

t

r.F
1

= s.K

s.F
2

= t.K

Schema: R, S

R.F
1

refers to S, R.F
2

refers to S

stratification: S < R

r

r.F
1

= s
1
.K

s
1

s
2

r.F
2

= s
2
.K

Universal Foreign Key Closure

Minimal extension Q
+
to a query Q:

Let Q be a keyjoin query over r
1
,r
2
,

r
k

For each r, if there is an attribute R.A
with parent R.F.B where R.F points to S,
then there is a unique tuple variable s
representing the join r.F=s.K

Proposition: Let Q be a query and let
Q
+

be its minimal extension. Then
size
Q
[D]=size
Q+

[D]

Construct Query Evaluation BN for Query:

select * from Person, Purchase

-
id

and Person.Income = high and
Purchase.Type=luxury

Person

Purchase

J
person

Income

Type

Age

Prestige

J
school

School

Compute upward closure of query attributes

by including all parents as well

SRM Learning

Learn parameters & qualitative
dependency structure

Extend known techniques for learning
Bayesian networks from data.

Database

Patient

Strain

Contact

Structure selection

Define scoring function:

log
-
likelihood function

l
(

θ
,S | D)=log P(D | S,
θ
)

Finding the model that has maximum
log
-
likelihood given data.

Do the greedy local structure search

Parameter Estimation

The model contains a parameter
Θ
a|x

for each value a of A and each assignment
of values x to X.

Θ
a|x

= P(R.A=a |X=x)

Θ
a|x

= F
D
(R.A=a,X=x)/F
D
(X=x)

System Architecture

Model

Constructor

Database

offline

Selectivity

Estimator

execution time

Query Q

Size(Q)

Conclusions

SRM is unique in its ability to handle
select and join operators

Estimates the high
-
dimensional joint
distribution using a set of lower
-
dimensional conditional distributions

To do:

Incremental maintenance of the SRM
as the database changes

Joins over non
-
key attributes

Selected Publications

o
“Learning Probabilistic Models of Link Structure”, L. Getoor, N. Friedman, D.
Koller and B. Taskar, JMLR 2002.

o
“Probabilistic Models of Text and Link Structure for Hypertext Classification”,
L. Getoor, E. Segal, B. Taskar and D. Koller, IJCAI WS ‘Text Learning: Beyond
Classification’, 2001.

o
“Selectivity Estimation using Probabilistic Models”, L. Getoor, B. Taskar and D.
Koller, SIGMOD
-
01.

o
“Learning Probabilistic Relational Models”, L. Getoor, N. Friedman, D. Koller,
and A. Pfeffer, chapter in Relation Data Mining, eds. S. Dzeroski and N.
Lavrac, 2001.

o
see also N. Friedman, L. Getoor, D. Koller, and A. Pfeffer, IJCAI
-
99.

o
“Learning Probabilistic Models of Relational Structure”, L. Getoor, N. Friedman,
D. Koller, and B. Taskar, ICML
-
01.

o
“From Instances to Classes in Probabilistic Relational Models”, L. Getoor, D.
Koller and N. Friedman, ICML Workshop on Attribute
-
Value and Relational
Learning: Crossing the Boundaries, 2000.

o
Notes from AAAI Workshop on Learning Statistical Models from Relational
Data, eds. L.Getoor and D. Jensen, 2000.

o
Notes from IJCAI Workshop on Learning Statistical Models from Relational
Data, eds. L.Getoor and D. Jensen, 2003.

See http://www.cs.umd.edu/~getoor