Data Mining for Query Optimization

sunfloweremryologistData Management

Oct 31, 2013 (4 years and 12 days ago)

121 views

Data Mining for Query Optimization

2

Outline


Semantic Query Optimization


Soft Constraints


Query Optimization via Soft Constraints


Selectivity Estimation via Soft Constraints


3

Semantic Query Optimization


Use integrity constraints associated with a database to rewrite


a query into a form that may be evaluated more efficiently

Some Techniques:


Join Elimination


Predicate Elimination


Join Introduction


Predicate Introduction


Detecting an Empty Answer Set

4

Commercial implementations of SQO

Early Experiences:



Could not spend too much time on optimization


Few integrity constraints are ever defined


Association with deductive databases

Few (if any!)

5

Join elimination: example

select

p_name, p_retailprice, s_name, s_address

from

tpcd.lineitem, tpcd.partsupp, tpcd.part, tpcd.supplier

where

p_partkey = ps_partkey and s_suppkey = ps_suppkey and



ps_partkey = l_partkey and ps_suppkey = l_suppkey;

RI constraints:

part
-
partsupp (on partkey)



supplier
-
partsupp (on partkey)



partsupp
-
lineitem (on partkey and suppkey)


select

p_name, p_retailprice, s_name, s_address

from

tpcd.lineitem, tpcd.partsupp, tpcd.part, tpcd.supplier

where

p_partkey = l_partkey and s_suppkey = l_suppkey;




6

Algorithm for join elimination

1. Derive column transitivity classes from the
join predicates in the query

2. Divide the relations in the query that are
related through RI constraints into
removable and non
-
removable

3. Eliminate all removable relations from the
query

4. Add
is not null
predicate to foreign key
columns of all tables whose RI parents were
removed


7

Algorithm for join elimination: example

C.C

PS.S

O.C

S.S

PS.S

O.C

C.C

S.S

PS.S

O.C

9

Performance results for join
elimination

10

Predicate Introduction: Example

select

sum(l_extendedprice * l_discount) as revenue

from

tpcd.lineitem

where

shipdate >date('1994
-
01
-
01');

select

sum(l_extendedprice * l_discount) as revenue

from

tpcd.lineitem

where

shipdate >date('1994
-
01
-
01') and receiptdate >= date('1994
-
01
-
01');

Check constraint:
receiptdate >= shipdate

Clustered Index on
receiptdate

11

Algorithm for Predicate Introduction

N

-

set of predicates derivable from the query and
check constraints



If

N

is inconsistent, stop.


Else, for each predicate A op B in
N
, add it to the
query if:


A or B is a join column


B is a major column of an index


no other index on B’s table can be used in the plan
for the original query

13

Queries

select


100.00 * sum



(case




when p_type like 'PROMO%'




then l_extendedprice * (1
-

l_discount)




else 0



end)



/ sum(l_extendedprice * (1
-

l_discount)) as promo_revenue

from

tpcd.lineitem, tpcd.part

where

l_partkey = p_partkey and



l_shipdate >= date('1998
-
09
-
01') and



l_shipdate < date('1998
-
09
-
01') + 1 month;

Given the check constraint
l_receiptdate >= l_shipdate

we may add

a new predicate to the query:




l_receiptdate >= date(‘1998
-
09
-
01’)

14

Performance Results for Index
Introduction

15

The Culprit

New query plan uses an index, but the original table

scan is still better!

Why did this happen:


incorrect estimate of the filter factor


underestimation of the CPU cost of locking index pages

16

Soft Constraints

17

Soft Constraints

Traditional (“hard”) integrity constraints are
defined to prevent incorrect updates. A
soft
constraint

is a statement that is true about
the current state of the database, but does
not verify updates. In fact, a soft constraint
can be
invalidated

by an update.


18

Soft Constraints (cont.)


Absolute soft constraints



no violation in

the current state of the database


Absolute soft constraints can be used for optimization in exactly
the same way traditional constraints are.



Statistical soft constraints



can have some

(small) degree of violation


Statistical soft constraints can be used for improved selectivity
estimation

19

Implementation of Soft Constraints

In Oracle the standard integrity constraints are
marked with a
rely

option, so that they are
not verified on updates.


In DB2 soft constraints are called
informational constraints.

20

Informational Check Constraint

Example 1:

Create an employee table where a
minimum salary of $25,000 is guaranteed by the
application


CREATE TABLE

emp(empno
INTEGER NOT NULL
PRIMARY KEY,


name
VARCHAR
(20),


firstname
VARCHAR
(20),


salary
INTEGER CONSTRAINT minsalary


CHECK

(salary >= 25000)


NOT ENFORCED


ENABLE QUERY
OPTIMIZATION
);

21

Enforcing Validation

Example 2
: Alter the employee table to start
enforcing the minimum wage of $25,000
using DB2. DB2 will also verify existing
data right away.


ALTER TABLE

emp
ALTER
CONSTRAINT

minsalary
ENFORCED


22

Informational RI Constraint

Example 3:

Create a department table where the
application ensures the existence of departments to
which the employees belong.


CREATE TABLE

dept(deptno
INTEGER NOT NULL PRIMARY
KEY
,


deptName

VARCHAR
(20),


budget
INTEGER
);


ALTER TABLE

emp
ADD COLUMN

dept
INTEGER NOT NULL


CONSTRAINT

dept_exist


REFERENCES

dept


NOT ENFORCED


ENABLE QUERY OPTIMIZATION
);

23

Query Optimization via Empty
Joins

24

Example

select


Model

from


Tickets T, Registration R

where

T.RegNum = R.RegNum and T.date > “1990
-
01
-
01”


and R.Model LIKE “BMW Z3%”

select


Model

from


Tickets T, Registration R

where

T.RegNum = R.RegNum and T.date >
“1997
-
01
-
01”


and R.Model LIKE “BMW Z3%”

First BMW Z3 series cars were made in 1997.

25

Matrix representation of empty joins



A,B
(R

S)

26

Staircase data structure

27

Properties of the algorithm


Time Complexity
O
(nm)


requires a single scan of the sorted data


Space Complexity
O(min(n,m))


only two rows of the matrix need be kept in memory


Scalable with respect to:


number of tuples in the join result


number of discovered empty rectangles


size of the domain of one of the attributes

28

How many empty rectangles are there?

Tests done on 4 pairs of attributes with numerical domain present in
typical joins in a real
-
world workload of a health insurance company.

29

How big are the rectangles?

30

Query rewrite: simple case

select …

from R, S,...

where R.C=S.C and


60<R.A<80 and


20<S.B<80 and...



select …

from R, S,...

where R.C=S.C and


60<R.A<80 and


20<S.B<60

and...

31

Query rewrite: complex case

select



from

R, S,...

where

R.C=S.C and



60<R.A<80 and



20<S.B<80 and...

select



from

R, S,...

where

R.C=S.C and


(… and …) or


(… and …) or


(… and …) or


...

32

Experiment I: Size of the Overlap

33

Experiment 2: Type of Overlap

34

Experiment 3: Number of Empty
Joins Used in Rewrite

35

How much do the rectangles
overlap with queries?

36

Query optimization experiments



real
-
world workload of 26 queries



5 of the queries “qualified” for the rewrite



only simple rewrites were considered



all rewrites led to improved performance

37

Query Cardinality Estimate via
Empty Joins

38

Query Cardinality Estimate via
Empty Joins (SIEQE)


Cardinality estimates crucial for designing
good query evaluation plans


Uniform data distribution (UDA): standard
assumption in database systems


Histograms effective in single dimensions:
too expensive to build and maintain
otherwise

39

The Strategy

Q1

Q2



With UDA, the “density”: 1 tuple/sq unit


Empty joins cover 20% of the area


Adjusted density: 1.25 tuples/sq unit

Cardinality

UDA

SIEQE

Q1

100

62

Q2

100

125

40

Experiments


Number of queries for which the error is less than a given limit

42

Discovery of Check Constraints
and Their Application in DB2


We discover two types of (rules) check constraints:


correlations between attributes over ordered domains


partitioning of attributes

43

Correlations between attributes
over ordered domains

Rules have the form:



Y = bX + a + [emin, emax]





Algorithm


for all tables in the database


for all comparable variable pairs (X and Y) in the table



apply OLS estimation to get the function of the





form: Y = a + bX




calculate the max and min error (or residual)





emax and emin


endfor

endfor

44

Partitioning

Rules have the form
:

If X = a, then Y


[emin, emax]






Algorithm



for

all

tables

in

the

database


for

any

qualifying

variable

pair

(X

and

Y)

in

the

table




calculate

partitions

by

using

GROUP

BY

X

statements



find

the

max

and

min

value

of

Y

for

each

partition


endfor

endfor

45

Experiments in TPC
-
H

Rules

discovered

through

partitioning
:


If

L_LINESTATUS=F,

then

L_SHIPDATE=(
01
/
04
/
1992
,

06
/
17
/
1995
),

m

=

0
.
50

If

L_LINESTATUS=O,

then

L_SHIPDATE=(
06
/
19
/
1995
,

12
/
25
/
1998
),

m

=

0
.
50


TPC
-
H contains the following check constraint:


L_RECEIPTDATE > L_SHIPDATE


Our algorithm discovered the following rule:


L_RECEIPTDATE = L_SHIPDATE + (1, 30), m = 0.0114.


46

Applications


DBA Wizard


Semantic Query Optimization


Improved Filter Factor Estimates

47

Example


ARRIVAL DATE <= ‘1999
-
06
-
15’
AND

DEPARTURE_DATE >= ‘1999
-
06
-
15’


The

filter factor estimate for the query would be:

ff = ff1 * ff2


Consider a query issued against a hotel database, that requests the number of guests
staying in the hotel on a given date.

If
‘1999
-
06
-
15’

was approximately midway in the date ranges, we would estimate a
quarter of all the guests that came in over the number of years would be in the answer
of the query!

48

Example (cont.)

Assume that the following check constraint was discovered:


DEPARTURE_DATE >= ARRIVAL_DATE + (1 DAY, 5 DAYS)


The original condition in the query predicate can then be changed to:


ARRIVAL_DATE <= ‘1999
-
06
-
15’
AND

ARRIVAL_DATE >= ‘1999
-
06
-
18’

or

ARRIVAL_DATE
BETWEEN

‘1999
-
06
-
15’
AND

‘1999
-
06
-
18’

The filter factor is now estimated to:


ff = (ff1 + ff2

1)

49

Other Research on the Use of Soft
Constraints in Query Optimization

50

Query
-
driven Approach


Built multidimensional histograms based on
query results (Microsoft)


Improve cardinality estimates by looking at
the intermediate query results (IBM)


Both techniques generate statistical soft constraints

51

Data
-
driven Approach


Lots of methods using Bayesian networks to
infer statistical soft constraint


Lots of methods to discover functional
dependencies in data (absolute soft
constraints)


Most recently, BHUNT and CORDS use
sampling to discover soft constraints (IBM)

52

References


Q. Cheng, J. Gryz, F. Koo, T. Y. Cliff Leung, L. Liu, X. Qian, B.
Schiefer:
Implementation of Two Semantic Query Optimization
Techniques in DB2 Universal Database
. VLDB 1999.


J. Edmonds, J. Gryz, D. Liang, R. Miller:
Mining for Empty Rectangles
in Large Data Sets.
ICDT 2001.


J. Gryz, B. Schiefer, J. Zheng, C. Zuzarte
:
Discovery and Application
of Check Constraints in DB2.
ICDE 2001.


P. Godfrey, J. Gryz, C. Zuzarte:
Exploiting Constraint
-
Like Data
Characterizations in Query Optimization
. SIGMOD 2001.


J. Gryz, D. Liang:
Query Optimization via Empty Joins
. DEXA 2002.


J. Gryz, D. Liang:
Query Cardinality Estimation via Data Mining
. IIS
2004.