Non-Distributive Aggregate Functions

basesprocketData Management

Oct 31, 2013 (3 years and 9 months ago)

87 views

Incremental Maintenance for

Non
-
Distributive Aggregate Functions

work done at IBM Almaden Research Center

Themis Palpanas (U of Toronto)

Richard Sidle

Bobbie Cochrane

Hamid Pirahesh

UoA, Sep 2002

Themis Palpanas
-

U of Toronto

2

Motivation

large amounts of data stored in databases


often times data warehouses are used


consolidate data from many sources


offer more general and descriptive view of data


queried by business intelligence tools and decision support
systems


produce expensive OLAP queries


these OLAP queries have nice properties:


based on same set of tables


perform similar aggregations


UoA, Sep 2002

Themis Palpanas
-

U of Toronto

3

Motivation (cont’d)

can efficiently support such queries with Automatic
Summary Tables (ASTs)


materialized queries defined over a set of base tables


precomputed once, used many times


answer complex queries fast


must maintain ASTs when base tables change


inserts, updates, deletes

4

Motivation (cont’d)

base

tables

AST

insert/update/delete

AST

definition

UoA, Sep 2002

Themis Palpanas
-

U of Toronto

5

Aggregate Functions

characterization of functions wrt insertion and
deletion operations


updates are series of deletions and insertions

distributive aggregate functions


new value computed based on old value and value of
operation


SUM()

non
-
distributive aggregate functions


above property does not hold


STDDEV()


MIN() (because of deletions)

UoA, Sep 2002

Themis Palpanas
-

U of Toronto

6

Problem Statement

given ASTs with aggregate functions


distributive


SUM, COUNT


non
-
distributive


STDDEV, CORRELATION, REGRESSION, MIN/MAX,
XMLAGG, …

when base tables change


incrementally maintain affected ASTs



efficient maintenance of ASTs with


non
-
distributive aggregate functions

UoA, Sep 2002

Themis Palpanas
-

U of Toronto

7

Outline

Current Approach


Our Solution


Experimental Evaluation


Related Work


Conclusions


8

AST

Current Approach

base

tables

insert/update/delete

delta

combine

old and

new values

AST

definition

Propagate

phase

Apply

phase

UoA, Sep 2002

Themis Palpanas
-

U of Toronto

9

Current Approach (cont’d)

works for distributive


SUM, COUNT


does not work for
non
-
distributive


STDDEV, CORRELATION, REGRESSION


MIN/MAX


XMLAGG


need new way to deal with these functions

UoA, Sep 2002

Themis Palpanas
-

U of Toronto

10

Our Solution

selective recomputation


no longer enough to compute
delta


must recompute some aggregation groups


minimize work to be done


choose which groups to recompute


optimize query plan

11

Our Solution (cont’d)

AST

base

tables

insert/update/delete

delta

recompute

affected

groups

combine

old and

new values

Propagate

phase

AST

definition

Apply

phase

UoA, Sep 2002

Themis Palpanas
-

U of Toronto

12

Our Solution (cont’d)

the 5 steps


1.
compute new aggregate values

2.
change column derivation

3.
recompute only affected groups

4.
eliminate unnecessary operations

5.
optimize for special cases

UoA, Sep 2002

Themis Palpanas
-

U of Toronto

13

Initial Query Plan

prop

UDI

LOJ

AST

Query Graph Model (QGM)

UoA, Sep 2002

Themis Palpanas
-

U of Toronto

14

1. Compute New Aggregate Values

compute
delta
for distributive
functions

recompute non
-
distributive
functions

get those values only for
affected groups



duplicate computation for
distributive functions!

prop

UDI

LOJ

AST

AST

LOJ

UoA, Sep 2002

Themis Palpanas
-

U of Toronto

15

2. Change Column Derivation

change column derivation

rewrite phase projects out
unused columns



entire AST gets recomputed!

prop

UDI

LOJ

AST

AST

LOJ

non
-
distributive
only

distributive
only

UoA, Sep 2002

Themis Palpanas
-

U of Toronto

16

2. Change Column Derivation

example AST:


SELECT dept_id,COUNT(emp_id),MAX(age),STDDEV(salary)

FROM employees

GROUP BY dept_id


result of COUNT() computed from old propagate phase

results of MAX() and STDDEV() from AST definition

UoA, Sep 2002

Themis Palpanas
-

U of Toronto

17

3. Recompute Affected Groups

push join predicate
down in AST


only affected groups
are recomputed



special rules for super
-
aggregates


GROUPING SETS


ROLLUP


CUBE

prop

UDI

LOJ

AST

AST*

LOJ

non
-
distributive
only

distributive


only

T1

Tk



J

J

UoA, Sep 2002

Themis Palpanas
-

U of Toronto

18

3. Recompute Affected Groups

special treatment for ASTs with super
-
aggregates


predicates not pushdownable


caution not to compute totals of totals


build special join predicate


ensure correct aggregations


change rewrite rules


allow predicate pushdown through super aggregates


applicable only for special join predicate

UoA, Sep 2002

Themis Palpanas
-

U of Toronto

19

4. Remove Unnecessary Operations


outerjoin not always needed

when changes are only inserts



reroute columns from
propagate phase through AST


remove outerjoin operator



same for updates not
referencing AST grouping
columns and predicates

prop

UDI

LOJ

AST

AST

T1

Tk



J

J

all columns

distributive


only

UoA, Sep 2002

Themis Palpanas
-

U of Toronto

20

4. Remove Unnecessary Operations

example AST:

SELECT dept_id,COUNT(emp_id),MAX(age),STDDEV(salary)

FROM employees

GROUP BY dept_id


modification on base tables:

UPDATE employees SET salary=10 WHERE age>40


outerjoin operation will not be built


update does not refer to grouping column (
dept_id
), and no
predicate in AST refers to updated column (
salary
)


certain that no tuples in AST will be deleted

only STDDEV() will be recomputed


the rest are not affected by changes


UoA, Sep 2002

Themis Palpanas
-

U of Toronto

21

5. Optimize for Special Cases

recomputation step not needed when



only insertions and only MIN/MAX functions


build predicate in apply phase


check if new min/max should replace old values



only deletions referring only to grouping columns of
AST


can only cause entire groups to be deleted


handled in apply phase

UoA, Sep 2002

Themis Palpanas
-

U of Toronto

22

5. Optimize for Special Cases

example AST:

SELECT dept_id,COUNT(emp_id),MAX(age),STDDEV(salary)

FROM employees

GROUP BY dept_id


modification on base tables:

DELETE FROM employees WHERE dept_id>40


selective recomputation step not needed


deletion refers only to grouping column (
dept_id
)


certain that entire groups will be deleted from AST


no other groups will be affected

UoA, Sep 2002

Themis Palpanas
-

U of Toronto

23

Experimental Evaluation

prototype implementation in IBM DB2 UDB

star schema database


sales of products over 5 year time period


fact table: 10 million tuples

AST with non
-
distributive aggregate function


240,000 tuples

workload simulates nightly updates

1.
add/delete data for first day of month

2.
add/delete data for second day of month

3.
add/delete data for full month

UoA, Sep 2002

Themis Palpanas
-

U of Toronto

24

Experimental Evaluation (cont’d)

workload 1

workload 2

workload 3

incremental

286

294

420

full refresh

699

702

692

deletions require 40
-
60% of full refresh time


workload 1

workload 2

workload 3

incremental

3

n/a

31

full refresh

699

702

692

optimized deletions require 1
-
4% of full refresh time

UoA, Sep 2002

Themis Palpanas
-

U of Toronto

25

Experimental Evaluation (cont’d)

workload 1

workload 2

workload 3

incremental

151

158

180

full refresh

702

702

721

insertions/updates require 20
-
25% of full refresh time

UoA, Sep 2002

Themis Palpanas
-

U of Toronto

26

Related Work

incremental view maintenance



differential refresh algorithms


Lindsay et al. 1986, Blakeley et al. 1986, Qian and
Wiederhold 1991, Ceri and Widom 1991



deferred incremental maintenance


Colby et al. 1996, Salem et al. 2000



views with aggregation


Quass 1996, Mumick et al. 1997

UoA, Sep 2002

Themis Palpanas
-

U of Toronto

27

Conclusions

incremental maintenance for ASTs with non
-
distributive aggregate functions


support MIN/MAX, STDDEV, CORRELATION,
REGRESSION, XMLAGG, …


efficient selective recomputation


recompute only affected groups


optimize query plan


customize for special cases


significant performance improvements

UoA, Sep 2002

Themis Palpanas
-

U of Toronto

28

Future Work

examine use of work areas


temporary storage space


store intermediate values


maintenance without recomputation


STDDEV, MIN/MAX(?), …


very helpful for ASTs defined with super
-
aggregates


ASTs with HAVING clauses


do not know when groups will enter/leave AST