Best Practices to Improve Query Performance in a

muttchessAI and Robotics

Nov 8, 2013 (3 years and 5 months ago)

47 views

December 01
-
03, 2009 •Minneapolis, Chicago, Milwaukee

0

Best Practices to Improve Query Performance in a
Data Warehouse
-

1

Calisto Zuzarte, STSM, IBM, calisto@ca.ibm.com


December 01
-
03, 2009 •Minneapolis, Chicago, Milwaukee

1

Data Warehouse Life Cycle

Database design / Application design


The Warehouse Application architects and Database
Administrators work together to design the queries and schema
before they put the application in production


Database performance layer implementation


In order to meet SLAs, DBAs usual go through some iterations
augmenting the database with performance layer objects and
set up the initial configuration to get good performance


Database tuning operations


During production, with changing requirements and change in
data, there is on
-
going tuning required to keep operations
smooth.

December 01
-
03, 2009 •Minneapolis, Chicago, Milwaukee

2

Motivation

Data warehouse environments characteristics:


Large volumes of data


Millions/Billions of rows involved in some tables


Large Joins


Large Sorts,


Large Aggregations


Many tables involved


Large amount of data rolled
-
in and rolled
-
out


Complex queries


Report Queries


Ad Hoc Queries


It is important to pay attention to query performance

December 01
-
03, 2009 •Minneapolis, Chicago, Milwaukee

3

Objective

Provide recommendations from a DB2
optimizer perspective to improve query
performance through the Data
Warehouse life cycle


December 01
-
03, 2009 •Minneapolis, Chicago, Milwaukee

4

Agenda

SESSION 1

Best Practices


Database Design

Best Practices


Application Design

Best Practices


Configuration and Operations




SESSION 2

Best Practices


Performance Layer



December 01
-
03, 2009 •Minneapolis, Chicago, Milwaukee

5

Best Practices


Database Design

Parallelism


Inter
-
partition Shared nothing parallelism (DPF)


Intra
-
Query Parallelism (SMP)


Partitioning


Database Partitioning


Table Partitioning


Table (Range) Partitioning


UNION ALL Views


Multi
-
Dimension Clustering


Schema


December 01
-
03, 2009 •Minneapolis, Chicago, Milwaukee

6

Best Practices
-

Parallelism

DPF or SMP or both ?


Database partition feature (DPF) is generally recommended
to achieve parallelism in a data warehouse


Achieves scalability and query performance


SMP (Intra
-
Query Parallelism) is NOT recommended in
concurrent multi
-
user environments with heavy CPU usage


SMP is only recommended


When CPUs are highly under utilized and when DPF is not an
option

December 01
-
03, 2009 •Minneapolis, Chicago, Milwaukee

7

Partitioning (Complimentary Strategies in DB2)

Database Partitioning (DPF)


CREATE TABLE … DISTRIBUTE BY HASH


Key Benefit : Better scalability and performance through
parallelism


Table Partitioning


Table (Range) Partitioning


CREATE TABLE …PARTITION BY RANGE


Key Benefit : Better data management (roll
-
in and roll
-
out of data)



UNION ALL Views


CREATE VIEW V AS (SELECT … FROM F1 UNION ALL … )


Key Benefit : Independent branch optimization


Multidimensional Clustering (MDC)


CREATE TABLE … ORGANIZE BY DIMENSION


Key Benefit : Better query performance through data clustering

“Database Partitioning”

“Distribution Key”

“Table Partitioning”

“Table Partitioning Key”

“UNION ALL branch

Partitioning”

“Cells”, “Blocks”,
“Dimensions”

December 01
-
03, 2009 •Minneapolis, Chicago, Milwaukee

8

Distribute By … Partition By … Organize By ..

CREATE TABLE …


DISTRIBUTE BY HASH


PARTITION BY RANGE


ORGANIZE BY DIMENSION

East

West

East

West

East

West

East

West

East

West

East

West

North

South

North

South

North

South

North

South

North

South

North

South

TS1

TS2

TS1

TS2

TS1

TS2

Jan

Feb

Jan

Feb

Jan

Feb

Database

Partition 1

Database

Partition 2

Database

Partition 3

December 01
-
03, 2009 •Minneapolis, Chicago, Milwaukee

9

Best Practices


DPF Partitioning

Collocate the fact and largest frequently joined dimension


Choose to avoid significant skew on some partitions


Avoid DATE dimension where active transactions for current
date all fall on one database partition (TIMESTAMP is good)


Possibilities for workload isolation for data marts


Different partition groups but common dimension tables


Recommend that dimension tables be replicated (discussed
later)

December 01
-
03, 2009 •Minneapolis, Chicago, Milwaukee

10

Best Practices


Table Partitioning

Recommend partitioning the fact tables

Recommend using the DATE dimension

Works better with application key predicates applied directly

Table (Range) Partitioning


Consider partitioned indexes with V9.7


Choose partitioning based on roll
-
in / roll
-
out granularity

UNION ALL Views


Define view predicates or CHECK Constraints to get branch
elimination with query predicates (with constants only)


Use UNION ALL views only with well designed applications


Dangers of materialization with ad hoc queries


Large number of branches needs time and memory to optimize

December 01
-
03, 2009 •Minneapolis, Chicago, Milwaukee

11

Best Practices


Multidimensional Clustering (MDC)

Recommend defining MDC on the fact table


Guaranteed clustering (Avoids the need to REORG for
clustering)


I/O optimization


Compact indexes (compact, coexists with regular indexes)

Choose dimensions based on query predicates


Recommend the use of 1 to 4 dimensions


Need to ensure dimensions are chosen such that they do not
waste storage

Could choose a finer granularity of Table partitioning range


For example: Table partition range by month, MDC by date

December 01
-
03, 2009 •Minneapolis, Chicago, Milwaukee

12

Star Schema

Product_id

Store_id

Channel_id

Date_id


Amount

Quantity



SALES

Store_id


Region_id



STORE

Date_id


Month_id

Quarter_id

Year_id

TIME

Channel_id




CHANNEL

Product_id


Class_id

Group_id

Family_id

Line_id

Division_id



PRODUCT

December 01
-
03, 2009 •Minneapolis, Chicago, Milwaukee

13

Dimension Hierarchy

Channel

Store

Month

Product

Class

Group

Family

Line

Division

Quarter

Year

Retailer

Sales Fact

Product Dimension

Time Dimension

Store Dimension

Channel Dimension

Level 5

Level 1

Level 2

Level 3

Level 4

Level 0

Date

December 01
-
03, 2009 •Minneapolis, Chicago, Milwaukee

14

Best Practices
-

Schema

Surrogate Keys


As far as possible use application keys themselves


allows predicates to be applied/transferred directly on the fact table


DATE is a good candidate (easier to roll
-
in/roll
-
out and for MDC )

Star Schema / Snowflakes


Separate tables for each dimension hierarchy (snowflake) may
result in a large number of joins


Flattened dimensions may contain a lot of redundancy (space)

Define Columns NOT NULL when appropriate


Many optimizations that are done based on NOT NULL

Define Uniqueness when appropriate


Primary Keys / Unique Constraints / Unique Indexes

December 01
-
03, 2009 •Minneapolis, Chicago, Milwaukee

15

Agenda

SESSION 1

Best Practices


Database Design

Best Practices


Application Design

Best Practices


Configuration and Operations




SESSION 2

Best Practices


Performance Layer



December 01
-
03, 2009 •Minneapolis, Chicago, Milwaukee

16

Application Considerations
-

Expressions

Use constants instead of expressions in the query


Example


SELECT … WHERE DateCol <= CURRENT DATE


5


Use VALUES(CURRENT DATE


5) to get the constant first and
use it in the query

Avoid expressions on indexed columns


Example


SELECT … WHERE DATECOL


2 DAYS > ‘2009
-
10
-
22’


SELECT … WHERE DATECOL > ‘2009
-
10
-
22’ + 2 DAYS

Similar recommendation with cast functions


Example


SELECT … WHERE INT(CHARCOL) = 2009


SELECT … WHERE CHARCOL = ‘2009’


Note you may lose Errors/Warnings


December 01
-
03, 2009 •Minneapolis, Chicago, Milwaukee

17

Application Considerations


Table Partitioning / MDC

As far as possible put local predicates directly on Table
Partition or MDC dimension columns of the fact table


SELECT ... FROM CUSTDIM C, TIMEDIM T, FACT F

WHERE C.country=USA and C.KEYCOL=F.CUSTKEYCOL and


T.Date = ‘2009
-
01
-
15’ and T.KEYCOL= F.TIMEKEYCOL


Simplify if the TIMEKEYCOL is correlated to the TIME values

(For example TIMEKEYCOL= 20090115 for the date ‘2009
-
01
-
15’)



SELECT ... FROM CUSTDIM C, FACT F

WHERE C.country=USA and C.KEYCOL=F.CUSTKEYCOL and


F.TIMEKEYCOL = 20090115

December 01
-
03, 2009 •Minneapolis, Chicago, Milwaukee

18

Application Considerations


Table Partitioning / MDC

Another example … consider

SELECT ... FROM CUSTDIM C, TIMEDIM T, FACT F

WHERE C.country=USA and C.KEYCOL=F.CUSTKEYCOL and


T.YEAR = 2009 and T.KEYCOL= F.TIMEKEYCOL



First get the values for MINKEY and MAXKEY




SELECT MIN(KEYCOL) FROM TIMEDIM WHERE YEAR=2009



SELECT MAX(KEYCOL) FROM TIMEDIM WHERE YEAR=2009


Then write the SQL as follows

SELECT ... FROM CUSTDIM C, TIMEDIM T, FACT F

WHERE C.country=USA and C.KEYCOL=F.CUSTKEYCOL and


T.YEAR = 2009 and T.KEYCOL= F.TIMEKEYCOL AND


F.TIMEKEYCOL >= MINKEY AND


F.TIMEKEYCOL <= MAXKEY

December 01
-
03, 2009 •Minneapolis, Chicago, Milwaukee

19

Application Considerations


General Recommendations

Avoid repetitions of complex expressions


Use Global Temporary Tables to split a query if it contains
more than about 15 tables and compile time is an issue


December 01
-
03, 2009 •Minneapolis, Chicago, Milwaukee

20

Agenda

SESSION 1

Best Practices


Database Design

Best Practices


Application Design

Best Practices


Configuration and Operations




SESSION 2

Best Practices


Performance Layer



December 01
-
03, 2009 •Minneapolis, Chicago, Milwaukee

21

Best Practices


Configuration and Operations

Configuration


Database Configuration


DBMS Configuration


Registry Settings

Operations


Collecting Statistics


December 01
-
03, 2009 •Minneapolis, Chicago, Milwaukee

22

Configuration

Optimization Level 5


Avoid multiple bufferpools of the same page size


Configuration thumb rules


BUFFPOOL ~= SHEAPTHRES


SORTHEAP ~= SHEAPTHRES/(# of concurrent SORT, HSJN)


December 01
-
03, 2009 •Minneapolis, Chicago, Milwaukee

23

Registry Variables

DB2_ANTIJOIN=EXTEND


If slow queries have NOT EXISTS, NOT IN predicates

December 01
-
03, 2009 •Minneapolis, Chicago, Milwaukee

24

Registry Variables

DB2_REDUCED_OPTIMIZATION=YES


Set if compile time is an issue

IBM Service may recommend a more complex setting for
example:


DB2_REDUCED_OPTIMIZATION=
10
,15,20
,
00011000….



First Part

: DB2_REDUCED_OPTIMIZATION=
A,B,C


IF more than C joins, then "quick greedy"


ELSE IF more than B joins, then use “greedy”


ELSE IF more than A joins, use reduced “dynamic” strategy.


Second Part

not documented (Mainly intended for setting by
service)


December 01
-
03, 2009 •Minneapolis, Chicago, Milwaukee

25

Best Practices


Optimization Level 5

BUFFERPOOL~=SHEAPTHRES

DB2_ANTIJOIN=EXTEND

DB2_REDUCED_OPTIMIZATION=YES

December 01
-
03, 2009 •Minneapolis, Chicago, Milwaukee

26

Collecting Statistics

The DB2 Query Optimizer relies on reasonably accurate
statistics to get a good query plans

User runs RUNSTATS when data changes (part of ETL)

Statistics Fabrication (unreliable)


DB2 keeps UPDATE / DELETE / INSERT counters


Fabrication limited to a few statistics


Not enough

Automatic Statistics


Automatically collects statistics on tables in need


Runs in the background as a low priority job

Real Time Statistics


Collects statistics on
-
the
-
fly

December 01
-
03, 2009 •Minneapolis, Chicago, Milwaukee

27

AUTO RUNSTATS

Set Under Automatic Table Maintenance hierarchy


AUTO_RUNSTATS cannot be ON unless
AUTO_TBL_MAINT is ON




Automatic maintenance (AUTO_MAINT) = ON


Automatic database backup (AUTO_DB_BACKUP) = OFF


Automatic table maintenance (AUTO_TBL_MAINT) = ON


Automatic runstats (AUTO_RUNSTATS) = ON


Automatic statement statistics (AUTO_STMT_STATS) = OFF


Automatic statistics profiling (AUTO_STATS_PROF) = OFF


Automatic profile updates (AUTO_PROF_UPD) = OFF


Automatic reorganization (AUTO_REORG) = OFF

December 01
-
03, 2009 •Minneapolis, Chicago, Milwaukee

28

REAL TIME STATISTICS

Set Under Automatic Table Maintenance hierarchy


Real Time Statistics cannot be ON unless AUTO
RUNSTATS is ON


AUTO_RUNSTATS cannot be ON unless
AUTO_TBL_MAINT is ON




Automatic maintenance (AUTO_MAINT) = ON


Automatic database backup (AUTO_DB_BACKUP) = OFF


Automatic table maintenance (AUTO_TBL_MAINT) = ON


Automatic runstats (AUTO_RUNSTATS) = ON


Automatic statement statistics (AUTO_STMT_STATS) = ON


Automatic statistics profiling (AUTO_STATS_PROF) = OFF


Automatic profile updates (AUTO_PROF_UPD) = OFF


Automatic reorganization (AUTO_REORG) = OFF

December 01
-
03, 2009 •Minneapolis, Chicago, Milwaukee

29

Best Practices


RUNSTATS

Distribution Statistics


Collect large Quantile Statistics for Date columns


Collect distribution statistics on columns used in predicates

Index Statistics


Do not collect DETAILED INDEX statistics . Use SAMPLED
DETAILED INDEX statistics instead

Avoid statistics on columns you know will never be used in
predicates or GROUP BY columns

Use TABLESAMPLE option for very large tables and
statistical views

Use RUNSTATS Profiles to store customized invocations

RUNSTATS with ATTACH ?

COMMIT immediately after RUNSTATS of each table

December 01
-
03, 2009 •Minneapolis, Chicago, Milwaukee

30

Collecting Statistics

Automatic RUNSTATS

Real Time Statistics

SAMPLED DETAILED INDEX

TABLESAMPLE

Selective column statistic specification

Use RUNSTATS PROFILES

December 01
-
03, 2009 •Minneapolis, Chicago, Milwaukee

31

Summary

Tips and best practices to improve data warehouse query
performance have been discussed.


Database Design


Application Design


Configuration and Operations

These include key considerations related to :


Parallelism


Partitioning


Schema


Application queries


Configuration

Session 2 will cover the Performance Layer

32

© Copyright IBM Corporation [current year]. All rights reserved.

U.S. Government Users Restricted Rights
-

Use, duplication or disclosure restricted by GSA ADP Schedule
Contract with IBM Corp.


THE INFORMATION CONTAINED IN THIS PRESENTATION IS PROVIDED FOR INFORMATIONAL PURPOSES
ONLY.


WHILE EFFORTS WERE MADE TO VERIFY THE COMPLETENESS AND ACCURACY OF THE
INFORMATION CONTAINED IN THIS PRESENTATION, IT IS PROVIDED “AS IS” WITHOUT WARRANTY OF
ANY KIND, EXPRESS OR IMPLIED. IN ADDITION, THIS INFORMATION IS BASED ON IBM’S CURRENT
PRODUCT PLANS AND STRATEGY, WHICH ARE SUBJECT TO CHANGE BY IBM WITHOUT NOTICE.


IBM
SHALL NOT BE RESPONSIBLE FOR ANY DAMAGES ARISING OUT OF THE USE OF, OR OTHERWISE
RELATED TO, THIS PRESENTATION OR ANY OTHER DOCUMENTATION. NOTHING CONTAINED IN THIS
PRESENTATION IS INTENDED TO, NOR SHALL HAVE THE EFFECT OF, CREATING ANY WARRANTIES OR
REPRESENTATIONS FROM IBM (OR ITS SUPPLIERS OR LICENSORS), OR ALTERING THE TERMS AND
CONDITIONS OF ANY AGREEMENT OR LICENSE GOVERNING THE USE OF IBM PRODUCTS AND/OR
SOFTWARE.


Please update paragraph below for the particular product or family brand trademarks you mention such as
WebSphere, DB2, Maximo, Clearcase, Lotus, etc


IBM, the IBM logo, ibm.com,
[IBM Brand, if trademarked],

and
[IBM Product, if trademarked]

are trademarks or registered
trademarks of International Business Machines Corporation in the United States, other countries, or both. If these and other
IBM trademarked terms are marked on their first occurrence in this information with a trademark symbol (® or ™), these
symbols indicate U.S. registered or common law trademarks owned by IBM at the time this information was published. Such
trademarks may also be registered or common law trademarks in other countries. A current list of IBM trademarks is
available on the Web at “Copyright and trademark information” at
www.ibm.com/legal/copytrade.shtml


If you have mentioned trademarks that are not from IBM, please update and add the following lines:


[Insert any special 3rd party trademark names/attributions here]

Other company, product, or service names may be trademarks or service marks of others.


Disclaimer