Lecture 4
CS

541
Lecture 4
Decision Support System
OLAP introduction, definitionand operations
Cube computation
Intersting OLAP Application: Discovery using OLAP cubes
SQL Extensions for OLAP and analytiics
Lecture 4
CS

541
Decision Support Systems
Decision

support systems
are used to make business
decisions often based on data collected by OLTP systems.
Examples of business decisions:
What product brands would be popular this season?
Who to advertise the ski package deal?
What services to upgrade in the new year.
Examples of data used for making decisions
Transaction retail data
Customer data
Performance data
Lecture 4
CS

541
Decision

Support Systems: Major categories
Data analysis
tasks are simplified by specialized tools and SQL
extensions
Example tasks
For each product category and each region, what were the total sales in
the last quarter and how do they compare with the same quarter last
year
As above, for each product category and each customer category
Statistical analysis
packages (e.g., : S++) can be interfaced with
databases
Data mining
seeks to discover knowledge automatically in the form of
statistical rules and patterns from Large databases.
A
data warehouse
archives information gathered from multiple
sources, and stores it under a unified schema, at a single site.
Important for large businesses which generate data from multiple divisions,
possibly at multiple sites
Data may also be purchased externally
Lecture 4
CS

541
Decision Support Systems: Datawarehouse
A
data warehouse
archives information gathered from multiple
sources, and stores it under a unified schema, at a single site.
Major properties of a data warehouse:
Subject

focused:
focused around a particular subject; irrleevant
data is not included
Integrated and consolidated:
Integrating data from multiple,
heterogeneous sources; issues like data cleaning, consistency are
involved.
historical:
historical data is of more significance than in OLTP dbs.
Rare updates:
Unlike in OLTP DBs, there is no concept of
transactions that update data; data once loaded will only typically be
acessed not updated (only deletions at the end).
Lecture 4
CS

541
Data Analysis and OLAP
Online Analytical Processing (OLAP)
Interactive analysis of data, allowing data to be summarized and
viewed in different ways in an online fashion (with negligible delay)
The object of interest in OLAP is a
cube.
A cube allows data to
be viewed in multiple
dimensions.
An
n

dimensional cube
is defined as a group of k

dimensional
(k<=n)
cuboids
arranged by the dimensions of the data. A
cell
of a cuboid represents an association of a measure m (eg total
sales) with a member of every dimension (product

>group=”toys”, location

>state=”NJ”, time

>year=2003”).
The n

D cuboid is called a
base cuboid
. The top most 0

D
cuboid, which holds the highest

level of summaries, is called the
apex cuboid
.
Lecture 4
CS

541
Cube Lattice
AB
ABCD
ABC
ABD
ACD
BCD
AC
AD
BC
BD
CD
A
B
C
D
all
Cube lattice for a 3

d data cube
Lecture 4
CS

541
Data cube with dimension heirarchy
Location

>state
PA
toys
clothes
cosmetics
2001
2002
2003
NJ
NY
Item

>category
Period

>year
2.5M
item
brand
category
location
city
state
country
period
day
month
year
Lecture 4
CS

541
Modelling of dimensions and measures in a
datawarehouse
Star schema:
A fact table in the middle containing the dimension
keys surrounded by a set of dimensional tables. The fact table
contains the set of measures on which the attributes are to be
aggregated.
Item_key
period_key
location_key
Num_units_sold
Amount_sold
item_key
item_name
Item_type
brand
period_key
day
month
year
location_key
street
city
state
country
Lecture 4
CS

541
Modelling of dimensions and measures in a
datawarehouse
Fact constellation:
Multiple fact tables share dimension tables
Item_key
period_key
location_key
Num_units_sold
Amount_sold
item_key
item_name
Item_type
brand
period_key
day
month
year
dept_key
period_key
location_key
Num_employees
expenses
dept_key
dept_name
dept_head
Dept_function
location_key
street
city
state
country
Lecture 4
CS

541
OLAP Implementation Architecture
multidimensional OLAP (MOLAP)
: use multidimensional arrays
to store data cubes; the advantage is faster performance; the
disadvantage is that data needs to be precomputed limiting the
data that can be looked at; proprietary tachnology. Examples:
Business Objects, Cognos, Hyperion
relational OLAP (ROLAP)
: uses relational database features; can
handle larger amounts of data; the disadvantage is slower
performance. Examples: Microstrategy
hybrid OLAP (HOLAP):
Hybrid systems, which store some
summaries in memory and store the base data and other
summaries in a relational database,
Lecture 4
CS

541
OLAP cube operations
Roll up (drill

up):
summarize data
by climbing up hierarchy or by
reducing the number of dimensions.
Drill down (roll down):
from higher level summary to lower level
summary or detailed data, or introducing new dimensions.
Slice operation:
Selects the dimensions of the cube to be
viewed. Eg view “Sales volume” as a function of “
category ”
by
“
Country
“by “
year”.
Dice operation:
Specifies the values along one or more
dimensions. Eg view “Sales volume” as a function of “
category ”
by “
Country
“by “
year for the year 2004.
Pivot operation:
Reorient the cube by replacing one or more of
the dimensions
Lecture 4
CS

541
OLAP Cube Computation
Top

down Approach:
Start by computing the base cuboid (groupby for which no cube
dimensions are aggregated).A single pass is made over the data,
a record is examined and the appropriate base cell is
incremented.
The remaining groupbys are computed by aggregating over
already computed finer grade groupby. If a groupby can be
computed from one or more possible parent groupbys, the
algorithm uses the parent which is smallest in size.
If a groupby can be computed from one or more possible parent
groupbys, the algorithm uses the parent which is smallest in size.
Lecture 4
CS

541
Top

down approach
AB
ABCD
ABC
ABD
ACD
BCD
AC
AD
BC
BD
CD
A
B
C
D
all
Figure 4: Top

down cube computation
Lecture 4
CS

541
Top

down optimizations: Pipe

sort and pipe

hash
The basic idea of both algorithms is that a minimum spanning
tree should be generated from the original lattice such that the
cost of traversing edges will be minimized. The optimizations for
the costs these algorithms include are:
Cache

results:
This optimization aims at ensuring the results of a
groupby is cached (in memory), so that it can be used by other
group

bys in future.
Amortize

scans:
This optimization amortizes the cost of a disk read
by computing maximum possible number of group

bys, together in
memory.
Share

sorts:
For a sort

based algorithm, this aims at sharing sorting
cost across multiple group

bys.
Share

partitions:
For a hash

based algorithm, when the hash

table
is too large to fit in memory, data is partitioned and aggregated to fit
in memory. This can be achieved by sharing this cost across
multiple group

bys.
Lecture 4
CS

541
Multiway Array Aggregation for MOLAP
Arrays are partitioned into chunks (a small subcube which fits in memory).
If sparse array, use compressed array addressing: (chunk_id, offset)
Compute aggregates in “multiway” by visiting cube cells in the order which minimizes the # of times
to visit each cell, and reduces memory access and storage cost.
Objective find the best order to do multi

way aggregation
A
B
29
30
31
32
1
2
3
4
5
9
13
14
15
16
64
63
62
61
48
47
46
45
a1
a0
c3
c2
c1
c 0
b3
b2
b1
b0
a2
a3
C
B
44
28
56
40
24
52
36
20
60
Lecture 4
CS

541
Multiway Array Aggregation for MOLAP
A
B
29
30
31
32
1
2
3
4
5
9
13
14
15
16
64
63
62
61
48
47
46
45
a1
a0
c3
c2
c1
c 0
b3
b2
b1
b0
a2
a3
C
44
28
56
40
24
52
36
20
60
B
After scan {1,2,3,4}:
•
b
0
c
0
chunk is computed
•
a
0
c
0
and a
0
b
0
are not
computed
Lecture 4
CS

541
Multiway Array Aggregation for MOLAP
A
B
29
30
31
32
1
2
3
4
5
9
13
14
15
16
64
63
62
61
48
47
46
45
a1
a0
c3
c2
c1
c 0
b3
b2
b1
b0
a2
a3
C
44
28
56
40
24
52
36
20
60
B
After scan 1

13:
•
a
0
c
0
and b
0
c
0
chunks are
computed
•
a
0
b
0
is not
computed (we will
need to scan 1

49)
We need to keep 4
a

c chunks in
memory
We need to keep 16
a

b chunks in
memory
We need to keep a
single b

c chunk in
memory
Lecture 4
CS

541
Multiway Array Aggregation for MOLAP
Method: the planes should be sorted and computed according to their size
in ascending order.
The proposed scan is optimal if C>B>A
MOLAP cube computation is faster than ROLAP
Limitation of MOLAP: computing well only for a small number of
dimensions
If there are a large number of dimensions use the iceberg cube
computation: process only “dense” chunks
Lecture 4
CS

541
OLAP Cube computation: Bottom

up
approach
Computing only the cuboid cells whose count or other
aggregates satisfying the condition:
HAVING COUNT(*) >=
minsup
The motivation is that we don’t want to compute all cells for the
cube. Only cells that have the
measure
above a certain threshold
is interesting.
Lecture 4
CS

541
Bottom

up cube compuation
BUC (Beyer & Ramakrishnan, SIGMOD’99)
Apriori property:
Aggregate the data, then move to the next level
If
minsup
is not met, stop!
This can be applied to other aggregates like MIN, MAX, SUM by
using the concept of
monotonicity of a query
AB
ABCD
ABC
ABD
ACD
BCD
AC
AD
BC
BD
CD
A
B
C
D
all
Figure 5: Bottom

up cube computation
Lecture 4
CS

541
Monotonicity of a query
A query is monotonic for a cell C in database D if the condition
Q(C) is FALSE imples Q(C’) is FALSE for any subcell C’ of C in
database D.
Suppose we have measure M, with every value of M>=0.
Suppose, we have a query Q()=SUM(M)>1000 and that for a cell
A=a1B=b1, SUM(M1)=600. Q() is FALSE for C. Q() is also
FALSE for any subcell of C. Why? Thus Q is monotonic for C.
However for arbitrary queries involving MIN, MAX, SUM, AVG
and operators (<,=,>) it is NP

hard to determine whetehr a given
query is monotonic at a given cell C.
Lecture 4
CS

541
View monotonicity
A view for a a cell C on set of measures (m1, m2, ..mn} is the set
of values assigned for the measures in the set.
For example, a view on cell C on (avg(sales), MAX(sales)) might
be (50k, 500k).
A query Q() is view monotonic on a view V if for any cell C in any
database D, s.t. V is the view for C, the condition Q is FALSE for
C implies Q is FALSE for all C’ subset of C.
Suppose we have a query COUNT(*)>=100 and
AVG(salesMilk)<=20. and view V { (COUNT()=500,
MIN(salesMilk)=10, MAX(salesMilk)=40),
SUM(salesMilk)=19970}. The query Q is view monotonic for
view V.
Lecture 4
CS

541
Checking for view monotonicity
Suppose we have a query Q in disjunctive normal form
consisting of m conjuncts in J dimensions and K distinct measure
attributes. Then the monotonicity of Q for a given view can be
tested in O(m(J+klogk)) time.
The idea is to reduce the set of constraints to a set of linear
inequalities in terms of COUNT().
Lecture 4
CS

541
Discovery Driven Exploration of OLAP Data
cubes
A business analyst is interested in exploring the data cubes,
looking for regions of anomalies ie where the value is
unexpected.
This is useful in finding problem areas or new opportunities.
For a user to drill down or rollup through the cube to find these
interesting regions is laborious and boring
. An interesting
value may not be so obvious or may lay deep into the data.
Rather, let the system automatically identify such regions and
guide the user discovery process.
Lecture 4
CS

541
Surprising cells
A value in a cell of a data cube is surprising if it is significantly different
from the anticipated value based on a statistical model. This model
computes the anticipated value of a cell in context of its position in the
data cube and combines trends along different dimensions that the cell
belongs to.
Assume Gaussian distribution, Exceptions fall outside 99% probability:
Surprise
Cell
= S, if S >2.5
Surprise
Cell
= 0, if S <2.5
Where
SD
Estimate
Measure
S
Lecture 4
CS

541
Estimating the measure m
The estimate is a function f of contributions of higher

level group

bys.
For example, the estimate for cell at position i on dim A, j on dim
B, k on dim C is computed in terms of the top

level group

by,
group

by A, group

by B, group

by C, group

by Ab, group

by AC,
group

by BC.
Estimate based on trimmed means.
Estimate M for a cell at position (i1,i2,i3,… in) is found by Log M
=
C
G
Where C stands for “contribution”
G is a possible Group

By and
C
None
= mean of all values
C
Dr
Ir
= A
Dr
Ir

C
None
where A
Dr
Ir
= mean over values along Irth member of
dimension Dr
Thus, this denotes how much
A
Dr
Ir
differs from overall
average
C
Dr
Ir
Ds
Is
= A
Dr
Ir
Ds
Is

C
Dr
Ir

C
Ds
rs

C
None
And so on.
Lecture 4
CS

541
Estimating the measure m (contd)
Coefficients corresponding to any group

by G obtained by
subtracting from the average A value all the coefficients from
higher level group

bys than G.
Very hard to compute (short cuts and optimizations presented in
paper)
Assumes logarithms of measures are distributed as Gaussian, all
with the same variance.
These expressions use ‘trimmed’ mean; exclude 25% outliers
Lecture 4
CS

541
Examples of how the estimate is computed
The log of the estimate for (Product=’clothes’, Store=’NJ’,
Year=’2001’) is computed as summation of contributions from
the following cells (C[Product=’clothes’, store=’NJ’],
C[Product=’clothes’, Year=’2001’], C[Store=’NJ’, year=’2001’],
C[Product=’clothes’], C[Store=’NJ’], C[Year=’2001’], C[None] or
C[all] )
C[None]=trimmed mean of the 0

D cuboid =trimmed mean of
measure M for the set of tuples in the base table.
C[Year=’2001’]=trimmed mean of measure M for all the members in
the cell [Year=’2001’] (ie the set of tuples with year=’2001’)
–
C[None]
C[Product=’clothes’, Year=’2001’]=trimmed mean of measure M for
members in the cell [year=’2001’, product=’clothes’

C[Year=’2001’]

C[Product=’clothes’]

C[None]
Lecture 4
CS

541
Computing the standard deviation
Initially tried using root mean square of Actual
–
Estimate. This
provides poor fit for OLAP data
If Poisson distiribution is assumed then the variance is equal to
the mean; but this would underestimate the standard deviation
here.
The technique set variance=(Estimate)
P
Assume Gaussian distribution, using Maximum Likelihood
criterion then the value p must satisfy the equation:
(Measure

Estimate)
2
/(Estimate)
p
log Estimate =
log Estimate
Lecture 4
CS

541
Type of surprise values
SurpriseCell (SelfExp):
as defined earlier
SurpriseCellDrill (InExp):
Maximum surprise of all cells that can
be reached from this cell by drilling down.
SurpriseCellPath (PathExp):
Maximum surprise of all cells that
can be reached from this cell by drilling down a particular
dimension.
To visually depict the degree of exception of each cell, cues such
as background color are used.
Lecture 4
CS

541
Example: Discovery Driven exploration of data
cubes
Lecture 4
CS

541
SQL Extended Aggregation
SQL

92 aggregation quite limited. Many useful aggregates were
not part of the standard SQL.
Data cube
Complex aggregates (median, variance)
binary aggregates (correlation, regression curves)
ranking queries (“assign each student a rank based on the total
marks”
SQL:1999 OLAP extensions provide a variety of aggregation
functions to address above limitations. They are supported by
several databases including Oracle and IBM DB2.
Lecture 4
CS

541
Extended Aggregation in SQL:1999
The
cube
operation computes union of
group by
’s on every subset of
the specified attributes
Consider the query
select
item, location, year,
sum
(
revenues
)
from
sales
group by cube
(
item, location, year
)
This computes the union of eight different groupings of the
sales
relation:
{ (
item, location, year
), (
item, location
),
(
item, year
), (
location, year
),
(
item
), (
location
),
(
year
), ( ) }
where ( ) denotes an empty
group by
list.
For each grouping, the result contains the null value
for attributes not present in the grouping ie “all”.
Lecture 4
CS

541
Extended Aggregation (Cont.)
The function
grouping()
can be applied on an attribute. It returns 1 if the
value is a null value representing “all”, and returns 0 in all other cases.
select
item, location, year,
sum
(
revenues
),
grouping
(item) as item

flag,
grouping
(location) as location

flag,
grouping
(year) as year

flag
from
sales
group by cube
(
item, location, year
)
Can use the function
decode()
in the
select
clause to replace
such nulls by a value such as
all
Decode syntax:
decode (
expression
, search , result [, search , result]... [,
default] )
E.g. replace
item

name
in first query by
decode
(
grouping
(item), 1, ‘all’,
item
)
Lecture 4
CS

541
Extended Aggregation (Cont.)
The
rollup
construct generates group by on every prefix of
specified list of attributes
E.g.
select
item, location, year,
sum
(
revenues
)
from
sales
group by rollup
(
item, location, year
)
Generates union of four groupings:
{ (
item, location, year
), (
item, location
), (
item
), ( ) }
Multiple rollups and cubes can be used in a single group by
clause
Each generates set of group by lists, cross product of sets gives
overall set of group by lists
Lecture 4
CS

541
Analytic Functions
AVG
*
CORR
*
COVAR_POP
*
COVAR_SAMP
*
COUNT
*
CUME_DIST
DENSE_RANK
FIRST
FIRST_VALUE
*
LAG
LAST
LAST_VALUE
*
LEAD
MAX
*
MIN
*
NTILE
PERCENT_RANK
PERCENTILE_CONT
PERCENTILE_DISC
RANK
RATIO_TO_REPORT
REGR_ (Linear Regression) Functions
*
ROW_NUMBER
STDDEV
*
STDDEV_POP
*
STDDEV_SAMP
*
SUM
*
VAR_POP
*
VAR_SAMP
*
VARIANCE
*
Lecture 4
CS

541
Analytic function syntax
The Syntax of analytic functions is:
Analytic

Function(<Argument>,<Argument>,...)
OVER (
<Query

Partition

Clause>
<Order

By

Clause>
<Windowing

Clause>
)
Example:
Running total of the employee salary
SQL> select eid, deptno, salary, sum(salary) OVER (ORDER BY deptno, eid) as
RUNTOTAL from emp;
EID DEPTNO SALARY RUNTOTAL




1 1 50000 50000
3 1 40000 90000
4 2 30000 120000
5 2 60000 180000
2 3 10000 190000
6 3 80000 270000
Lecture 4
CS

541
Ranking
Ranking is done in conjunction with an order by specification.
SQL> select eid, salary, rank() over (order by salary desc) as salRank from
emp;
EID SALARY SALRANK



6 80000 1
5 60000 2
1 50000 3
7 50000 3
3 40000 5
4 30000 6
2 10000 7
An extra
order by
clause is needed to ensure they are in sorted order
Ranking may leave gaps: e.g. 2 employees share the 3
rd
salary rank and next
one has the rank of 5.
dense_rank
does not leave gaps, so next dense rank would be 4
Lecture 4
CS

541
Ranking (Cont.)
Ranking can be done within partition of the data.
SQL> select eid, salary, deptno, rank() over (partition by deptno order by salary
desc) as salRak from emp;
EID SALARY DEPTNO SALRAK




1 50000 1 1
7 50000 1 1
3 40000 1 3
5 60000 2 1
4 30000 2 2
6 80000 3 1
2 10000 3 2
Multiple
rank
clauses can occur in a single
select
clause
Ranking is done
after
applying
group by
clause/aggregation
Lecture 4
CS

541
Ranking (Cont.)
Other ranking functions:
percent_rank
(within partition, if partitioning is done): As an analytic
function, for a row R, PERCENT_RANK calculates the rank of R minus 1,
divided by 1 less than the number of rows being evaluated (the entire query
result set or a partition).
cume_dist
cumulative distribution: CUME_DIST calculates the cumulative
distribution of a value in a group of values. The range of values returned by
CUME_DIST is >0 to <=1. Tie values always evaluate to the same
cumulative distribution value. For a row R, assuming ascending ordering, the
CUME_DIST of R is the number of rows with values lower than or equal to
the value of R, divided by the number of rows being evaluated
row_number
(returns a running serial number to a partition of records) non

deterministic in presence of duplicates)
SQL:1999 permits the user to specify
nulls first
or
nulls last. NULLS
LAST
is the default for ascending order, and
NULLS FIRST
is the
default for descending order
rank
( )
over
(
order by salary
desc nulls last
)
as
s

rank
Lecture 4
CS

541
Ranking (Cont.)
For a given constant n, the ranking the function ntile(n) takes the tuples in each
partition in the specified order, and divides them into n buckets with equal
numbers of tuples.
SQL> select salary, ntile(4) over (order by salary) as quartile from emp;
SALARY QUARTILE


10000 1
30000 1
40000 2
50000 2
50000 3
60000 3
80000 4
Antother example, we can sort employees by salary, and use ntile(3) to find
which range (bottom third, middle third, or top third) each employee is in, and
compute the total salary earned by employees in each range:
select
threetile
,
sum
(
salary
)
from
(
select
salary
,
ntile
(3)
over
(
order by
salary
)
as
threetile
from
emp
)
as
s
group by
threetile
Lecture 4
CS

541
LEAD/LAG FUNCTION
LEAD allows to look at rows coming after the current row and return the
value as part of the current row.
The general syntax of LEAD is shown below:
LEAD (<sql_expr>, <offset>, <default>) OVER (<analytic_clause>)
The syntax of LAG is similar except that the offset for LAG goes into the
previous rows
<sql_expr> is the expression to compute from the leading row.
<offset> is the index of the leading row relative to the current row.
<offset> is a positive integer with default 1.
<default> is the value to return if the <offset> points to a row outside the
partition range.
SELECT deptno, eid, salary,
LEAD(salary, 1, 0) OVER (PARTITION BY dept ORDER BY saaryl
DESC NULLS LAST) NEXT_LOWER_SAL,
LAG(salary, 1, 0) OVER (PARTITION BY dept ORDER BY sal DESC
NULLS LAST) PREV_HIGHER_SAL
FROM emp ORDER BY deptno, salary DESC;
Lecture 4
CS

541
Windowing
E.g.: “Given sales values for each date, calculate for each date the average
of the sales on that day, the previous day, and the next day”
Such
moving average
queries are used to smooth out random variations.
In contrast to group by, the same tuple can exist in multiple windows
Window specification
in SQL:
Ordering of tuples, size of window for each tuple, aggregate function
E.g. given relation
sales(date, value)
select
date,
sum
(
value
)
over
(
order by
date
row between
1
preceding and
1
following
)
from
sales
Examples of other window specifications:
between unbounded preceding and current
range between
10
preceding and current row
All rows with values between current row value
–
10 to current value
range interval
10
day preceding
Not including current row
For ROW type windows the definition is in terms of row numbers before or after the
current row. For RANGE type windows the definition is in terms of values before or
after the current ORDER.
Lecture 4
CS

541
Windowing (Cont.)
Can do windowing within partitions
E.g. Given a relation
transaction
(
account

number, date

time,
value
), where value is positive for a deposit and negative for a
withdrawal
“Find total balance of each account after each transaction on the
account”
select
account

number, date

time
,
sum
(
value
)
over
(
partition by
account

number
order by
date

time
row between current row and unbounded preceding
)
as
balance
from
transaction
order by
account

number, date

time
Lecture 4
CS

541
OTHER Interesting functions
Variance and standard deviation functions
Covariance function
Correlation function
Linear Regression functions:
Slope
Y

intercept
Goodness of fit
Σχόλια 0
Συνδεθείτε για να κοινοποιήσετε σχόλιο