Cube Computation and

concretecakeΠολεοδομικά Έργα

29 Νοε 2013 (πριν από 3 χρόνια και 11 μήνες)

160 εμφανίσεις

1

Cube Computation and
Indexes for Data Warehouses

CPS 196.03

Notes 7

2

Processing


ROLAP servers vs. MOLAP servers


Index Structures


Cube computation


What to Materialize?


Algorithms

Client

Client

Warehouse

Source

Source

Source

Query & Analysis

Integration

Metadata

3

ROLAP Server


Relational OLAP Server

relational

DBMS

ROLAP

server

tools

utilities

sale
prodId
date
sum
p1
1
62
p2
1
19
p1
2
48
Special indices, tuning;

Schema is “denormalized”

4

MOLAP Server


Multi
-
Dimensional OLAP Server

multi
-
dimensional

server

M.D. tools

utilities

could also

sit on

relational

DBMS

Product

Date

1 2 3 4

milk

soda

eggs

soap

A

B

Sales

5

MOLAP

Total annual sales

of TV in U.S.A.

Date

Country

sum

sum



TV

VCR

PC

1Qtr

2Qtr

3Qtr

4Qtr

U.S.A

Canada

Mexico

sum

6

MOLAP

A

B

29

30

31

32

1

2

3

4

5

9

13

14

15

16

64

63

62

61

48

47

46

45

a1

a0

c3

c2

c1

c 0

b3

b2

b1

b0

a2

a3

C

44

28

56

40

24

52

36

20

60

B

7

Challenges in MOLAP


Storing large arrays for efficient access


Row
-
major, column major


Chunking


Compressing sparse arrays


Creating array data from data in tables


Efficient techniques for Cube computation

Topics are discussed in the paper for reading

8

Index Structures


Traditional Access Methods


B
-
trees, hash tables, R
-
trees, grids, …


Popular in Warehouses


inverted lists


bit map indexes


join indexes


text indexes

9

Inverted Lists

20
23
18
19
20
21
22
23
25
26
r4
r18
r34
r35
r5
r19
r37
r40
rId
name
age
r4
joe
20
r18
fred
20
r19
sally
21
r34
nancy
20
r35
tom
20
r36
pat
25
r5
dave
21
r41
jeff
26
. . .

age

index

inverted

lists

data

records

10

Using Inverted Lists


Query:


Get people with age = 20 and name = “fred”


List for age = 20:
r4, r18, r34, r35


List for name = “fred”:
r18, r52


Answer is intersection:
r18

11

Bit Maps

20
23
18
19
20
21
22
23
25
26
id
name
age
1
joe
20
2
fred
20
3
sally
21
4
nancy
20
5
tom
20
6
pat
25
7
dave
21
8
jeff
26
. . .

age

index

bit

maps

data

records

1
1
0
1
1
0
0
0
0
0
0
1
0
0
0
1
0
1
1
12

Bitmap Index


Index on a particular column


Each value in the column has a bit vector: bit
-
op is fast


The length of the bit vector: # of records in the base table


The

i
-
th bit is set if the

i
-
th row of the base table has the
value for the indexed column


not suitable for high cardinality domains

Cust
Region
Type
C1
Asia
Retail
C2
Europe
Dealer
C3
Asia
Dealer
C4
America
Retail
C5
Europe
Dealer
RecID
Retail
Dealer
1
1
0
2
0
1
3
0
1
4
1
0
5
0
1
RecID
Asia
Europe
America
1
1
0
0
2
0
1
0
3
1
0
0
4
0
0
1
5
0
1
0
Base table

Index on Region

Index on Type

13

Using Bit Maps


Query:


Get people with age = 20 and name = “fred”


List for age = 20:
1101100000


List for name = “fred”:
0100000001


Answer is intersection:
010000000000


Good if domain cardinality small


Bit vectors can be compressed

14

Join

sale
prodId
storeId
date
amt
p1
c1
1
12
p2
c1
1
11
p1
c3
1
50
p2
c2
1
8
p1
c1
2
44
p1
c2
2
4


“Combine” SALE, PRODUCT relations



In SQL:
SELECT * FROM SALE, PRODUCT WHERE ...

product
id
name
price
p1
bolt
10
p2
nut
5
joinTb
prodId
name
price
storeId
date
amt
p1
bolt
10
c1
1
12
p2
nut
5
c1
1
11
p1
bolt
10
c3
1
50
p2
nut
5
c2
1
8
p1
bolt
10
c1
2
44
p1
bolt
10
c2
2
4
15

Join Indexes

product
id
name
price
jIndex
p1
bolt
10
r1,r3,r5,r6
p2
nut
5
r2,r4
sale
rId
prodId
storeId
date
amt
r1
p1
c1
1
12
r2
p2
c1
1
11
r3
p1
c3
1
50
r4
p2
c2
1
8
r5
p1
c1
2
44
r6
p1
c2
2
4
join index

16

Cube Computation for Data
Warehouses


17

Counting Exercise


How many cuboids are there in a cube?


The full or nothing case


When dimension hierarchies are present


What is the size of each cuboid?

18

Lattice of Cuboids

city, product, date

city, product

city, date

product, date

city

product

date

all

day 2

c1
c2
c3
p1
44
4
p2
c1
c2
c3
p1
12
50
p2
11
8
day 1

c1
c2
c3
p1
56
4
50
p2
11
8
c1
c2
c3
p1
67
12
50
129

19

Dimension Hierarchies

all

state

city

cities
city
state
c1
CA
c2
NY
20

Dimension Hierarchies

city, product

city, product, date

city, date

product, date

city

product

date

all

state, product, date

state, date

state, product

state

not all arcs shown...

21

Efficient Data Cube Computation


Data cube can be viewed as a lattice of cuboids


The bottom
-
most cuboid is the base cuboid


The top
-
most cuboid (apex) contains only one cell


How many cuboids in an n
-
dimensional cube with L
levels?



Materialization of data cube


Materialize
every

(cuboid) (full materialization),
none
(no materialization), or
some (partial materialization)


Selection of which cuboids to materialize


Based on size, sharing, access frequency, etc.

)
1
1
(




n
i
i
L
T
22

Derived Data


Derived Warehouse Data


indexes


aggregates


materialized views (next slide)


When to update derived data?


Incremental vs. refresh

23

Idea of Materialized Views


Define new warehouse tables/arrays

sale
prodId
storeId
date
amt
p1
c1
1
12
p2
c1
1
11
p1
c3
1
50
p2
c2
1
8
p1
c1
2
44
p1
c2
2
4
product
id
name
price
p1
bolt
10
p2
nut
5
joinTb
prodId
name
price
storeId
date
amt
p1
bolt
10
c1
1
12
p2
nut
5
c1
1
11
p1
bolt
10
c3
1
50
p2
nut
5
c2
1
8
p1
bolt
10
c1
2
44
p1
bolt
10
c2
2
4
does not exist

at any source

24

Efficient OLAP Processing


Determine which operations should be performed on available cuboids


Transform drill, roll, etc. into corresponding SQL and/or OLAP operations,
e.g., dice = selection + projection


Determine which materialized cuboid(s) should be selected for OLAP:


Let the query to be processed be on {brand, province_or_state} with the
condition “year = 2004”, and there are 4 materialized cuboids available:

1) {year, item_name, city}

2) {year, brand, country}

3) {year, brand, province_or_state}

4) {item_name, province_or_state} where year = 2004

Which should be selected to process the query?


Explore indexing structures & compressed vs. dense arrays in MOLAP

25

What to Materialize?


Store in warehouse results useful for
common queries


Example:

day 2

c1
c2
c3
p1
44
4
p2
c1
c2
c3
p1
12
50
p2
11
8
day 1

c1
c2
c3
p1
56
4
50
p2
11
8
c1
c2
c3
p1
67
12
50
c1
p1
110
p2
19
129

. . .

total sales

materialize

26

Materialization Factors


Type/frequency of queries


Query response time


Storage cost


Update cost

Will study a concrete algorithm later

27

Iceberg Cube


Computing only the cuboid cells
whose count or other aggregates
satisfying the condition like


HAVING COUNT(*) >=
minsup


Motivation


Only a small portion of cube cells may be “above the
water’’ in a sparse cube


Only calculate “interesting” cells

data above certain
threshold

28

Challenges in MOLAP


Storing large arrays for efficient access


Row
-
major, column major


Chunking


Compressing sparse arrays


Creating array data from data in tables


Efficient techniques for Cube computation

Topics are discussed in the paper for reading