Data Warehouse Design

separatesnottySoftware and s/w Development

Nov 25, 2013 (3 years and 8 months ago)

94 views

Data Warehouse Design

Enrico Franconi

CS 636

CS 336

2

Implementing a Warehouse


Monitoring
: Sending data from sources


Integrating
: Loading, cleansing,...


Processing
: Query processing, indexing, ...


Managing
: Metadata, Design, ...

CS 336

3

Monitoring


Source Types: relational, flat file, IMS,
VSAM, IDMS, WWW, news
-
wire, …


How to get data out?


Replication tool


Dump file


Create report


ODBC or third
-
party “wrappers”

CS 336

4

Monitoring Techniques


Periodic snapshots


Database triggers


Log shipping


Data shipping (replication service)


Transaction shipping


Polling (queries to source)


Screen scraping


Application level monitoring

CS 336

5

Monitoring Issues


Frequency


periodic: daily, weekly, …


triggered: on “big” change, lots of changes, ...


Data transformation


convert data to uniform format


remove & add fields
(e.g., add date to get history)


Standards
(e.g., ODBC)


Gateways

CS 336

6

Wrapper

Converts data and queries from one data model to another

Extends query capabilities for sources with limited capabilities

Data

Model

B

Data

Model

A

Queries

Data

Queries

Source

Wrapper

CS 336

7

Wrapper Generation


Solution 1: Hard code for each source


Solution 2: Automatic wrapper generation

Wrapper

Wrapper

Generator

Definition

CS 336

8

Integration


Data Cleaning


Data Loading


Derived Data

Client

Client

Warehouse

Source

Source

Source

Query & Analysis

Integration

Metadata

CS 336

9

Data Integration


Receive data (changes) from multiple
wrappers/monitors and integrate into warehouse


Rule
-
based


Actions


Resolve inconsistencies


Eliminate duplicates


Integrate into warehouse (may not be empty)


Summarize data


Fetch more data from sources (wh updates)


etc.


CS 336

10

Data Cleaning


Find (& remove) duplicate tuples


e.g., Jane Doe vs. Jane Q. Doe


Detect inconsistent, wrong data


Attribute values that don’t match


Patch missing, unreadable data


Insert default values


Notify sources of errors found

CS 336

11

Data Cleaning


Migration (e.g., yen to dollars)


Scrubbing: use domain
-
specific knowledge (e.g., social
security numbers)


Fusion (e.g., mail list, customer merging)




billing DB

service DB

customer1(Joe)

customer2(Joe)

merged_customer(Joe)

CS 336

12

Loading Data in the Warehouse


Incremental vs. refresh


Off
-
line vs. on
-
line


Frequency of loading


At night, 1x a week/month, continuously


Parallel/Partitioned load

CS 336

13

Warehouse Maintenance


Warehouse data


materialized view


Initial loading


View maintenance


Derived Warehouse Data


indexes


aggregates


materialized views


View maintenance





CS 336

14

Materialized Views


Define new warehouse relations using SQL
expressions

sale
prodId
storeId
date
amt
p1
c1
1
12
p2
c1
1
11
p1
c3
1
50
p2
c2
1
8
p1
c1
2
44
p1
c2
2
4
product
id
name
price
p1
bolt
10
p2
nut
5
joinTb
prodId
name
price
storeId
date
amt
p1
bolt
10
c1
1
12
p2
nut
5
c1
1
11
p1
bolt
10
c3
1
50
p2
nut
5
c2
1
8
p1
bolt
10
c1
2
44
p1
bolt
10
c2
2
4
does not exist

at any source

CS 336

15

Differs from Conventional View
Maintenance...


Warehouses may be highly aggregated and
summarized


Warehouse views may be over history of
base data


Process large batch updates


Schema may evolve

CS 336

16

Differs from Conventional View
Maintenance...


Base data doesn’t participate in view
maintenance


Simply reports changes


Loosely coupled


Absence of locking, global transactions


May not be queriable

CS 336

17

Warehouse Maintenance Anomalies


Materialized view maintenance in loosely
coupled, non
-
transactional environment


Simple example

Sales

Comp.

Integrator

Data

Warehouse

Sale(item,clerk)

Emp(clerk,age)

Sold (item,clerk,age)

Sold = Sale Emp

CS 336

18

Warehouse Maintenance Anomalies

1. Insert into
Emp(Mary,25)
, notify integrator

2. Insert into
Sale (Computer,Mary)
, notify integrator

3. (1)


integrator adds
Sale (Mary,25)

4. (2)


integrator adds
(Computer,Mary) Emp

5. View incorrect (duplicate tuple)

Sales

Comp.

Integrator

Data

Warehouse

Sale(item,clerk)

Emp(clerk,age)

Sold (item,clerk,age)

CS 336

19

Maintenance Anomaly
-

Solutions


Incremental update algorithms (ECA,
Strobe, etc.)


Research issues: Self
-
maintainable views


What views are self
-
maintainable


Store auxiliary views so original + auxiliary
views are self
-
maintainable

CS 336

20

Self
-
Maintainability: Examples

Sold(item,clerk,age) =

Sale(item,clerk) Emp(clerk,age)


Inserts into
Emp

If
Emp.clerk

is key and
Sale.clerk

is foreign
key (with ref. int.) then no effect


Inserts into
Sale

Maintain auxiliary view:
Emp
-

clerk,age
(Sold)


Deletes from
Emp

Delete from
Sold

based on
clerk

CS 336

21

Self
-
Maintainability: Examples


Deletes from
Sale

Delete from
Sold

based on
{item,clerk}

Unless age at time of sale is relevant



Auxiliary views for self
-
maintainability


Must themselves be self
-
maintainable


One solution: all source data


But want minimal set






CS 336

22

Partial Self
-
Maintainability


Avoid (but don’t prohibit) going to sources

Sold=Sale(item,clerk) Emp(clerk,age)


Inserts into
Sale


Check if
clerk

already in
Sold
, go to source if
not


Or replicate all clerks over age 30


Or ...

CS 336

23

Warehouse Specification (ideally)

Extractor/

Monitor

Extractor/

Monitor

Extractor/

Monitor

Integrator

Warehouse

...

Metadata

Warehouse

Configuration

Module

View Definitions

Integration

rules

Change

Detection

Requirements

CS 336

24

Processing


ROLAP servers vs. MOLAP servers


Index Structures


What to Materialize?


Algorithms

Client

Client

Warehouse

Source

Source

Source

Query & Analysis

Integration

Metadata

CS 336

25

ROLAP Server


Relational OLAP Server

relational

DBMS

ROLAP

server

tools

utilities

sale
prodId
date
sum
p1
1
62
p2
1
19
p1
2
48
Special indices, tuning;

Schema is “denormalized”

CS 336

26

MOLAP Server


Multi
-
Dimensional OLAP Server

multi
-
dimensional

server

M.D. tools

utilities

could also

sit on

relational

DBMS

Product

Date

1 2 3 4

milk

soda

eggs

soap

A

B

Sales

CS 336

27

Index Structures (sketch)


Traditional Access Methods


B
-
trees, hash tables, R
-
trees, grids, …


Popular in Warehouses


inverted lists


bit map indexes


join indexes


text indexes

CS 336

28

What to Materialize?


Store in warehouse results useful for
common queries


Example:

day 2

c1
c2
c3
p1
44
4
p2
c1
c2
c3
p1
12
50
p2
11
8
day 1

c1
c2
c3
p1
56
4
50
p2
11
8
c1
c2
c3
p1
67
12
50
c1
p1
110
p2
19
129

. . .

total sales

materialize

CS 336

29

Materialization Factors


Type/frequency of queries


Query response time


Storage cost


Update cost

CS 336

30

Cube Aggregates Lattice

city, product, date

city, product

city, date

product, date

city

product

date

all

day 2

c1
c2
c3
p1
44
4
p2
c1
c2
c3
p1
12
50
p2
11
8
day 1

c1
c2
c3
p1
56
4
50
p2
11
8
c1
c2
c3
p1
67
12
50
129

use greedy

algorithm to

decide what

to materialize

CS 336

31

Dimension Hierarchies

all

state

city

cities
city
state
c1
CA
c2
NY
CS 336

32

Dimension Hierarchies

city, product

city, product, date

city, date

product, date

city

product

date

all

state, product, date

state, date

state, product

state

not all arcs shown...

CS 336

33

Interesting Hierarchy

all

years

quarters

months

days

weeks

time
day
week
month
quarter
year
1
1
1
1
2000
2
1
1
1
2000
3
1
1
1
2000
4
1
1
1
2000
5
1
1
1
2000
6
1
1
1
2000
7
1
1
1
2000
8
2
1
1
2000
conceptual

dimension table

CS 336

34

Managing


Metadata


Warehouse Design


Tools

Client

Client

Warehouse

Source

Source

Source

Query & Analysis

Integration

Metadata

CS 336

35

Metadata


Administrative


definition of sources, tools, ...


schemas, dimension hierarchies, …


rules for extraction, cleaning, …


refresh, purging policies


user profiles, access control, ...

CS 336

36

Metadata


Business


business terms & definition


data ownership, charging


Operational


data lineage


data currency (e.g., active, archived, purged)


use stats, error reports, audit trails

CS 336

37

Design Summary


What data is needed?


Where does it come from?


How to clean data?


How to represent in warehouse (schema)?


What to summarize?


What to materialize?


What to index?

CS 336

38

Tools


Development


design & edit: schemas, views, scripts, rules, queries, reports


Planning & Analysis


what
-
if scenarios
(schema changes, refresh rates)
, capacity planning


Warehouse Management


performance monitoring, usage patterns, exception reporting


System & Network Management


measure traffic (sources, warehouse, clients)


Workflow Management


“reliable scripts” for cleaning & analyzing data

CS 336

39

Current State of Industry


Extraction and integration done off
-
line


Usually in large, time
-
consuming, batches


Everything copied at warehouse


Not selective about what is stored


Query benefit vs storage & update cost


Query optimization aimed at OLTP


High throughput instead of fast response


Process whole query before displaying anything

CS 336

40

State of Commercial Practice ...


Connectivity to sources


Apertus


Information Builders


Informix Enterprise Gateway


Oracle Open Connect


CA
-
Ingres gateway


MS ODBC


Platinum InfoHub


Data extract, clean,
transform, refresh


CA
-
Ingres Replicator


ETI
-
Extract


IBM Data Joiner, Data
Propagator


Prism Warehouse manager


SAS Access


Sybase Replication Server


Trinzic InfoPump

CS 336

41

… State of Commercial Practice ...


Multidimensional
Database Engines


Arbor Essbase


Oracle RIR Express


Comshare Commader


SAS System


Warehouse Data Servers


CA
-
Ingres


Oracle 8


RedBrick


Sybase IQ


Informix Dynamic Server


IBM DB2


ROLAP Servers


HP Intelligent Warehouse


Informix Metacube


MicroStrategy DSS Server


Information Advantage Asxys


CS 336

42

… State of Commercial Practice


Query/Reporting
Environments


IBM DataGuide


SAS Access CA Visual Express
Platinum Forest&Trees


Informix ViewPoint


Multidimensional Analysis


Kenan Systems Acumate


Microsoft Excel


Arbor Essbase Analysis server


Cognos PowerPlay


IQ Software IQ/Vision


Lotus 123


SAS OLAP++


Business Objects


Lots and lots of consulting!!

CS 336

43

Future Directions


Better performance


Larger warehouses


Easier to use


What are companies & research labs
working on?

CS 336

44

Research (1)


Incremental Maintenance


Data Consistency


Data Expiration


Recovery


Data Quality


Error Handling (Back Flush)

CS 336

45

Research (2)


Rapid Monitor Construction


Temporal Warehouses


Materialization & Index Selection


Data Fusion


Data Mining


Integration of Text & Relational Data


Conceptual Modelling

CS 336

46

Conclusions


Massive amounts of data and
complexity of queries will push limits
of current warehouses


Need better systems:


easier to use


provide quality information