OLAP Products, Challenges, and Related Technologies

separatesnottySoftware and s/w Development

Nov 25, 2013 (3 years and 6 months ago)

108 views

OLAP

Products, Challenges, and Related Technologies

Agenda


Basic Introduction and Overview



Thoughts on Models and Schemas



Product Challenges



RDBMS Focus on Warehousing



Parting Comments

Some Terms


Data Warehouse


The primary repository for report


Typically a secondary data source created with extracts from
primary data sources


Datamarts


Smaller repositories created from the warehouse


Also small warehouses or summary tables


DSS


Reporting
--

accomplished by any type of reporting tool or
application


Business Intelligence


Same as above

Some More Terms


OLAP


Any type of “live” reporting, as opposed to batch



Query Tools


Requires user to write SQL, may be on
-
line or batch


OLAP Tools


Generates SQL for user based on semantic model, metadata


Datamining Tools


Stand alone applications focused on high end analytics



Difference between ROLAP and MOLAP


Amount of work done in the relational database. And that’s it...


The DSS Value Proposition


Banks and Financial Services


4
-
7% churn rates


Fraud costs billions worldwide


1% default rate = $500 million for $50 billion assets


Cross
-
selling products is worth tens of millions



Telecommunications



25% churn rates; turn over entire customer base in 4 years


Typical Telco loses $100+ million/yr to churn


$300,000+/day in losses



Database Marketing



More targeted mail campaigns can save hundreds of millions



Retail


Inventory Management

The Basic Idea


Warehousing


Collect the data


Who purchased mutual funds in the last 3 years?



Basic Reporting


Analyze data


What is the income distribution of mutual fund buyers?


Who are my most profitable customers?



Advanced Reporting


Predict


What do customers buy in combination?


Who will buy a mutual fund in the next 6 months and why?



The DSS System Life Cycle


Customers “grow” into high end DSS



Most customers struggle to build the
warehouse. Once the warehouse is
in place they progress fairly

rapidly up the “DSS chain”

Warehouse

Simple Reporting

OLAP

Data Mining

Closed Loop DSS

Raw Data

(OLTP, external
data)

Extracts, Load,
Transformation


Customers opt to build
Datamarts because of
the “difficulty” of
building a proper
warehouse


“Actionable DSS”

So How Are Customers Doing?


Customer’s succeed when they know what they
want to know.


Victoria’s Secret Inventory Management


Wal*Mart Pharmacy



Many customers “fail / struggle” at first.


Poor source information


Distracted by technology


Database benchmarks?


Internal politics


Unrealistic scope/timeframes


Attempt to implement inappropriate technology


Database gateways?



Types of Reporting / OLAP Strategies


MD OLAP


Extract data into a cube or MD database and perform analysis
on extracted data


Datamarts


Extract data into a smaller relational repository and perform
analysis on datamart, using some SQL based tool


Structured ROLAP


Build a schema tailored to a ROLAP tool and perform analysis
on the structured schema using SQL based tool


Flexible ROLAP


Perform analysis directly on the warehouse


Require an intelligent SQL engine which is used for more than
simple extractions

Different Approaches to DSS

Warehouse

ROLAP

Raw Data

(OLTP, external
data)

MDDB

Datamart

Structured
Schema

MOLAP/
HOLAP

SQL

MD API

SQL

SQL

Models & Schemas


Dimensional Modeling?


Process of putting a semantic object layer over the physical
schema


Semantic model typically includes dimensions; attributes or
level; facts, metrics or measures; and possibly other objects


Wide degree of variance in products on how closely the physical
structure must resemble the logical presentation layer.



Star
--

Structured


A MDDB stored relationally


Snowflake
--

Normalized


Terrible “new” name for an old concept



The Real World?


TPC
-
D as a good example

Lookup_Geography

geo_key

store_name

market_name

region_name

level

Lookup_Time

time_key

day

month_name

year

level


Lookup_Product

product_key

item_name

class_name

department_name

division_name

level

Fact_Sales

product_key

geo_key

time_key

reg_sls_unit

reg_sls_dollar

pln_sls_unit

pln_sls_dollar

The Original Star Schema

“A relational cube”

Lookup_item

item_id

item_name

class_id

Fact_Sales

item_id

store_id

day

reg_sls_unit

reg_sls_dollar

pln_sls_unit

pln_sls_dollar

Lookup_class

class_id

class_name

department_id

Lookup dept

department_id

department_name

division_id

Lookup_division

division_id

division_name

Lookup_store

store_id

store_name

market_id

Lookup_region

region_id

region_name

Lookup_market

market_id

market_name

region_id

Lookup_day

day

month_id

Lookup_month

month_id

month_name

year

Lookup_year

year

The Snowflake Schema

Lookup_item

item_id

item_name

class_id

class_name

department_id

department_name

division_id

division_name

Fact_Sales

item_id

store_id

day

reg_sls_unit

reg_sls_dollar

cle_sls_unit

cle_sls_dollar

pml_sls_unit

pml_sls_dollar

pln_sls_unit

pln_sls_dollar

Lookup_class

class_id

class_name

department_id

Lookup dept

department_id

department_name

division_id

Lookup_division

division_id

division_name

Lookup_store

store_id

store_name

market_id

market_name

region_id

region_name

Lookup_region

region_id

region_name

Lookup_market

market_id

market_name

region_id

Lookup_day

day

month_id

month_name

year

Lookup_month

month_id

month_name

year

Lookup_year

year

The Star / Snowflake Schema?

A Real Schema: TPC
-
D for example

Part Key

Name

MFGR

Brand

Type

Size

Container

Retail Price

Comment

Part

Supp Key

Name

Address

Nation Key

Phone

Acct Bal

Comment

Supplier

Nation Key

Name

Region Key

Comment

Cust Nation

Region Key

Name

Comment

Cust Region

Part Key

Supp Key

Avail Qty

Supply Cost

Comment

Part Supp

Cust Key

Name

Address

Nation Key

Phone

Acct Bal

Mkt Segment

Comment

Customer

Order Key

Cust Key

Order Status

Total Price

Order Date

Order Priority

Clerk

Ship Priority

Comment

Order

Time Key

Alpha

Year

Month

Week

Day

Order Time

Order Key

Part Key

Supp Key

Line Number

Quantity

Extend Price

Discount

Tax

Return Flag

Line Status

Ship Date

Commit Date

Receipt Date

Ship Instruct

Ship Mode

Comment

Line Item

Time Key

Alpha

Year

Month

Week

Day

Ship Time

Time Key

Alpha

Year

Month

Week

Day

Commit Time

Time Key

Alpha

Year

Month

Week

Day

Receipt Time

Nation Key

Name

Region Key

Comment

Supp Nation

Region Key

Name

Comment

Supp Region

Dimensional Model for TPC
-
D

Nation Key

Name

Region Key

Comment

Cust Nation

Region Key

Name

Comment

Cust Region

Cust Key

Name

Address

Nation Key

Phone

Acct Bal

Mkt Segment

Comment

Customer

Order Key

Cust Key

Order Status

Total Price

Order Date

Order Priority

Clerk

Ship Priority

Comment

Order

Time Key

Alpha

Year

Month

Week

Day

Order Time

Order Key

Part Key

Supp Key

Line Number

Quantity

Extend Price

Discount

Tax

Return Flag

Line Status

Ship Date

Commit Date

Receipt Date

Ship Instruct

Ship Mode

Comment

Line Item

Orders

Cust Region

Cust Nation

Customer

Mkt Segment

Order

Order Date

Order Priority

Ship Priority

Clerk

Order Status

Line Item

Commit Date

Ship Date

Line Status

Receipt Date

Return Flag

Ship Instruct

Ship Mode

Part Key

Supp Key

Logical Business Model for the Order Dimension

Modeling Conclusions


Seem complicated?



Modeling data is fairly simple if the data and the
capabilities/requirement of the tools are well
understood.



Not all tools are created equally so often many
data transformations must occur to achieve
desired results.



Real world data is rarely “cube” or “star” like.


Caveat...

Industry Benchmarks
-

Comparison


There exist interesting differences between the
two DSS benchmarks: APB
-

1 and TPC
-
D



APB
-

1 (built by the OLAP council)


Is a basic budgeting application


Contains no many to many relationships


Contains “clean” dimensions


Is very “star” and “cube” like



TPC
-

D (built by the RDBMS community)


Is a basic order entry system


Contains facts with different dimensional keys


Is relatively normalized


Contains cross dimensional attributes relationships


Contains “table
-
less” dimensions

The Tough Problems


Handling Large Volumes



Working with Complex / Varied Data Structures



Performing Advanced Calculations
--

Efficiently




Also called “Depth
-

Breadth
-

Reporting Range”




State of Technology


Vendors
-

Database (good)


Database engines add more scalability and flexibility



Vendors
-

OLAP (bad)


Continue to focus on making simple problems simpler


Basing solutions on too many assumptions


Working to confuse market
--

ROLAP, HOLAP, MOLAP


Working with inherently limited architectures


Not utilizing underling RDBMS capabilities


Working within fixed database schemas



Net Result (still much room for improvement)


Vendors failing to solve customers true needs


The market is pushing back to datamarts


Market is living with simpler reporting
--

which may not be bad

Large Systems?


>> 50
-
100+ Gigabytes of Raw Data



Customers


Want central data warehouse


Find large systems difficult to build and maintain


Have data in a variety of structures (table formats)



OLAP Vendors


Advocate storing subsets of data in different structure


Build proprietary MDDB


Push for datamarts



ROLAP / RDBMS


Push for less data restructuring


Design for less data movement

RDBMS Vendors

Fortunately, with advances in RDBMS technology, ROLAP is
increasingly recognized as the best approach



Key Enhancements
-

System


Better support for large systems


Partitioning (a.k.a. segmentation, AKA fragmentation)


Hash and bitmap joins, hash and bitmap index technology


Parallel and clustered processing



Key Enhancements
--

Function


Temporary table support


Derived table support


Outer Joins


“OLAP” functions

OLAP Demands On An RDMBS


Ability to efficiently perform


Joins and


Aggregations


Row restrictions
--

filters



Ability to generate Counts and Sums



Ability to perform iterative calculations and filters


Temporary Tables


Derived Tables or Table Expressions


The above gets you 80 percent of the way there, except for
Ranks, Cumulative Sums, Moving Sums

Example Analysis


DSS Question

“Show me customer revenue & customers’ percent
contribution (customer rev / total rev), only for
those customers who contributed more than 1% to
total revenue”



Popular OLAP Approach


Fetch revenue data for each Customer into OLAP Server


Calculate percent to total revenue for each Customer


Restrict result set to those Customers whose Contribution
is greater than 1%

Example Analysis (2)


Pure ROLAP Approach


select Customer, Sum(Revenue) as REV

into Temp1

from Customer_Fact

group by Customer

select Sum(REV) as TOT_REV

into Temp2

from Temp1

select Temp1.Customer, Temp1.REV/Temp2.TOT_REV as CONT

from Temp1, Temp2

where Temp1.REV/Temp2.TOT_REV >= .01



SQL Extensions


Temporary Table
--

Declared Local Tables (ANSI ‘92)


Derived Tables
--

Selects in FROM clause (ANSI ‘92)

Example Analysis (3)


ROLAP using Table Expressions



select

Temp1.Customer,



Temp1.REV/Temp2.TOT_REV as CONT


from

(select Customer, Sum(Revenue) as REV



from Customer_Fact



group by Customer)
as Temp1
,





(select Sum(Revenue) as TOT_REV



from Custom_Fact)
as Temp2


where Temp1.REV/Temp2.TOT_REV >= .01




Either Implementation is known as Multi
-
pass SQL

So What About the Other 20%?


How do you calculate Ranking, Moving Sums,
and Cumulative Sums?



Currently OLAP tools must do this on their own.



RDBMS vendors begin to add support for this.


Teradata and Red Brick have commercial implementations.


Proposal put forth by Oracle and IBM for ANSI SQL ‘99.

(just approved)


Few RDBMS vendors have incorporated


Aggregate Navigation


Aggregate Navigation involves two parts


Materialized View support


Query rewrite capabilities



Materialized Views


A “Summary Table” defined as a view


Additional properties telling the database how to update the
view


An advanced type of index



Query Rewrite


The ability for the optimizer to redirect a query to a “higher”
materialized view based on group by and where clause
evaluation

Query Rewrite Example

Base_Sales

store_id

sls_unit

sls_dollar


Aggregate_Sales

region_id

sls_unit

sls_dollar


Base Table

Materialized View

Select

Region,


Sls_unit

from

Aggregate_Sales

Select

Region,


Sum(Sls_unit)

from

Base_Sales

group

by Region

Query Rewrite

OLAP & RDBMS




How does all this affect OLAP tools?


As RDBMS vendors add more functionality
--

OLAP tools must
become smarter in terms of generating SQL



ROLAP does not replace OLAP tools, the tools must work
together more intelligently



It lessens the appeal of MD OLAP solutions




Database Technologies

Who are the “leaders”?



By market share...


International Data Corp
--

$9.7 billion market


40.4% Oracle with $3.93 billion


17.8% UDB with $1.73 billion


5.7% Informix


5.1% Microsoft


4.4% Sybase



Dataquest Inc


32.3% UDB


29.3% Oracle


10.2% Microsoft


4.4% Informix


3.5% Sybase

Database Technologies (cont.)

Who are the “leaders”?



By benchmarks...


See TPC
-
D results






Another Tangent

--

Database Gateways?


Cohera, ISG Navigator, IBM’s Data Joiner


OLAP Technologies

Who are the “leaders”?



By Market Share


“The OLAP Report”
-

$2+ billion



34% Hyperion Solutions Inc. (merged with Arbor)


17% Oracle Express (from 21% slipping)


9.6% Cognos (slipping slightly)


6.4% MicroStrategy (up from 4.5% and rising)





Parting Comments


Customers need

...


More flexible OLAP tools


RDBMS optimized for DSS



Sophisticated problems


Limitations of Multidimensional Model?


Large volumes


Schema support


Management of the environment


True ROLAP calculations
--

minimize data movement


HOLAP is not necessarily ROLAP



DSS is becoming mission critical


Systems need to ensure success and availability.



The End

Some further detail on schemas

Lookup_Geography

geo_key

geo_name

store_id

market_id

region_id

level

Star #1

GEO_KEY
GEO_NAME
STORE_ID
MARKET_ID
REGION_ID
LEVEL
1001
Boston
101
20
1
1
1002
Greenwich
102
20
1
1
1003
Providence
103
20
1
1
1004
Baltimore
104
10
1
1
1005
Philadelphia
105
10
1
1
1006
Charlotte
106
30
2
1
1007
Durham
107
30
2
1
1008
Greenville
108
30
2
1
1009
Atlanta
109
40
2
1
1010
Fayetteville
110
40
2
1
1011
Mid-Atlantic
10
1
2
1012
New England
20
1
2
1013
Carolinas
30
2
2
1014
Deep South
40
2
2
1015
Northeast
1
3
1016
South
2
3
A Sample Star Lookup Table


Star Schema lookup tables hold all of the elements within a dimension in one physical
lookup table.


Each dimensional lookup table will have a single column primary key, that is unique within
the dimension, regardless of the attribute.


Each dimension lookup table will include a ‘level’ field which indicates the attribute level.

Fact_Sales

product_key

geo_key

time_key

reg_sls_unit

reg_sls_dollar

cle_sls_unit

cle_sls_dollar

pml_sls_unit

pml_sls_dollar

pln_sls_unit

pln_sls_dollar

Two types of Star Schemas


Atomic data only


Consolidated

Original Star Fact Tables


Atomic data only


Base tables contain only one level of data (per table).


No ‘in
-
table’ aggregation.


Consolidated


Base tables contain base table data as well as aggregate data for
every

possible level of aggregation


‘In
-
table’ aggregation = storing aggregate data in the same table as
atomic level data, for example, storing store, market, and region level
information within the same fact table.

REGION_ID
REGION_NAM
E
1
Northeast
2
South
MARKET_ID
MARKET_NAME
REGION_ID
10
Mid-Atlantic
1
20
New England
1
30
Carolinas
2
40
Deep South
2
STORE_ID
STORE_NAME
MARKET_ID
REGION_ID
101
Boston
20
1
102
Greenwich
20
1
103
Providence
20
1
104
Baltimore
10
1
105
Philadelphia
10
1
106
Charlotte
30
2
107
Durham
30
2
108
Greenville
30
2
109
Atlanta
40
2
110
Fayetteville
40
2
Lookup_store

store_id

store_name

market_id

region_id

Lookup_region

region_id

region_name

Lookup_market

market_id

market_name

region_id

A Sample Snowflake Lookup


The snowflake design typically has one physical lookup table per attribute, with each
attribute identified by a unique key and having its own description column.


Attributes are related to each other by including foreign key columns in attribute
lookup tables, as region_id is stored in the Lookup_Market table.