Research Report on Data Warehousing

previousdankishSoftware and s/w Development

Nov 25, 2013 (3 years and 8 months ago)

88 views


Research Report on Data
Warehousing

IS 2200 Red Team


Chandler Ardis, Danita Austin, Amy Griffith, Kenny Griffith, Kaitlyn Pack, Sarah Roberts

12/2/2011










1


Table of Contents




Title Page























2

Table of Contents



















3

Executive Summary




















4

Data Mining






















4

Business Uses of Data Mining and Warehousing









10

Cost of Data Warehousing






















13

Summary of Report Findings



















15

References


16





2


Executive Summary


Data
warehousing is the technology used to store data and have the ability to retrieve this data
easily and quickly. It is characterized by subject
-
oriented, integrated, non
-
volatile, time
-
variant,
accessible, and process
-
oriented. Data mining is different from

data warehousing in the fact
that it uses pattern recognition to identify trends in the data set. The data sources are input
from operational systems and flat files, comprised in the warehouse in metadata, summary
data, and raw data, and the users are abl
e to analyze, report, and mine the inputted data.
Metadata is main data and other structures used to generate data warehouses. Summary data
is comprised aggregated data from facts and possibly dimension tables. Raw data is
unprocessed and unformatted and i
s useless until processed manually or by machine. When
creating data warehouses, it is important to understand what you want the data warehouse to
be used for so that when setting it up, you know what it needs to be able to do when done. The
main steps whe
n creating data warehousing are: gather and analyze existing informational
capabilities, identify significant business questions and key metrics from end users, breakdown
the metrics into their component parts with specific definitions, and map the compone
nt parts
to the informational model and systems of record. ETL, extract, transform, and load, is a key
component of data warehousing because this is how someone sifts through data and records it
in the warehouse. It is used more and more in big businesses
where large volumes of data are
needing to be documented and retrievable. There are multiple sources for programs to employ
data warehouses in businesses and multiple ways in which someone pays for the services. Data
warehousing seems to be getting more po
pular as the understanding of their capabilities are
increasing, because in all reality, they are very useful and practical in not only the business
world, but with anything that has technology as a source of data.


Data Warehousing


The data warehousing
technology has been embraced by medium to large companies and it
satisfies strategic corporate data requirements.

It has proven to be useful to business leaders
and managers who would not want to run a business without it.

Data warehouses are software
pr
ograms that are used to store large volumes of data and can run queries and reports from
that data.

Data warehousing can extract and store data to allow easier reporting.

What Is Data Warehousing?


Data warehousing is a subject
-
orientated, integrated, tim
e
-
variant and non
-
volatile collection
of data in support of management’s decision making process, as defined by Bill Inmon in
1990.

Bill Inmon was a computer scientist; that was credited for creating data
warehousing.

The data warehouse has five characte
ristics, subject
-
oriented, integrated, non
-
volatile, time variant, accessible and process
-
oriented.

These five characteristics are defined
below.

Subject
-
Oriented

Subject
-
Oriented means that the data is manipulated to provide specific information about a
3


subject or area of interest.

Data warehouses are designed to help the user analyze data.

Integrated

Integrated means the data from different sources is gathered together to form one coherent
source.

The data warehouse will contain a variety of
information from different sources.

Non
-
Volatile

Non
-
Volatile means that once the information is entered into the warehouse it will not change,
even when new information is entered into the warehouse from day to day operations of the
business.

This charac
teristic allows for a consistent picture of the business.

Time
-
Variant

The time
-
variant function of the warehouse allows for the data to be identified with time
periods.

Historical data is an important function of the data warehouse.

Accessible

The data w
arehouse must be accessible to provide information to end
-
users.

Process
-
Oriented

Data warehousing is a process in delivering information and must have ongoing maintenance.

Basic Data Warehouse Architecture

The figure below shows a simple architecture for
a data warehouse.

On one side you have the
data sources from operational systems and flat files loaded into the warehouse software.

On
the other side you have the users where analysis, reporting and data mining are being
performed.

In the middle you ha
ve the components of the warehouse, metadata, summary data
and raw data.














4


Architecture of a Data Warehouse



Warehouse Nucleus

Metadata

Metadata describes data and other structures, such as objects, business rules, and processes.
For example,
the schema design of a data warehouse is typically stored in a repository as
metadata, which is used to generate scripts used to build and populate the data warehouse. A
repository contains metadata.

For example, metadata is used to generate and populate

the
data warehouse with information, definitions of tables, columns, business rules, or discounts of
10 percent after selling 1,000 items.

Summary Data

A pre
-
computed table comprising aggregated or joined data from fact and possibly dimensions
tables.

Raw

Data

Raw data is data that has not been processed or formatted in a presentable form. Raw data is
meaningless and is processed manually or by a machine.

Data Warehouse Creation


5


Data Warehouse Sources

To create a data warehouse the following sources are
needed, the current on
-
line transaction
processing system, which holds the day to day information about the business.

Historical data
from prior periods is also needed to create the warehouse.

Data Warehouse Creation

Creating a data warehouse has to done
with great care to ensure that design supports the types
of queries the warehouse will be used for.

The creator must understand the database schema
to be created and have interaction with the anticipated users of the database.

The database
must be modifi
ed several times before the model will be stabilized.

Stabilizing the database
must be done carefully, because once the database is populated with large amounts of
information which may be hard to recreate, the database cannot be easily changed.

Informati
onal Requirements



Gather and analyze existing informational capabilities



Identify significant business questions and key metrics from end users.



Breakdown the metrics into their component parts with specific definitions



Map the component parts to the
informational model and systems of record.


First Increment of Information

The warehouse architecture is initially developed, and then the first increment is developed
based on the architecture.

The first increment will cause architectural changes and ope
ration
of the warehouse will cause architectural changes.

Once you have created your first data warehouse increment, the following should be
considered for the additional increments that are added.



Start with one subject area and one target user group.



Con
tinue and add subject areas, user groups and informational capabilities to the
architecture based on the organization’s requirements for information.



Improvements are made from what was learned from previous increments.



Improvements are made from what was
learned about warehouse operation and
support.



The end user requirements are reanalyzed after each improvement.


Extract, Transform and Load (ETL)

6



The warehouse is populated through extracting the data from the source environment, and
then changing the da
ta to have the warehouse characteristics, such as subject
-
orientation.

Lastly, load the data into the software.

Data Modifications

Data warehouses are updated regularly through the ETL process.

Updates are usually done at
night or weekly.

Bulk data modi
fication techniques are used which means the end users would
not update the data warehouse.

Data warehouses usually store many months or years of data
to support historical analysis.

This is why the warehouse should go through a data cleansing
process.

D
ata Cleansing

Data cleansings are done during the ETL process; it is also called data scrubbing or data quality
assurance.

A data warehouse that has incorrect information is useless and can be dangerous to
the companies’ decision
-
making.

Once the warehous
e is built it will provide the company with a competitive advantage and
return on investment.

A tool that helps search for patterns in data warehouses is data mining,
it helps ask questions that a company may have not thought to ask.

Data Mining


Sometime
s data warehousing and data mining is thought to be the same in the business
world.

Data mining is differs because of the system designs, methodology used and the
purpose.

Data mining uses pattern recognition to identify trends within a data set.

Usuall
y a
statistical analysis software package is used to identify specific patterns based on the data set
and queries the end users uses.

Data mining is used to target marketing programs, financial
fraud and to identify unusual patterns in behavior for securi
ty reviews.

Key Properties of Data Mining

The key properties of data mining are; automatic discovery of patterns, prediction of likely
outcomes, creation of actionable information and focus on large data sets and databases.

Building Models (Automatic Disco
very)

A data mining model uses an algorithm to act on a set of data.

The models can be used to mine
the data from which they are built from.

The process of applying a model to data is called
scoring.

Data mining algorithms often require large data sets
for the creation of quality models.

Prediction

Data Mining can be predictive, predication probabilities are also known as confidence.

7


Actionable Information

Data mining can pull information that can for example help a town planner use a model that
can pre
dict income based on demographics to develop a plan for low
-
income housing.

Data Mining Process


The figure below shows the data mining process.

The results of data mining trigger new
business questions, which can help develop more specific models.



Problem Definition

This phase focuses on the understanding of the objectives and requirements.

You can develop
an preliminary implementation plan once you have decided on your business perspective.

For
instance, if you needed to know who is will purchase

a product.

The data on who has bought
the product in the past must be assembled.

Data Gathering and Preparation

This phase involves collecting the data and checking to see if it addresses the business
problem.

Data quality issues and data patterns can a
lso be found and corrected during this
stage.

Data preparation can significantly improve the information that can be discovered
8


through data mining.

Model Building and Evaluation

In this phase, the selection and application of modeling techniques are put
together to create
the best values.

When building the model, it should be created in small sets of data, because the
final table may contain thousands or millions of cases.

At this phase the model needs to be
evaluated to make sure if satisfies the origi
nal business goal.


Knowledge Deployment

This phase is the use of data mining in a target environment.

Insight and actionable
information can be derived from the data.

This phase involves scoring, which is the application
of models to new data, the extr
action of model details, data warehouse infrastructure, or query
and reporting tools.


Business Uses of Data Mining and Warehousing



Data warehousing and data mining are fairly new technologies that are helping companies
worldwide to find and use
data quickly and efficiently. Businesses that use this new
technology are more organized in their decision making and are resourceful because they are
able to organize and present data in a variety of different ways. The reason for doing this is to
accommo
date the diversity of their customers and their needs.





According to one article, data mining is useful in,“customer relationship management
(Drew et al. 2001; Garver 2002), financial analysis (Cowan 2002; Hui and Weigend 2004),
credit card
management (Adams et al. 2001), health service debt management (Zurada and
Lonial 2005), banking (Sung et al. 1999), insurance (Smith et al. 2000), and many other areas
of business involving services” (Data Mining in Business Services). Having a data wareh
ouse
is becoming an increasingly popular resource to have in businesses because of its vast
knowledge of information.

Data Resources



An example of data mining in a large company would be Washington Radio Reports. WRR
(
www.wrronline.com
) is a small company in Pennsylvania that mines multiple data sources
to build a massive data warehouse of radio frequencies. Data from the Federal
Communications Commission (
www.fcc.gov
), Industry Canada
(
http://www.ic.gc.ca/eic/site/sd
-
sd.nsf/eng/home
), US Postal Service zip codes, Canadian
zip codes and many others are integrated into the data wareho
use. Then, subscribers pay
Washington Radio Reports to utilize their search tools that access this wealth of information.


Knowledge Discovery

9




There are many types of businesses that use data mining as a means of knowledge
discovery. In the medical
field, data mining is very important because it helps doctors to
detect ADE (adverse drug events) in different types of medical conditions. By having all of the
medical information in a data warehouse, doctors and other workers

in the medical industry
are

able to detect ADE’s much earlier and quicker than in using previous methods. Thus
being able to treat patients effectively.


Analytical Purposes



On the other hand, libraries “are discovering, as businesses have, the value of merging
existing data o
r full text resources to form a very large data warehouse that can be mined for
analytical purposes”(Data Mining and Data Warehousing p.13). Libraries are not only using
data mining for administrative purposes, but for scholarly research as well. Librarian
s
“recognize data mining

techniques as offering new approaches to analyzing content and knowledge discovery within
these very large databases and the Web” (Data Mining and Data Warehousing p.13). As a
student, data mining makes research and reporting simp
ler and more defined. Therefore, if
students are able to use this type of resource efficiently, businesses should use data mining
as an essential solution to their business needs.


How To Make Data Mining Work for Businesses




According to one website
, it’s a fact that “
in the retail sector, implementing data mining in
sales, ordering and market profiling has improved profits by 60%” (www.sdart.co.uk). This
goes to show that this up and coming development of data mining is becoming a way to save
money
and increase revenue. There is a sparking interest in many different types of
businesses for data warehousing and data mining for this reason.
A good example of how to
use Data Mining in a modern business setting would be as follows:





The u
ses of data mining in business settings are numerous and can be used in many
different areas. One website that clearly defines and explains how data mining is used in
businesses has a list of examples of areas where data mining can be used. The highlighted

portions are important key factors of why using data mining will help businesses succeed.

Examples of the areas where Data Mining can be used:



Sales data analysis

-

If you sell anything you know how important it is to know how
to promote your products. As

there are patterns in sales, it is possible to find out
which products people buy together. You can therefore reduce unsuccessful
promotions, find out what customers need and sell more products.



Churn analysis

-

There are situations when even a loyal customer could change any
of its business partners.
Data Mining churn analysis could help you identify these
customers before they go to your competitors.
Keep your customers with you and
don't lose their loyalty.

10




Customer profiling

-

According to the Pareto Principle only 20% of your customers
generate 80% of your income. This Data Mining analysis will help you identify key
customers.

You could prepare special offers for them. Moreover, you could divide
your custom
ers into several groups according to their age or interests.



Employee profiling

-

The success of every company lies in the hands of its
personnel
. By analysing personnel data you can find a bright star or again a weak link
that might decrease overall perfo
rmance. As a result of this Data Mining analysis you
could give extra training to employees with low performance and reward or promote
those who do best for you.



Message segregation

-

When you receive a vast number of messages it is important
to segregate them to right branches of the company.
With automatic segregation
mechanism all incoming messages will be sent to the right employees.

It will reduce
the number of messages, which ha
ve to be resent and any time wasted by

manual
segregation.



Financial performance analysis

-

Have you wondered why financial crises
happen? The answer is very simple. Without good financial knowledge, investment
becomes an emotional and often irrational pr
ocess, so when the economy validates
the market, a crisis starts. Therefore,
you need a source of automatic financial
performance Data Mining analysis to evaluate companies' financial standings.



Weblog analysis

-

Have you invested in an advanced website, b
ut find that not many
users read it? Maybe they cannot find the information they want in the right place?
Log analysis of website usage will identify patterns of use and help create a proper
website structure.
Sections of websites can then be generated mor
e naturally or
suggested links could be presented to the user.



Overall, these are some great innovative ways that businesses can use data mining. It
will help with their financial performance, data analysis, customer and employee profiling,
and a

variety of other areas as well. The key part in the way businesses make their money is
through the data mining process, therefore the more use and acquisition of information used,
the better for the company. An ending quote to summarize the use of data mi
ning in business
comes from Stephan Kudyba.

Regarding how retrieving data is useful to a business he says,
“The bottom line to business success is to increase the knowledge of decision makers at all
levels of an organization. The process of knowledge crea
tion and enhancement comes from
information and information is nothing more than data that has been collected, accessed,
formatted and analyzed. The more efficiently organizations can access value added data, the
better they can gain greater insights into
what drives their activities which enables them to
better devise business strategy”(
Data mining efforts increase business productivity and
efficiency).





11


Cost of Data Warehousing


The pricing of data warehousing systems are most commonly done by the
cost per terabyte.
The price usually fluctuates according to what is included in the systems package. One of the
companies I researched, IBM, has a warehousing system named InfoSphere. They give their
clients two package options, the Enterprise edition and

the Departmental edition. Both are
priced using the price per terabyte method. The Enterprise edition is the more advanced
option, priced at $70,000/TB. The Departmental edition is the more basic IBM InfoSphere
package, and is priced at $35, 000/TB.




On the website, IBM compared their packages with Oracle’s packages. Oracle is a competing
company, who is also offering an enterprise and basic package. Instead of listing their system
by price per terabyte, Oracle lists theirs by price per user and

price per processor. These are
two more examples of the many way’s to price a data warehousing system. I’ve included their
comparison chart below.



IBM InfoSphere Warehouse and Oracle Data Warehouse Packaging:

INFOSPHERE WAREHOUSE ENTERPRISE
EDITION

ORA
CLE DATABASE 11g ENTERPRISE
EDITION

Included:



Partitioning



Deep Compression



Cubing Services



Intelligent Miner



Management tools



List price per TB: $70,000

Separate charge:



Partitioning



Advanced Compression



OLAP



Data
Mining



Diagnostics & Tuning Packs


List price/user: $2,530; List price/processor:
$126,500



INFOSPHERE WAREHOUSE
DEPARTMENTAL EDITION

ORACLE DATABASE 11g STANDARD EDITION

Included:



Partitioning



Deep Compression



Cubing Services



Intelligent Miner



Management tools



List price per TB: $35,000

Not Included:



Partitioning



Advanced Compression



OLAP



Data Mining

Separate Charge:



Diagnostics & Tuning Packs

List price/user:
$550; List price/processor: $27,500


12




*ORACLE DATABASE 11g ENTERPRISE EDITION


On another website I discovered, businessman/blogger Curt Monash compares the list prices
various warehousing systems and what they offer, and he discusses the advantages and
disadvantages of pricing sys
tems ‘per terabyte’ versus ‘per user’. He claims that although the
price per terabyte method is most the most common, listing the systems by terabyte usage is
outdated. He lists the disadvantages of price per terabyte as:



Price
-
per
-
terabyte metrics

ignore issues of throughput, latency, workload, and so on.



Price
-
per
-
terabyte metrics ignore quality of storage medium (slow disks, fast disks, Flash,
etc.)



Price
-
per
-
terabyte metrics can be radically affected by changes in disk size.


Comparatively, listing as price per user has certain advantages, which include:



User data is what matters.



User data is what users doing product evaluations or setting budgets can best estimate in
advance.



User data is a reasonable
and popular basis for software
-
only analytic DBMS pricing.


The disadvantages, according to Monash, seem to be that this system makes assumption based
on the amount of compression and indexing used, which are “dependent on the specifics of the
data set”(Mo
nash, 2010).


Some systems that he included were Netezza, Oracle, and EMC.

Netezza has two systems, 2x
compression and 4x compression. The 2x compression is a little under $20,000/TB, and the 4x
compression is about $11,000/TB according to Phil Fransico.
The EMC system is listed at $1
million for 18 terabytes of uncompressed user data, or $14,000/TB for 4x compression.


Based on this research, it seems that the best way to choose the system you need, is not by the
way the system is priced, but by what the
system has to offer and how that fits your needs. The
best priced system, and the most commonly used, seems to be the Oracle systems. By listing
their prices multiple ways, it gives the buyer a more rounded and suitable way to choose the
most appropriate w
arehouse system for their company.








13


Summary of Report Findings

After researching data mining and data warehousing, we have come to the understanding of
how they work and the conclusion that they can be very helpful in a multitude of ways. Like
previously stated, data warehouses are programs used to store data and run queries and
reports of that data. Bill Inmon is credited as the creator and he describes data warehousing
with six main characteristics: subject
-
oriented, integrated, non
-
volatile,
time
-
variant,
accessible, and process
-
oriented. When you break down what the warehouse itself is, there are
three components: metadata, summary data, and raw data. Users are able to access the data
within the warehouse with ease and whenever necessary, if
set up properly. There are the
obvious business usages for them, such as storing data and being able to easily access it and
things pertaining to it, such as the history of many company aspects from customer service to
financial analysis. Hospitals and lib
raries are also beginning to implement data warehousing
systems in their businesses because it helps them easily organize their data and have the
capabilities of going back into the warehouses to find their information. Data warehousing is
even something t
hat people can use on daily bases. in general, data warehousing and mining
are becoming very popular methods within larger companies when it comes to gathering and
managing data, but can also be used in daily procedures for everyday people.









14


Refere
nces


Reed, Micheal. "Intranet Journal: Feature: Finding Your Way Around E
-
commerce."
Intranet Journal
-

Intranet Design, Development, Content Management, Collaboration
Tools, Microsoft SharePoint, Wikis, Lotus Notes
-

Intranet Journal
. Technology
Evaluation.com. Web. 18 Nov. 2011.
<http://www.intranetjournal.com/features/datawarehousing.html>.


Dambrine, Thibault. "Data Warehouse: Primary Concepts." (2009). Toronto Users
Group for Power Systems, Mar. 2009. Web. 18 Nov. 2011.
<http://www.tylogix.com
/Articles/Data_Wareouse_Primary_Concepts.pdf>.


"Data Warehousing Guide."
Oracle
. 2002. Web. 18 Nov. 2011.
<http://download.oracle.com/docs/cd/B10500_01/server.920/a96520.pdf>.


"What Is Data Mining?"
Oracle Documentation
. Oracle, 2008. Web. 30 Nov. 2011.
<http://docs.oracle.com/cd/B28359_01/datamine.111/b28129/process.htm>.


Francois, Carol. "What Is the Difference Between Data Mining and Data Warehousing?"
WiseGEEK: Clear Answers for Common Questions
. Conjecture Corporation, 18 Aug.
2011. Web. 30 Nov. 201
1. <http://www.wisegeek.com/what
-
is
-
the
-
difference
-
between
-
data
-
mining
-
and
-
data
-
warehousing.htm>.


15


Mento, Barbara, and Brendan Rapple. "Data Mining and Data Warehousing."
Data
Mining and Data Warehousing
. Washington, D.C.: Association of Research Libraries
,
Office of Leadership and Management Services, 2003. 9
-
13. Web.


Olson, David L. "Data Mining in Business Services."
Service Business

1.3 (2007): 181
-
93. Web Journal.


"Data Driven Innovation | SDART Ltd."
SDART Innovation through Software
Developement an
d Research Technology | SDART Ltd
. Web. 02 Dec. 2011.
<http://sdart.co.uk/data_driven_innovation>.


Wilson, A. M., Thabane, L., & Holbrook, A. (2004). Application of data mining
techniques in pharmacovigilance.
British Journal of Clinical Pharmacology
, 57(
2), 127
-
134. doi:10.1046/j.1365
-
2125.2003.01968.x


Kudyba, Stephan. "Data Mining Efforts Increase Business Productivity and
Efficiency."
Kennesaw State University Proxy
. Jan. 2003. Web. 02 Dec. 2011.


Paladin, D. (2011).
Sap business warehouse
. Retrieved f
rom

http://www.squidoo.com/BW


OSIsoft. (2010).
Pi processbook
. Retrieved
from

http://www.osisoft.com/software
-
16


support/products/PI_ProcessBook.aspx


Leonhardi, B., Sieb, C., Wurst, M.
, Rollins, J., & Ramos, J. (2008, October 30).
I
ntegrate

infosphere warehouse data mining with ibm cognos reporting, part
1: overview of info sphere warehouse and cognos integration architecture
. Retrieved
from http://www.ibm.com/develop
erworks/data/library/techarticle/dm
-
081wurst/index.html


Grossman, R., Hornick, M., & Meyer, G. (2002).
Data mining standards & initiatives.
Association for Computing Machinery, 45
(8), 59
-
61. Retrieved from
http://content.ebscohost.com/pdf17_20/pdf/2002/AC
M/01Aug02/11863431.pdf?T=P&P=
AN&K=11863431&S=R&D=bth&EbscoContent=dGJyMNLe80Sepq84wtvhOLCmr0mep
7RSsqy4SreWxWXS&ContentCustomer=dGJyMOnb7Em549+B7LHjgO3p8gAA


ITG. (2011, May).
Bottom
-
line advantages of ibm infosphere warehouse
. FTP to

ftp://ftp.software.ibm.com/software//data/sw
-
library/infosphere/whitepapers/Bottom
-
line
-
Advantages
-
Inf
oSphere
-
Warehouse
-
Terabyte
-
Pricing.pdf


Monash, C. (2010, October 15). [Web log message]. Retrieved from

http://www.dbms2.com/2010/10/15/pricing
-
of
-
data
-
warehouse
-
applia
nces/