Research Report on Data
IS 2200 Red Team
Chandler Ardis, Danita Austin, Amy Griffith, Kenny Griffith, Kaitlyn Pack, Sarah Roberts
Table of Contents
Table of Contents
Business Uses of Data Mining and Warehousing
Cost of Data Warehousing
Summary of Report Findings
warehousing is the technology used to store data and have the ability to retrieve this data
easily and quickly. It is characterized by subject
oriented, integrated, non
accessible, and process
oriented. Data mining is different from
data warehousing in the fact
that it uses pattern recognition to identify trends in the data set. The data sources are input
from operational systems and flat files, comprised in the warehouse in metadata, summary
data, and raw data, and the users are abl
e to analyze, report, and mine the inputted data.
Metadata is main data and other structures used to generate data warehouses. Summary data
is comprised aggregated data from facts and possibly dimension tables. Raw data is
unprocessed and unformatted and i
s useless until processed manually or by machine. When
creating data warehouses, it is important to understand what you want the data warehouse to
be used for so that when setting it up, you know what it needs to be able to do when done. The
main steps whe
n creating data warehousing are: gather and analyze existing informational
capabilities, identify significant business questions and key metrics from end users, breakdown
the metrics into their component parts with specific definitions, and map the compone
to the informational model and systems of record. ETL, extract, transform, and load, is a key
component of data warehousing because this is how someone sifts through data and records it
in the warehouse. It is used more and more in big businesses
where large volumes of data are
needing to be documented and retrievable. There are multiple sources for programs to employ
data warehouses in businesses and multiple ways in which someone pays for the services. Data
warehousing seems to be getting more po
pular as the understanding of their capabilities are
increasing, because in all reality, they are very useful and practical in not only the business
world, but with anything that has technology as a source of data.
The data warehousing
technology has been embraced by medium to large companies and it
satisfies strategic corporate data requirements.
It has proven to be useful to business leaders
and managers who would not want to run a business without it.
Data warehouses are software
ograms that are used to store large volumes of data and can run queries and reports from
Data warehousing can extract and store data to allow easier reporting.
What Is Data Warehousing?
Data warehousing is a subject
orientated, integrated, tim
variant and non
of data in support of management’s decision making process, as defined by Bill Inmon in
Bill Inmon was a computer scientist; that was credited for creating data
The data warehouse has five characte
oriented, integrated, non
volatile, time variant, accessible and process
These five characteristics are defined
Oriented means that the data is manipulated to provide specific information about a
subject or area of interest.
Data warehouses are designed to help the user analyze data.
Integrated means the data from different sources is gathered together to form one coherent
The data warehouse will contain a variety of
information from different sources.
Volatile means that once the information is entered into the warehouse it will not change,
even when new information is entered into the warehouse from day to day operations of the
teristic allows for a consistent picture of the business.
variant function of the warehouse allows for the data to be identified with time
Historical data is an important function of the data warehouse.
The data w
arehouse must be accessible to provide information to end
Data warehousing is a process in delivering information and must have ongoing maintenance.
Basic Data Warehouse Architecture
The figure below shows a simple architecture for
a data warehouse.
On one side you have the
data sources from operational systems and flat files loaded into the warehouse software.
the other side you have the users where analysis, reporting and data mining are being
In the middle you ha
ve the components of the warehouse, metadata, summary data
and raw data.
Architecture of a Data Warehouse
Metadata describes data and other structures, such as objects, business rules, and processes.
the schema design of a data warehouse is typically stored in a repository as
metadata, which is used to generate scripts used to build and populate the data warehouse. A
repository contains metadata.
For example, metadata is used to generate and populate
data warehouse with information, definitions of tables, columns, business rules, or discounts of
10 percent after selling 1,000 items.
computed table comprising aggregated or joined data from fact and possibly dimensions
Raw data is data that has not been processed or formatted in a presentable form. Raw data is
meaningless and is processed manually or by a machine.
Data Warehouse Creation
Data Warehouse Sources
To create a data warehouse the following sources are
needed, the current on
processing system, which holds the day to day information about the business.
from prior periods is also needed to create the warehouse.
Data Warehouse Creation
Creating a data warehouse has to done
with great care to ensure that design supports the types
of queries the warehouse will be used for.
The creator must understand the database schema
to be created and have interaction with the anticipated users of the database.
must be modifi
ed several times before the model will be stabilized.
Stabilizing the database
must be done carefully, because once the database is populated with large amounts of
information which may be hard to recreate, the database cannot be easily changed.
Gather and analyze existing informational capabilities
Identify significant business questions and key metrics from end users.
Breakdown the metrics into their component parts with specific definitions
Map the component parts to the
informational model and systems of record.
First Increment of Information
The warehouse architecture is initially developed, and then the first increment is developed
based on the architecture.
The first increment will cause architectural changes and ope
of the warehouse will cause architectural changes.
Once you have created your first data warehouse increment, the following should be
considered for the additional increments that are added.
Start with one subject area and one target user group.
tinue and add subject areas, user groups and informational capabilities to the
architecture based on the organization’s requirements for information.
Improvements are made from what was learned from previous increments.
Improvements are made from what was
learned about warehouse operation and
The end user requirements are reanalyzed after each improvement.
Extract, Transform and Load (ETL)
The warehouse is populated through extracting the data from the source environment, and
then changing the da
ta to have the warehouse characteristics, such as subject
Lastly, load the data into the software.
Data warehouses are updated regularly through the ETL process.
Updates are usually done at
night or weekly.
Bulk data modi
fication techniques are used which means the end users would
not update the data warehouse.
Data warehouses usually store many months or years of data
to support historical analysis.
This is why the warehouse should go through a data cleansing
Data cleansings are done during the ETL process; it is also called data scrubbing or data quality
A data warehouse that has incorrect information is useless and can be dangerous to
the companies’ decision
Once the warehous
e is built it will provide the company with a competitive advantage and
return on investment.
A tool that helps search for patterns in data warehouses is data mining,
it helps ask questions that a company may have not thought to ask.
s data warehousing and data mining is thought to be the same in the business
Data mining is differs because of the system designs, methodology used and the
Data mining uses pattern recognition to identify trends within a data set.
statistical analysis software package is used to identify specific patterns based on the data set
and queries the end users uses.
Data mining is used to target marketing programs, financial
fraud and to identify unusual patterns in behavior for securi
Key Properties of Data Mining
The key properties of data mining are; automatic discovery of patterns, prediction of likely
outcomes, creation of actionable information and focus on large data sets and databases.
Building Models (Automatic Disco
A data mining model uses an algorithm to act on a set of data.
The models can be used to mine
the data from which they are built from.
The process of applying a model to data is called
Data mining algorithms often require large data sets
for the creation of quality models.
Data Mining can be predictive, predication probabilities are also known as confidence.
Data mining can pull information that can for example help a town planner use a model that
dict income based on demographics to develop a plan for low
Data Mining Process
The figure below shows the data mining process.
The results of data mining trigger new
business questions, which can help develop more specific models.
This phase focuses on the understanding of the objectives and requirements.
You can develop
an preliminary implementation plan once you have decided on your business perspective.
instance, if you needed to know who is will purchase
The data on who has bought
the product in the past must be assembled.
Data Gathering and Preparation
This phase involves collecting the data and checking to see if it addresses the business
Data quality issues and data patterns can a
lso be found and corrected during this
Data preparation can significantly improve the information that can be discovered
through data mining.
Model Building and Evaluation
In this phase, the selection and application of modeling techniques are put
together to create
the best values.
When building the model, it should be created in small sets of data, because the
final table may contain thousands or millions of cases.
At this phase the model needs to be
evaluated to make sure if satisfies the origi
nal business goal.
This phase is the use of data mining in a target environment.
Insight and actionable
information can be derived from the data.
This phase involves scoring, which is the application
of models to new data, the extr
action of model details, data warehouse infrastructure, or query
and reporting tools.
Business Uses of Data Mining and Warehousing
Data warehousing and data mining are fairly new technologies that are helping companies
worldwide to find and use
data quickly and efficiently. Businesses that use this new
technology are more organized in their decision making and are resourceful because they are
able to organize and present data in a variety of different ways. The reason for doing this is to
date the diversity of their customers and their needs.
According to one article, data mining is useful in,“customer relationship management
(Drew et al. 2001; Garver 2002), ﬁnancial analysis (Cowan 2002; Hui and Weigend 2004),
management (Adams et al. 2001), health service debt management (Zurada and
Lonial 2005), banking (Sung et al. 1999), insurance (Smith et al. 2000), and many other areas
of business involving services” (Data Mining in Business Services). Having a data wareh
is becoming an increasingly popular resource to have in businesses because of its vast
knowledge of information.
An example of data mining in a large company would be Washington Radio Reports. WRR
) is a small company in Pennsylvania that mines multiple data sources
to build a massive data warehouse of radio frequencies. Data from the Federal
Communications Commission (
), Industry Canada
), US Postal Service zip codes, Canadian
zip codes and many others are integrated into the data wareho
use. Then, subscribers pay
Washington Radio Reports to utilize their search tools that access this wealth of information.
There are many types of businesses that use data mining as a means of knowledge
discovery. In the medical
field, data mining is very important because it helps doctors to
detect ADE (adverse drug events) in different types of medical conditions. By having all of the
medical information in a data warehouse, doctors and other workers
in the medical industry
able to detect ADE’s much earlier and quicker than in using previous methods. Thus
being able to treat patients effectively.
On the other hand, libraries “are discovering, as businesses have, the value of merging
existing data o
r full text resources to form a very large data warehouse that can be mined for
analytical purposes”(Data Mining and Data Warehousing p.13). Libraries are not only using
data mining for administrative purposes, but for scholarly research as well. Librarian
“recognize data mining
techniques as offering new approaches to analyzing content and knowledge discovery within
these very large databases and the Web” (Data Mining and Data Warehousing p.13). As a
student, data mining makes research and reporting simp
ler and more defined. Therefore, if
students are able to use this type of resource efficiently, businesses should use data mining
as an essential solution to their business needs.
How To Make Data Mining Work for Businesses
According to one website
, it’s a fact that “
in the retail sector, implementing data mining in
sales, ordering and market profiling has improved profits by 60%” (www.sdart.co.uk). This
goes to show that this up and coming development of data mining is becoming a way to save
and increase revenue. There is a sparking interest in many different types of
businesses for data warehousing and data mining for this reason.
A good example of how to
use Data Mining in a modern business setting would be as follows:
ses of data mining in business settings are numerous and can be used in many
different areas. One website that clearly defines and explains how data mining is used in
businesses has a list of examples of areas where data mining can be used. The highlighted
portions are important key factors of why using data mining will help businesses succeed.
Examples of the areas where Data Mining can be used:
Sales data analysis
If you sell anything you know how important it is to know how
to promote your products. As
there are patterns in sales, it is possible to find out
which products people buy together. You can therefore reduce unsuccessful
promotions, find out what customers need and sell more products.
There are situations when even a loyal customer could change any
of its business partners.
Data Mining churn analysis could help you identify these
customers before they go to your competitors.
Keep your customers with you and
don't lose their loyalty.
According to the Pareto Principle only 20% of your customers
generate 80% of your income. This Data Mining analysis will help you identify key
You could prepare special offers for them. Moreover, you could divide
ers into several groups according to their age or interests.
The success of every company lies in the hands of its
. By analysing personnel data you can find a bright star or again a weak link
that might decrease overall perfo
rmance. As a result of this Data Mining analysis you
could give extra training to employees with low performance and reward or promote
those who do best for you.
When you receive a vast number of messages it is important
to segregate them to right branches of the company.
With automatic segregation
mechanism all incoming messages will be sent to the right employees.
It will reduce
the number of messages, which ha
ve to be resent and any time wasted by
Financial performance analysis
Have you wondered why financial crises
happen? The answer is very simple. Without good financial knowledge, investment
becomes an emotional and often irrational pr
ocess, so when the economy validates
the market, a crisis starts. Therefore,
you need a source of automatic financial
performance Data Mining analysis to evaluate companies' financial standings.
Have you invested in an advanced website, b
ut find that not many
users read it? Maybe they cannot find the information they want in the right place?
Log analysis of website usage will identify patterns of use and help create a proper
Sections of websites can then be generated mor
e naturally or
suggested links could be presented to the user.
Overall, these are some great innovative ways that businesses can use data mining. It
will help with their financial performance, data analysis, customer and employee profiling,
variety of other areas as well. The key part in the way businesses make their money is
through the data mining process, therefore the more use and acquisition of information used,
the better for the company. An ending quote to summarize the use of data mi
ning in business
comes from Stephan Kudyba.
Regarding how retrieving data is useful to a business he says,
“The bottom line to business success is to increase the knowledge of decision makers at all
levels of an organization. The process of knowledge crea
tion and enhancement comes from
information and information is nothing more than data that has been collected, accessed,
formatted and analyzed. The more efficiently organizations can access value added data, the
better they can gain greater insights into
what drives their activities which enables them to
better devise business strategy”(
Data mining efforts increase business productivity and
Cost of Data Warehousing
The pricing of data warehousing systems are most commonly done by the
cost per terabyte.
The price usually fluctuates according to what is included in the systems package. One of the
companies I researched, IBM, has a warehousing system named InfoSphere. They give their
clients two package options, the Enterprise edition and
the Departmental edition. Both are
priced using the price per terabyte method. The Enterprise edition is the more advanced
option, priced at $70,000/TB. The Departmental edition is the more basic IBM InfoSphere
package, and is priced at $35, 000/TB.
On the website, IBM compared their packages with Oracle’s packages. Oracle is a competing
company, who is also offering an enterprise and basic package. Instead of listing their system
by price per terabyte, Oracle lists theirs by price per user and
price per processor. These are
two more examples of the many way’s to price a data warehousing system. I’ve included their
comparison chart below.
IBM InfoSphere Warehouse and Oracle Data Warehouse Packaging:
INFOSPHERE WAREHOUSE ENTERPRISE
CLE DATABASE 11g ENTERPRISE
List price per TB: $70,000
Diagnostics & Tuning Packs
List price/user: $2,530; List price/processor:
ORACLE DATABASE 11g STANDARD EDITION
List price per TB: $35,000
Diagnostics & Tuning Packs
$550; List price/processor: $27,500
*ORACLE DATABASE 11g ENTERPRISE EDITION
On another website I discovered, businessman/blogger Curt Monash compares the list prices
various warehousing systems and what they offer, and he discusses the advantages and
disadvantages of pricing sys
tems ‘per terabyte’ versus ‘per user’. He claims that although the
price per terabyte method is most the most common, listing the systems by terabyte usage is
outdated. He lists the disadvantages of price per terabyte as:
ignore issues of throughput, latency, workload, and so on.
terabyte metrics ignore quality of storage medium (slow disks, fast disks, Flash,
terabyte metrics can be radically affected by changes in disk size.
Comparatively, listing as price per user has certain advantages, which include:
User data is what matters.
User data is what users doing product evaluations or setting budgets can best estimate in
User data is a reasonable
and popular basis for software
only analytic DBMS pricing.
The disadvantages, according to Monash, seem to be that this system makes assumption based
on the amount of compression and indexing used, which are “dependent on the specifics of the
Some systems that he included were Netezza, Oracle, and EMC.
Netezza has two systems, 2x
compression and 4x compression. The 2x compression is a little under $20,000/TB, and the 4x
compression is about $11,000/TB according to Phil Fransico.
The EMC system is listed at $1
million for 18 terabytes of uncompressed user data, or $14,000/TB for 4x compression.
Based on this research, it seems that the best way to choose the system you need, is not by the
way the system is priced, but by what the
system has to offer and how that fits your needs. The
best priced system, and the most commonly used, seems to be the Oracle systems. By listing
their prices multiple ways, it gives the buyer a more rounded and suitable way to choose the
most appropriate w
arehouse system for their company.
Summary of Report Findings
After researching data mining and data warehousing, we have come to the understanding of
how they work and the conclusion that they can be very helpful in a multitude of ways. Like
previously stated, data warehouses are programs used to store data and run queries and
reports of that data. Bill Inmon is credited as the creator and he describes data warehousing
with six main characteristics: subject
oriented, integrated, non
accessible, and process
oriented. When you break down what the warehouse itself is, there are
three components: metadata, summary data, and raw data. Users are able to access the data
within the warehouse with ease and whenever necessary, if
set up properly. There are the
obvious business usages for them, such as storing data and being able to easily access it and
things pertaining to it, such as the history of many company aspects from customer service to
financial analysis. Hospitals and lib
raries are also beginning to implement data warehousing
systems in their businesses because it helps them easily organize their data and have the
capabilities of going back into the warehouses to find their information. Data warehousing is
even something t
hat people can use on daily bases. in general, data warehousing and mining
are becoming very popular methods within larger companies when it comes to gathering and
managing data, but can also be used in daily procedures for everyday people.
Reed, Micheal. "Intranet Journal: Feature: Finding Your Way Around E
Intranet Design, Development, Content Management, Collaboration
Tools, Microsoft SharePoint, Wikis, Lotus Notes
Evaluation.com. Web. 18 Nov. 2011.
Dambrine, Thibault. "Data Warehouse: Primary Concepts." (2009). Toronto Users
Group for Power Systems, Mar. 2009. Web. 18 Nov. 2011.
"Data Warehousing Guide."
. 2002. Web. 18 Nov. 2011.
"What Is Data Mining?"
. Oracle, 2008. Web. 30 Nov. 2011.
Francois, Carol. "What Is the Difference Between Data Mining and Data Warehousing?"
WiseGEEK: Clear Answers for Common Questions
. Conjecture Corporation, 18 Aug.
2011. Web. 30 Nov. 201
Mento, Barbara, and Brendan Rapple. "Data Mining and Data Warehousing."
Mining and Data Warehousing
. Washington, D.C.: Association of Research Libraries
Office of Leadership and Management Services, 2003. 9
Olson, David L. "Data Mining in Business Services."
1.3 (2007): 181
93. Web Journal.
"Data Driven Innovation | SDART Ltd."
SDART Innovation through Software
d Research Technology | SDART Ltd
. Web. 02 Dec. 2011.
Wilson, A. M., Thabane, L., & Holbrook, A. (2004). Application of data mining
techniques in pharmacovigilance.
British Journal of Clinical Pharmacology
Kudyba, Stephan. "Data Mining Efforts Increase Business Productivity and
Kennesaw State University Proxy
. Jan. 2003. Web. 02 Dec. 2011.
Paladin, D. (2011).
Sap business warehouse
. Retrieved f
Leonhardi, B., Sieb, C., Wurst, M.
, Rollins, J., & Ramos, J. (2008, October 30).
infosphere warehouse data mining with ibm cognos reporting, part
1: overview of info sphere warehouse and cognos integration architecture
Grossman, R., Hornick, M., & Meyer, G. (2002).
Data mining standards & initiatives.
Association for Computing Machinery, 45
61. Retrieved from
ITG. (2011, May).
line advantages of ibm infosphere warehouse
. FTP to
Monash, C. (2010, October 15). [Web log message]. Retrieved from