data warehouse

tastelesscowcreekBiotechnology

Oct 4, 2013 (3 years and 11 months ago)

95 views

Introduction to data mining

Literature

Data mining in commerce


About 13 million customers per month contact the West
Coast customer service call center of the Bank of America


In the past, each caller would have listened to the same
marketing advertisement, whether or not it was relevant to
the caller’s interests.


Chris Kelly, vice president and director of database
marketing: “rather than pitch the product of the week, we
want to be as relevant as possible to each customer”


Thus, based on individual customer profiles, the customer
can be informed of new products that may be of greatest
interest.



Data mining helps to identify the type of marketing
approach for a particular customer, based on the
customer’s individual profile.

Recommendation systems

Why mine data


commercial viewpoint


Lots of data is being collected


Web data, e
-
commerce


purchases at department/grocery stores


Bank/Credit Card transactions


Computers have become cheaper and more powerful


Competitive pressure is strong


Provide better, customized services

R. Grossman, C. Kamath, V. Kumar, “Data Mining for Scientific and Engineering Applications”


Data collected and stored at enormous speeds (GB/hour)


remote sensors on a satellite


telescopes scanning the skies


microarrays generating gene expression data


scientific simulations generating terabytes of data


Traditional techniques infeasible for raw data


Data mining may help scientists


in classifying and segmenting data


in hypothesis formation


Why mine data


scientific viewpoint

R. Grossman, C. Kamath, V. Kumar, “Data Mining for Scientific and Engineering Applications”

Data mining in bioinformatics


Brain tumors represent
the most deadly cancer
among children



Gene expression
database for pediatric
brain tumors was built,
in an effort to develop
more effective
treatment.


Clearly, a lot of data is being collected.


However, what is being learned from all this
data? What knowledge are we gaining from all
this information?


“we are drowning in information but starved
for knowledge”


The problem today is not that there is not
enough data. Rather, the problem is that there
are
not enough trained human analysts
available who are skilled at translating all of
this data into knowledge.


Data mining is the process of discovering
meaningful
new correlations, patterns and trends

by sifting through
large amounts of data
stored in
repositories, using pattern recognition
technologies as well as statistical an
mathematical techniques.


(www.gartner.com)



Data mining is an interdisciplinary field bringing
togther techniques from
machine learning,
pattern recognition, statistics, databases, and
visualization
to address the issue of information
extraction from large data bases.


(Peter Cabena, Pablo Hadjinian, Rolf Stadler, JaapVerhees, and Alessandro Zanasi, Discovering Data Mining:
From Concept to Implementation, Prentice Hall, Upper Saddle River, NJ, 1998.)


The growth in this field has been fueled
by several factors:


growth in data collection


storing of the data in data warehouses


availability of increased access to data from
Web


competitive pressure to increase market
share


development of data mining software suites


tremendous growth in computing power
and storage capacity

Need for human direction of DM


Don’t believe software vendors advertising
their analytical software as being plug
-
and
-
play out
-
of
-
the
-
box application providing
solutions without the need of human
interaction!


Data mining is not a product that can be
bought, it is a discipline that must be
mastered
!


Automation is not substitute for human input.


Data mining is easy to do badly.


Software always gives some result.


A little knowledge is especially dangerous


e.g. analysis carried out on unpreprocessed data
can lead to errorneous conclusions, the models
can be way off


if deployed, the errors can lead to very expensive
failures


The costly errors stem from the black
-
box
approach.

Data maning trap


If we try hard enough, we always find some patterns.


However, they may be just a matter of chance.


They don’t have to be characteristic for process that
generates the data.


Google defines data mining as:


D
ata mining is the

equivalent

to sitting a

huge
number of monkeys

down
at keyboards, and then

r
eporting
on the monkeys

who
happened

to type actual

words.


Instead, apply a “white
-
box” methodology.


i.e. understand of the algorithms and
statistical model structures underlying the
software




The white
-
box approach is the reason why
you are attending this lecture (apart from the
fact, that the lecture is compulsory).

Data mining as a process


One of the fallacies associated with DM is that
DM represents an
isolated

set of tools


Instead, DM should be viewed as a
process


The process is standardized


CRISP
-
DM
framework
(http://www.crisp
-
dm.org/)


Cross
-
Industry Standard Process for Data Mining


developed in 1996 by analysts from DaimlerChrysler,
SPSS, and NCR


provides a nonproprietary and freely available
standard process for fitting data mining into the
general problem
-
solving strategy of a business or
research unit

CRISP
-
DM

starts here

1.
Business understanding phase


Formulate the project objectives and requirements

2.
Data understanding phase


collect the data


use EDA (exploratory data analysis) to familiarize
yourself with the data


evaluate the quality of the data

3.
Data preparation phase


prepare from the initial raw data the final data set.
This phase is
very labor intensive
.


select the cases and variables you want to analyze


perform transformation of variables, if needed


clean the raw data so they are ready for modelling
tools

4.
Modeling phase


select and apply appropriate modeling techniques


calibrate model settings to optimize results


often, several different techniques may be used


if necessary, loop back to the data preparation phase
to bring the form of the data into line with the
specific requirements of a particular data mining
technique

5.
Evaluation phase


evaluate models for quality and effectivness


establish whether some important facet of the
business or research problem has not been
accounted for sufficiently

6.
Deployment phase


make use of the models created


examples of deployment:


report


implement a parallel DM process in another
department

CRISP
-
DM example


Business understanding


Objectives: reduce costs associated with
warranty claims and improve customer
satisfaction


Specific business problems can be formulated:


Are there interdependencies among warranty claims?


Are past warranty claims associated with similar
claims in the future?


Investigated patterns in the warranty claims

for DaimlerChrysler automobiles

Jochen Hipp and Guido Lindner, Analyzing warranty claims of automobiles: an application description following the CRISP

DM data
mining process, in
Proceedings of
the 5th International Computer Science Conference (ICSC ’99), pp. 31

40, Hong Kong, December
13

15, 1999


Data understanding


use of DaimlerChrysler’s Quality Information
System (QUIS)


it contains information on over 7 million vehicles
and is about 40 gigabytes in size


QUIS contains production details about how and
where a particular vehicle was constructed +
warranty claim information


researchers stressed the fact that the database
was entirely unintelligible to domain nonexperts


experts from different departments had to be located
and consulted, a task that turned out to be rather costly


Data preparation


the QUIS DB did not contain all information
needed for the modelling purposes


e.g. the variable “
number of days from selling date
until first claim”

had to be derived from the
appropriate date attributes


researchers then turned to DM software where
they ran into a common roadblock: data format
requirements varied from algorithm to algorithm


result was further exhaustive preprocessing of the data


researchers mention that the data preparation
phase took much longer than they had planned


Modeling


to investigate dependencies, researchers used


Bayesian networks


Association rules mining


the details of the results are confidential, but we
can get general idea of dependencies uncovered
by models


particular combination of construction specifications
doubles the probability of encountering an automobile
electrical cable problem


Evaluation


The researchers were disappointed that
association rules models were found to be lacking
in effectiveness and to fall short of the objectives
set for them in the business understanding phase


“In fact, we did not find any rule that our domain
experts would judge as interesting.”


To account for this, the researchers point to the
“legacy” structure of the database, for which
automobile parts were categorized by garages and
factories for historic or technical reasons and not
designed for data mining.


They suggest redesigning the database to make it
more amenable to knowledge discovery.


Deployment


It was a pilot project, without intention to deploy
any large
-
scale models from the first iteration.


Product: report describing lessons learned from
this project


e.g. change of the structure of the database (new
variables, different categorization of automobile parts)

Lessons learned


uncovering hidden nuggets of knowledge in
databases is a rocky road


intense human participation and supervision
is required at every stage of the data mining
process


there is no guarantee of positive results

Connection to other fields



Statistics

Database

systems

Vizualization

Data Mining

Machine learning

Pattern recognition

Machine learning


A subfield of
artificial intelligence
.


Discipline that is concerned with the design and
development of algorithms that allow
computers to
evolve behavior

based on
experience
.


experience


empirical data, such as from sensors or
databases


evolve behavior


usually through search of patterns
in data


similar goal as DM, DM uses algorithms from ML

Pattern recognition


Problem of searching patterns
-

a fundamental
one, long and successful history.


For instance, the extensive astronomical
observations of
Tycho Brahe
in the
16
th

century
allowed Johannes Kepler to discover the
empirical laws
of planetary
motion, which in
turn provided a springboard for the
development of
classical mechanics
.

Pattern recognition


automatic discovery of
regularities in data
through the use of computer algorithms and
with the use
of these
regularities to take
actions such as classifying the data into
different
categories

Pattern recognition


if train has 2 wagons, it
goes to the left

data

patterns

More real patterns

face detection


Connection to other fields



Statistics

Database

systems

Vizualization

Data Mining

Machine learning

Pattern recognition

Iris Sample Data Set


Many of the exploratory data techniques are illustrated with the
Fisher
’s Iris
Plant data set
.


From the statistician Douglas Fisher, mid
-
1930s


Can
be obtained from the UCI Machine Learning Repository

http://www.ics.uci.edu/~mlearn/MLRepository.html


based on WEKA tutorial

Fisher, R.A. (1936). "The Use of Multiple Measurements in Taxonomic Problems".
Annals of Eugenics

7
:
179

188
,
http
://digital.library.adelaide.edu.au/coll/special//fisher/138.pdf
.


iris setosa

iris versicolor

iris virginica

Contains flower
dimension measurements on 50
samples of each species.

Data mining terminology


The four iris dimensions are termed
attributes
,
input attributes
,
features


The three iris species are termed
classes
,
output attributes


Each example of an iris is termed a
sample
,
instance
,
object
,

data
point

These dimensions were measured:


sepal
(kališní lístek)
length


sepal

width


petal
(korunní lístek)
length


petal width

Measurements on these iris species:


setosa


versicolor


virginica

based on WEKA tutorial

Numerical

Nominal

Class

Sample

based on WEKA tutorial

Statistics


statistical analysis


summary statistics (mean, median, standard
deviation)


Exploratory Data Analysis (EDA)


A preliminary exploration of the data to better
understand its characteristics.


Created by statistician John Tukey


A nice online introduction can be found in Chapter 1
of the NIST Engineering Statistics Handbook
http://www.itl.nist.gov/div898/handbook/index.htm

EDA


Helps to select the right tool for preprocessing or
analysis


People can recognize patterns not captured by
data analysis tools


In EDA, as originally defined by Tukey


The focus was on visualization


Clustering and anomaly detection were viewed as
exploratory techniques


In data mining, clustering and anomaly detection are
major areas of interest, and not thought of as just
exploratory


Human makes and validates hypotheses


While in DM computer makes and validates hypotheses


setosa

virginica

versicolor

based on WEKA tutorial

based on WEKA tutorial

setosa

versicolor

virginica

based on WEKA tutorial

sepal length

sepal width

Connection to other fields



Statistics

Database

systems

Vizualization

Data Mining

Machine learning

Pattern recognition

Visualization


Can reveal hypotheses

based on WEKA tutorial

Connection to other fields



Statistics

Database

systems

Vizualization

Data Mining

Machine learning

Pattern recognition

Data warehouse


A
data warehouse
is a repository of an
organization's electronically stored data.



Data warehouses are designed to facilitate
reporting and analysis
.


Technology:


relational database system


multidimensional database system

Data warehousing


process of constructing and using data
warehouse


Data warehousing
is the coordinated, periodic
copying of

data from various sources, both
inside and outside the enterprise, into

an
environment optimized for analytical and
informational processing.


data warehousing includes



business intelligence tools



tools to extract, transform, and load data



tools to manage and retrieve metadata


Business intelligence tools


a
type of application software designed to report,
analyze and present data

(stored e. g. in data
warehouse)


they include


r
eporting and querying software


“Tell me what happened.”


tools that extract, sort, summarize, and present selected data


OLAP (
On
-
Line Analytical Processing
)


“Tell me what happened

and why
.”


data mining


“Tell me what
might
happened.” (predict)


“T
ell me something interesting
.” (relationships)

OLAP


Query and report data is typically presented in
row after

row of two
-
dimensional data.


OLAP:
“Tell me what happened and why.”


To support this type of processing, OLAP
operates against
multidimensional databases
.

Example: Iris data


We show how the attributes, petal length, petal
width, and species type can be converted to a
multidimensional array


First, we discretized the petal width and length to
have categorical values:
low
,
medium
, and
high


We get the following table
-

note the count
attribute




Length


Slices of the multidimensional array are shown
by the following
cross
-
tabulations

Setosa

Versicolor

Virginica

Creating a Multidimensional Array


Two key steps in converting tabular data into a
multidimensional array.

1.
identify
which attributes are to be the dimensions
and which attribute is to be the target attribute
whose values appear as entries in the
multidimensional array.


The attributes used as dimensions must have discrete
values


The target value is typically a count or continuous
value

2.
find
the value of each entry in the
multidimensional array by summing the values (of
the target attribute) or count of all objects that
have the attribute values corresponding to that
entry.

OLAP Operations: Data Cube


The key operation of
an
OLAP is the formation
of a
data
cube
.


A data cube is a multidimensional
representation of data, together with all
possible aggregates.



By all possible aggregates,
we mean the aggregates
that result by selecting a
proper subset of the
dimensions and summing
over all remaining
dimensions.


For example, if we choose
the species type dimension
of the Iris data and sum
over all other dimensions,
the result will be a one
-
dimensional entry with
three entries, each of which
gives the number of flowers
of each type.

Length


Consider a data set that records the sales of
products at a number of company stores at
various dates.


This data can be represented

as a 3 dimensional array


There are 3 two
-
dimensional

aggregates (3 choose 2 ),

3 one
-
dimensional aggregates,

and 1 zero
-
dimensional

aggregate (the overall total)


Data Cube Example


The following figure table shows one of the two
dimensional aggregates, along with two of the
one
-
dimensional aggregates, and the overall
total


OLAP
Operations


Various operations are defined on the data
cube:


Slicing/Dicing
-

selecting a
group/subgroup
of cells
from the entire multidimensional array by
specifying a specific value for one or more
dimensions
.


Roll
-
up and
Drill
-
down
-

granularity

The End

OLAP Operations: Roll
-
up and Drill
-
down


Attribute values often have a hierarchical
structure.


Each date is associated with a year, month, and week.


A location is associated with a continent, country, state
(province, etc.), and city.


Products can be divided into various categories, such as
clothing, electronics, and furniture.


Note that these categories often nest and form a
tree or lattice


A year contains months which contains day


A country contains a state which contains a city


This hierarchical structure gives rise to the
roll
-
up

and
drill
-
down

operations.


For sales data, we can aggregate (roll up) the sales
across all the dates in a month.


Conversely,

given
a view of the data where the
time dimension is broken into months, we could
split the monthly sales totals (drill down) into daily
sales totals
.

OLAP Operations: Roll
-
up and Drill
-
down