Data Mining vs. Statistics

sentencehuddleΔιαχείριση Δεδομένων

20 Νοε 2013 (πριν από 3 χρόνια και 6 μήνες)

72 εμφανίσεις

Data Mining vs. Statistics

Pavel Brusilovsky

2

Objectives

2


Intro to Data Mining



Data Mining vs. Statistics



Data Mining vs. Text Mining



Applications of Data Mining

3

Data Mining

3


Data Mining


is a cutting edge technology to analyze diverse,
multidisciplinary and multidimensional complex data



Data mining could identify relationships in your multidimensional and
heterogeneous data that cannot be identified in any other way




Successful application of state
-
of
-
the
-
art data mining technology to
marketing and sales is indicative of analytic maturity and the
success of a company



Working definition of Data Mining:


Data Mining is a process of discovering previously unknown and
potentially useful hidden pattern in your data


4


4

What is the Taxonomy of Data Mining?


Data mining taxonomy, based on application


Data Mining


Text Mining


Web Mining


Image Mining…



Data mining taxonomy, based on the usage of domain knowledge:


Verification
-
driven data mining


Is associated with traditional quantitative approaches that permit a
decision maker to express and verify organizational and personal
domain knowledge


Discovery driven data Mining


It tied with knowledge discovery technology capable of automatically
discovering previously unknown patterns hidden in the data


Combination of both classes leads to synergy that can produce
meaningful and reliable results that may not be obtained within the
framework of each class of data mining independently



Data mining taxonomy, based on estimation paradigm:


supervised learning


unsupervised learning

5


5

What is the deference between “Search”
and “Discover”

Source:

http://www.knowledgetechnologies.org/proceedings/presentations/treloar/nathantreloar.ppt

6


6

Example: Amazon.com purchase suggestion

Amazon.com increased
sales by 15%, using
data/text mining
generated purchase
suggestions

7


7

Data Mining and Related Fields

Statistics: “The model is king” (Hand)

Data Mining: “The data is king”

8


8


Is Data Mining extension of Statistics?



Data Mining and Statistics: mutual fertilization with
convergence



Statistical Data Mining

(Graduate course, George Mason
University)



Statistical Data Mining and Knowledge Discovery (Hardcover)

by
Hamparsum Bozdogan

(Editor)



An overview of Bayesian and frequentist issues that arise in
multivariate statistical modeling involving data mining





Data Mining with Stepwise Regression

(Dean Foster, Wharton
School)


use interactions to capture non
-
linearities


use Bonferroni adjustment to pick variables to include


use the sandwich estimator to get robust standard errors


9


9

What are Data Mining Myths?


Myth 1
: Data mining automatically discovers hidden pattern in your
data



Myth 2
: Data mining is design for business analysts who are not
professional in quantitative fields



Myth 3
: Data mining findings can be easily translated into decision
-
maker actions



Myth 4
: Data mining encompasses decision analysis/decision
support technology

10


10


What are logical steps of Data Mining?

SEMMA methodology (SAS Enterprise Miner)


The core process of conducting data mining study includes the following
steps (SEMMA):


Sample


Explore


Modify


Model


Assess


SEMMA is a logical organization of the functional tool set of SAS
Enterprise Miner for carrying out the core tasks of data mining


SEMMA is focused on the model development aspects of data mining

11


11


CRoss
-
Industry Standard Process for Data
Mining (CRISP
-
DM)

Six phases of CRISP
-
DM:

1.
Business understanding

2.
Data understanding

3.
Data preparation

4.
Modeling

5.
Evaluation

6.
Model deployment

SPSS Clementine

www.crips
-
dm.org

12


12

Statistics vs. Data Mining: Concepts

Feature

Statistics

Data Mining

Type of Problem

Well structured

Unstructured / Semi
-
structured

Inference Role

Explicit inference plays
great role in any analysis

No explicit inference

Objective of the Analysis
and Data Collection

First


objective
formulation, and then
-

data collection

Data rarely collected for objective of
the analysis/modeling

Size of data set

Data set is small and
hopefully homogeneous

Data set is large and data set is
heterogeneous

Paradigm/Approach

Theory
-
based (deductive)

Synergy of theory
-
based and
heuristic
-
based approaches
(inductive)

Signal
-
to
-
Noise Ratio

STNR > 3

0 < STNR <= 3

Type of Analysis

Confirmative

Explorative

Number of variables

Small

Large

13


13


Statistics vs. Data Mining: Regression Modeling

Feature

Statistics

Data Mining

Number of inputs

Small

Large

Type of inputs

Interval scaled and categorical with
small number of categories
(percentage of categorical variables is
small)

Any mixture of interval scaled,
categorical, and text variables

Multicollinearity

Wide range of degree of
multicollinearity with intolerance to
multicollinearity

Severe multicollinearity is
always there, tolerance to
multicollinearity

Distributional
assumptions,
homoscedasticity,

outliers, missing
values

Intolerance to distrubitional
assumption violation,
homoscedasticity,

Outliers/leverage points, missing
values


Tolerance to distributional
assumption violation,
outliers/leverage points, and
missing values

Type of model

Linear / Non
-
linear / Parametric / Non
-
Parametric in low dimensional X
-
space (intolerance to
uncharacterizable non
-
linearities)

Non
-
linear and non
-
parametric
in high dimensional X
-
space
with tolerance to
uncharacterizable non
-
linearities

14


14

What is an unstructured problem?

Well
-
structured Business
Problem

Unstructured Business Problem

Definition

Can be described with a high
degree of
completeness

Cannot be described with a high
degree of
completeness

Can be solved with a high
degree of
certainty

Cannot be resolved with a high
degree of
certainty

Experts usually agree on the
best method and best
solution

Experts often disagree about the
best method and best solution

Can be easily and uniquely
translated into quantitative
counterpart

Cannot be easily and uniquely
translated into quantitative
counterpart

Goal

Find the best solution

Find reasonable solution

Complexity

Ranges from very simple to
complex

Ranges from complex to very
complex

15


15


What are differences between Data/Text
Mining and Statistics?


Statistical analysis is designed to deal with structured data in order to
solve structured problem:


Results are software and researcher independent


Inference reflects statistical hypothesis testing


Data mining is designed to deal with structured data in order to solve
unstructured business problems


Results are software and researcher dependent (
absence of
implementation standards)


Inference reflects computational properties of data mining
algorithm at hand


Text mining is designed to deal with unstructured data in order to solve
unstructured problems


Results are software and researcher dependent


Inference reflects computational properties and visualization
capability of text mining algorithm at hand



16


16

When data mining technology is
appropriate?


Data mining technology is appropriate if:


The business problem is unstructured


Accurate prediction is more important than the explanation


The data include the mixture of interval, nominal, ordinal, count,
and text variables, and the role and the number of non
-
numeric
variables are essential


Among those variables there are a lot of irrelevant and redundant
attributes


The relationship among variables could be non
-
linear with
uncharacterizable nonlinearities


The data are highly heterogeneous with a large percentage of
outliers, leverage points, and missing values


The sample size is relatively large



Important marketing and sales studies/projects have the majority of
these features

17


17

Accurate prediction is more important than

the explanation

18


18

What is Breiman Uncertainty Principle?


Breiman uncertainty principle:

Accuracy * Interpretability = Breiman’s constant



Breiman uncertainty principle means that

The higher method’s accuracy, the lower its interpretability, and
vice versa


19


19

What are great Data Mining Ideas?


Injecting randomness into function estimation procedure



Bagging (Breiman, 1996):


Apply the same unstable algorithm to different samples (with
replacement) of the original data


Different samples yield different models


The average of the predictions of these models might be better
than the predictions from any single model



Boosting (Friedman, Hastie, and Tibshirani (1999):


Each model is based on the same original data


The first individual model is fit to the original data


For the second model, subtract the predicted value from the
original target value, and use the difference as the target value
to train the second model


For the third model, subtract weighted average of the predictions
from the original target value, and use the difference as the
target value to train the third model, and so on.


20


20

What are the best Data Mining
Conferences?


Annual SAS Data Mining Technology Conference


The world’s largest data mining conference that balancing
theory and practice



Annual International Conference on Knowledge Discovery and Data
Mining (KDD)


Sponsored by the American Association for Artificial Intelligence
(AAII)



Annual International Salford Systems Data Mining Conference


Focusing on solving real world challenges


Business Applications of CART, MARS, TreeNet, and Random
Forrest


Keynote speakers: Jerome Friedman (Stanford University) and Leo
Breiman (University of California, Berkeley)


21


21

What are the best data mining tools?


Salford Systems’ Tools (CART, Random Forest, MARS, TreeNet)



SAS Enterprise Miner/Text Miner



SPSS Clementine



Megaputer Intelligence PolyAnalyst


22


22

Reference (Data Mining)


Randall Matignon (2007), Neural Network Modeling Using SAS
Enterprise Miner , SAS® Institute Inc.



David J. Hand, Data Mining: Statistics and More? The American
Statistician, May 1998, Vol. 52 No. 2


http://www.amstat.org/publications/tas/hand.pdf



Friedman, J.H. 1997. Data Mining and Statistics. What’s connection?
Proceedings of the 29
th

Symposium on the Interface: Computing
Science and Statistics, May 1997, Houston, Texas



Doug Wielenga (2007), Identifying and Overcoming Common Data
Mining Mistakes, SAS Global Forum Paper 073
-
2007



Nathan Treloar (2002), Text Mining: Tools, Techniques, and
Applications
http://www.knowledgetechnologies.org/proceedings/presentations/trel
oar/nathantreloar.ppt
.