A Best Practices Framework for

runmidgeΤεχνίτη Νοημοσύνη και Ρομποτική

20 Οκτ 2013 (πριν από 3 χρόνια και 11 μήνες)

72 εμφανίσεις

13 June 2013
|

Virtual Business Analytics Chapter

A
Best Practices
Framework for
Data
Mining


Mark Tabladillo, Ph.D., Data Mining Scientist


Artus

Krohn
-
Grimberghe
, Ph.D., Consultant and Assistant Professor

About MarkTab


Training and Consulting with
http://marktab.com


Data Mining Resources and Blog at
http://
marktab.net


Ph.D.


Industrial Engineering, Georgia Tech


Training and consulting internationally across
many industries


SAS and Microsoft


Contributed to peer
-
reviewed research and
legislation


Mentoring doctoral dissertations at the
accredited University of Phoenix


Presenter

About
Artus


Assistant Professor for Analytic Information
Systems and Business Intelligence


PhD in computer science


Research: data mining for e
-
commerce and
mobile business


Consultant

Section One

DATA MI NI NG FOUNDATI ON

4

Definition 1


(Informal)


Data mining
is the automated or semi
-
automated process
of discovering patterns in data.



Definition 2

Data
Mining is a process using

1.
Exploratory Data Analysis

Statistical
and visual data analysis
techniques.




Forming a hypothesis

2.
Data Modeling & Predictions

Describe data using probability distributions and

Machine Learning algorithms (“model”).




Fitting a hypothesis

3.
Statistical Learning Theory

Model selection, model evaluation

6

Data Mining Visualized












Target
: attribute we are interested in.


Input
: data available for our predictions.


Function f: describes the relationship between target and input.

Regrettably,
f is unknown
and unknowable.

7

Input

Target

f (

)

Data Mining Visualized











8

Input

Target

f (

)

Hypothesis h

)

(

Unknown

Real world:

Data Mining model:

Need to find “good” h.

h is your
DM “algorithm”.

Input data has to be appropriate.

Select and transform as needed

Correct modeling of

target is crucial

Top 10 Expectations

BEST PRACTI CE: LEARN FROM EXPERI ENCE

9


People can start data
mining in 10 minutes…



Marketing


More Scientific


Better models come
from days, weeks or
months of iterative
improvement

10

Expectation Ten


Data miners can
provide provably good
models with little or
zero knowledge of the
specific industry…



Marketing


More Scientific


Knowing the industry
and organizational
goals helps orient the
questions, modeling,
and analysis.

11

Expectation Nine


Open source software
can provide quality
results worthy of peer
-
reviewed literature…


Marketing


More Scientific


Commercial software
with years
-
long service
options is required for
enterprise scale.

12

Expectation Eight


We can learn a lot from
the current data
warehouses, cubes,
and big data…


Marketing


More Scientific


We can improve our
modeling by creating
new data collection
strategies.

13

Expectation Seven


People can build data
mining models with
little or zero data
cleaning…


Marketing


More Scientific


Better results happen
when we organize and
rearrange data for best
success.

14

Expectation Six


Data mining can
provide answers to
problems…


Marketing


More Scientific


Most times we only get
detail insights toward
larger problems, and
sometimes uncover
more problems than
we started with.

15

Expectation Five


A little data mining
knowledge can provide
an organization with a
competitive edge…


Marketing


More Scientific


The edge grows along
with experience and
better study of the
methodology and
mathematics.

16

Expectation Four


Individual
professionals can
deliver excellent
predictive analysis…


Marketing


More Scientific


Small teams working
together can help
quickly and efficiently
conquer some of the
most difficult analytic
challenges.

17

Expectation Three


Numbers speak for
themselves and can
influence better
decision making…


Marketing


More Scientific


Leadership strategy
helps teams deliver
results in the best way
given the current
culture.

18

Expectation Two


A lot of data mining
best practices and
strategies can be
communicated in an
hour or a day…


Marketing


More Scientific


The best commitment
is ongoing education
on both data mining
and machine learning
technology.

19

Expectation One

Section Two

ANALYZI NG AND PREPARI NG DATA

20

Best practice: study individual attributes


Histograms and frequencies (discrete)


Kernel density estimates


Cumulative distribution function


Rank
-
order plots and lift charts


Summary statistics (continuous)


Box
-
and
-
whisker plots



21

Best practice: study combinations


Pivot tables


Scatter plots


Logarithmic plots


Naïve Bayes


Correlation matrices


False
-
Color plots


Scatter
-
Plot matrix


Co
-
plot

22

Section Three

MACHI NE LEARNI NG ALGORI THMS

23

How to Choose an Algorithm


Choosing an algorithm or series of algorithms is an art


One algorithm could perform different tasks


Be willing to experiment with algorithms and algorithm parameters

24

Algorithms for Data Mining
Tasks (1 of 2)

Algorithm
Name

Description

Microsoft Time
Series

Analyzes time
-
related data by using a linear decision tree.

Patterns can be used to predict future values in the time series.

Microsoft
Decision Trees

Makes predictions based on the relationships between columns in the dataset, and models
the relationships as a tree
-
like series of splits on specific values.

Supports the prediction of both discrete and continuous attributes.

Microsoft
Linear
Regression

If there is a linear dependency between the target variable and the variables being
examined, finds the most efficient relationship between the target and its inputs.

Supports prediction of continuous attributes.

Microsoft
Clustering

Identifies relationships in a dataset that you might not logically derive through casual
observation. Uses iterative techniques to group records into clusters that contain similar
characteristics.

Algorithms for Data Mining
Tasks (2 of 2)

Algorithm Name

Description

Microsoft Naïve
Bayes

Finds the probability of the relationship between all input and predictable columns. This algorithm is
useful for quickly generating mining models to discover relationships.

Supports only discrete or discretized attributes.

Treats all input attributes as independent.

Microsoft Logistic
Regression

Analyzes the factors that contribute to an outcome, where the outcome is restricted to two values,
usually the occurrence or non
-
occurrence of an event.

Supports the prediction of both discrete and continuous attributes.

Microsoft Neural
Network

Analyzes complex input data or business problems for which a significant quantity of training data is
available but for which rules cannot be easily derived by using other algorithms.

Can predict multiple attributes.

Can be used to classify discrete attributes and regression of continuous attributes.

Microsoft
Association Rules

Builds rules that describe which items are likely to appear together in a transaction.

Microsoft
Sequence
Clustering

Identifies clusters of similarly ordered events in a sequence.

Provides a combination of sequence analysis and clustering.

Best practice: Document your science


Describe the business problem


Determine how to measure success (including baseline)


Document what was learned during data preparation and analysis


Justify the algorithms used during the investigation


List assumptions were made

27

Section Four

ACHI EVI NG BUSI NESS VALUE

28

Leadership challenges


Build on organizational communications


Consider redoing analysis


Find results champions


Celebrate the results

29

Best practice: prepare the next cycle


Note strengths, weaknesses, opportunities, risks


Build consensus on model expiration dates


Encourage and improve the process


Create insight into new future data collection




30

Conclusion


Best Practices Framework



Provide a data mining foundation



Prepare the data



Evaluate machine learning output



Plan to move toward actionable decisions

31

Resources


http://www.lfd.uci.edu/~gohlke/pythonlibs
/

Free Win x64 Python libs


http
://www.enthought.com/products/epd.php

Commercial Python


http://
www.burns
-
stat.com/pages/Tutor/R_inferno.pdf

R Tutorial


http://
technet.microsoft.com/en
-
us/sqlserver/cc510301.aspx

SQL Server Analysis Services Data
Mining


http://marktab.net

Data Mining Portal


http://sqlserverdatamining.com

Data Mining
Team Portal


Books: “Data Mining with SQL Server 2008”, “Data Mining for Business Intelligence”, “Practical
Time Series Forecasting”





32