Data Mining with Analysis Services - SQL Server Consultants

muscleblouseAI and Robotics

Oct 19, 2013 (4 years and 20 days ago)

96 views

Data Mining



with Analysis Services

Carlos Bossy

Principal Consultant

MCTS, MCITP BI


Aabcom Solutions

www.aabcomsolutions.com

www.carlosbossy.com

My Background


25 years Software Development


15 years software companies as Programmer, Architect,
Manager, Director, VP, CTO


5+
years Business Intelligence Consultant


Data
Warehouse deployment
for State of
WA
Child
Welfare


Data
Integration/Warehouse architecture for Solar Energy company


Data Warehouse model for State of OR


Data Mining model for Houston sports league

Current Projects

Experience

Overview

Make
Data Mining an integral
component of your Data Architecture.

Session Overview

What is
Data Mining

Data Mining
Algorithms
in SQL
Server

Creating a
Data Mining
Model

Using the
Model in an
Application

Using the
Model in
SSIS

Awareness
of Data
Mining
Architecture
and Process

Application Architecture

Application

Service

Business

Data

Relational Database

Application Architecture with BI

Application

Service

Business

Data

Relational Database

Data Integration

Data
Warehouse

Cube

Text
Warehouse

Performance
Management

Reports

Analysis

Data Mining

Text Mining

Ad
-
hoc

Data Mining

If
you want to…


Predict the
future


Get rich buying
stocks


Win the
lottery


Win your Fantasy Football
League

Data
Mining is for you!

Why Mine

Data?

Really
-

Why Mine
Data?

Un c o v e r H i d d e n
R e l a t i o n s h i p s

F i n d s o me t h i n g u n u s u a l o r
u n e x p e c t e d

I mp r o v e u p o n d o ma i n
e x p e r t ’ s k n o wl e d g e

Ma n a g e l a r g e d a t a s e t s

C r e a t e P r e d i c t i v e A n a l y t i c s
p l a t f o r m

Maximize
value of Data

Competitive
Edge

Data Mining Defined

Data Mining is the process of sorting through large
amounts of
data and
picking out relevant information (Wikipedia).


Data
Mining is the extraction of hidden predictive information from
large databases.



The Process of Knowledge Discovery


Sophisticated Statistical Model


Discovery of Patterns and Relationships

It is
not
Dredging, Snooping or an
Invasion of Privacy

Today vs. Yesterday


Explosion of Data doubles every 3
years (Moore’s Law)


Data Volumes can’t be comprehended
by humans


Uncover complex and difficult to find
patterns for competitive edge


Improve professional judgment of
Domain Expert (small but valuable)


Knowledge Discovery


Converting Data to Information


N


Infinity

Future

Legacy


Manageable Volumes of Data


The power of SQL


Domain Experts could grasp and
analyze a complete Database


Limited CPU Horsepower


Finite


Outcomes

Predict number of runs Colorado Rockies would score in their next game
using Neural Network.


Input


Previous 12 games runs scored


Home/Away

Predict


Number runs next game

Model


Home

Runs = (G7*.142)+(G1*.118)+…


Away


Runs = (G7*.129)+(G1*.091)+…

Non
-
linear Results

Fitting a
line
to a set of data points to measure the effect
of a single independent variable.

y =
mx

+ b

Statistical Methods

Data Mining

Outcomes


Decision Tree

Annual Income = 108,491.880+6.394*(Savings Balance
-
27,898.703)
-
2,443.789*(
Avg

Check Size
-
59.555)
-
0.247*(Credit Card Limit
-
8,250.000)+33.430*(Age
-
38.500)
-
1,032.285*(Over Drafts
-
1.000)

Outcomes


Time Series

Total Gen
-
1 < 36276.430 and Flow Date < 3/2/2007 4:58:07 AM and Flow Date <
8/22/2006 5:54:22 PM and Flow Date < 8/2/2006 1:35:37 PM

Total Gen = 31452.534 + 0.126 * Total Gen(
-
1)

Total Gen
-
1 < 36276.430 and Flow Date >= 3/2/2007 4:58:07 AM and Flow Date <
5/16/2007 7:46:52 PM and Flow Date >= 3/15/2007 12:56:15 PM

Total Gen = 6186.992 + 0.361 * Total Gen(
-
2) + 0.602 * Total Gen(
-
1)

Applications

o
Credit Risk Analysis

o
Churn Analysis

o
Customer Retention

o
Targeted
Marketing

o
Market Basket Analysis

o
Sales Forecasting

o
Stock Predictions

o
Medical Diagnosis

o
Bioscience Research

o
Surveys

o
Insurance Rate Quotes

o
Credit Card Fraud

o
Web Site Events

o
Loan Applications

o
Hiring and Recruiting

o
Cross
-
Marketing

o
Social Science

o
Economics

Data Mining
Sources

SSAS

Cube

DW


Data Mining


OLTP

Database

Table

Col 1

Col 2

Col

3

SQL Server

Oracle

MySQL

Etc.

Cleanse

Massage

Select
Training Set

Apply DM
Algorithm

Train

Test

Training the Model

Using the Model

Customer

CustomerID

CustName

Mining Model

CheckingBalance

SavingsBalance

NumTransactions

MonthlyBankAccount

BankAccountID

CheckingBalance

SavingsBalance

NumTransactions

CustomerID

Prediction

Income

SQL Join

DMX Prediction
Join

Result

Prediction Join

DMX

SELECT


Predict([Movie Purchases],3) as Movies

From


[Customer Movie Association]

NATURAL PREDICTION JOIN

(SELECT 44 AS [Age],


'Female'

AS [Gender],


'Rent'

AS [Home Ownership],


'Divorced'

AS [Marital Status],


(SELECT
'Mrs. Doubtfire'

AS [Movie Title]


UNION SELECT
'My Big Fat Greek Wedding'

AS [Movie Title]


UNION SELECT
'Patriot Games'

AS [Movie Title]) AS [Movie Purchases]) AS t

Approaches

Clustering

Classification

Regression

Market
Basket
Analysis

Mining Algorithms


Time Series


Naïve Bayes


Association


Clustering


Decision Trees


Logistic Regression


Clustering


Sequence Clustering


Neural Networks

Data Mining Algorithms

Analytical problem

Examples

Algorithms

Classification: Assign cases to
predefined classes

Credit risk analysis

Churn analysis

Customer retention

Decision Trees

Na
i
ve

Bayes

Neural Nets

Segmentation:
T
axonomy for grouping
similar cases

Customer profile analysis

Mailing campaign

Clustering

Sequence Clustering

Association: Advanced counting for
correlations

Market basket analysis

Advanced data exploration

Decision Trees

Association

Time Series Forecasting: Predict the
future

Forecast sales

Predict stock prices

Time Series

Prediction: Predict a value for a new
case based on values for similar cases

Quote insurance rates

Predict customer income

All

Deviation analysis: Discover how a case
or segment differs from others

Credit card fraud detection

Network infusion analysis

All

*
Andy
Cheung, Microsoft

Time Series


Uses
Autoregression

+ Decision Tree to build model


Each time series is a single case


No prediction join with test or actual cases


Prediction is always the same for given time slots

Analyzes how a variable changes over time.

Time Series Tree

All

Stock(t
-
5)
<= 22.83

Stock(t
-
5)
> 22.83

Stock(t
-
1)
<= 30.15

Stock(t
-
1)
> 30.15

Stock = 6.85 + .62*Stock(t
-
1) + .21*Stock(t
-
2)

Node Regression Formula

Naïve Bayes

Probabilistic classifier based on
Bayes
’ theorem with strong
(naive) independence assumptions.



Simple Classification Algorithm


Good starting point for better understanding of your data


Uses only discrete data

Naïve Bayes Example

Cell Phone
Service

Gender

Premium
Service

Custom Ring Tones

International
Calls

Female

53%

19%

56%

27%

Male

47%

41%

14%

38%

Premium Svc = Yes Ring Tones = Yes International = No

Likelihood of Female = .53 * .19 * .56 * .73 = .0412

Likelihood of Male = .47 * .41 * .14 * .62 = .0167

P(Female) = .0412/(.0412 + .0167) = 71.2%

P(Male) = .0167/(.0412 + .0167) = 28.8%

Association Rules

Detect relationships or associations between specific values of
categorical variables in large data sets.


Uses only Discrete Data


Rule
-

Attribute value conditions that occur frequently together in a given
dataset {Male, IT, Star Wars}


{Star Trek}


Itemset
-

A set of attribute values.


Support
-

Total number of transactions.


Confidence
-

Probability that {X}


{Y}


Importance (Lift)

Interestingness
. Measure of whether Correlation is
positive, negative or none.

Logistic Regression

Predict the probability of a discrete outcome from a set of
variables that are continuous, discrete or both
.



Non
-
linear regression model that produces results between 0 and 1.


Popular in health science for disease prediction.


Marketing uses for dichotomous predictions (buy or not buy, renew or
cancel).


Same as Neural Network without the hidden layer.

Probability = 1/1 + e
-
z

where z = c + yx
1

+ yx
2
+ …


If x1 = Weight and x2 = Age then

Risk of heart attack = 1/1 + e
-
z

where z =
-
2.4 + 1.3x
1

-

.7x
2

Decision Tree

Graphical representation displaying options, risks and the
decision
-
making sequence
.



Most popular data mining model.


East to visualize because of it’s graphical representation. Branches
represent choices with associated risks, costs, results, or probabilities.


Each test examines the value of a single column in the data and uses it to
determine the next test to apply. The results of all tests determine which
label to predict.


Similar to human thought process when making a decision.


Finds non
-
linear relationships.


Supports classification, regression and association within the model.

Neural Network

Classifies large and complex data sets by grouping cases together
in a way loosely based on the brain.



Most sophisticated algorithm but difficult to interpret.


Works well with non
-
linear data and finds smooth non
-
linear relationships.


Modeled as a group of interconnected nodes.


No agreed upon definition. Microsoft algorithm is one of many techniques.


Can build multiple models based on discrete inputs.

I

I

I

H

H

H

H

O

O

Back to Input layer after
weights adjusted for error

Clustering

Places data elements into related groups without
advance knowledge of the group definitions
.



Good starting point for better understanding of your data.


Finds the hidden variable that accurately classifies data.


Data grouped into clusters have a high similarity based on
the attribute values.


Sequence Clustering

Discovers the properties of sequences by grouping them
into clusters and assigning
them to one
of
the
clusters
.



Hybrid of sequence and clustering techniques.


Typically used with web and event logs as data sources.


Demo

Conclusion

Slow Adoption

Where do you
start?

Science + Art

Not quite A.I.
… yet!

More Info and References

TDWI


The Data Warehousing Institute

ACM, IEEE

Books: Data Mining with SQL Server 2005/8 (Wiley)



Mining the Talk (IBM Press)



Data Mining
know it all

(Morgan/Kaufman)