Data Mining
with Analysis Services
Carlos Bossy
Principal Consultant
MCTS, MCITP BI
Aabcom Solutions
www.aabcomsolutions.com
www.carlosbossy.com
My Background
25 years Software Development
15 years software companies as Programmer, Architect,
Manager, Director, VP, CTO
5+
years Business Intelligence Consultant
Data
Warehouse deployment
for State of
WA
Child
Welfare
Data
Integration/Warehouse architecture for Solar Energy company
Data Warehouse model for State of OR
Data Mining model for Houston sports league
Current Projects
Experience
Overview
Make
Data Mining an integral
component of your Data Architecture.
Session Overview
What is
Data Mining
Data Mining
Algorithms
in SQL
Server
Creating a
Data Mining
Model
Using the
Model in an
Application
Using the
Model in
SSIS
Awareness
of Data
Mining
Architecture
and Process
Application Architecture
Application
Service
Business
Data
Relational Database
Application Architecture with BI
Application
Service
Business
Data
Relational Database
Data Integration
Data
Warehouse
Cube
Text
Warehouse
Performance
Management
Reports
Analysis
Data Mining
Text Mining
Ad

hoc
Data Mining
If
you want to…
Predict the
future
Get rich buying
stocks
Win the
lottery
Win your Fantasy Football
League
Data
Mining is for you!
Why Mine
Data?
Really

Why Mine
Data?
Un c o v e r H i d d e n
R e l a t i o n s h i p s
F i n d s o me t h i n g u n u s u a l o r
u n e x p e c t e d
I mp r o v e u p o n d o ma i n
e x p e r t ’ s k n o wl e d g e
Ma n a g e l a r g e d a t a s e t s
C r e a t e P r e d i c t i v e A n a l y t i c s
p l a t f o r m
Maximize
value of Data
Competitive
Edge
Data Mining Defined
Data Mining is the process of sorting through large
amounts of
data and
picking out relevant information (Wikipedia).
Data
Mining is the extraction of hidden predictive information from
large databases.
The Process of Knowledge Discovery
Sophisticated Statistical Model
Discovery of Patterns and Relationships
It is
not
Dredging, Snooping or an
Invasion of Privacy
Today vs. Yesterday
Explosion of Data doubles every 3
years (Moore’s Law)
Data Volumes can’t be comprehended
by humans
Uncover complex and difficult to find
patterns for competitive edge
Improve professional judgment of
Domain Expert (small but valuable)
Knowledge Discovery
Converting Data to Information
N
Infinity
Future
Legacy
Manageable Volumes of Data
The power of SQL
Domain Experts could grasp and
analyze a complete Database
Limited CPU Horsepower
Finite
Outcomes
Predict number of runs Colorado Rockies would score in their next game
using Neural Network.
Input
Previous 12 games runs scored
Home/Away
Predict
Number runs next game
Model
Home
Runs = (G7*.142)+(G1*.118)+…
Away
Runs = (G7*.129)+(G1*.091)+…
Non

linear Results
Fitting a
line
to a set of data points to measure the effect
of a single independent variable.
y =
mx
+ b
Statistical Methods
Data Mining
Outcomes
–
Decision Tree
Annual Income = 108,491.880+6.394*(Savings Balance

27,898.703)

2,443.789*(
Avg
Check Size

59.555)

0.247*(Credit Card Limit

8,250.000)+33.430*(Age

38.500)

1,032.285*(Over Drafts

1.000)
Outcomes
–
Time Series
Total Gen

1 < 36276.430 and Flow Date < 3/2/2007 4:58:07 AM and Flow Date <
8/22/2006 5:54:22 PM and Flow Date < 8/2/2006 1:35:37 PM
Total Gen = 31452.534 + 0.126 * Total Gen(

1)
Total Gen

1 < 36276.430 and Flow Date >= 3/2/2007 4:58:07 AM and Flow Date <
5/16/2007 7:46:52 PM and Flow Date >= 3/15/2007 12:56:15 PM
Total Gen = 6186.992 + 0.361 * Total Gen(

2) + 0.602 * Total Gen(

1)
Applications
o
Credit Risk Analysis
o
Churn Analysis
o
Customer Retention
o
Targeted
Marketing
o
Market Basket Analysis
o
Sales Forecasting
o
Stock Predictions
o
Medical Diagnosis
o
Bioscience Research
o
Surveys
o
Insurance Rate Quotes
o
Credit Card Fraud
o
Web Site Events
o
Loan Applications
o
Hiring and Recruiting
o
Cross

Marketing
o
Social Science
o
Economics
Data Mining
Sources
SSAS
Cube
DW
Data Mining
OLTP
Database
Table
Col 1
Col 2
Col
3
SQL Server
Oracle
MySQL
Etc.
Cleanse
Massage
Select
Training Set
Apply DM
Algorithm
Train
Test
Training the Model
Using the Model
Customer
CustomerID
CustName
Mining Model
CheckingBalance
SavingsBalance
NumTransactions
MonthlyBankAccount
BankAccountID
CheckingBalance
SavingsBalance
NumTransactions
CustomerID
Prediction
Income
SQL Join
DMX Prediction
Join
Result
Prediction Join
DMX
SELECT
Predict([Movie Purchases],3) as Movies
From
[Customer Movie Association]
NATURAL PREDICTION JOIN
(SELECT 44 AS [Age],
'Female'
AS [Gender],
'Rent'
AS [Home Ownership],
'Divorced'
AS [Marital Status],
(SELECT
'Mrs. Doubtfire'
AS [Movie Title]
UNION SELECT
'My Big Fat Greek Wedding'
AS [Movie Title]
UNION SELECT
'Patriot Games'
AS [Movie Title]) AS [Movie Purchases]) AS t
Approaches
Clustering
Classification
Regression
Market
Basket
Analysis
Mining Algorithms
Time Series
Naïve Bayes
Association
Clustering
Decision Trees
Logistic Regression
Clustering
Sequence Clustering
Neural Networks
Data Mining Algorithms
Analytical problem
Examples
Algorithms
Classification: Assign cases to
predefined classes
Credit risk analysis
Churn analysis
Customer retention
Decision Trees
Na
i
ve
Bayes
Neural Nets
Segmentation:
T
axonomy for grouping
similar cases
Customer profile analysis
Mailing campaign
Clustering
Sequence Clustering
Association: Advanced counting for
correlations
Market basket analysis
Advanced data exploration
Decision Trees
Association
Time Series Forecasting: Predict the
future
Forecast sales
Predict stock prices
Time Series
Prediction: Predict a value for a new
case based on values for similar cases
Quote insurance rates
Predict customer income
All
Deviation analysis: Discover how a case
or segment differs from others
Credit card fraud detection
Network infusion analysis
All
*
Andy
Cheung, Microsoft
Time Series
Uses
Autoregression
+ Decision Tree to build model
Each time series is a single case
No prediction join with test or actual cases
Prediction is always the same for given time slots
Analyzes how a variable changes over time.
Time Series Tree
All
Stock(t

5)
<= 22.83
Stock(t

5)
> 22.83
Stock(t

1)
<= 30.15
Stock(t

1)
> 30.15
Stock = 6.85 + .62*Stock(t

1) + .21*Stock(t

2)
Node Regression Formula
Naïve Bayes
Probabilistic classifier based on
Bayes
’ theorem with strong
(naive) independence assumptions.
Simple Classification Algorithm
Good starting point for better understanding of your data
Uses only discrete data
Naïve Bayes Example
Cell Phone
Service
Gender
Premium
Service
Custom Ring Tones
International
Calls
Female
53%
19%
56%
27%
Male
47%
41%
14%
38%
Premium Svc = Yes Ring Tones = Yes International = No
Likelihood of Female = .53 * .19 * .56 * .73 = .0412
Likelihood of Male = .47 * .41 * .14 * .62 = .0167
P(Female) = .0412/(.0412 + .0167) = 71.2%
P(Male) = .0167/(.0412 + .0167) = 28.8%
Association Rules
Detect relationships or associations between specific values of
categorical variables in large data sets.
Uses only Discrete Data
Rule

Attribute value conditions that occur frequently together in a given
dataset {Male, IT, Star Wars}
{Star Trek}
Itemset

A set of attribute values.
Support

Total number of transactions.
Confidence

Probability that {X}
{Y}
Importance (Lift)
–
Interestingness
. Measure of whether Correlation is
positive, negative or none.
Logistic Regression
Predict the probability of a discrete outcome from a set of
variables that are continuous, discrete or both
.
Non

linear regression model that produces results between 0 and 1.
Popular in health science for disease prediction.
Marketing uses for dichotomous predictions (buy or not buy, renew or
cancel).
Same as Neural Network without the hidden layer.
Probability = 1/1 + e

z
where z = c + yx
1
+ yx
2
+ …
If x1 = Weight and x2 = Age then
Risk of heart attack = 1/1 + e

z
where z =

2.4 + 1.3x
1

.7x
2
Decision Tree
Graphical representation displaying options, risks and the
decision

making sequence
.
Most popular data mining model.
East to visualize because of it’s graphical representation. Branches
represent choices with associated risks, costs, results, or probabilities.
Each test examines the value of a single column in the data and uses it to
determine the next test to apply. The results of all tests determine which
label to predict.
Similar to human thought process when making a decision.
Finds non

linear relationships.
Supports classification, regression and association within the model.
Neural Network
Classifies large and complex data sets by grouping cases together
in a way loosely based on the brain.
Most sophisticated algorithm but difficult to interpret.
Works well with non

linear data and finds smooth non

linear relationships.
Modeled as a group of interconnected nodes.
No agreed upon definition. Microsoft algorithm is one of many techniques.
Can build multiple models based on discrete inputs.
I
I
I
H
H
H
H
O
O
Back to Input layer after
weights adjusted for error
Clustering
Places data elements into related groups without
advance knowledge of the group definitions
.
Good starting point for better understanding of your data.
Finds the hidden variable that accurately classifies data.
Data grouped into clusters have a high similarity based on
the attribute values.
Sequence Clustering
Discovers the properties of sequences by grouping them
into clusters and assigning
them to one
of
the
clusters
.
Hybrid of sequence and clustering techniques.
Typically used with web and event logs as data sources.
Demo
Conclusion
Slow Adoption
Where do you
start?
Science + Art
Not quite A.I.
… yet!
More Info and References
TDWI
–
The Data Warehousing Institute
ACM, IEEE
Books: Data Mining with SQL Server 2005/8 (Wiley)
Mining the Talk (IBM Press)
Data Mining
know it all
(Morgan/Kaufman)
Comments 0
Log in to post a comment