MARK333-GMO_v3

chickenchairwomanAI and Robotics

Oct 19, 2013 (3 years and 7 months ago)

63 views

Green

Mail Order

MARK 333 L2 Group 5

Business Objectives &

Success Criteria



Boost sales


Increase response rate


Reduce mailing costs by
more accurate
targeting at potential customers

CRISP
-
DM


STEP 1: Business Understanding

Data Mining Goal


Decide who to send catalog to


Customers: high probability of purchase

CRISP
-
DM


STEP 1: Business Understanding

Data Understanding

Verifying data quality




Missing data:


“0” in ‘COUNTY’



-
999” in ‘RETURN’


“none” in ‘NTITLE’



Mismatch data:


“Ms” in ‘NTITLE’ & “1” in ‘MARITAL’

CRISP
-
DM


STEP 2: Data Understanding

Data Preparation

Data Cleansing


resolve
inconsistencies



CRISP
-
DM


STEP 3: Data Preparation

Data Preparation


Data Selection


Filter away irrelevant variables


VALRATION


STATECOD


NTITLE or SEX


RETURN


DINING/DISHES/FLATWARE/KITCHEN

CRISP
-
DM


STEP 3: Data Preparation



ACCTNUM



COUNTY



CUSTDATE


Decision Tree

SEX

JOB

EDLEVEL

HEAT

FLATWARE

APPAREL

APRTMNT

TELIND

FREQUENCY

9 Selected attributes:

Highest Accuracy: 61.82%

CRISP
-
DM


STEP 4: Modeling (Decision Tree)

Mechanism of selecting variables

1.
Logic


JOB, EDLEVEL, RACE, INCOME, TRAVTIME, HEAT,
HOMEVAL, NUMCARS, TELIND, MOBILE, JEWELRY

2.
Association Rule


Interesting relationship between variables

3.

Trial & Error


Repeat adding and deleting variables

CRISP
-
DM


STEP 4: Modeling (Decision Tree)

Evaluation


Gains Chart

CRISP
-
DM


STEP 4: Modeling (Decision Tree)

Stability Testing

by random seed

Average

57.58%

S.D.

2.47%

Maximum

61.82%

Minimum

55.77%

CRISP
-
DM


STEP 4: Modeling (Decision Tree)

Neural Network

AMOUNT

FREQUENCY

RECENCY

MOBILE

APPAREL

PROMO13

MENSWARE

FLATWARE

DISHES

LAMPS

LINENS

BLANKETS

OUTDOOR

COATS

WCOAT

WAPPAR

HHAPPAR

JEWELRY

DINING

19 Selected attributes:

CRISP
-
DM


STEP 4: Modeling (Logistic Regression)

Highest Accuracy: 62.74%

Mechanism of Selecting Variables

1.
Quick method


Take reference of the previous findings

2.
Relative Importance



Delete those with low importance



CRISP
-
DM


STEP 4: Modeling (Artificial Neural Network)

Mechanism of Selecting Variables

3. Trial & Error


Test with different random seeds to test
stability

4. Further evaluation by other methods


Dynamic


Multiple


CRISP
-
DM


STEP 4: Modeling (Artificial Neural Network)

Evaluation

-

by Random Seeds

Seed Number

Accuracy Rate

111111

62.74%

3086

62.57%

77

58.57%

9876

61.57%

555

59.71%

4575715

58.62%

2174006

57.03%

CRISP
-
DM


STEP 4: Modeling (Artificial Neural Network)

Evaluation

-

by Different Methods

Quick

Mean Accuracy Rate

60.45%

Standard Deviation

2.05%

Dynamic

Mean Accuracy Rate

60.60%

Standard Deviation

2.66%

Multiple

Mean Accuracy Rate

59.67%

Standard Deviation

2.31%

CRISP
-
DM


STEP 4: Modeling (Artificial Neural Network)

Evaluation Gains Charts

CRISP
-
DM


STEP 4: Modeling (Artificial Neural Network)

Logistic Regression

FREQUENCY

RECENCY

TELIND

APPAREL

PROMO13

FLATWARE

LAMPS

OUTDOOR

COATS

WCOAT

JEWELRY

NUMCARS

12 Selected attributes:

CRISP
-
DM


STEP 4: Modeling (Logistic Regression)

Highest Accuracy: 64.05%

Mechanism of selecting variables

1. Forward method


6 variables: FREQUENT, RECENCY, LAMPS, COATS,
PROMO13 and OUTDOOR

2. Association rule


Strong Support: APPAREL, FLATWARE and JEWELRY


100% Confidence: If APPAREL, FLATWARE and
JEWELRY then NUMCARS


3. Trial & Error


Two variables: TELIND and WCOAT

CRISP
-
DM


STEP 4: Modeling (Logistic Regression)

Evaluation of

Logistic Regression Model

Maximum Likelihood function

Model Fitting Information

Model Fitting Criteria

Likelihood Ratio Tests

-
2
Log Likelihood

Chi
-
Square

df

Sig.

1527.174










1437.696

89.478

12

.000

CRISP
-
DM


STEP 4: Modeling (Logistic Regression)

Accuracy Rate


Average

60.81%

S.D.

2.02%

Maximum

64.05%

Minimum

56.91%

CRISP
-
DM


STEP 4: Modeling (Logistic Regression)

Stability testing



by random seed and different testing data size

50% sampled
-
> 64%

Predictive power of the final
model
-

Gain Chart

Model Evaluation

Evaluation Criteria

Which model should we use?


1.
Accuracy Rate


The higher the rate, the better


Compare accuracy rates on testing
data

CRISP
-
DM


STEP 5: Evaluation

Evaluation Criteria

2.
Confusion Matrix


More detailed information


Positives are especially important


How many of the positives does our
model cover
?


CRISP
-
DM


STEP 5: Evaluation

Evaluation Criteria

3.
Gain Chart


Which model performs best
depending on how many customers
we serve?


Look at 60% mark


CRISP
-
DM


STEP 5: Evaluation

Model Selection (1)


Accuracy rate


Logistic Regression (avg): 58.8%


Decision Tree (avg): 56.9%


Artificial Neural Network (avg):
56.9%

CRISP
-
DM


STEP 5: Evaluation


Confusion Matrix (positives covered)


LR: 51.3%, stable


DT: 53.6%, not stable


ANN: 53.3% not stable


CRISP
-
DM


STEP 5: Evaluation

Model Selection (2)

Gain Chart


CRISP
-
DM


STEP 5: Evaluation

Deployment


Produce Prediction Result (CSV
-
File)


Send out catalogues to positive
predictions


Monitor efficiency


If efficiency is high: adjust model over
time for future use


If efficiency is low: Rework on the old
model or build a new model

CRISP
-
DM


STEP 6: Deployment

Q & A Session

Confusion Matrix

CRISP
-
DM


STEP 4: Modeling (Logistic Regression)