Introduction to Data Mining

sharpfartsΤεχνίτη Νοημοσύνη και Ρομποτική

8 Νοε 2013 (πριν από 3 χρόνια και 10 μήνες)

71 εμφανίσεις

© Deloitte Consulting, 2004

Introduction to Data Mining

James Guszcza, FCAS, MAAA

CAS 2004 Ratemaking Seminar

Philadelphia

March 11
-
12, 2004


© Deloitte Consulting, 2004

2

Themes


What is Data Mining?


How does it relate to statistics?


Insurance applications


Data sources


The Data Mining Process


Model Design


Modeling Techniques


Louise Francis’ Presentation


© Deloitte Consulting, 2004

3

Themes


How does data mining need actuarial
science?


Variable creation


Model design


Model evaluation


How does actuarial science need data
mining?


Advances in computing, modeling techniques


Ideas from other fields can be applied to insurance
problems


© Deloitte Consulting, 2004

4

Themes


“The quiet statisticians have changed our world; not
by discovering new facts or technical developments,
but by changing the ways that we reason,
experiment and form our opinions.”







--

Ian Hacking



Data mining gives us new ways of approaching the
age
-
old problems of risk selection and pricing….


….and other problems not traditionally considered
‘actuarial’.

© Deloitte Consulting, 2004

What is Data Mining?

© Deloitte Consulting, 2004

6

What is Data Mining?


My definition:
“Statistics for the Computer Age”


Many new techniques have come from Computer Science,
Marketing, Biology… but all can (should!) be brought
under the framework of “statistics”


Not a radical break with traditional statistics


Complements, builds on traditional statistics


Statistics enriched with brute
-
force capabilities of
modern computing


Opens the door to new techniques


Therefore Data Mining tends to be associated with
industrial
-
sized data sets

© Deloitte Consulting, 2004

7

Buzz
-
words


Data Mining


Knowledge Discovery


Machine Learning


Statistical Learning


Predictive Modeling


Supervised Learning


Unsupervised Learning


….etc




© Deloitte Consulting, 2004

8

What is Data Mining?


Supervised learning
: predict the value of a
target

variable based on several
predictive
variables


“Predictive Modeling”


Credit / non
-
credit scoring engines


Retention, cross
-
sell models



Unsupervised learning
: describe associations and
patterns along many dimensions without any target
information


Customer segmentation


Data Clustering


Market basket analysis (“diapers and beer”)


© Deloitte Consulting, 2004

9

So Why Should Actuaries Do
This Stuff?


Any application of statistics requires subject
-
matter
expertise


Psychometricians


Econometricians


Bioinformaticians


Marketing scientists


…are all applied statisticians with a particular subject
-
matter expertise & area of specialty


Add actuarial modelers to this list!


“Insurometricians”!?


Actuarial knowledge is critical to the success of insurance
data mining projects

© Deloitte Consulting, 2004

10

Three Concepts


Scoring engines


A “predictive model” by any other name…


Lift curves


How much worse than average are the policies with
the worst scores?


Out
-
of
-
sample tests


How well will the model work in the real world?


Unbiased estimate of predictive power


© Deloitte Consulting, 2004

11

Classic Application:


Scoring Engines


Scoring engine
: formula that classifies or
separates policies (or risks, accounts,
agents…) into


profitable vs. unprofitable


Retaining vs. non
-
retaining…


(Non
-
)Linear equation
f
(

)
of several
predictive variables


Produces continuous range of scores

score = f
(
X
1
,
X
2
, …,
X
N
)


© Deloitte Consulting, 2004

12

What “Powers” a Scoring
Engine?


Scoring Engine:

score = f
(
X
1
,
X
2
, …,
X
N
)


The
X
1
,
X
2
,…,
X
N
are
at least

as important as
the
f
(

)
!


Again why actuarial expertise is necessary


Think of the predictive power of credit variables


A large part of the modeling process consists
of variable creation and selection


Usually possible to generate 100’s of variables


Steepest part of the learning curve

© Deloitte Consulting, 2004

13

Model Evaluation: Lift Curves


Sort data by score


Break the dataset into
10 equal pieces


Best “decile”: lowest
score


lowest LR


Worst “decile”: highest
score


highest LR



Difference: “Lift”


Lift = segmentation
power


Lift translates into ROI
of the modeling project


© Deloitte Consulting, 2004

14

Out
-
of
-
Sample Testing


Randomly divide data into 3 pieces


Training

data,
Test

data,
Validation

data


Use
Training

data to fit models


Score the
Test

data to create a lift curve


Perform the train/test steps iteratively until you have a
model you’re happy with


During this iterative phase, validation data is set aside in a
“lock box”




Once model has been finalized, score the
Validation

data and produce a lift curve


Unbiased estimate of future performance

© Deloitte Consulting, 2004

15

Data Mining: Applications


The classic: Profitability Scoring Model


Underwriting/Pricing applications


Credit models


Retention models


Elasticity models


Cross
-
sell models


Lifetime Value models


Agent/agency monitoring


Target marketing


Fraud detection


Customer segmentation


no target variable (“unsupervised learning”)

© Deloitte Consulting, 2004

16

Skills needed


Statistical


Beyond college/actuarial exams… fast
-
moving field


Actuarial


The subject
-
matter expertise


Programming!


Need scalable software, computing environment


IT
-

Systems Administration


Data extraction, data load, model implementation


Project Management


Absolutely critical because of the scope &
multidisciplinary nature of data mining projects

© Deloitte Consulting, 2004

17

Data Sources


Company’s internal data


Policy
-
level records


Loss & premium transactions


Billing


VIN……..


Externally purchased data


Credit


CLUE


MVR


Census


….


© Deloitte Consulting, 2004

The Data Mining
Process

© Deloitte Consulting, 2004

19

Raw Data


Research/Evaluate possible data sources


Availability


Hit rate


Implementability


Cost
-
effectiveness


Extract/purchase data


Check data for quality (QA)


At this stage, data is still in a “raw” form


Often start with voluminous transactional data


Much of the data mining process is “messy”

© Deloitte Consulting, 2004

20

Variable Creation


Create predictive and target variables


Need good programming skills


Need domain and business expertise


Steepest part of the learning curve


Discuss specifics of variable creation
with company experts


Underwriters, Actuaries, Marketers…


Opportunity to
quantify
tribal wisdom


© Deloitte Consulting, 2004

21

Variable Transformation


Univariate analysis of predictive variables


Exploratory Data Analysis (EDA)


Data Visualization


Use EDA to cap / transform predictive
variables


Extreme values


Missing values


…etc


© Deloitte Consulting, 2004

22

Multivariate Analysis


Examine correlations among the variables


Weed out redundant, weak, poorly distributed
variables


Model design


Build candidate models


Regression/GLM


Decision Trees/MARS


Neural Networks


Select final model

© Deloitte Consulting, 2004

23

Model Analysis & Implementation


Perform model analytics


Necessary for client to gain comfort with the model


Calibrate Models


Create user
-
friendly “scale”


client dictates


Implement models


Programming skills again are critical


Monitor performance


Distribution of scores/variables, usage of the models,..etc


Plan model maintenance schedule


© Deloitte Consulting, 2004

Model Design

Where Data Mining Needs
Actuarial Science

© Deloitte Consulting, 2004

25

Model Design Issues


Which target variable to use?


Frequency & severity


Loss Ratio, other profitability measures


Binary targets: defection, cross
-
sell


…etc


How to prepare the target variable?


Period
-

1
-
year or Multi
-
year?


Losses evaluated @?


Cap large losses?


Cat losses?


How / whether to re
-
rate, adjust premium?


What counts as a “retaining” policy?


…etc



© Deloitte Consulting, 2004

26

Model Design Issues


Which data points to include/exclude


Certain classes of business?


Certain states?


…etc


Which variables to consider?


Credit, or non
-
credit only?


Include rating variables in the model?


Exclude certain variables for regulatory reasons?


…etc


What is the “level” of the model?


Policy
-
term level, HH
-
level, Risk
-
level ..etc


Or should data be summarized into “cells”
à

la minimum bias?





© Deloitte Consulting, 2004

27

Model Design Issues


How should model be evaluated?


Lift curves, Gains chart, ROC curve?


How to measure ROI?


How to split data into train/test/validation? Or cross
-
validation?


Is there enough data for lift curve to be “credible”?


Are your “incredible” results credible?


…etc

Not an exhaustive list


every project raises
different actuarial issues!

© Deloitte Consulting, 2004

28

Reference

My favorite textbook:



The Elements of Statistical Learning

--
Jerome Friedman, Trevor Hastie, Robert Tibshirani