© Deloitte Consulting, 2004
Introduction to Data Mining
James Guszcza, FCAS, MAAA
CAS 2004 Ratemaking Seminar
Philadelphia
March 11

12, 2004
© Deloitte Consulting, 2004
2
Themes
What is Data Mining?
How does it relate to statistics?
Insurance applications
Data sources
The Data Mining Process
Model Design
Modeling Techniques
Louise Francis’ Presentation
© Deloitte Consulting, 2004
3
Themes
How does data mining need actuarial
science?
Variable creation
Model design
Model evaluation
How does actuarial science need data
mining?
Advances in computing, modeling techniques
Ideas from other fields can be applied to insurance
problems
© Deloitte Consulting, 2004
4
Themes
“The quiet statisticians have changed our world; not
by discovering new facts or technical developments,
but by changing the ways that we reason,
experiment and form our opinions.”

Ian Hacking
Data mining gives us new ways of approaching the
age

old problems of risk selection and pricing….
….and other problems not traditionally considered
‘actuarial’.
© Deloitte Consulting, 2004
What is Data Mining?
© Deloitte Consulting, 2004
6
What is Data Mining?
My definition:
“Statistics for the Computer Age”
Many new techniques have come from Computer Science,
Marketing, Biology… but all can (should!) be brought
under the framework of “statistics”
Not a radical break with traditional statistics
Complements, builds on traditional statistics
Statistics enriched with brute

force capabilities of
modern computing
Opens the door to new techniques
Therefore Data Mining tends to be associated with
industrial

sized data sets
© Deloitte Consulting, 2004
7
Buzz

words
Data Mining
Knowledge Discovery
Machine Learning
Statistical Learning
Predictive Modeling
Supervised Learning
Unsupervised Learning
….etc
© Deloitte Consulting, 2004
8
What is Data Mining?
Supervised learning
: predict the value of a
target
variable based on several
predictive
variables
“Predictive Modeling”
Credit / non

credit scoring engines
Retention, cross

sell models
Unsupervised learning
: describe associations and
patterns along many dimensions without any target
information
Customer segmentation
Data Clustering
Market basket analysis (“diapers and beer”)
© Deloitte Consulting, 2004
9
So Why Should Actuaries Do
This Stuff?
Any application of statistics requires subject

matter
expertise
Psychometricians
Econometricians
Bioinformaticians
Marketing scientists
…are all applied statisticians with a particular subject

matter expertise & area of specialty
Add actuarial modelers to this list!
“Insurometricians”!?
Actuarial knowledge is critical to the success of insurance
data mining projects
© Deloitte Consulting, 2004
10
Three Concepts
Scoring engines
A “predictive model” by any other name…
Lift curves
How much worse than average are the policies with
the worst scores?
Out

of

sample tests
How well will the model work in the real world?
Unbiased estimate of predictive power
© Deloitte Consulting, 2004
11
Classic Application:
Scoring Engines
Scoring engine
: formula that classifies or
separates policies (or risks, accounts,
agents…) into
profitable vs. unprofitable
Retaining vs. non

retaining…
(Non

)Linear equation
f
(
)
of several
predictive variables
Produces continuous range of scores
score = f
(
X
1
,
X
2
, …,
X
N
)
© Deloitte Consulting, 2004
12
What “Powers” a Scoring
Engine?
Scoring Engine:
score = f
(
X
1
,
X
2
, …,
X
N
)
The
X
1
,
X
2
,…,
X
N
are
at least
as important as
the
f
(
)
!
Again why actuarial expertise is necessary
Think of the predictive power of credit variables
A large part of the modeling process consists
of variable creation and selection
Usually possible to generate 100’s of variables
Steepest part of the learning curve
© Deloitte Consulting, 2004
13
Model Evaluation: Lift Curves
Sort data by score
Break the dataset into
10 equal pieces
Best “decile”: lowest
score
lowest LR
Worst “decile”: highest
score
highest LR
Difference: “Lift”
Lift = segmentation
power
Lift translates into ROI
of the modeling project
© Deloitte Consulting, 2004
14
Out

of

Sample Testing
Randomly divide data into 3 pieces
Training
data,
Test
data,
Validation
data
Use
Training
data to fit models
Score the
Test
data to create a lift curve
Perform the train/test steps iteratively until you have a
model you’re happy with
During this iterative phase, validation data is set aside in a
“lock box”
Once model has been finalized, score the
Validation
data and produce a lift curve
Unbiased estimate of future performance
© Deloitte Consulting, 2004
15
Data Mining: Applications
The classic: Profitability Scoring Model
Underwriting/Pricing applications
Credit models
Retention models
Elasticity models
Cross

sell models
Lifetime Value models
Agent/agency monitoring
Target marketing
Fraud detection
Customer segmentation
no target variable (“unsupervised learning”)
© Deloitte Consulting, 2004
16
Skills needed
Statistical
Beyond college/actuarial exams… fast

moving field
Actuarial
The subject

matter expertise
Programming!
Need scalable software, computing environment
IT

Systems Administration
Data extraction, data load, model implementation
Project Management
Absolutely critical because of the scope &
multidisciplinary nature of data mining projects
© Deloitte Consulting, 2004
17
Data Sources
Company’s internal data
Policy

level records
Loss & premium transactions
Billing
VIN……..
Externally purchased data
Credit
CLUE
MVR
Census
….
© Deloitte Consulting, 2004
The Data Mining
Process
© Deloitte Consulting, 2004
19
Raw Data
Research/Evaluate possible data sources
Availability
Hit rate
Implementability
Cost

effectiveness
Extract/purchase data
Check data for quality (QA)
At this stage, data is still in a “raw” form
Often start with voluminous transactional data
Much of the data mining process is “messy”
© Deloitte Consulting, 2004
20
Variable Creation
Create predictive and target variables
Need good programming skills
Need domain and business expertise
Steepest part of the learning curve
Discuss specifics of variable creation
with company experts
Underwriters, Actuaries, Marketers…
Opportunity to
quantify
tribal wisdom
© Deloitte Consulting, 2004
21
Variable Transformation
Univariate analysis of predictive variables
Exploratory Data Analysis (EDA)
Data Visualization
Use EDA to cap / transform predictive
variables
Extreme values
Missing values
…etc
© Deloitte Consulting, 2004
22
Multivariate Analysis
Examine correlations among the variables
Weed out redundant, weak, poorly distributed
variables
Model design
Build candidate models
Regression/GLM
Decision Trees/MARS
Neural Networks
Select final model
© Deloitte Consulting, 2004
23
Model Analysis & Implementation
Perform model analytics
Necessary for client to gain comfort with the model
Calibrate Models
Create user

friendly “scale”
–
client dictates
Implement models
Programming skills again are critical
Monitor performance
Distribution of scores/variables, usage of the models,..etc
Plan model maintenance schedule
© Deloitte Consulting, 2004
Model Design
Where Data Mining Needs
Actuarial Science
© Deloitte Consulting, 2004
25
Model Design Issues
Which target variable to use?
Frequency & severity
Loss Ratio, other profitability measures
Binary targets: defection, cross

sell
…etc
How to prepare the target variable?
Period

1

year or Multi

year?
Losses evaluated @?
Cap large losses?
Cat losses?
How / whether to re

rate, adjust premium?
What counts as a “retaining” policy?
…etc
© Deloitte Consulting, 2004
26
Model Design Issues
Which data points to include/exclude
Certain classes of business?
Certain states?
…etc
Which variables to consider?
Credit, or non

credit only?
Include rating variables in the model?
Exclude certain variables for regulatory reasons?
…etc
What is the “level” of the model?
Policy

term level, HH

level, Risk

level ..etc
Or should data be summarized into “cells”
à
la minimum bias?
© Deloitte Consulting, 2004
27
Model Design Issues
How should model be evaluated?
Lift curves, Gains chart, ROC curve?
How to measure ROI?
How to split data into train/test/validation? Or cross

validation?
Is there enough data for lift curve to be “credible”?
Are your “incredible” results credible?
…etc
Not an exhaustive list
–
every project raises
different actuarial issues!
© Deloitte Consulting, 2004
28
Reference
My favorite textbook:
The Elements of Statistical Learning

Jerome Friedman, Trevor Hastie, Robert Tibshirani
Comments 0
Log in to post a comment