The Data Mining Process

Download Report

Transcript The Data Mining Process

© Deloitte Consulting, 2004
Introduction to Data Mining
James Guszcza, FCAS, MAAA
CAS 2004 Ratemaking Seminar
Philadelphia
March 11-12, 2004
© Deloitte Consulting, 2004
Themes

What is Data Mining?
How does it relate to statistics?
 Insurance applications
 Data sources




The Data Mining Process
Model Design
Modeling Techniques

Louise Francis’ Presentation
2
© Deloitte Consulting, 2004
Themes

How does data mining need actuarial
science?
Variable creation
 Model design
 Model evaluation


How does actuarial science need data
mining?
Advances in computing, modeling techniques
 Ideas from other fields can be applied to insurance
problems

3
© Deloitte Consulting, 2004
Themes

“The quiet statisticians have changed our world; not
by discovering new facts or technical developments,
but by changing the ways that we reason,
experiment and form our opinions.”
-- Ian Hacking

Data mining gives us new ways of approaching the
age-old problems of risk selection and pricing….
….and other problems not traditionally considered
‘actuarial’.

4
© Deloitte Consulting, 2004
What is Data Mining?
© Deloitte Consulting, 2004
What is Data Mining?

My definition: “Statistics for the Computer Age”


Not a radical break with traditional statistics


Complements, builds on traditional statistics
Statistics enriched with brute-force capabilities of
modern computing


Many new techniques have come from Computer Science,
Marketing, Biology… but all can (should!) be brought
under the framework of “statistics”
Opens the door to new techniques
Therefore Data Mining tends to be associated with
industrial-sized data sets
6
© Deloitte Consulting, 2004
Buzz-words








Data Mining
Knowledge Discovery
Machine Learning
Statistical Learning
Predictive Modeling
Supervised Learning
Unsupervised Learning
….etc
7
© Deloitte Consulting, 2004
What is Data Mining?

Supervised learning: predict the value of a target
variable based on several predictive variables
“Predictive Modeling”
 Credit / non-credit scoring engines
 Retention, cross-sell models


Unsupervised learning: describe associations and
patterns along many dimensions without any target
information
Customer segmentation
 Data Clustering
 Market basket analysis (“diapers and beer”)

8
© Deloitte Consulting, 2004
So Why Should Actuaries Do This
Stuff?

Any application of statistics requires subject-matter
expertise






Psychometricians
Econometricians
Bioinformaticians
Marketing scientists
…are all applied statisticians with a particular subjectmatter expertise & area of specialty
Add actuarial modelers to this list!


“Insurometricians”!?
Actuarial knowledge is critical to the success of insurance
data mining projects
9
© Deloitte Consulting, 2004
Three Concepts

Scoring engines


Lift curves


A “predictive model” by any other name…
How much worse than average are the policies with
the worst scores?
Out-of-sample tests
How well will the model work in the real world?
 Unbiased estimate of predictive power

10
© Deloitte Consulting, 2004
Classic Application:
Scoring Engines

Scoring engine: formula that classifies or
separates policies (or risks, accounts,
agents…) into
 profitable
vs. unprofitable
 Retaining vs. non-retaining…


(Non-)Linear equation f( ) of several
predictive variables
Produces continuous range of scores
score = f(X1, X2, …, XN)
11
© Deloitte Consulting, 2004
What “Powers” a Scoring
Engine?
Scoring Engine:
score = f(X1, X2, …, XN)
 The X1, X2,…, XN are at least as important as
the f( )!

Again why actuarial expertise is necessary
 Think of the predictive power of credit variables


A large part of the modeling process consists
of variable creation and selection
Usually possible to generate 100’s of variables
 Steepest part of the learning curve

12
© Deloitte Consulting, 2004
Model Evaluation: Lift Curves


Sort data by score
Break the dataset into
10 equal pieces





Best “decile”: lowest
score  lowest LR
Worst “decile”: highest
score  highest LR
Difference: “Lift”
Lift = segmentation
power
Lift translates into ROI
of the modeling project
13
© Deloitte Consulting, 2004
Out-of-Sample Testing

Randomly divide data into 3 pieces



Use Training data to fit models
Score the Test data to create a lift curve



Training data, Test data, Validation data
Perform the train/test steps iteratively until you have a
model you’re happy with
During this iterative phase, validation data is set aside in a
“lock box”
Once model has been finalized, score the
Validation data and produce a lift curve

Unbiased estimate of future performance
14
© Deloitte Consulting, 2004
Data Mining: Applications

The classic: Profitability Scoring Model










Underwriting/Pricing applications
Credit models
Retention models
Elasticity models
Cross-sell models
Lifetime Value models
Agent/agency monitoring
Target marketing
Fraud detection
Customer segmentation

no target variable (“unsupervised learning”)
15
© Deloitte Consulting, 2004
Skills needed

Statistical


Actuarial


Need scalable software, computing environment
IT - Systems Administration


The subject-matter expertise
Programming!


Beyond college/actuarial exams… fast-moving field
Data extraction, data load, model implementation
Project Management

Absolutely critical because of the scope &
multidisciplinary nature of data mining projects
16
© Deloitte Consulting, 2004
Data Sources

Company’s internal data





Policy-level records
Loss & premium transactions
Billing
VIN……..
Externally purchased data





Credit
CLUE
MVR
Census
….
17
© Deloitte Consulting, 2004
The Data Mining
Process
© Deloitte Consulting, 2004
Raw Data

Research/Evaluate possible data sources
Availability
 Hit rate
 Implementability
 Cost-effectiveness




Extract/purchase data
Check data for quality (QA)
At this stage, data is still in a “raw” form
Often start with voluminous transactional data
 Much of the data mining process is “messy”

19
© Deloitte Consulting, 2004
Variable Creation

Create predictive and target variables
 Need
good programming skills
 Need domain and business expertise
Steepest part of the learning curve
 Discuss specifics of variable creation
with company experts

 Underwriters,
Actuaries, Marketers…
 Opportunity to quantify tribal wisdom
20
© Deloitte Consulting, 2004
Variable Transformation




Univariate analysis of predictive variables
Exploratory Data Analysis (EDA)
Data Visualization
Use EDA to cap / transform predictive
variables
 Extreme
values
 Missing values
 …etc
21
© Deloitte Consulting, 2004
Multivariate Analysis




Examine correlations among the variables
Weed out redundant, weak, poorly distributed
variables
Model design
Build candidate models
Regression/GLM
 Decision Trees/MARS
 Neural Networks


Select final model
22
© Deloitte Consulting, 2004
Model Analysis & Implementation

Perform model analytics


Calibrate Models


Create user-friendly “scale” – client dictates
Implement models


Necessary for client to gain comfort with the model
Programming skills again are critical
Monitor performance


Distribution of scores/variables, usage of the models,..etc
Plan model maintenance schedule
23
© Deloitte Consulting, 2004
Model Design
Where Data Mining Needs
Actuarial Science
© Deloitte Consulting, 2004
Model Design Issues

Which target variable to use?





Frequency & severity
Loss Ratio, other profitability measures
Binary targets: defection, cross-sell
…etc
How to prepare the target variable?







Period - 1-year or Multi-year?
Losses evaluated @?
Cap large losses?
Cat losses?
How / whether to re-rate, adjust premium?
What counts as a “retaining” policy?
…etc
25
© Deloitte Consulting, 2004
Model Design Issues

Which data points to include/exclude




Which variables to consider?





Certain classes of business?
Certain states?
…etc
Credit, or non-credit only?
Include rating variables in the model?
Exclude certain variables for regulatory reasons?
…etc
What is the “level” of the model?


Policy-term level, HH-level, Risk-level ..etc
Or should data be summarized into “cells” à la minimum bias?
26
© Deloitte Consulting, 2004
Model Design Issues

How should model be evaluated?
Lift curves, Gains chart, ROC curve?
 How to measure ROI?
 How to split data into train/test/validation? Or crossvalidation?
 Is there enough data for lift curve to be “credible”?



Are your “incredible” results credible?
…etc
Not an exhaustive list – every project raises
different actuarial issues!
27
© Deloitte Consulting, 2004
Reference
My favorite textbook:
The Elements of Statistical Learning
--Jerome Friedman, Trevor Hastie, Robert Tibshirani
28