The Data Mining Process
Download
Report
Transcript The Data Mining Process
© Deloitte Consulting, 2004
Introduction to Data Mining
James Guszcza, FCAS, MAAA
CAS 2004 Ratemaking Seminar
Philadelphia
March 11-12, 2004
© Deloitte Consulting, 2004
Themes
What is Data Mining?
How does it relate to statistics?
Insurance applications
Data sources
The Data Mining Process
Model Design
Modeling Techniques
Louise Francis’ Presentation
2
© Deloitte Consulting, 2004
Themes
How does data mining need actuarial
science?
Variable creation
Model design
Model evaluation
How does actuarial science need data
mining?
Advances in computing, modeling techniques
Ideas from other fields can be applied to insurance
problems
3
© Deloitte Consulting, 2004
Themes
“The quiet statisticians have changed our world; not
by discovering new facts or technical developments,
but by changing the ways that we reason,
experiment and form our opinions.”
-- Ian Hacking
Data mining gives us new ways of approaching the
age-old problems of risk selection and pricing….
….and other problems not traditionally considered
‘actuarial’.
4
© Deloitte Consulting, 2004
What is Data Mining?
© Deloitte Consulting, 2004
What is Data Mining?
My definition: “Statistics for the Computer Age”
Not a radical break with traditional statistics
Complements, builds on traditional statistics
Statistics enriched with brute-force capabilities of
modern computing
Many new techniques have come from Computer Science,
Marketing, Biology… but all can (should!) be brought
under the framework of “statistics”
Opens the door to new techniques
Therefore Data Mining tends to be associated with
industrial-sized data sets
6
© Deloitte Consulting, 2004
Buzz-words
Data Mining
Knowledge Discovery
Machine Learning
Statistical Learning
Predictive Modeling
Supervised Learning
Unsupervised Learning
….etc
7
© Deloitte Consulting, 2004
What is Data Mining?
Supervised learning: predict the value of a target
variable based on several predictive variables
“Predictive Modeling”
Credit / non-credit scoring engines
Retention, cross-sell models
Unsupervised learning: describe associations and
patterns along many dimensions without any target
information
Customer segmentation
Data Clustering
Market basket analysis (“diapers and beer”)
8
© Deloitte Consulting, 2004
So Why Should Actuaries Do This
Stuff?
Any application of statistics requires subject-matter
expertise
Psychometricians
Econometricians
Bioinformaticians
Marketing scientists
…are all applied statisticians with a particular subjectmatter expertise & area of specialty
Add actuarial modelers to this list!
“Insurometricians”!?
Actuarial knowledge is critical to the success of insurance
data mining projects
9
© Deloitte Consulting, 2004
Three Concepts
Scoring engines
Lift curves
A “predictive model” by any other name…
How much worse than average are the policies with
the worst scores?
Out-of-sample tests
How well will the model work in the real world?
Unbiased estimate of predictive power
10
© Deloitte Consulting, 2004
Classic Application:
Scoring Engines
Scoring engine: formula that classifies or
separates policies (or risks, accounts,
agents…) into
profitable
vs. unprofitable
Retaining vs. non-retaining…
(Non-)Linear equation f( ) of several
predictive variables
Produces continuous range of scores
score = f(X1, X2, …, XN)
11
© Deloitte Consulting, 2004
What “Powers” a Scoring
Engine?
Scoring Engine:
score = f(X1, X2, …, XN)
The X1, X2,…, XN are at least as important as
the f( )!
Again why actuarial expertise is necessary
Think of the predictive power of credit variables
A large part of the modeling process consists
of variable creation and selection
Usually possible to generate 100’s of variables
Steepest part of the learning curve
12
© Deloitte Consulting, 2004
Model Evaluation: Lift Curves
Sort data by score
Break the dataset into
10 equal pieces
Best “decile”: lowest
score lowest LR
Worst “decile”: highest
score highest LR
Difference: “Lift”
Lift = segmentation
power
Lift translates into ROI
of the modeling project
13
© Deloitte Consulting, 2004
Out-of-Sample Testing
Randomly divide data into 3 pieces
Use Training data to fit models
Score the Test data to create a lift curve
Training data, Test data, Validation data
Perform the train/test steps iteratively until you have a
model you’re happy with
During this iterative phase, validation data is set aside in a
“lock box”
Once model has been finalized, score the
Validation data and produce a lift curve
Unbiased estimate of future performance
14
© Deloitte Consulting, 2004
Data Mining: Applications
The classic: Profitability Scoring Model
Underwriting/Pricing applications
Credit models
Retention models
Elasticity models
Cross-sell models
Lifetime Value models
Agent/agency monitoring
Target marketing
Fraud detection
Customer segmentation
no target variable (“unsupervised learning”)
15
© Deloitte Consulting, 2004
Skills needed
Statistical
Actuarial
Need scalable software, computing environment
IT - Systems Administration
The subject-matter expertise
Programming!
Beyond college/actuarial exams… fast-moving field
Data extraction, data load, model implementation
Project Management
Absolutely critical because of the scope &
multidisciplinary nature of data mining projects
16
© Deloitte Consulting, 2004
Data Sources
Company’s internal data
Policy-level records
Loss & premium transactions
Billing
VIN……..
Externally purchased data
Credit
CLUE
MVR
Census
….
17
© Deloitte Consulting, 2004
The Data Mining
Process
© Deloitte Consulting, 2004
Raw Data
Research/Evaluate possible data sources
Availability
Hit rate
Implementability
Cost-effectiveness
Extract/purchase data
Check data for quality (QA)
At this stage, data is still in a “raw” form
Often start with voluminous transactional data
Much of the data mining process is “messy”
19
© Deloitte Consulting, 2004
Variable Creation
Create predictive and target variables
Need
good programming skills
Need domain and business expertise
Steepest part of the learning curve
Discuss specifics of variable creation
with company experts
Underwriters,
Actuaries, Marketers…
Opportunity to quantify tribal wisdom
20
© Deloitte Consulting, 2004
Variable Transformation
Univariate analysis of predictive variables
Exploratory Data Analysis (EDA)
Data Visualization
Use EDA to cap / transform predictive
variables
Extreme
values
Missing values
…etc
21
© Deloitte Consulting, 2004
Multivariate Analysis
Examine correlations among the variables
Weed out redundant, weak, poorly distributed
variables
Model design
Build candidate models
Regression/GLM
Decision Trees/MARS
Neural Networks
Select final model
22
© Deloitte Consulting, 2004
Model Analysis & Implementation
Perform model analytics
Calibrate Models
Create user-friendly “scale” – client dictates
Implement models
Necessary for client to gain comfort with the model
Programming skills again are critical
Monitor performance
Distribution of scores/variables, usage of the models,..etc
Plan model maintenance schedule
23
© Deloitte Consulting, 2004
Model Design
Where Data Mining Needs
Actuarial Science
© Deloitte Consulting, 2004
Model Design Issues
Which target variable to use?
Frequency & severity
Loss Ratio, other profitability measures
Binary targets: defection, cross-sell
…etc
How to prepare the target variable?
Period - 1-year or Multi-year?
Losses evaluated @?
Cap large losses?
Cat losses?
How / whether to re-rate, adjust premium?
What counts as a “retaining” policy?
…etc
25
© Deloitte Consulting, 2004
Model Design Issues
Which data points to include/exclude
Which variables to consider?
Certain classes of business?
Certain states?
…etc
Credit, or non-credit only?
Include rating variables in the model?
Exclude certain variables for regulatory reasons?
…etc
What is the “level” of the model?
Policy-term level, HH-level, Risk-level ..etc
Or should data be summarized into “cells” à la minimum bias?
26
© Deloitte Consulting, 2004
Model Design Issues
How should model be evaluated?
Lift curves, Gains chart, ROC curve?
How to measure ROI?
How to split data into train/test/validation? Or crossvalidation?
Is there enough data for lift curve to be “credible”?
Are your “incredible” results credible?
…etc
Not an exhaustive list – every project raises
different actuarial issues!
27
© Deloitte Consulting, 2004
Reference
My favorite textbook:
The Elements of Statistical Learning
--Jerome Friedman, Trevor Hastie, Robert Tibshirani
28