Data Mining – Best Practices

Download Report

Transcript Data Mining – Best Practices

Francis Analytics
Actuarial Data Mining Services
Data Mining – Best Practices
CAS 2008 Spring Meeting
Quebec City, Canada
Louise Francis, FCAS, MAAA
[email protected], www.data-mines.com
Topics in Data Mining Best
Practices
•
•
•
•
•
•
•
Introduction: Data Mining for Management
Data Quality
Data augmentation
Data adjustments
Method/Software issues
Post deployment monitoring
References & Resources
Introduction
• Research by IBM indicates on 1% of data collected by
organizations is used for analysis
• Predictive Modeling and Data Mining widely
embraced by leading businesses
– 2002 Strategic Decision Making survey by Hackett Best
Practices determined that world class companies adopted
predictive modeling technologies at twice the rate of other
companies
– Important commercial application is Customer retention:
5% increase in retention  95% increase in profit
– It costs 5 to 10 times more to acquire new business
• Another study of 24 leading companies found that
they needed to go beyond simple data collection
Successful Implementation of
Data Mining
• Data Mining: Process of discovering
previously unknown patterns in databases
• Needs insights from many different areas
• Multidisciplinary effort
–
–
–
–
–
Quantitative experts
IT
Business experts
Managers
Upper management
Becoming a better practitioner
Manage the Human Side of
Analytics
Pay greater attention to the interaction of models and humans
• Data Collection
• Communicating business benefits
• Belief, model understanding, model
complexity
• ‘Tribal’ Knowledge as model attributes
• Behavioral change and Transparency
• Disruption in ‘standard’ processes
• Threat of obsolescence (automation)
Don’t over rely on the technology and recognize the disruptive role you play
CRISP-DM
• Cross Industry Standard Process for
Data Mining
• Standardized approach to data mining
• www.crisp-dm.org
Phases of CRISP-DM
Data Quality
• Scope of problem
• How it is addressed
• New educational resources for
actuaries
Survey of Actuaries
• Data quality issues have a significant
impact on the work of general
insurance (P&C) actuaries
– About a quarter of their time is spent
on such issues
– About a third of projects are adversely
affected
– See “Dirty Data on Both Sides of the
Pond” – 2008 CAS Winter Forum
– Data quality issues consume
significantly more time on large
predictive modelling Projects
Statistical Data Editing
• Process of Checking data for errors and
correcting them
• Uses subject matter experts
• Uses statistical analysis of data
• May include using methods to “fill in”
missing values
• Final result of SDE is clean data as well as
summary of underlying causes of errors
• See article in Encyclopedia of Data
Warehousing and Data Mining
Step 0
Data
Requirements
Step 1
EDA: Overview
Data Collection
Step 2
Transformations
Aggregations
Step 3
Analysis
Step 4
Presentation of
Results
Final Step
Decisions
• Typically first step in analyzing data
• Purpose:
– Explore structure of the data
– Find outliers and errors
• Uses simple statistics and graphical
techniques
• Examples include histograms, descriptive
statistics and frequency tables
Step 0
Data
Requirements
Step 1
Data Collection
Step 2
Transformations
Aggregations
Step 3
Analysis
Step 4
Presentation of
Results
Final Step
Decisions
EDA: Histograms
Data Educational Materials Working
Party Formation
• The closest thing to data quality on the
CAS syllabus are introductions to
statistical plans
• The CAS Data Management and
Information Committee realized that SOX
and Predictive Modeling have increased
the need for quality data
• So they formed the CAS Data
Management Educational Materials
working party to find and gather
materials to educate actuaries
CAS Data Management Educational
Materials Working Party
Publications
• Book reviews of data management and data quality
texts in the CAS Actuarial Review starting with the
August 2006 edition
• These reviews are combined and compared in “Survey of
Data Management and Data Quality Texts,” CAS Forum,
Winter 2007, www.casact.org
This presentation references our recently published
paper:
• “Actuarial IQ (Information Quality)” published in the
Winter 2008 edition of the CAS Forum:
http://www.casact.org/pubs/forum/08wforum/
Step 0
Data
Requirements
Step 1
Data Collection
Step 2
Transformations
Aggregations
Step 3
Analysis
Step 4
Presentation of
Results
Final Step
Decisions
Data Flow
Information Quality involves all steps:
 Data Requirements
 Data Collection
 Transformations & Aggregations
 Actuarial Analysis
 Presentation of Results
To improve Final Step:
 Making Decisions
Data Augmentation
• Add information from Internal data
• Add information from external data
• For overview of inexpensive sources of
data see: “Free and Cheap Sources of
Data”, 2007 Predictive modeling
seminar and “External Data Sources”
at 2008 Ratemaking Seminar
Data Augmentation – Internal
Data
• Create aggregated statistics from
internal data sources
– Number of lawyers per zip
– Claim frequency rate per zip
– Frequency of back claims per state
• Use unstructured data
– Text Mining
Data Adjustments
• Trend
– Adjust all records to common cost level
– Use model to estimate trend
• Development
– Adjust all losses to ultimate
– Adjust all losses to a common age
– Use model to estimate future
development
KDnuggets Poll on Data
Methods:
What are data miners using?
How well does it work?
Major Kinds of Data Mining
• Supervised learning
– Most common
situation
– A dependent
variable
•
Unsupervised learning
–
–
• Frequency
• Loss ratio
• Fraud/no fraud
– Some methods
• Regression
• Trees/Machine
Learning
• Some neural
networks
–
No dependent variable
Group like records together
• A group of claims with
similar characteristics
might be more likely to
be fraudulent
• Ex: Territory
assignment, Text Mining
Some methods
• Association rules
• K-means clustering
• Kohonen neural
networks
KDnuggets Poll on Methods
KDnuggets Poll on Open Source
Software
The Supervised Methods and
Software Evaluated
Research by Derrig and Francis
1)
2)
3)
4)
5)
6)
TREENET
Iminer Tree
SPLUS Tree
CART
S-PLUS Neural
Iminer Neural
7) Iminer Ensemble
8) MARS
9) Random Forest
10) Exhaustive Chaid
11) Naïve Bayes (Baseline)
12) Logistic reg ( (Baseline)
TREENET ROC Curve – IME
Explain AUROC AUROC = 0.701
Monitoring Models
• Monitor use of model
• Monitor data going into model
• Monitor performance
– This requires more mature data
Novelty Detection
An example of model interaction with people to improve business outcomes
Problem Statements:
• At the time of underwriting a risk,
how different is the subject risk
from the data used to build the
model?
• How are the differences, if any,
logically grouped for business
meaning
Clustering Methods
Make Models
1. Select features that you are interested in clustering, e.g.
Demographics, Risk, Auto, Employment
2. Run cluster algorithms within the grouped features to find
homogenous groups (let the data tell you the groupings). Each
member has a distance to the ‘center’ of the cluster.
3. Explore each cluster and statistically describe them compared to the
entire ‘world’ from the training data; create thresholds for distance
to the center that you care about; may add additional description
and learning
4. Assign business meaning (names) to cluster members; homogenous
group; Deploy; score new data as it becomes available
5. Look at novelty within each cluster on the new sample; distance,
single variable differences
6. Use the Threshold to determine differences from the cluster
membership.
7. Investigate for business impact or unexpected changes
Novelty Score Uses
Novelty Score: to detect ‘drift’ of aspects of clusters in predictor data over
time
• Dimensional Novelty • Operationalize
–
–
–
–
–
Market Cycles
Policy Limits
Exposure
Geography
Demographics
– Book drift
– Evaluation of
pricing and
marketing
activities
– Model refresh
cycle
– Regulatory Support
Example – Automobile Insurance
Data
Six clusters with the following statistical profile and distribution in the sample set; look
a the data and assign names to the groups (in this case 3 variables)
Demographic
Features and
Clusters
WORLD
The view of the
current book
Display the distribution of named clusters within the grouping of features
(Demographic Cluster) in the test set
View of the
clusters in the
current book
business within
Demographics
Monitor the changes in distribution of the clusters in the data over time
Two clusters now
show up in
different
percentages
After 6 months
Initial Customer Base
Humility
• Models incorporate significant
uncertainties about parameters
• When deployed, models will likely not
be as good as they were on historic
data
• Need to appreciate the limitations of
the models
Additional References
•
•
•
•
•
•
www.kdnuggets.com
www.r-project.org
Encyclopedia of Data Warehousing and Data Mining, John Wang
For GLMs: 2004 CAS Discussion Paper Program
2008 Discussion Paper program on multivariate methods
“Distinguishing the Forest From the Trees” – 2006 Winter
Forum, updated at www.data-mines .com
– See other papers by Francis on CAS web site
• Data Preparation for Data Mining using SAS, Mamdouh Refaat
• Data Mining for Business Intelligence: Concepts, Techniques
and Applications in Microsoft Excel with XL Miner, Shmuel,
Patel and Bruce
Questions?