Data Mining and Actuarial Science
Download
Report
Transcript Data Mining and Actuarial Science
Data Mining Applications
in P&C Insurance
CASE Spring Meeting
April 12, 2005
Lijia Guo, PhD, ASA, MAAA
University of Central Florida
1
Agenda
Introductions
to data mining modeling
Understanding the data mining process
Data mining (DM) techniques
Applications in P&C Insurance
Case Study
April 12, 2005
Guo
2
Introduction – What is Data Mining?
Process of exploration and analysis of large
quantities of data in order to discover meaningful
patterns and rules.
Uses a variety of data analysis tools to discover
relationships that may be used to make valid
predictions.
It is not a magic wand:
Must know your business
Understand your data
Understand the analytical methods
April 12, 2005
Guo
3
Introduction - DM Modeling
An
information discovery process.
Knowing your goals
Understanding your data
Choosing the right methods
Understanding the limitations
Validation and testing
Make crucial business decisions
April 12, 2005
Guo
4
Introduction – DM Process
Understand the
Economics
Define the Goal
Identify Data
Sources
Prepare Data
Transform Data
Apply DM Models
Validate DM Models
April 12, 2005
Guo
IMPLEMENT
5
Introduction – DM Goals
Identifying
responsive potential customers
Identifying existing customers that more
likely to terminate
Identifying high risk purchaser
Identifying the factors that cause large
claims
Identifying interactions among risk factors
April 12, 2005
Guo
6
Introduction – DM Process
April 12, 2005
Guo
7
DM Techniques
Decision Trees
Logistic regression
Neural Networks
Fuzzy Logics
Genetic Algorithms
Clustering
Associated discovery
Sequence Discovery
Bayesian analysis
Visualization
Hybrid algorithms
April 12, 2005
Guo
8
DM Techniques -- Decision Trees
What
are decision trees
Classify observations based on the values
of nominal, binary, or ordinal targets
Predict outcomes for interval targets
Predict the appropriate decision when you
specify decision alternatives
April 12, 2005
Guo
9
DM Techniques -- Decision Trees
Example
Classification Of Surrender Risk
Yes
Income >$50,000
Yes Or No
Job >5 Years
Yes or No
If yes low risk
else high risk
April 12, 2005
No
High Debt
Yes or No
If yes low risk
Else high risk
Guo
10
DM Techniques -- Decision Trees
Strengths
and weaknesses
Insights into the decision-making
process
Efficient and is thus suitable for large
data sets
Relatively unstable
Difficult to detect linear or quadratic
relationships
April 12, 2005
Guo
11
DM Techniques
-- Logistic regression
What
is Logistic regression
How Logistic regression works
Odds ratios
Each dependent variable affects logit linearly
pi
logit log
1 pi
April 12, 2005
k
0 j x ji , where i 1, 2, , n.
j 1
Guo
12
DM Techniques - Logistic Regression
Strengths
and weaknesses
Maximum Likelihood Curve Fitting
Multiple Logistic Regression Model
Interaction-effect modifier
Multinomial Logistic Regression Model
April 12, 2005
Guo
13
DM Techniques
-- Neural Networks
What
are Neural Networks
x1
w11
w21
x2
H1
w21
w1
y
w22
w2
x3
w31
w32
April 12, 2005
H2
Guo
Input
layer - a
unit for each
input variable
Output layer the target
Hidden layer hidden unit
(neurons)
14
DM Techniques – Neural Networks
g01 E ( y ) w0 w1H1 w2 H 2
H1 g1 ( w01 w11 x1 w21 x2 w31 x3 )
H 2 g 2 (w02 w12 x1 w22 x2 w32 x3 )
g 0 ()
: output activation function.
gi ()
: activation functions-nonlinear
transformations.
w11 , w21 , , w32 , w1 , w2
: weights
w , w , w
: Bias
0
April 12, 2005
01
02
Guo
15
DM Techniques –Neural Networks
How Neural Networks work
Processing elements
Training
Predicting
Activation Functions
• logistic function
1
l ( )
1 e
• hyperbolic tangent
April 12, 2005
x
e e
tanh( x) x x
e e
x
Guo
16
DM Techniques -- Neural Networks
Strengths
and weaknesses
• Accurately prediction for complex problems
• Black box predict engine
• Overtraining
• Training speed
April 12, 2005
Guo
17
DM Techniques -- Hybrid Algorithms
Problems
with standard algorithms
Advanced algorithms
Discovery-driven approaches
Mixture of algorithms
April 12, 2005
Guo
18
DM Applications in P&C Insurance
Data
Warehouse
Underwriting
Pricing/Rate Making
Claim Scoring
Risk Management
Policy Level Analysis
Variable Selection
April 12, 2005
Guo
19
Data Warehousing Example
Transactions SurveysDemographics
Unique Patient List
Transactions
Pharmacy
Claims
Rx
April 12, 2005
Demographics
Service Level Table
Derived Variables/
Flags
Surveys
...
Group by Patient
Hospital
Claims
Surveys
Secondary Selection:
WHAT DATA?
Med Claims
Physician
Claims
Primary Selection:
WHO?
Tertiary Selection:
WHAT DOES THE TRANSACTION
DATA TELL US?
Summary Level Table
Service Level
Summary Level
Variables
Variables
Guo
Summary:
WHAT DO WE
KNOW ABOUT
THIS PATIENT?
20
DM in Insurance Underwriting
Improving
profit margin.
Gaining competitive edge
Risk evaluation process.
Lots of variables
Lots of interactions
Easy
to follow procedure.
Decision tree can be used
April 12, 2005
Guo
21
DM in Insurance Underwriting
- Auto Driver’s Claim Information
Variable
Variable Type
Measurement Level
Description
Age
Continuous
Interval
Driver’s age in years
Car age
Continuous
Interval
Age of the car
Car type
Categorical
Nominal
Type of the car
Gender
Categorical
Binary
F=female, M=male
Coverage level
Categorical
Nominal
Policy coverage
Education
Categorical
Nominal
Education level of the drive
Location
Categorical
Nominal
Location of residence
Climate
Categorical
Nominal
Climate code for residence
Credit rating
Continuous
Interval
Credit score of the driver
ID
Input
Nominal
Driver’s identification number
No. of claims
Categorical
Nominal
Number of claims
April 12, 2005
Guo
22
DM in Insurance Underwriting
- Decision Tree Diagram
April 12, 2005
Guo
23
DM in Pricing/Rate Making
Data: Auto Driver’s Claim Information
Decision
trees analysis to identify risk
factors that predict profits, claims and
losses
Logistic regression applied to model
Claim frequency
Effect of each risk factor
April 12, 2005
Guo
24
DM in Pricing/Rate Making
Effect T-scores from the logistic regression
April 12, 2005
Guo
25
DM in Pricing/Rate Making
- Assessment
Assessment
Cross-model comparisons of the expected to actual
profits/losses
Independent of all other factors (sample size,..)
Lift charts
% claim-occurrence value to a random baseline
model
Performance quality demonstrated by the degree the
lift chart curve pushes upward and to the left
April 12, 2005
Guo
26
DM in Pricing/Rate Making
- Lift Chart for Logistic
Regression
logistic Regression
- Captured 30% of
the drivers in the
10th percentile
- Better predictive
power from about
the 20th to the
80th percentiles
April 12, 2005
Guo
27
DM in Risk Management
Reinsurance
To structure more effectively by segmentation
Hedging
Target
April 12, 2005
retention and building loyalty
Guo
28
DM in Policy Level Analysis
Retention
analysis
Profitability analysis
Policyholder’s behavior
DM methods used
Neural networks
Decision trees
Logistic regression
April 12, 2005
Guo
29
Applications – Variable Selection
Problem
-- Given {Y,X} where
X {x1 , x2 ,...xN }
Find F, such that F ( X ) Y
Find Z X , and F*, such that F *( X ) Y
Improving model accuracy and efficiency
Making crucial business decisions
April 12, 2005
Guo
30
Case Study - Group Insurance
Identify ways to build upon the current
manual rating structure utilizing exiting rating
variables to develop a practical tool to guild
underwriting in rates adjustments
Identify any new rating variables with
significant predictive power
Currently gathered, but not utilized data
Transformations of existing variables
introduce new rating variables (e.g. external financial
data)
April 12, 2005
Guo
31
Case Study – Group Insurance
Profit
margin over x year period
128 input variables
Principle Components Analysis applied
42 variables remains
How to improve business profit?
April 12, 2005
Guo
32
Case Study - Goals
Developing
a practical
underwriting tool
Detecting deviations
Identifying key drivers
Improving
model predictive power
Risk selection
April 12, 2005
Guo
33
Function Approximation
F ( X ) F0 1T1 ( X ) 2T2 ( X ) ... M TM ( X )
F0
is the initial guess
Stegewise approximation
Each stage added by reducing errors
Each stage is weak linear – a small tree.
Sequential adjustment
April 12, 2005
Guo
34
Regression Tree Example
Profit=6.5%
+0.8% , if AS >
421
-0.5% ,
otherwise
April 12, 2005
Guo
+1.2% , if male
young than
30
-1.1% ,
otherwise
35
Function Approximation
GIVEN
Y: Output and X: Inputs or Predictors
L(Y, F): Loss Function
ESTIMATE
F *( X ) arg min F ( X ) EY , X [ L(Y , F ( X ))]
April 12, 2005
Guo
36
Classical Function Approximation
F F ( X , ), { j }
Solve { j }
from
min L(Y , F ( X , B))
April 12, 2005
Guo
37
Nonparametric Function
Approximation
{F0 ( X i )}
Initial guess
Compute
Take a step in the steepest descent direction
April 12, 2005
N
L
g
F ( X i ) i 1
Guo
38
Gradient Boosting
1 N
L({F ( X i )}) (Yi F ( X i )) 2
N i 1
Initial guess
FOR m = 1 TO M
{F0 ( X i )}
gm L( Fm1 ( X i ))
Fit an L-node regression tree to the current residuals
For each given node, calculate node average residual
Update:
END
April 12, 2005
hm ( X i )
{Fm ( X i )} {Fm1 ( X i )} hm ( X i )
Guo
39
Case Study
Tw o Predictor Dependence For
PROFIT_MARGIN
April 12, 2005
Guo
40
Case Study
Tw o Predictor Dependence For
PROFIT_MARGIN
April 12, 2005
Guo
41
Case Study
- Single Stats and Variable Importance
Input
Variable 1
Variable 2
Variable 3
Variable 4
Variable 5
Variable 6
Variable 7
Variable 8
Variable 9
April 12, 2005
Additive
0.2679
0.2779
0.1456
0.2263
0.1059
0.2741
0.1289
0.0797
0.1129
Multiplicative Importance
0.2690
100.00
0.3203
75.23
0.1771
54.65
0.2469
47.41
0.1425
42.81
0.2847
34.81
0.1306
34.27
0.0864
25.35
0.1148
23.37
Guo
42
Case Study
- Pair Stats and Variable Importance
Variables
Variable 1 & Variable 2
Variable 2 & Variable 3
Variable 2 & Variable 4
Variable 2 & Variable 7
Variable 3 & Variable 4
Variable 3 & Variable 6
Variable 4 & Variable 7
Variable 5 & Variable 6
Variable 6 & Variable 7
April 12, 2005
Additive
Multiplicative
0.3714
0.3704
0.3686
0.3401
0.2795
0.2895
0.2417
0.2622
0.2904
0.3847
0.4066
0.4010
0.3856
0.3137
0.3082
0.2592
0.2766
0.3066
Guo
43
Predictive Modeling
Predicts
deviations from expected
profitability (used 9 variables)
Practical guide for underwriters to use for
rates adjustments
New variables Identified to have strong
predictive power
Improve business profit (20% Profit margin)
April 12, 2005
Guo
44
Importance of Multiple
Techniques
Robust
model with high predictive
accuracy
Practical constrains
Algorithm complexity
Ease of understanding of results
April 12, 2005
Guo
45
Is Data Mining for you?
Defining
the goals
Understanding your data
Using multiple techniques
Improving your decision making process
Gaining competitive edges!
Thank you!
April 12, 2005
Guo
46