Data Mining in Insurance

Download Report

Transcript Data Mining in Insurance

Does Credit Score Really Help
Explain Insurance Losses?
Cheng-Sheng Peter Wu,
FCAS, ASA, MAAA,
Jim Guszcza, ACAS,
MAAA, Ph. D.
1
Themes







2
The History
What Does the Question Mean?
Simpson’s Paradox - Need for Multivariate
Analysis
What Has Been Done So Far?
Our Large-Scale Data Mining Experience
Going Beyond Credit
Conclusions
The History
Pricing/Class Plans
 Few factors before World War II
 Explosion of class plan factors after the War
 Current class plans (Auto) – territory, driver,
vehicle, loss and violation, others,
tiers/company, etc.
 Actuarial techniques – Minimum Bias & GLM
3
The History
Credit







4
First important factor identified over the past 2 decades
Composite multivariate score vs. raw credit information
Introduced in late 80’s and early 90’s
Viewed at first as a “secret weapon”
Currently almost everyone is using it
Industry scores vs. proprietary scores
Quiet, confidential, controversial, black-box, …etc
What Does the Question Mean?
Can Credit Score Really “Explain” Ins Losses?
“X explains Y”
 Weaker than claiming that X causes Y
 Stronger than merely reporting that X is
correlated with Y

5
What Does the Question Mean?
Working Definition

We say that “X helps explain Y” if:
–
–
6
X is correlated with Y
The correlation does not go away when
other available, measurable information is
introduced
What Does the Question Mean?
Intuition Behind the Definition
 It might be okay for X to be a proxy for a “true”
cause of Y
–
–

7
Testosterone level might be a true cause of auto
losses…. But it’s not available
Age/Gender is a reasonable proxy
It might not be okay for X to be a proxy for
other available predictive information
What Does the Question Mean?
Applying the Definition
 Suppose we see that credit score plays an
important role in a multivariate regression
equation that predicts loss ratio
 Then it is fair to say the credit helps explain
insurance losses
 A multivariate study is needed
8
Simpson’s Paradox – Need for
Multivariate Analysis
Statistics can lie
 Illustrates how a univariate association
can lead to a spurious conclusion
 The “true” explanatory factor is masked
by the spurious correlation
 Famous example: 1973 Berkeley
admissions data

9
Simpson’s Paradox – Need for
Multivariate Analysis
The Berkeley Example (stylized)
 2200 people applied for admission
 1100 men; 1100 women
 210 men, 120 women were accepted.
 Clear-cut case of gender discrimination…
 …. Or is it?
10
Simpson’s Paradox – Need for
Multivariate Analysis
Female
Male
11
# Applicants
Arts
Eng Total
1000
100 1100
100
1000 1100
# Accepted
Arts
Eng Total
100
20 120
10
200 210
% Accepted
Arts
Eng Total
10%
20% 11%
10%
20% 19%
Simpson’s Paradox – Need for
Multivariate Analysis
12
REGRESSION RESULTS
Beta
Intercept
0.109
T- Score
10.2
Gender
0.082
5.1
Intercept
Beta
0.10
T- Score
9.20
Gender
0.00
0.00
School
0.10
3.80
Simpson’s Paradox – Need for
Multivariate Analysis
Good Credit
Bad Credit
13
# Policies
Adult Youthful Total
1000
100 1100
100
1000 1100
# Policies w/Claims
Adult Youthful Total
100
20 120
10
200 210
Frequency
Adult Youthful Total
10%
20% 11%
10%
20% 19%
What Has Been Done So Far


We (actuaries) have been quiet
Few published actuarial studies/opinions
–
–

Recent/related studies
–
–
–
–
14
NAIC/Tillinghast (1997)
Monaghan’s Study (2000)
Virginia State Study (1999)
CAS Sub-Committee (2002)
Washington State Study (2003)
University of Texas Study (2003)
What Has Been Done So Far
Relevant Actuarial/Statistical Principles

Pure premium vs. loss ratio
–

Independence vs. correlation
–

Correlated variables call for multivariate studies for true
answers (Simpson’s Paradox)
Credibility vs. homogeneity
–
15
Most insurance variables are correlated
Univariate vs. multivariate
–

Loss ratio studies go beyond existing rating plans, and are
implicitly multivariate
Studies need to be credible and representative
What Has Been Done So Far
The Tillinghast Study
 9 companies’ data, seems representative
 Loss ratio study
 No other predictive variables included in the study
 No detailed information given about the data
 Strong correlation with loss ratio, seems credible
 This is true, but it doesn’t answer our question and
doesn’t quiet the critics
16
What Has Been Done So Far
Tillinghast Study of 9 Companies' Data
Loss Ratio Relativity of the Best and Worst 20% of Credit Score
17
Co1
Co2
Co3
Co4
Co5
Co6
Co7
Co8
Co9
Avg
Best 20%
-38%
-29%
-19%
-15%
-14%
-34%
-22%
-22%
-36%
-25%
Worst 20%
48%
20%
32%
30%
46%
59%
20%
22%
95%
41%
What Has Been Done So Far
Monaghan’s Study






18
Loss ratio study
Large amount of data – credible analysis
Analyze individual credit variables as well as score
Multivariate analysis – limited to score + 1 traditional
rating variable at a time
Shows strong correlations with loss ratio do not go
away in the presence of other variables
Another good step, but we can go further
Our Large-Scale Data Mining
Experience
Our Work




Loss ratio studies
Multiple studies - representative
Large amounts of data – credible
Hundreds of variables tested along with credit – truly
multivariate
–


19
Policy, driver, vehicle, coverages, billing, agency, external data,
synthetic, …etc.
Sound actuarial and statistical model design
Disciplined data mining process
Our Large-Scale Data Mining
Experience
What Have We Found Out?


Credit score is always one of top variables selected for
the multivariate models
Credit score has among the strongest parameters and
statistical measurements (t-score)
–

20
Credit’s predictive power does not go away in the truly
multivariate context
Removing credit score dampens the predictive power
of the models
Our Large-Scale Data Mining
Experience
What Do We Conclude?
 We conclude that credit score bears an
unambiguous relationship to insurance losses,
and is not a mere proxy for other kinds of
information available to insurance companies.
 This does not mean that credit score is the
“cause” of insurance losses
21
Our Large-Scale Data Mining
Experience
Why Is Credit Score Correlated with Ins Losses?

Beyond the scope of our work
–

Plausible speculations include
–
–
–

22
Emphasis is not causation
Stress/planning & organization
Risk-seeking behavior
??
Analogy: Age/Gender might be a proxy for testosterone
Going Beyond Credit
Can We Do Well Without Credit?

YES: non-credit predictive models are
–
–
–
–

23
Valuable alternative to credit scores
Flexible
Tailored to individual companies
Comparable predictive power to credit scores
Also possible to build mixed credit/non-credit
models
Going Beyond Credit
Keys to Building Successful Non-Credit Models:

Fully utilize all sources of information
–
–




24
Leverage company’s internal data sources
Enriched with other external data sources
Use large amount of data
Employ disciplined analytical process
Utilize state-of-the-art modeling tools
Apply multivariate methodology
Going Beyond Credit
Advantages of Going Beyond Credit








25
Next generation of competitive advantage
More variables, more predictive power
Leverages company’s internal data sources
More flexibility
Address regulatory issues and public concerns
Expense savings
Everyone gets a score (less of a “no hit” problem)
More customized – less “plain vanilla” than credit score
Conclusions



Credit works… even in a fully multivariate setting
But non-credit models can work well too!
What it means to us – beginning of a new era
–
–
–
–
–
26
Advances in computer technology
Advances in predictive modeling techniques
Large scale multivariate studies now practical
More external and internal info, anything else out there?
Other ways to go beyond credit?
Conclusions
Future works on this topic


Multivariate pure premium analysis would
provide more insights
Further study of public policy issues
–

27
WA, VA came to opposite conclusions
Comparison of various existing scoring models