Transcript Data Mining
Data Mining – Best Practices
Part #2
Richard Derrig, PhD,
Opal Consulting LLC
CAS Spring Meeting
June 16-18, 2008
Data Mining
Data Mining, also known as KnowledgeDiscovery in Databases (KDD), is the
process of automatically searching large
volumes of data for patterns. In order to
achieve this, data mining uses
computational techniques from statistics,
machine learning and pattern recognition.
www.wikipedia.org
AGENDA
Predictive v Explanatory Models
Discussion of Methods
Example: Explanatory Models for
Decision to Investigate Claims
The “Importance” of Explanatory
and Predictive Variables
An Eight Step Program for Building
a Successful Model
Predictive v Explanatory Models
Both are of the form: Target or Dependent
Variable is a Function of Feature or
Independent Variables that are related to
the Target Variable
Explanatory Models assume all Variables
are Contemporaneous and Known
Predictive Models assume all Variables
are Contemporaneous and Estimable
Desirable Properties of a Data Mining
Method:
Any nonlinear relationship between target
and features can be approximated
A method that works when the form of the
nonlinearity is unknown
The effect of interactions can be easily
determined and incorporated into the model
The method generalizes well on out-of
sample data
Major Kinds of Data Mining Methods
Supervised learning
Most common situation
Target variable
Frequency
Loss ratio
Fraud/no fraud
Unsupervised learning
No Target variable
Group like records
together-Clustering
Some methods
Regression
Decision Trees
Some neural networks
A group of claims with
similar characteristics
might be more likely to be
of similar risk of loss
Ex: Territory assignment,
Some methods
PRIDIT
K-means clustering
Kohonen neural networks
The Supervised Methods and Software
Evaluated
1)
2)
3)
4)
5)
6)
TREENET
Iminer Tree
SPLUS Tree
CART
S-PLUS Neural
Iminer Neural
7) Iminer Ensemble
8) MARS
9) Random Forest
10) Exhaustive Chaid
11) Naïve Bayes (Baseline)
12) Logistic reg ( (Baseline)
Decision Trees
In decision theory (for example risk
management), a decision tree is a graph of
decisions and their possible consequences,
(including resource costs and risks) used to
create a plan to reach a goal. Decision trees are
constructed in order to help with making
decisions. A decision tree is a special form of
tree structure.
www.wikipedia.org
CART – Example of 1st split on Provider 2
Bill, With Paid as Dependent
1st Split
All Data
Mean = 11,224
Bill < 5,021
Bill>= 5,021
Mean = 10,770
Mean = 59,250
For the entire database, total squared deviation of paid losses around the
predicted value (i.e., the mean) is 4.95x1013. The SSE declines to 4.66x1013
after the data are partitioned using $5,021 as the cutpoint.
Any other partition of the provider bill produces a larger SSE than 4.66x1013.
For instance, if a cutpoint of $10,000 is selected, the SSE is 4.76*1013.
Different Kinds of Decision Trees
Single Trees (CART, CHAID)
Ensemble Trees, a more recent development
(TREENET, RANDOM FOREST)
A composite or weighted average of many trees
(perhaps 100 or more)
There are many methods to fit the trees and prevent
overfitting
Boosting:
Iminer Ensemble and Treenet
Bagging: Random Forest
Neural Networks
Three Layer Neural Network
=
Input Layer
(Input Data)
Hidden Layer
(Process Data)
Output Layer
(Predicted Value)
NEURAL NETWORKS
Self-Organizing Feature Maps
T.
Kohonen 1982-1990 (Cybernetics)
Reference vectors of features map to
OUTPUT format in topologically faithful
way. Example: Map onto 40x40 2dimensional square.
Iterative Process Adjusts All Reference
Vectors in a “Neighborhood” of the
Nearest One. Neighborhood Size
Shrinks over Iterations
FEATURE MAP
SUSPICION LEVELS
S16
S13
4-5
S10
3-4
S7
16
13
10
7
4
1
S4
S1
2-3
1-2
0-1
FEATURE MAP
SIMILIARITY OF A CLAIM
S16
S13
4-5
S10
3-4
S7
17
13
9
5
1
S4
S1
2-3
1-2
0-1
DATA MODELING EXAMPLE: CLUSTERING
Data on 16,000
Medicaid providers
analyzed by
unsupervised neural net
Neural network
clustered Medicaid
providers based on
100+ features
Investigators validated a
small set of known
fraudulent providers
Visualization tool
displays clustering,
showing known fraud
and abuse
Subset of 100 providers
with similar patterns
investigated: Hit rate >
70%
© 1999 Intelligent Technologies Corporation
Cube size proportional to annual Medicaid revenues
Multiple Adaptive Regression Splines
(MARS)
MARS fits a piecewise linear regression
BF1 = max(0, X – 1,401.00)
BF2 = max(0, 1,401.00 - X )
BF3 = max(0, X - 70.00)
Y = 0.336 + .145626E-03 * BF1 - .199072E-03 * BF2
- .145947E-03 * BF3; BF1 is basis function
BF1, BF2, BF3 are basis functions
MARS uses statistical optimization to find best basis
function(s)
Basis function similar to dummy variable in regression.
Like a combination of a dummy indicator and a linear
independent variable
Baseline Methods:
Naive Bayes Classifier
Logistic Regression
Naive Bayes assumes feature (predictor)
variables) independence conditional on
each category
Logistic Regression assumes target is
linear in the logs of the feature (predictor)
variables
REAL CLAIM FRAUD
DETECTION PROBLEM
Classify all claims
Identify valid classes
Pay the claim
No hassle
Visa Example
Identify (possible) fraud
Investigation needed
Identify “gray” classes
Minimize with “learning” algorithms
The Fraud Surrogates used as Target
Decision Variables
Independent Medical Exam (IME)
requested
Special Investigation Unit (SIU) referral
IME successful
SIU successful
DATA: Detailed Auto Injury Closed Claim
Database for Massachusetts
Accident Years (1995-1997)
DM
Databases
Scoring Functions
Graded Output
Non-Suspicious Claims
Routine Claims
Suspicious Claims
Complicated Claims
ROC Curve
Area Under the ROC Curve
Want good performance both on sensitivity and
specificity
Sensitivity and specificity depend on cut points
chosen for binary target (yes/no)
Choose a series of different cut points, and
compute sensitivity and specificity for each of
them
Graph results
Plot sensitivity vs 1-specifity
Compute an overall measure of “lift”, or area
under the curve
True/False Positives and True/False
Negatives: The “Confusion” Matrix
Choose a “cut point” in the model score.
Claims > cut point, classify “yes”.
Sample Confusion Matrix: Sensitivity and Specificity
True Class
Prediction
No
Yes
Column Total
Sensitivity
Specificity
No
800
200
1,000
Yes
200
400
600
Row Total
1,000
600
Correct
Total
Percent Correct
800
1,000
80%
400
600
67%
TREENET ROC Curve – IME
AUROC = 0.701
Logistic ROC Curve – IME
AUROC = 0.643
Ranking of Methods/Software – IME
Requested
Method/Software
Random Forest
Treenet
MARS
SPLUS Neural
S-PLUS Tree
Logistic
Naïve Bayes
SPSS Exhaustive CHAID
CART Tree
Iminer Neural
Iminer Ensemble
Iminer Tree
AUROC Lower Bound Upper Bound
0.7030
0.6954
0.7107
0.7010
0.6935
0.7085
0.6974
0.6897
0.7051
0.6961
0.6885
0.7038
0.6881
0.6802
0.6961
0.6771
0.6695
0.6848
0.6763
0.6685
0.6841
0.6730
0.6660
0.6820
0.6694
0.6613
0.6775
0.6681
0.6604
0.6759
0.6491
0.6408
0.6573
0.6286
0.6199
0.6372
Variable Importance (IME) Based on Average of Methods
Important Variable Summarizations for IME
Tree Models, Other Models and Total
Total
Tree
Score
Score
Variable Total
Variable
type
Score
Rank
Rank
Health Insurance
F
16529
1
Provider 2 Bill
F
12514
2
Injury Type
F
10311
3
Territory
F
5180
4
Provider 2 Type
F
4911
5
Provider 1 Bill
F
4711
6
Attorneys Per Zip
DV
2731
7
Report Lag
DV
2650
8
Treatment Lag
DV
2638
9
Claimant per City
DV
2383
10
Provider 1 Type
F
1794
11
Providers per City
DV
1708
12
Attorney
F
1642
13
Distance MP1 Zip to Clt
Zip
DV
1134
14
AGE
F
1048
15
Avg. Household
DM
907
16
Price/Zip
Emergency Treatment
F
660
17
Income Household/Zip
DM
329
18
Providers/Zip
DV
288
19
Household/Zip
DM
242
20
Policy Type
F
4
21
Other
Score
Rank
2
1
3
4
6
5
7
10
13
12
9
11
8
1
3
2
7
4
5
14
8
6
9
13
11
16
18
17
10
12
16
14
15
20
19
21
15
18
20
17
19
21
Claim Fraud Detection Plan
STEP 1:SAMPLE: Systematic benchmark of a
random sample of claims.
STEP 2:FEATURES: Isolate red flags and other
sorting characteristics
STEP 3:FEATURE SELECTION: Separate
features into objective and subjective, early,
middle and late arriving, acquisition cost levels,
and other practical considerations.
STEP 4:CLUSTER: Apply unsupervised
algorithms (Kohonen, PRIDIT, Fuzzy) to cluster
claims, examine for needed homogeneity.
Claim Fraud Detection Plan
STEP 5:ASSESSMENT: Externally classify claims
according to objectives for sorting.
STEP 6:MODEL: Supervised models relating selected
features to objectives (logistic regression, Naïve Bayes,
Neural Networks, CART, MARS)
STEP7:STATIC TESTING: Model output versus expert
assessment, model output versus cluster homogeneity
(PRIDIT scores) on one or more samples.
STEP 8:DYNAMIC TESTING: Real time operation of
acceptable model, record outcomes, repeat steps 1-7 as
needed to fine tune model and parameters. Use PRIDIT
to show gain or loss of feature power and changing data
patterns, tune investigative proportions to optimize
detection and deterrence of fraud and abuse.