Gastrointestinal Cancer Committee

Download Report

Transcript Gastrointestinal Cancer Committee

Wharton
Department of Statistics
Profiting from Data Mining
Bob Stine
Department of Statistics
The Wharton School, Univ of Pennsylvania
April 5, 2002
www-stat.wharton.upenn.edu/~bob
Overview
 Critical
Wharton
Department of Statistics
stages of data mining process
- Choosing the right data, people, and problems
- Modeling
- Validation
 Automated
modeling
- Feature creation and selection
- Exploiting expert knowledge, “insights”
 Applications
- Little detail – Biomedical: finding predictive risk factors
- More detail – Financial: predicting returns on the market
- Lots of detail – Credit: anticipating the onset of bankruptcy
2
Predicting Health Risk
 Who
Wharton
Department of Statistics
is at risk for a disease?
- Example: detect osteoporosis without expense of x-ray
 Goals
- Improving public health
- Savings on medical care
- Confirm an informal model with data mining
 Many
types of features, interested groups
- Clinical observations of doctors
- Laboratory measurements, “genetic”
- Self-reported behavior
 Missing
data
3
Predicting the Stock Market
 Small,
Wharton
Department of Statistics
“hands-on” example
 Goals
- Better retirement savings?
- Money for that special vacation?
- Trade-offs: risk vs return
 Lots
College?
of “free” data
- Access to accurate historical time trends, macro factors
- Recent data more useful than older data
 “Simple”
modeling technique
 Validation
4
Predicting the Market: Specifics Wharton
Department of Statistics
 Build
a regression model
- Response is return on the value-weighted S&P
- Use standard forward/backward stepwise
- Battery of 12 predictors with interactions
 Train
the model during 1992-1996 (training data)
- Model captures most of variation in 5 years of returns
- Retain only the most significant features (Bonferroni)
 Predict
returns in 1997 (validation data)
 Another
version in Foster, Stine & Waterman
5
Wharton
Historical patterns?
Department of Statistics
0.08
0.06
vwReturn
0.04
0.02
?
0.00
-0.02
-0.04
-0.06
92
93
94
95
96
97
98
Year
6
Wharton
Fitted model predicts...
Department of Statistics
0.15
Exceptional Feb return?
0.10
0.05
-0.00
-0.05
92
93
94
95
96
97
98
Year
7
Wharton
What happened?
Department of Statistics
0.10
0.05
Pred Er ror
-0.00
-0.05
Training Period
-0.10
-0.15
92
93
94
95
96
97
98
Year
8
Wharton
Claimed versus Actual Error
Department of Statistics
120
Actual
Squared 100
Prediction
Error 80
60
40
Claimed
20
0
10
20
30
40
50
60
70
80
90
100
Complexity of Model
9
Over-confidence?
Wharton
Department of Statistics
 Over-fitting
- Model fits the training data too well –
better than it can predict the future.
- Greedy fitting procedure
“Optimization capitalizes on chance”
 Some
intuition
- Coincidences
• Cancer clusters, the “birthday problem”
- Illustration with an auction
• What is the value of the coins in this jar?
10
Auctions and Over-fitting
Wharton
Department of Statistics
What is the value of these coins?
11
Auctions and Over-fitting



Auction jar of coins to a
class of MBA students
Histogram shows the bids of
30 students
Most were suspicious, but a
few were not!

Actual value is $3.85

Known as “Winner’s Curse”

Similar to over-fitting:
best model like high bidder
Wharton
Department of Statistics
9
8
7
6
5
4
3
2
1
12
Profiting from data mining?
 Where’s
Wharton
Department of Statistics
the profit in this?
- “Mining the miners” vs getting value from your data
- Lost opportunities
 Importance
 Validation
of domain knowledge
as a measure of success
- Prediction provides an explicit check
- Does your application predict something?
13
Pitfalls and Role of Management
Wharton
Department of Statistics
Over-fitting is dominated by other issues…
 Management
support
- Life in silos
- Coordination across domains
 Responsibility
and reward
- Accountability
- Who gets the credit when it succeeds?
Who suffers if the project is not successful?
14
Specific Potholes
 Moving
Wharton
Department of Statistics
targets
- “Let’s try this with something else.”
 Irrational
expectations
- “I could have done better than that.”
 Not
with my data
- “It’s our data. You can’t use it.”
- “You did not use our data properly.”
15
Back to a real application…
Wharton
Department of Statistics
Emphasis on the statistical issues…
16
Predicting Bankruptcy
Wharton
Department of Statistics
 Goal
- Reduce losses stemming from personal bankruptcy
 Possible
strategies
- If can identify those with highest risk of bankruptcy…
Take some action
• Call them for a “friendly chat” about circumstances
• Unilaterally reduce credit limit
 Trade-off
- Good customers borrow lots of money
- Bad customers also borrow lots of money
17
Predicting Bankruptcy
 “Needle
Wharton
Department of Statistics
in a haystack”
- 3,000,000 months of credit-card activity
- 2244 bankruptcies
- Simple predictor that all are OK looks pretty good.
 What
factors anticipate bankruptcy?
- Spending patterns? Payment history?
- Demographics? Missing data?
- Combinations of factors?
• Cash Advance + Las Vegas = Problem
 We
consider more than 100,000 predictors!
18
Modeling: Predictive Models
Wharton
Department of Statistics
 Build
the model
Identify patterns in training data that predict future
observations.
- Which features are real? Coincidental?
 Evaluate
the model
How do you know that it works?
- During the model construction phase
• Only incorporate meaningful features
- After the model is built
• Validate by predicting new observations
19
Are all prediction errors the same?
Wharton
Department of Statistics
 Symmetry
- Is over-predicting as costly as under-predicting?
- Managing inventories and sales
- Visible costs versus hidden costs
 Does
a false positive = a false negative?
- Classification in data mining
- Credit modeling, flagging “risky” customers
- False positive: call a good customer “bad”
- False negative: fail to identify a “bad”
- Differential costs for different types of errors
20
Building a Predictive Model
Wharton
Department of Statistics
So many choices…
 Structure:
What type of model?
• Neural net
• CART, classification tree
• Additive model or regression spline
 Identification: Which
features to use?
• Time lags, “natural” transformations
• Combinations of other features
 Search:
How does one find these features?
• Brute force has become cheap.
21
Our Choices
Wharton
Department of Statistics
 Structure
- Linear regression with nonlinearity via interactions
- All 2-way and some 3-way, 4-way interactions
- Missing data handled with indicators

Identification
- Conservative standard error
- Comparison of conservative t-ratio to adaptive threshold
 Search
- Forward stepwise regression
- Coming: Dynamically changing list of features
• Good choice affects where you search next.
22
Identifying Predictive Features
 Classical
Wharton
Department of Statistics
problem of “variable selection”
 Thresholding
methods (compare t-ratio to threshold)
- Akaike information criterion (AIC)
- Bayes information criterion (BIC)
- Hard thresholding and Bonferroni
 Arguments
for adaptive thresholds
- Empirical Bayes
- Information theory
- Step-up/step-down tests
23
Adaptive Thresholding
 Threshold
Wharton
Department of Statistics
changes to conform to attributes of data
- Easier to add features as more are found.
 Threshold
for first predictor
- Compare conservative t-ratio to Bonferroni.
- Bonferroni is about Sqrt(2 log p)
- If something significant is found, continue.
 Threshold
for second predictor
- Compare t-ratio to reduced threshold
- New threshold is about Sqrt(2 log p/2)
24
Adaptive Thresholding: Benefits Wharton
Department of Statistics
 Easy
As easy and fast as implementing the standard
criterion that is used in stepwise regression.
 Theory
Resulting model provably as good as best Bayes
model for the problem at hand.
 Real
world
It works! Finds models with real signal, and stops
when the signal runs out.
25
Bankruptcy Model: Construction
 Data:
Wharton
Department of Statistics
reserve 80% for validation
- Training data
• 600,000 months
• 458 bankruptcies
- Validation data
• 2,400,000 months
• 1786 bankruptcies
 Selection
via adaptive thresholding
- Compare sequence of t-statistics to Sqrt(2 log p/q)
- Dynamic expansion of feature space
26
Bankruptcy Model: Preview
Wharton
Department of Statistics
 Predictors
- Initial search identifies 39
• Validation SS monotonically falls to 1650
• Linear fit can do no better than 1735
- Expanded search of higher interactions finds a bit more
• Nature of predictors comprising the interactions
• Validation SS drops 10 more
 Validation:
Lift chart
- Top 1000 candidates have 351 bankrupt
 More
validation: Calibration
- Close to actual Pr(bankrupt) for most groups.
27
Bankruptcy Model: Fitting
Department of Statistics
should the fitting process be stopped?
Residual Sum of Squares
SS
 Where
Wharton
470
460
450
440
430
420
410
400
0
50
100
150
Number of Predictors
28
Bankruptcy Model: Fitting
Wharton
Department of Statistics
 Our
adaptive selection procedure stops at a model
with 39 predictors.
SS
Residual Sum of Squares
470
460
450
440
430
420
410
400
0
50
100
150
Number of Predictors
29
Bankruptcy Model: Validation
Wharton
Department of Statistics
 The
validation indicates that the fit gets better while
the model expands. Avoids over-fitting.
Validation Sum of Squares
1760
SS
1720
1680
1640
0
50
100
150
Number of Predictors
30
Bankruptcy Model: Linear?
Wharton
Department of Statistics
 Choosing
from linear predictors (no interactions) does
not match the performance of the full search.
Validation Sum of Squares
1760
SS
1720
1680
1640
0
50
100
150
Number of Predictors
Linear
Quadratic
31
Wharton
Bankruptcy Model: More?
Department of Statistics
 Searching
higher-order interactions offers modest
improvement.
Validation Sum of Squares
SS
1680
1640
0
20
40
60
Number of Predictors
Quadratic
Cubic
32
Lift Chart
 Measures
Wharton
Department of Statistics
how well model classifies sought-for group
% bankrupt in DM selection
Lift 
% bankrupt in all data
 Depends
on rule used to label customers
- Very
high threshold
Lots of lift, but few bankrupt customers are found.
- Lower threshold
Lift drops, but finds more bankrupt customers.
33
Wharton
Generic Lift Chart
Department of Statistics
1.0
Model
%Responders
0.8
Random
0.6
0.4
0.2
0.0
0
10
20
30
40
50
60
70
80
90
100
% Chosen
34
Wharton
Bankruptcy Model: Lift
 Much
Department of Statistics
better than diagonal!
100
% Found
75
50
25
0
0
25
50
75
100
% Contacted
35
Wharton
Calibration
Classifier assigns
Prob(“BR”)
rating to a customer.

Weather forecast

Among those classified as
2/10 chance of “BR”,
how many are BR?

100
75
Actual

Department of Statistics
50
25
0
10
20
30
40
50
60
70
80
90
Closer to diagonal is
better.
36
Bankruptcy Model: Calibration
 Over-predicts
Wharton
Department of Statistics
risk above claimed probability 0.4
Calibration Chart
1.2
Actual
1
0.8
0.6
0.4
0.2
0
0
0.2
0.4
0.6
0.8
Claim
37
Summary of Bankruptcy Model Wharton
Department of Statistics
 Automatic,
adaptive selection
- Finds patterns that predict new observations
- Predictive, but not easy to explain
 Dynamic
feature set
- Current research
- Information theory allows changing search space
- Finds more structure than direct search could find
 Validation
- Essential only for judging fit.
- Better than “hand-made models” that take years to create.
38
So, where’s the profit in DM?
Wharton
Department of Statistics
 Automated
modeling has become very powerful,
avoiding problems of over-fitting.
 Role
for expert judgment remains
- What data to use?
- Which features to try first?
- What are the economics of the prediction errors?
 Collaboration
- Data sources
- Data analysis
- Strategic decisions
39