Gastrointestinal Cancer Committee
Download
Report
Transcript Gastrointestinal Cancer Committee
Wharton
Department of Statistics
Profiting from Data Mining
Bob Stine
Department of Statistics
The Wharton School, Univ of Pennsylvania
April 5, 2002
www-stat.wharton.upenn.edu/~bob
Overview
Critical
Wharton
Department of Statistics
stages of data mining process
- Choosing the right data, people, and problems
- Modeling
- Validation
Automated
modeling
- Feature creation and selection
- Exploiting expert knowledge, “insights”
Applications
- Little detail – Biomedical: finding predictive risk factors
- More detail – Financial: predicting returns on the market
- Lots of detail – Credit: anticipating the onset of bankruptcy
2
Predicting Health Risk
Who
Wharton
Department of Statistics
is at risk for a disease?
- Example: detect osteoporosis without expense of x-ray
Goals
- Improving public health
- Savings on medical care
- Confirm an informal model with data mining
Many
types of features, interested groups
- Clinical observations of doctors
- Laboratory measurements, “genetic”
- Self-reported behavior
Missing
data
3
Predicting the Stock Market
Small,
Wharton
Department of Statistics
“hands-on” example
Goals
- Better retirement savings?
- Money for that special vacation?
- Trade-offs: risk vs return
Lots
College?
of “free” data
- Access to accurate historical time trends, macro factors
- Recent data more useful than older data
“Simple”
modeling technique
Validation
4
Predicting the Market: Specifics Wharton
Department of Statistics
Build
a regression model
- Response is return on the value-weighted S&P
- Use standard forward/backward stepwise
- Battery of 12 predictors with interactions
Train
the model during 1992-1996 (training data)
- Model captures most of variation in 5 years of returns
- Retain only the most significant features (Bonferroni)
Predict
returns in 1997 (validation data)
Another
version in Foster, Stine & Waterman
5
Wharton
Historical patterns?
Department of Statistics
0.08
0.06
vwReturn
0.04
0.02
?
0.00
-0.02
-0.04
-0.06
92
93
94
95
96
97
98
Year
6
Wharton
Fitted model predicts...
Department of Statistics
0.15
Exceptional Feb return?
0.10
0.05
-0.00
-0.05
92
93
94
95
96
97
98
Year
7
Wharton
What happened?
Department of Statistics
0.10
0.05
Pred Er ror
-0.00
-0.05
Training Period
-0.10
-0.15
92
93
94
95
96
97
98
Year
8
Wharton
Claimed versus Actual Error
Department of Statistics
120
Actual
Squared 100
Prediction
Error 80
60
40
Claimed
20
0
10
20
30
40
50
60
70
80
90
100
Complexity of Model
9
Over-confidence?
Wharton
Department of Statistics
Over-fitting
- Model fits the training data too well –
better than it can predict the future.
- Greedy fitting procedure
“Optimization capitalizes on chance”
Some
intuition
- Coincidences
• Cancer clusters, the “birthday problem”
- Illustration with an auction
• What is the value of the coins in this jar?
10
Auctions and Over-fitting
Wharton
Department of Statistics
What is the value of these coins?
11
Auctions and Over-fitting
Auction jar of coins to a
class of MBA students
Histogram shows the bids of
30 students
Most were suspicious, but a
few were not!
Actual value is $3.85
Known as “Winner’s Curse”
Similar to over-fitting:
best model like high bidder
Wharton
Department of Statistics
9
8
7
6
5
4
3
2
1
12
Profiting from data mining?
Where’s
Wharton
Department of Statistics
the profit in this?
- “Mining the miners” vs getting value from your data
- Lost opportunities
Importance
Validation
of domain knowledge
as a measure of success
- Prediction provides an explicit check
- Does your application predict something?
13
Pitfalls and Role of Management
Wharton
Department of Statistics
Over-fitting is dominated by other issues…
Management
support
- Life in silos
- Coordination across domains
Responsibility
and reward
- Accountability
- Who gets the credit when it succeeds?
Who suffers if the project is not successful?
14
Specific Potholes
Moving
Wharton
Department of Statistics
targets
- “Let’s try this with something else.”
Irrational
expectations
- “I could have done better than that.”
Not
with my data
- “It’s our data. You can’t use it.”
- “You did not use our data properly.”
15
Back to a real application…
Wharton
Department of Statistics
Emphasis on the statistical issues…
16
Predicting Bankruptcy
Wharton
Department of Statistics
Goal
- Reduce losses stemming from personal bankruptcy
Possible
strategies
- If can identify those with highest risk of bankruptcy…
Take some action
• Call them for a “friendly chat” about circumstances
• Unilaterally reduce credit limit
Trade-off
- Good customers borrow lots of money
- Bad customers also borrow lots of money
17
Predicting Bankruptcy
“Needle
Wharton
Department of Statistics
in a haystack”
- 3,000,000 months of credit-card activity
- 2244 bankruptcies
- Simple predictor that all are OK looks pretty good.
What
factors anticipate bankruptcy?
- Spending patterns? Payment history?
- Demographics? Missing data?
- Combinations of factors?
• Cash Advance + Las Vegas = Problem
We
consider more than 100,000 predictors!
18
Modeling: Predictive Models
Wharton
Department of Statistics
Build
the model
Identify patterns in training data that predict future
observations.
- Which features are real? Coincidental?
Evaluate
the model
How do you know that it works?
- During the model construction phase
• Only incorporate meaningful features
- After the model is built
• Validate by predicting new observations
19
Are all prediction errors the same?
Wharton
Department of Statistics
Symmetry
- Is over-predicting as costly as under-predicting?
- Managing inventories and sales
- Visible costs versus hidden costs
Does
a false positive = a false negative?
- Classification in data mining
- Credit modeling, flagging “risky” customers
- False positive: call a good customer “bad”
- False negative: fail to identify a “bad”
- Differential costs for different types of errors
20
Building a Predictive Model
Wharton
Department of Statistics
So many choices…
Structure:
What type of model?
• Neural net
• CART, classification tree
• Additive model or regression spline
Identification: Which
features to use?
• Time lags, “natural” transformations
• Combinations of other features
Search:
How does one find these features?
• Brute force has become cheap.
21
Our Choices
Wharton
Department of Statistics
Structure
- Linear regression with nonlinearity via interactions
- All 2-way and some 3-way, 4-way interactions
- Missing data handled with indicators
Identification
- Conservative standard error
- Comparison of conservative t-ratio to adaptive threshold
Search
- Forward stepwise regression
- Coming: Dynamically changing list of features
• Good choice affects where you search next.
22
Identifying Predictive Features
Classical
Wharton
Department of Statistics
problem of “variable selection”
Thresholding
methods (compare t-ratio to threshold)
- Akaike information criterion (AIC)
- Bayes information criterion (BIC)
- Hard thresholding and Bonferroni
Arguments
for adaptive thresholds
- Empirical Bayes
- Information theory
- Step-up/step-down tests
23
Adaptive Thresholding
Threshold
Wharton
Department of Statistics
changes to conform to attributes of data
- Easier to add features as more are found.
Threshold
for first predictor
- Compare conservative t-ratio to Bonferroni.
- Bonferroni is about Sqrt(2 log p)
- If something significant is found, continue.
Threshold
for second predictor
- Compare t-ratio to reduced threshold
- New threshold is about Sqrt(2 log p/2)
24
Adaptive Thresholding: Benefits Wharton
Department of Statistics
Easy
As easy and fast as implementing the standard
criterion that is used in stepwise regression.
Theory
Resulting model provably as good as best Bayes
model for the problem at hand.
Real
world
It works! Finds models with real signal, and stops
when the signal runs out.
25
Bankruptcy Model: Construction
Data:
Wharton
Department of Statistics
reserve 80% for validation
- Training data
• 600,000 months
• 458 bankruptcies
- Validation data
• 2,400,000 months
• 1786 bankruptcies
Selection
via adaptive thresholding
- Compare sequence of t-statistics to Sqrt(2 log p/q)
- Dynamic expansion of feature space
26
Bankruptcy Model: Preview
Wharton
Department of Statistics
Predictors
- Initial search identifies 39
• Validation SS monotonically falls to 1650
• Linear fit can do no better than 1735
- Expanded search of higher interactions finds a bit more
• Nature of predictors comprising the interactions
• Validation SS drops 10 more
Validation:
Lift chart
- Top 1000 candidates have 351 bankrupt
More
validation: Calibration
- Close to actual Pr(bankrupt) for most groups.
27
Bankruptcy Model: Fitting
Department of Statistics
should the fitting process be stopped?
Residual Sum of Squares
SS
Where
Wharton
470
460
450
440
430
420
410
400
0
50
100
150
Number of Predictors
28
Bankruptcy Model: Fitting
Wharton
Department of Statistics
Our
adaptive selection procedure stops at a model
with 39 predictors.
SS
Residual Sum of Squares
470
460
450
440
430
420
410
400
0
50
100
150
Number of Predictors
29
Bankruptcy Model: Validation
Wharton
Department of Statistics
The
validation indicates that the fit gets better while
the model expands. Avoids over-fitting.
Validation Sum of Squares
1760
SS
1720
1680
1640
0
50
100
150
Number of Predictors
30
Bankruptcy Model: Linear?
Wharton
Department of Statistics
Choosing
from linear predictors (no interactions) does
not match the performance of the full search.
Validation Sum of Squares
1760
SS
1720
1680
1640
0
50
100
150
Number of Predictors
Linear
Quadratic
31
Wharton
Bankruptcy Model: More?
Department of Statistics
Searching
higher-order interactions offers modest
improvement.
Validation Sum of Squares
SS
1680
1640
0
20
40
60
Number of Predictors
Quadratic
Cubic
32
Lift Chart
Measures
Wharton
Department of Statistics
how well model classifies sought-for group
% bankrupt in DM selection
Lift
% bankrupt in all data
Depends
on rule used to label customers
- Very
high threshold
Lots of lift, but few bankrupt customers are found.
- Lower threshold
Lift drops, but finds more bankrupt customers.
33
Wharton
Generic Lift Chart
Department of Statistics
1.0
Model
%Responders
0.8
Random
0.6
0.4
0.2
0.0
0
10
20
30
40
50
60
70
80
90
100
% Chosen
34
Wharton
Bankruptcy Model: Lift
Much
Department of Statistics
better than diagonal!
100
% Found
75
50
25
0
0
25
50
75
100
% Contacted
35
Wharton
Calibration
Classifier assigns
Prob(“BR”)
rating to a customer.
Weather forecast
Among those classified as
2/10 chance of “BR”,
how many are BR?
100
75
Actual
Department of Statistics
50
25
0
10
20
30
40
50
60
70
80
90
Closer to diagonal is
better.
36
Bankruptcy Model: Calibration
Over-predicts
Wharton
Department of Statistics
risk above claimed probability 0.4
Calibration Chart
1.2
Actual
1
0.8
0.6
0.4
0.2
0
0
0.2
0.4
0.6
0.8
Claim
37
Summary of Bankruptcy Model Wharton
Department of Statistics
Automatic,
adaptive selection
- Finds patterns that predict new observations
- Predictive, but not easy to explain
Dynamic
feature set
- Current research
- Information theory allows changing search space
- Finds more structure than direct search could find
Validation
- Essential only for judging fit.
- Better than “hand-made models” that take years to create.
38
So, where’s the profit in DM?
Wharton
Department of Statistics
Automated
modeling has become very powerful,
avoiding problems of over-fitting.
Role
for expert judgment remains
- What data to use?
- Which features to try first?
- What are the economics of the prediction errors?
Collaboration
- Data sources
- Data analysis
- Strategic decisions
39