What`s in your wallet? - NYU Stern School of Business

Download Report

Transcript What`s in your wallet? - NYU Stern School of Business

IBM Research
INFORMS Data Mining Contest 2009
Claudia Perlich
Grzegorz Świrszcz, Saharon Rosset
IBM Research, BAMS
© 2009 IBM Corporation
Who we are: Predictive Modeling Group
IBM Ad-Hoc Analytics
Publish papers in leading
KDD and ML conferences




ROI on Research
Project impact analysis
Design metrics for business cases
…
 KDD Cup
 INFORMS
 KDD DM Practice Prize
 KDD (Research + Industrial Tracks)
 SIAM Data Mining
 ICDM
> 45 papers in the above
conferences/journals
Core Machine
Learning Skills
 Quantile Regression
 Statistics
 Relational Learning
…
Deliver innovative solutions to
address key IBM challenges




MAP (Salesforce Alignment)
Patent Quality and Value
BANTER (Social Media Analytics)
…
OnTARGET, MAP were separate Research
Technical Accomplishments in 2005 and 2006
Grow external reputation (e.g. via
leading data mining competitions)
We won KDD Cup in 2007, 2008 and 2009
We were 2nd Place in 2007 KDD Practice Prize
We won INFORMS Data Mining in 2008
Finalist, INFORMS Edelman Award 2009
Develop novel, industry-specific
solutions for key IBM clients
Exploratory Research
 Machine Learning approaches to climate
modeling
 Freightliner (Optimal product Features)
 Xerox (Fault Classification)
 Cardinal Health (Pricing)
 Pfizer (Stratification for Diabetic Drugs)
…
We were awarded an IBM Research Exploratory
Research project in climate modeling
2
Overview of INFORMS Data Mining Contests
 History: 2008 first INFORMS Data Mining Contest on Hospital
Data run by Nick Street
 2009 contest was released on 8/1/09
 Website for registration and information:
http://www.informsdmcontest2009.org/
 Additional Blog to publicize answers of common questions
 ~250 registrations
 27 submissions
• 6 picked only one task
• 21 submitted solution to both tasks
 Winner and Runner-up will publish their approach in the
Journal “Statistical Analysis and Data Mining”
3
Organizing the 2009 Contest
 Criteria for the choice of the domain
• Data that can be published and are not publically available already
• “Real Data” with limited pre-processing
• Interesting/relevant problem
• Optimally some optimization (but we gave up on this)
 Collaboration with Health Care Intelligence
http://www.informsdmcontest2009.org/
• Focus on health care quality questions
• Data: sequences of hospital discharge data
• Task 1: Model the transfer decision (is there a “Golden rule”?)
• Task 2: Predict mortality
• Special: Find leakage
4
Structure and Stats of the Dataset
Example of a temporal sequence of length 3 for some patient
H1
Home
H2
H3
Death
Transfer
 Hospital discharge information for last 10 years for a large population
 Complete coverage of patient history
 Disposition: Outcome of the hospital stay
 Unique characteristics:
• Temporal
• Relational (Patient,Hospital)
 Train/Test split
• 50% random assignment of patients
Entity
Count
# Visits (rows)
~1 Million
# Hospitals
240
# Patients
280,000
5
Dataset Content: 80 Variables
 Patient Data
• Race, Ethnicity, Gender, Age
 Hospital Data
• ID, Region
 Hospital Visit
• Admission Type, Diagnosis codes, Duration, Severity, Outcome, …
 Financial Data
• Charges, Reimbursement codes, Cost
6
Concerns in Designing the Task
 Simulate the ‘use case’ of a model in a medical domain
• Decide NOW whether to initiate a transfer
• Decide on potential preventive measure given the current risk of a case
 Maintain the history view of the patient
• Past visits should be considered
 Reasonable positive rate for the task of interest
• You cannot use the natural end of the sequence
 Avoid leakage
• Avoid models that use ex-post knowledge
7
Making it realistic and useful: choosing the target
Example of a temporal sequence of length 3 for some patient
H1
Home
H2
H3
Death
Transfer
• Observation 1: You cannot predict every visit
— You get mostly after the fact information
o If he comes back, he clearly did not die …
o If he comes back in one day – he was probably transferred
— You loose the temporal information in the test
• Observation 2: You cannot predict only last visit in the sequence
— You get mostly death
— You get hardly any transfer
8
Choosing the Target
• Pick random cutoff in the last year from the last appearance of the
patient
• Only predict outcome of the next stay of a patient after the cutoff
• Throw away any later visits
• Remove all absolute dates and maintain only differences
Temporal sequence of length 3 for some patient
Home
Cutoff
Transfer
?
Death
9
Submission Performances
Results
0.9
 Participants who did well in one task also
did well in the other
 Most submission have much higher
performance than the ‘medical’ risk
measure
0.5
0.6
• The Severity field has AUC of 0.81702
0.7
 Baseline comparison for Task 2:
AUC Task 2
0.8
Severity
0.5
0.9
5
5
0.8
AUC Task 1
Task 1
3
Frequency
4
Task 2
1
1
2
2
3
4
0.72
Task
0
0
Frequency
0.6
6
6
Task 1
0.6
0.7
0.8
0.9
0.65
0.70
0.75
0.80
0.85
0.90
0.95
10
Results Task 1
Winner
"Old Dogs with New Tricks":
David J. Slate and Peter W. Frey
s
Runners-Up
"ID Analytics Team":
Jianjun Xie and Stephen Coggeshall
"PriceWaterhouseCoopers":
Kevin Hua, Aditya Sane, Lever Wang, Jafar Adibi, Jason Chiu,
Stephen Bay, Krishna Kumaraswamy, Carter Shock, David
Steier,Balaji Lakshminarayanan
11
Results Task 2
Winner
"GVCJAB":
Gordon V. Cormack and Judy A. Brown
s
Runners-Up
"ID Analytics Team":
Jianjun Xie and Stephen Coggeshall
Lars Schmidt-Thieme
"Old Dogs with New Tricks":
David J. Slate and Peter W. Frey
12
On Leakage
 There was still something left …
 We had decided to remove all single visit patients
• Very short history
• Cutoff approach would include it by default (no randomness)
 Any patient who had only one visit, must have come back and
therefore did not die during that first visit
 (Duration of the stay)
• Post-hoc knowledge
• Not clear if any signal leaked
• Decided to keep it in since it might be very important in reality
13