RoR_MHS_Jayshree_final_Presentation
Download
Report
Transcript RoR_MHS_Jayshree_final_Presentation
Predicting Risk of Re-hospitalization
for Congestive Heart Failure Patients
(in collaboration with
)
Jayshree Agarwal
Senjuti Basu Roy,
Ankur Teredesai, Si-Chi Chin, David Hazel, Kiyana, Mehrdad,
(UWT)
Paul Amoroso, Yoshi Williams, Dr. Lester Reed, Sheila, Eric
Johnson (MHS)
Motivation
19.6% patients
readmitted within
30 days [Jencks et
al. 2009]
31.1% patients
readmitted within
60 days [Jencks et
al. 2009]
Many
hospitalizations
readmissions
$$$COST - 2004
unplanned re-admits =
$17.4 billion [Jencks et
al. 2009]
Congestive
Heart
Failure(CHF)
LOW Readmission
rate = HIGH quality
of care by hospital
No
reimbursement
for readmission
within 30 days
2
MHS - UWT Web and Data Science
collaboration objectives
Predict the RISK of Readmission for CHF patients
Reduce the Readmission rate and cost
Improve patient satisfaction and quality of care
Appropriate pre-discharge and post-discharge planning
Proper resource utilization
3
Problem
Develop models that can predict risk of readmission
for CHF patients within
30 days after discharge
60 days after discharge
The readmission may happen for other reasons in
addition to CHF
5
Overall Approach
How to solve the problem?
– Apply predictive data mining techniques such as,
classification
What do these predictive mining techniques
require?
– Data in homogeneous format
• Information Extraction, Integration, and data
preparation
• Prepare labeled dataset to train the model; used later
on for testing.
6
Our Challenges
Building domain knowledge
– Which variables to consider?
– How to merge and unify them in a homogeneous
format (information extraction and integration)
– How to understand the relative importance of the
variables in the prediction task?
How to prepare data?
– Class label generation
– Noisy real world data (missing values, inconsistencies,
etc.)
– Serious skew in the dataset
7
Solution
8
Building Predictive Classification Models
Data
Understanding
Data
Preprocessing
Modeling
Evaluation
9
Data Understanding
Collect initial data
Acquire Domain
knowledge
Describe and explore dataset
Create data visualization
10
Building Predictive Classification Models
Data
Understanding
Data
Preprocessing
Modeling
Evaluation
11
Data Preprocessing
Finding Eligible
CHF admissions
Define class label
Attribute
selection
Data Integration
Removal of incomplete data
12
Eligible CHF admissions and Generating
Class Labels
All CHF
Admissions
In hospital deaths removed
Eligible CHF
Admissions
X=30
X=60
The class label is
assigned as 0
NO
Is there any
readmission
within x days
of discharge?
YES
The class label is
assigned as 1
13
Attribute selection
Yale Model [Krumholz et al]
-Socio-Demographic
variable(2)
-Comorbidities(35)
Chi-square
correlation test
“Baseline”
“All”
“Correlated”
Additional predictor
variables identified by us
(14)
“New”
14
Data Extraction
Labeled data
Patient details
Primary and
Secondary
diagnosis
Table
Joins
Incomplete
data removed
Data
Data used for
training the
Models
Lab
measurement
Administrative
data
15
Data Distribution
30 days time frame
60 days time frame
12000
12000
10000
10000
8000
8000
Readmit
6000
Readmit
6000
No Readmit
No Readmit
4000
4000
2000
2000
0
0
Readmissions
Readmissions
16
Building Predictive Classification Models
Data
Understanding
Data
Preprocessing
Modeling
Evaluation
17
Modeling
Balancing imbalanced data
by under-sampling and over
sampling
Selecting modeling
technique for Binary
Classification
Building prediction models
• Logistic regression
• Naïve Bayes classifier
• Support Vector Machine
18
Logistic Regression Model
𝑝=
= 𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋2 + ⋯ + 𝛽𝑖 𝑋𝑖
1
1+𝑒 − 𝛽0 +𝛽1 𝑋1 +⋯+𝛽𝑖 𝑋𝑖
=
1
1+𝑒 −𝑧
P (Probability of Y)
ln
𝑝
1−𝑝
Z ------>
19
Naïve Bayesian Classification
Statistical Classifier performs probabilistic prediction
based on Bayes Theorem
Assumes that the attributes are conditionally
independent
Given a data tuple X and m classes 𝐶1 , 𝐶2 , … 𝐶𝑚
𝑃 𝐶𝑖 𝑋) =
𝑃
𝑋 𝐶𝑖
𝑃 𝐶𝑖
𝑃(𝑋)
Predicts X belongs to 𝐶𝑘 only if 𝑃 𝐶𝑘 𝑋 is highest
among all the 𝑃 𝐶𝑖 𝑋 for all the m classes
20
Support Vector Machine
A method of classification for both linear and non
linear data
Searches for optimal separating hyperplane
separating the two classes
21
Building Predictive Classification Models
Data
Understanding
Data
Preprocessing
Modeling
Evaluation
22
Performance Evaluation Metrics
Precision – percentage of tuples labeled as positive are actually positive
= TP/TP+FP
Recall – measures the percentage of positive tuples that are labeled
positive = TP/TP+FN
Accuracy – percentage of tuples correctly classified = (TP+TN)/P+N
ROC curves and area under the curve (AUC) – Shows the trade-off
between true positive rate and false positive rate.
23
Evaluation
• Predictive models are assessed using 10 fold
cross validation
• The performance is compared using different
evaluation metrics mentioned previously
25
RESULTS
Logistic Regression for 30 days
Area Under the Curve (AUC)
Recall
27
Logistic regression for 60 days
Area Under the Curve (AUC)
Recall
28
Naïve Bayes classifier for 30 days
0.64
0.63
0.62
0.61
Baseline
New
0.6
All
0.59
Correlated
0.58
0.57
0.56
Attribute Set
Area Under the Curve (AUC)
29
Support Vector Machine for 30 days
0.635
0.63
0.625
0.62
0.615
Baseline
0.61
New
0.605
All
0.6
Correlated
0.595
0.59
0.585
0.58
Attribute Set
Area Under the Curve (AUC)
30
Conclusion and Discussion
It is one of the difficult problem to solve
Feature selection gives the best results.
With data balancing recall of the model improves
35
Future Work
Investigate other classifier techniques like ensemble
methods, neural networks
To explore additional features and study their
relevance
To employ other feature selection techniques
To device a method to impute missing values
Deploying the predictive models
36
Acknowledgement
Multicare health System (MHS) and Dr. Lester
Reed for giving us this opportunity
Data architects and domain experts in MHS
for their inputs
Professors Dr. Ankur Teredesai and Dr. Senjuti
Basu Roy for their guidance
Other team members in UWT for their
support
37
References
S. F. Jencks, M. V. Williams, and E. A. Coleman,
“Rehospitalizations among Patients in the Medicare
Fee-for-Service Program,” New England Journal of
Medicine, vol. 360, no. 14, pp. 1418–1428, 2009.
J. Han and M. Kamber, Data mining: concepts and
techniques. Morgan Kaufmann, 2006
H. M. Krumholz, S. L. T. Normand, P. S. Keenan, Z. Q.
Lin, E. E. Drye, K. R. Bhat, Y. F. Wang, J. S. Ross, J. D.
Schuur, and B. D. Stauffer, Hospital 30-day heart failure
readmission measure methodology. Report prepared
for the Centers for Medicare & Medicaid Services.
38
Questions
39