reclassification leave

Download Report

Transcript reclassification leave

Team Dogecoin: An Experience in Predicting
Hospital Readmissions Max Payton Katherine Ford
The Problem
Hospitals in the UK must keep track of which patients, once
released, return to the hospital within 30 days. A high rate of
patient readmission may suggest that a hospital is providing
insufficient care. Figuring out which patients are likely to return
helps hospitals improve their preventative care and patient
assessment strategies.
Given data from thousands of patients, including demographic
information, test results over time, and the ultimate outcome of
their hospital stay, we set out to learn a model to predict when
individual patients are likely to be readmitted.
Methodology - Toolset
Logistic Regression
•
•
Starting point for evaluating feature selection
We obtained maximum 56 % accuracy
•
Ensemble methods draw more complex decision boundaries than logistic regression,
ease some feature finding difficulties
Two underlying weak learners: Classifier and Regression
Error with Classification
Error with Regression
Random Forest
•
The Data
We had access to labeled data from 14,878 patients, with
unlabeled data from an additional 6,359 patients. For each patient,
we knew their age, gender, how they were admitted, to where they
were released, how many prior admissions they had, and, most
significantly, the results and times of every lab test given to each
patient.
To get a sense of what this data signified, we researched each test
to find what physiological system it related to and what its normal
ranges were. We tried to derive features and trends in the data
based on our analysis of the test results and the demographic
data.
Boosting
Alternative ensemble method focusing on multiple weighted weak learners;
weights represent importance of reclassification
• Three used methods: AdaBoost, GentleBoost, LogitBoost
Error with Adaboost
Error with GentleBoost
Error with LogitBoost
•
1. We clean up the features by turning admission and leave
times into a length of stay
2. We add 3 features for each test: number of that test
administered, difference between first and last test result,
value of last test
3. We use a HMM illustrated below, which has been trained
on the training data and saved it’s a and b matrix to
generate a hidden state sequence, and for every test we
add a feature indicating which hidden state the patient
will be in after the final test was administered
4. We use cross-validation on the training data to determine
the number of trees to use for the random forest learner
5. We train the random forest on the training data with the
number of trees found above
6. We use the random forest to classify the test data
Hidden Markov Models
•
•
•
Unsupervised learning algorithm focusing on minimizing distance in k clusters from
center point mean distance from the rest of the cluster and every other point
Potential Distance measure: Cartesian, CityBlock, and Correlation
Used to find outliers in data for initial reweighting
Hidden Markov Models
Tools: Mathworks MATLAB, Python
Data: Dr Tony Wolf, MD
Testing Infrastructure: Kaggle Inc.
Lab Test Informaton: Royal College of Physicians and Surgeons
of Canada, US National Library of Medicine, Mayo Clinic
Team Name: dogecoin
Private Leaderboard Position: 20th
Final score: 0.59405
Our Generation Strategy
Strategy Evaluation
K-means Clustering
Acknowledgements
Results
•
•
•
•
Method for learning time-series data by using a hidden layer.
Used in an ensemble fashion due to lack of single series data for accurate generations
of transition matrix.
Used to learn an underlying model for lab-test data
Converted into features for later processing by another model
• Adding features derived from
our HMMs did not improve
our model
• Next steps involve
synthesizing the hidden layer
information into overall
organ state
• Also combine model data for
patients on sequential visits
K-Means for Outlier Detections
• Clusters had very low predictive accuracy, hurting ability to detect
outliers
• Next Step: evaluate cluster centers to obtain more knowledge
about underlying data
• Also find alternative distance measure then Cartesian and
evaluate their effectiveness
Shrinking Features
• Our final model had over 100 features, with some features
definitely cluttering the models