Association Rule Mining in Type
Download
Report
Transcript Association Rule Mining in Type
Association Rule Mining in Type2 Diabetes Risk Prediction
Gyorgy J. Simon
Dept. of Health Sciences Research
Mayo Clinic
SHARPn Summit 2012
Outline
• Introduction
• Modeling Diabetes Risk
– Association Rule Mining
• Results
– Diabetes Disease Network Reconstruction
– Diabetes Risk Prediction
• Applicability to SHARP
Diabetes
• In the US, 25.8 million people (8% of the population) suffer from
Diabetes Mellitus
– Type 2 Diabetes Mellitus (DM)
• DM leads to significant medical complications
• Effective preventive treatments exist
– Identifying subpopulations at risk is important
• Pre-Diabetes (PreDM) is a condition that precedes DM
– fasting glucose 100-125
• Identify sets of risk factors that significantly increase the risk of
developing diabetes in a pre-diabetic population
– Risk factors:
• Co-morbid diseases: obesity, cardiac-, vascular conditions
• Vitals, lab test results, medications, co-morbid conditions
• 85k Mayo Patients 1999-2004 with research consent
Design
Study Period
Follow-Up
PreDM
23,828
PreDM
21,826
2,002
DM
424
16,664
DM
19,013
347
Normal
84,708
Normal
44,156
1/1/1999
Normal
43,809
12/31/2004
7/2010
Data
PID
Co-morbidities
OB
001
Y
HTN
Y
Glucose
Age
FUT
DM
…
110
55
1.8
Y
002
115
19
2.5
N
…
…
…
• Follow-up Time (FUT): Time since PreDM Dx
• Co-morbidities: before elevated glucose measurement
– hypertension, hyperlipidemia, obesity, various cardiac and vascular diseases
• Age and Follow-up time (FUT) are predictive of DM
– They are not modifiable, we need to compensate for them
• Goal is different from high-throughput phenotyping
– None of the patients have the disease
– Predict the risk that patients progress to DM
Outline
• Introduction
• Modeling Diabetes Risk
– Association Rule Mining
• Results
– Diabetes Disease Network Reconstruction
– Diabetes Risk Prediction
• Applicability to SHARP
Computational Model
Unknown
Disease
Mechanism
Age
…
bmi
HTN
…
hdl
glucose
Goal
Find sets of clinical factors (level 2)
that are associated with elevated
risk of DM
DM Dx
Sex
…
…
statin
… Tobacco
Level 1
Unmodifiable
“nuisance”
factors
Level 2
Clinical
factors of
interest
Level 3
Glucose
“definition”
of DM
We have to adjust for level 1 factors
before we can assess the effect of
level 2 factors !
Modeling Approaches
1. Logistic regression / Survival Analysis
–
No ability to discover interactions
2. Decision Trees/RandomForest/Gradient-boosted Trees
–
–
Greedy approach to discover interaction
No ability to compensate for age and follow-up time (FUT)
3. Association Rule Mining (ARM)
–
–
Specifically designed to discover interactions
No ability to compensate for age and FUT
Regression Analysis + Association Rule Mining
Remove the effect of age
gender and FUT
Find association between the risk
factors and the DM risk not
explained by age and FUT
Simon et al. AMIA 2011
Overview
1st Phase
PID
DM
2nd Phase
Age
FUT
R1
Co-morbidities
Obese
001
Y
55
1.8
002
N
19
2.5
…
…
O Observed Number
of DM incidents
E1 Expected Number
of DM incidents
based on age and
sex only
Regression modeling
• Survival model or
• Logistic regression
3rd Phase
Y
HTN
R2
Glucose
…
103
Y
112
…
R1 = O – E1
1st Phase Residual
R2 = O–(E1+E2) = R1-E2
2nd Phase Residual
E2 Expected Number
of DM incidents based
on co-morbidities only
(after adjusting for age
and sex)
E3 Expected
Number of
DM incidents
based on
glucose (after
adjusting for
everything else)
Association Rule Mining
E = E1 + E2 + E3
Final Prediction
Association Rule Mining
•
•
•
•
•
Origins from sales data
Items (columns): co-morbid conditions
Transactions (rows): patients
Itemsets: sets of co-morbid conditions
Goal: find all itemsets (sets of
conditions) that frequently co-occur in
patients.
Patient OB HTN IHD …
DM
001
Y
Y
Y
Y
002
Y
Y
Y
Y
003
Y
Y
004
Y
005
Y
Y
Y
– One of those conditions should be DM.
• Support: # of transactions the itemset
I appeared in
– Support({OB, HTN, IHD})=3
• Frequent: an itemset I is frequent, if
support(I)>minsup
X: infrequent
Distributional Association Rule Mining
Distributional Association Rules associate an itemset with a continuous outcome.
A
B
C
D
01
Y
Y
Y
Y
.40
02
Y
Y
Y
.38
03
Y
Y
Y
04
Y
Y
Y
Y
06
08
.00
Y
Y
Y
.39
10
5
0
R 0 0.15 0.3 0.45
.41
Y
Y
07
Y
R
.01
.02
.00
6
Frequency
05
…
Frequency
PID
15
4
2
0
R 0 0.15 0.3 0.45
Application to Diabetes
Find all sets I of co-morbid conditions, such that the distribution of risk R is
significantly different between the patient population having I and without I
Simon et al, KDD 2011a
Why Association Rule Mining?
Challenge
Solution
Interactions
Designed to discover associations
Missing data
Asymmetry in items
• Absence of item does not mean that
the risk factor was not present
Clinical question
Directly extracts sets of risk factors
Allows for differences in modeling
for prediction and for disease
mechanism discovery
Computational Efficiency
Efficient algorithms exist
Outline
• Introduction
• Modeling Diabetes Risk
– Association Rule Mining
• Results
– Diabetes Disease Network Reconstruction
– 4.5-yr DM Risk Prediction
• Applicability to SHARP
Diabetes Disease Network
Reconstruction
• Metabolic Syndrome: DM + cardiac/vascular
diseases
• Use Association Rule Mining to map out the
relationships between DM and other metabolic
syndrome diseases
– Also measure their effect on DM progression risk
• Predictors: Age, sex, FUT; co-morbid disease Dx
• 1st Phase model is survival model
• 2nd Phase ARM
Results
• 37 Distributional Association Rules were
discovered
• 11 are significant.
(Poisson test; Bonferroni adjusted 5%)
Sup
Cases
P-value
RR
Itemset
7116
819
2.0e-7
1.32
HTN
4729
560
1.7e-8
1.45
OB
8612
964
2.6e-8
1.31
HL
1980
291
1.9e-9
1.78
HTN,OB
4171
534
1.5e-8
1.47
HTN,HL
553
85
8.3e-4
1.86
OB,IHD
2434
335
4.3e-9
1.68
OB,HL
382
66
7.7e-4
2.08
1271
204
2.8e-8
470
76
339
61
• Interpretation: Patients with
HTN,OB,IHD and HL have age and
FUT adjusted 2.15 RR of DM.
• Effect of age- and FUT adjustment
– The entire PreDM population has
8.04% chance of DM.
– Without age and FUT
adjustment, the above
population has 61/339=17.9%
– With age and FUT adjustment, 1(1-.084)2.15=17.2%
Legend
OB
Obesity
HTN,OB,IHD
HTN
Hypertension
1.93
HTN,OB,HL
IHD
7.2e-4
1.93
OB,IHD,HL
Ischemic Heart
Disease
6.1e-4
2.15 HTN,OB,IHD,HL
HL
Hyperlipidemia
Results
Condition(s)
IHD
2366 (1.16)
Subpop. ( Relative
[p-value .11]
Size
Risk )
HTN, OB, IHD
382 (2.08)
HTN, IHD, HL
Legend
1210 (1.36)
[p-valueOB
.015] Obesity
HTN
Hypertension
IHD
Ischemic Heart
Disease
HL
Hyperlipidemia
Outline
• Introduction
• Modeling Diabetes Risk
– Association Rule Mining
• Results
– Diabetes disease network re-construction
– 4.5-yr DM risk prediction
• Applicability to SHARP
DM Progression Risk Prediction
• Predicting the probability of progression to DM
within 4.5 years
• Predictors: age, sex, co-morbid Dx, laboratory
results and medication orders
• 1st Phase: spline logistic regression to adjust for
age and sex
• 2nd Phase: ARM
• 3rd Phase: linear regression using glucose
Machine Learned Indices
• Comparison to machine
learning methods
– Gradient Boosted Trees
(GBM)
C-statistic
• 10,000 trees
– Linear Model (LM)
– Random Forest (RF)
• 275-325 trees
– Association Rule Mining
(ARM)
• 100 rules
• 10-fold CV repeated 50 times
• Same predictive
performance but more
interpretable model
Traditional Indices
• Performance similar to San Antonio (Refit)
• ARM readily provides a justification as to why the risk
is high
• Proposed method places the patient on a path in the
diabetes network
Clinical Validation
• Work in progress…
• Apply the rules to
both normoglycemic and Pre-DM
patients
• Each point is a rule
• Patterns similar for
lower-risk
subpopulations
• For high-RR rules,
risk of DM is higher
for Pre-DM patients
Outline
• Introduction
• Modeling Diabetes Risk
– Association Rule Mining
• Results
– Interpretability
– Predictive Performance
• Applicability to SHARP
High-Throughput Phenotyping (HTP)
• We can use the Association Rules as a HTP
algorithm
– Discover the rules with ARM
– Validate the rules with an expert clinician
High-throughput Phenotyping
DM Risk Assessment
Does the patient currently have Will the patient progress to DM
DM?
in 4.5 yrs?
- Interventions are possible
Binary decision (DM or not)
Probability of diabetes
- Prob. can be dichotomized
into DM/no DM
Acknowledgment
Peter W. Li, PhD
Health Sciences Research, Mayo Clinic, MN
Pedro J. Caraballo, MD
Internal Medicine, Mayo Clinic, MN
M. Regina Castro, MD
Division of Endocrinology and Metabolism, Mayo Clinic, MN
Terry M. Therneau, PhD
Health Sciences Research, Mayo Clinic, MN
Vipin Kumar, PhD
Department of Computer Science,
University of Minnesota
References
Vemuri P, Simon G, Kantarci K, Whitwell J, Senjem M, Przybelski S, Gunter J,
Josephs K, Knopman D, Boeve B, Ferman T, Dickson D, Parisi J, Petersen R and Jack
C. Antemortem differential diagnosis of dementia pathology using structural MRI:
Differential-STAND. NeuroImage, 2010.
Caraballo P, Li P, Simon G. Use of Association Rule-mining to Assess Diabetes Risk
in Patients with Impaired Fasting Glucose, AMIA, 2011.
Simon G, Kumar V, Li P. A Simple statistical model and association rule filtering. In
Proc. ACM International Conference on Data Mining and Knowledge Discovery
(KDD), 2011.
Simon G. Li P, Jack C, Vemuri P. Understanding Atrophy Trajectories in Alzheimer’s
Disease Using Association Rules on MRI images. In Proc. ACM International
Conference on Data Mining and Knowledge Discovery (KDD), 2011.