ppt - Monash University
Download
Report
Transcript ppt - Monash University
Data Mining
Cardiovascular
Bayesian Networks
Charles Twardy†, Ann Nicholson†, Kevin Korb†, John McNeil‡
(Danny Liew‡, Sophie Rogers‡, Lucas Hope†)
†School of Computer Science & Software Engineering
‡Dept. of Epidemilogy & Preventive Medicine
Monash University
www.datamining.monash.edu.au/bnepi
Overview
Problem: assessment of risk for coronary heart disease (CHD)
1. Knowledge Engineering
2 epidemiological models
2. Data Mining
Busselton Study data
Bayesian network
software (Netica)
Causal
discovery
(CaMML)
+ Other learners
Medical Experts
3. Evaluation
Knowledge Engineering BNs
from the medical literature
The Australian Busselton Study
»
»
»
»
every 3 years, 1966-1981, > 8,000 participants
mortality followup via WA death register + manually
Cox proportional-hazards model, 2,258 from 1978 cohort
CHD event base rates: 23% for men, 14% for women
The German PROCAM Study
» 1979-1985, followup every 2 years, > 25,000 participants
» Scoring model (based on Cox), ~5,000 men
» CHD event base rates: ~6%
General question: are models transferable across
populations?
The Busselton BN: nodes
The Busselton BN: arcs
uninformative
All nodes have an associated
conditional prob. distribution
P(S,B,Al,At) =P(S)P(B|S)P(Al|S)P(At|S)
predictor variables
10-year risk of CHD event
BNs summarize the joint
distribution
The Busselton BN: discretization
binary nodes
discretization choices
The Busselton BN: reasoning
The Busselton BN: reasoning
The Busselton BN: reasoning
Normal
Bad cholesterol
Heavy smoking
The Busselton BN: reasoning
More risk factors
!
A risk assessment tool for clinicians
Previous tool: TAKEHEART
Combine risk assessment (probability) with
costs.
Risk Assessment Tool: example
Young, predictor not observed – don’t treat
old, predictor not observed – treat
Young, predictor observed – don’t treat
Not so old, predictor not observed – treat
CaMML: a causal learner
Developed at Monash University
Data mines BNs from epidemiological data
Minimum message length (MML) metric:
Trades-off complexity vs goodness of fit
MCMC search over model space
CaMML: example BN
CaMML: example BN
Evaluation
Predicting 10 year risk of CHD using Busselton data
Split data 90-10 training/testing
10 fold cross validation
Metrics:
» Predictive Accuracy
» ROC Curves (area under curve): correct classification vs
false positives
» Bayesian Information Reward (BIR)
Using Weka: Java environment for machine learning
tools and techniques
Predictive accuracy
Examining each joint observation in the
sample
Adding any available evidence for the other
nodes
Updating the network
Use value with highest probability as
predicted value
Compare predicted value with the actual value
Information Reward
Rewards calibration of probabilities
Zero reward for just reporting priors
Unbounded below for a bad prediction
Bounded above by a maximum that depends
on priors
Reward = 0
Repeat
If I == correct state
IR += log ( 1 / p[i] )
else
IR += log ( 1 / 1 - p[i] )
Experimental Evaluation
Experiment 1:
» Compare Busselton, PROCAM and CaMML BNs
Experiment 2
» Compare CaMML and other standard machine
learners (from Weka)
Evaluation: Weka learners
Naïve Bayes
J48 (version of C4.5)
CaMML –Causal BN learner, using MML
metric
Pr=1/3
Pr=1/3
Pr=1/3
AODE
TAN
Logistic
Experiment 1: ROC Results
Extremes:
Everyone at risk!
Area under curve (AUC)
priors
No-one at risk!
Experiment 2: ROC Results
Experiment 2: Bayesian Info Reward
Summary of Results
Experiment I (Models of whole data)
PROCAM model does at least as well as Busselton
» On Busselton data
» For both "relative" (ROC) and "absolute" (BIR) risk
CaMML Models do as well
» But much simpler: only 4 nodes matter to CHD10!
Experiment II (Cross-validation of learners)
Logistic regression does best on both metrics
» Statistically powerful: only 1 parameter per arc
» No search required: structure is given
» No discretization necessary
Conclusions
Busselton & PROCAM models appear to perform
equally well on Busselton data, using an absolute risk
measure (BIR) from the literature
CaMML results suggest the data have high variance
and are too weak to support inference to complex
models. Combining data would help.
Future directions
Improve data mining by
» Adding prior knowledge to search
» Assessing whether data sources can be combined;
if so, do so
Investigate combination of continuous and discrete
variables in data mining and modeling
Develop new TAKEHEART model using BNs (taking
the best from experts, literature, data mining)
» with intervention modeling (Causal Reckoner)
» with decision support
» with GUI, usable by clinicians
References
G. Assmann, P. Cullen and H. Schulte. Simple scoring scheme for
calculating the risk of acute coronary events based on the 10-year follow-up
of the Prospective Cardiovascular Munster (PROCAM) study. Circulation,
105(3):310-315, 2002.
M.W. Knuiman, H.T. Vu and H. C. Bartholomew. Multivariate risk estimation
for coronary heart disease: the Busselton Health Study, Australian & New
Zealand Journal of Public Health, 22:747-753, 1998.
C.S. Wallace and K.B. Korb. Learning Linear Causal Models by MML
Sampling, In A. Gammerman, editor, Causal Models and Intelligent Data
Management, pages 89-111. Springer-Verlag, 1999.
www.datamining.monash.edu.au/software/camml
C.R. Twardy, A.E. Nicholson, K.B. Korb and J. McNeil. Data Mining
Cardiovascular Bayesian Networks. Technical report 2004/165. School of
Computer Science and Software Engineering, Monash University, 2004.
C.R. Twardy, A.E. Nicholson and K.B. Korb. Knowledge engineering
cardiovascular Bayesian networks from the literature, Technical Report
2005/170, School of CSSE, Monash University, 2005.