Smart Prediction System
Download
Report
Transcript Smart Prediction System
Trend Analysis
in Stulong Data
Jiří Kléma, Lenka Nováková,
Filip Karel, Olga Štěpánková
The Gerstner laboratory for
intelligent decision making and
control
Department of Cybernetics,
Czech Technical University,
Prague
PKDD 2004, Discovery Challenge
Outline
Previous CTU entry
– subgroup discovery (ENTRY), general CVD model
– trend analysis: global approach vs. windowing
Role of windowing in mining trends
– KM, Cox models in medicine
– (symbolic) temporal trends in data mining
Development of windowing approach
– temporal CVD definition
– role of the window length
– multi-feature interactions
Ordinal association rules
– processing of the windowed features
STULONG Data
Four tables: Entry, Control, Letter, Death
Dependent variable: (static) CVD
– CardioVascular Diseases
– Boolean attribute derived of A2 questionnaire (Control
table)
CVD = false
The patient has no coronary disease.
CVD = true
The patient has one of these attributes true
(Hodn1, Hodn2, Hodn3, Hodn11, Hodn13, Hodn14)
positive
angina
pectoris
(silent)
myocardial
infarction
cerebrovascular
accident
We remove patients who have diabetes (Hodn4)
or cancer (Hodn15) only.
ischemic
heart
disease
ENTRY - subgroup discovery
AQ no.6: Are there any differences in the ENTRY
examination for different CVD groups?
Statistica 6.0
– module for interactive decision tree induction
– two tailed t-test or chi-square test to asses significance of
subgroups
Dependencies are relatively weak
Interesting dependencies found
– social characteristics: derived attribute AGE_of_ENTRY
– alcohol: “positive effect” of beer, no effect of wine
– sugar consumption increases CVD risk
– well-known dependencies are not mentioned (smoking,
BMI, cholesterol)
ENTRY - general model
General CVD model (in WEKA)
– feature selection + modeling (e.g., decision trees)
– tends to generate trivial models (always predicting false)
– asymmetric error-cost matrix does not help
Predict CVD risk
– Identify principal variables
(Chi-squared test)
– Naïve Bayes + ROC evaluation
– three independent variables
– discretized AGE_of_ENTRY
– discretized BMI
– Cholrisk - derived of CHLST
– AUC = 0.66
CONTROL - trend analysis
AQ no.7: Are there any differences in development
of risk factors for different CVD groups?
– increasing BMI makes a contribution to CVD appearance
ENTRY table
ICO – primary key
Year of birth
Year of entry
Smoking
Alcohol
Cholesterol
Body Mass Index
Blood pressure
CONTR table
ICO
Risk factors
followed
during 20 years
Motivation
focus on development – trend gradients
possibilities
– contemporary statistical methods used in medicine
• KM, Cox models – analyze sth else than we want
• ANOVA etc. – features have to be developed anyway, lack of
data
– complex sequential data mining
• introduction of structural patterns and then e.g., association
rules
• interesting but again needs more data
our approach
– introduction of simple aggregates
– application of windowing
– statistical evaluation for simple dependencies
– ordinal association rules for more complex relations
Survival curves
Kaplan-Meier or Cox method
– typical example of temporal analysis in medicine
– regards survival period, BUT disregards development of RFs
– typical scenario
• distinguish groups of patients (ENTRY table)
• follow their “survival” periods (DEATH or CONTROL table)
Derived trend attributes
Intercept
Correlation coefficient
y (observed
variable)
Mean
Standard deviation
Gradient
x (decimal time ~ year + 1/12 month)
referential
time (1975)
Global Approach
Risk factors to be observed are selected
– SYST, DIAST, TRIGL, BMI, CHLSTMG
Selected control examinations are transformed
– pivoting
Patients with no control entries are removed
– about 60 patients
Trend aggregates are calculated
ICO
ICO_1
ICO_2
Entry
Contr1 Contr2
...
ContrM
Aggr1
...
AggrN
Windowing Approach
Constant number of examinations for individuals
Issues:
– window length
• time period vs. number of checkups
• how many checkups to select? 5, 8, 10 tested
– single distinct window or sliding window?
• entry is used as the first examination
• more records per patient records are not independent
– temporal CVD definition
• CVDi - time from the last examination to CVD
• yes/no (yes = CVD in the next year or CVD in future)
– missing values treatment
Windowing – missing values
approach 1:
shift the series
approach 2:
introduce a new value
Window length selection
Window length effects
3 different lengths tested, 5 risk factors considered
compared with the global approach
test used,
– null hypothesis: independence of trends and CVD
– p-values are shown
windowing: CVD1 vs. nonCVD group
global: CVD vs. nonCVD group
global approach is completely misleading
prefer shorter windows
down-up effect
prefers longer windows
only long term changes may have effect
ControlCount vs. CVD
ControlCount
– number of examinations
– strong relation with CVD
– AUC = 0.35
– ControlCount CVD risk
– anachronistic attribute
– introduced by the design of the
study
ControlCount has influence on the trend
aggregates - ControlCount gradients tend to be
more steep etc.
Conclusion: global approach cannot be applied
(at least with the selected aggregates)
Influence of SYSTGrad (W5)
122 individual CVD1 observations in total
SYSTGrad (W5) equi-depth binned in 5 groups
representation CVD1 group significantly increases with
increasing group number of SYSTGrad
0.040
34
0.035
28
0.030
CVD rate
average rate
0.025
25
0.020
0.015
18
17
0.010
0.005
0.000
1
2
3
4
SYSTGrad group (equi-depth binning)
5
Averaged blood pressure
striking difference in CVD1 and nonCVD groups
– linear vs. down-up development
– can also be observed for the individuals – see the next slide
– cannot be distinguished by longer windows
88
SystCVD
DiastCVD
SystHealthy
Avg. diastolic blood pressure [mm Hg]
Avg. systolic blood pressure [mm Hg]
142
140
138
136
134
132
DiastHealthy
87
86
85
84
83
82
81
130
9
8
7
6
5
4
3
2
Time to last examination [years]
1
0
9
8
7
6
5
4
3
2
Time to last examination [years]
1
0
Averaged body mass index
difference in CVD1
and nonCVD groups
– increasing BMI in
the CVD1 group
– longer windows
express this trend
better
– this graph shows
that W10 may
benefit from
increase between
examination 9 and
8
BMICVD
Avg. diastolic blood pressure [mm Hg]
– steady BMI in the
nonCVD group
28
BMIHealthy
27.5
27
26.5
26
25.5
9
8
7
6
5
4
3
2
Time to last examination [years]
1
0
Trend factors – hypothesis testing
Influence of trend aggregates on CVD
– 9 gradients considered: SYST, DIAST, CHLSTMG, TRIGLMG, BMI,
HDL, LDL, POCCIG and MOC
Identified relations
– decreasing HDL cholesterol level relates to the increasing risk of
CVD (p=0.001)
– decreasing POCCIG (the average number of cigarettes smoked
per day) relates to the increasing risk of CVD (p=0.0001)
Again: correlation vs. causality
– statement 1 makes sense: HDL is a ’good’ cholesterol
– statement 2 suggests spurious dependency
smoking habits
effect 1
patient state
cause
CVD onset
effect 2
Overview of AR found
Group a – relations among trend factors
– a great prevalence of the rules joining together either blood
pressures (DIASTGrad and SYSTGrad) or cholesterol attributes
(HLDGrad, LDLGrad and CHLSTGrad)
Group b - hypothesis to be verified by experts
– insufficient target groups, 6% transactions makes 26 individuals,
i.e., instead of 10 prospective diseased patients we actually
observe 19
Conclusions
The main scope
– AQ no.7: Are there any differences in development of
risk factors for different CVD groups?
Contributions
– Pitfalls of the global approach revealed
– Windowing enabling multivariate temporal analysis
proposed, effects of various window lengths studied
– Development of the following risk factors may
influence future CVD occurrence:
• DIAST, SYST, BMI, (HDL) cholesterol, (POCCICG)
– Other trends may have or intensify their influence
under specific conditions (BMI trend and overweight,
etc.) – we lack data to prove it