Smart Prediction System

Download Report

Transcript Smart Prediction System

Trend Analysis
in Stulong Data
Jiří Kléma, Lenka Nováková,
Filip Karel, Olga Štěpánková
The Gerstner laboratory for
intelligent decision making and
control
Department of Cybernetics,
Czech Technical University,
Prague
PKDD 2004, Discovery Challenge
Outline
 Previous CTU entry
– subgroup discovery (ENTRY), general CVD model
– trend analysis: global approach vs. windowing
 Role of windowing in mining trends
– KM, Cox models in medicine
– (symbolic) temporal trends in data mining
 Development of windowing approach
– temporal CVD definition
– role of the window length
– multi-feature interactions
 Ordinal association rules
– processing of the windowed features
STULONG Data
 Four tables: Entry, Control, Letter, Death
 Dependent variable: (static) CVD
– CardioVascular Diseases
– Boolean attribute derived of A2 questionnaire (Control
table)
CVD = false
The patient has no coronary disease.
CVD = true
The patient has one of these attributes true
(Hodn1, Hodn2, Hodn3, Hodn11, Hodn13, Hodn14)
positive
angina
pectoris
(silent)
myocardial
infarction
cerebrovascular
accident
We remove patients who have diabetes (Hodn4)
or cancer (Hodn15) only.
ischemic
heart
disease
ENTRY - subgroup discovery
 AQ no.6: Are there any differences in the ENTRY
examination for different CVD groups?
 Statistica 6.0
– module for interactive decision tree induction
– two tailed t-test or chi-square test to asses significance of
subgroups
 Dependencies are relatively weak
 Interesting dependencies found
– social characteristics: derived attribute AGE_of_ENTRY
– alcohol: “positive effect” of beer, no effect of wine
– sugar consumption increases CVD risk
– well-known dependencies are not mentioned (smoking,
BMI, cholesterol)
ENTRY - general model
 General CVD model (in WEKA)
– feature selection + modeling (e.g., decision trees)
– tends to generate trivial models (always predicting false)
– asymmetric error-cost matrix does not help
 Predict CVD risk
– Identify principal variables
(Chi-squared test)
– Naïve Bayes + ROC evaluation
– three independent variables
– discretized AGE_of_ENTRY
– discretized BMI
– Cholrisk - derived of CHLST
– AUC = 0.66
CONTROL - trend analysis
 AQ no.7: Are there any differences in development
of risk factors for different CVD groups?
– increasing BMI makes a contribution to CVD appearance
ENTRY table
ICO – primary key
Year of birth
Year of entry
Smoking
Alcohol
Cholesterol
Body Mass Index
Blood pressure
CONTR table
ICO
Risk factors
followed
during 20 years
Motivation
 focus on development – trend gradients
 possibilities
– contemporary statistical methods used in medicine
• KM, Cox models – analyze sth else than we want
• ANOVA etc. – features have to be developed anyway, lack of
data
– complex sequential data mining
• introduction of structural patterns and then e.g., association
rules
• interesting but again needs more data
 our approach
– introduction of simple aggregates
– application of windowing
– statistical evaluation for simple dependencies
– ordinal association rules for more complex relations
Survival curves
 Kaplan-Meier or Cox method
– typical example of temporal analysis in medicine
– regards survival period, BUT disregards development of RFs
– typical scenario
• distinguish groups of patients (ENTRY table)
• follow their “survival” periods (DEATH or CONTROL table)
Derived trend attributes
Intercept
Correlation coefficient
y (observed
variable)
Mean
Standard deviation
Gradient
x (decimal time ~ year + 1/12 month)
referential
time (1975)
Global Approach
 Risk factors to be observed are selected
– SYST, DIAST, TRIGL, BMI, CHLSTMG
 Selected control examinations are transformed
– pivoting
 Patients with no control entries are removed
– about 60 patients
 Trend aggregates are calculated
ICO
ICO_1
ICO_2
Entry
Contr1 Contr2
...
ContrM
Aggr1
...
AggrN
Windowing Approach
 Constant number of examinations for  individuals
 Issues:
– window length
• time period vs. number of checkups
• how many checkups to select? 5, 8, 10 tested
– single distinct window or sliding window?
• entry is used as the first examination
• more records per patient  records are not independent
– temporal CVD definition
• CVDi - time from the last examination to CVD
• yes/no (yes = CVD in the next year or CVD in future)
– missing values treatment
Windowing – missing values
approach 1:
shift the series
approach 2:
introduce a new value
Window length selection
Window length effects
 3 different lengths tested, 5 risk factors considered
 compared with the global approach

test used,
– null hypothesis: independence of trends and CVD
– p-values are shown
 windowing: CVD1 vs. nonCVD group
 global: CVD vs. nonCVD group
global approach is completely misleading
prefer shorter windows
down-up effect
prefers longer windows
only long term changes may have effect
ControlCount vs. CVD
 ControlCount
– number of examinations
– strong relation with CVD
– AUC = 0.35
– ControlCount  CVD risk 
– anachronistic attribute
– introduced by the design of the
study
 ControlCount has influence on the trend
aggregates - ControlCount  gradients tend to be
more steep etc.
 Conclusion: global approach cannot be applied
(at least with the selected aggregates)
Influence of SYSTGrad (W5)
 122 individual CVD1 observations in total
 SYSTGrad (W5) equi-depth binned in 5 groups
 representation CVD1 group significantly increases with
increasing group number of SYSTGrad
0.040
34
0.035
28
0.030
CVD rate
average rate
0.025
25
0.020
0.015
18
17
0.010
0.005
0.000
1
2
3
4
SYSTGrad group (equi-depth binning)
5
Averaged blood pressure
 striking difference in CVD1 and nonCVD groups
– linear vs. down-up development
– can also be observed for the individuals – see the next slide
– cannot be distinguished by longer windows
88
SystCVD
DiastCVD
SystHealthy
Avg. diastolic blood pressure [mm Hg]
Avg. systolic blood pressure [mm Hg]
142
140
138
136
134
132
DiastHealthy
87
86
85
84
83
82
81
130
9
8
7
6
5
4
3
2
Time to last examination [years]
1
0
9
8
7
6
5
4
3
2
Time to last examination [years]
1
0
Averaged body mass index
 difference in CVD1
and nonCVD groups
– increasing BMI in
the CVD1 group
– longer windows
express this trend
better
– this graph shows
that W10 may
benefit from
increase between
examination 9 and
8
BMICVD
Avg. diastolic blood pressure [mm Hg]
– steady BMI in the
nonCVD group
28
BMIHealthy
27.5
27
26.5
26
25.5
9
8
7
6
5
4
3
2
Time to last examination [years]
1
0
Trend factors – hypothesis testing
 Influence of trend aggregates on CVD
– 9 gradients considered: SYST, DIAST, CHLSTMG, TRIGLMG, BMI,
HDL, LDL, POCCIG and MOC
 Identified relations
– decreasing HDL cholesterol level relates to the increasing risk of
CVD (p=0.001)
– decreasing POCCIG (the average number of cigarettes smoked
per day) relates to the increasing risk of CVD (p=0.0001)
 Again: correlation vs. causality
– statement 1 makes sense: HDL is a ’good’ cholesterol
– statement 2 suggests spurious dependency
smoking habits
effect 1
patient state
cause
CVD onset
effect 2
Overview of AR found
 Group a – relations among trend factors
– a great prevalence of the rules joining together either blood
pressures (DIASTGrad and SYSTGrad) or cholesterol attributes
(HLDGrad, LDLGrad and CHLSTGrad)
 Group b - hypothesis to be verified by experts
– insufficient target groups, 6% transactions makes 26 individuals,
i.e., instead of 10 prospective diseased patients we actually
observe 19
Conclusions
 The main scope
– AQ no.7: Are there any differences in development of
risk factors for different CVD groups?
 Contributions
– Pitfalls of the global approach revealed
– Windowing enabling multivariate temporal analysis
proposed, effects of various window lengths studied
– Development of the following risk factors may
influence future CVD occurrence:
• DIAST, SYST, BMI, (HDL) cholesterol, (POCCICG)
– Other trends may have or intensify their influence
under specific conditions (BMI trend and overweight,
etc.) – we lack data to prove it