presentation
Download
Report
Transcript presentation
Mining Episode Rules in STULONG
dataset
•N. Méger1, C. Leschi1, N. Lucas2 & C. Rigotti1
1 INSA Lyon
- LIRIS FRE CNRS 2672
2 Université d’Orsay – LRI CNRS UMR 8623
This work has been partially funded by the European Project AEGIS (IST-2000-26450).
1
Content
Motivation
About WinMiner
Data Mining Effort
Conclusion
2
Motivation : Data
STULONG Data : A 20 year longitudinal study of risk factors related
to atherosclerosis in a population of middle-aged men
Tables ENTRY and CONTROL:
– 1216 patients described by:
•
•
•
•
Identification and social characteristics
Behavior
Health events
Physical and biochemical examinations
– From 1 up to 21 control per patients
A sequence of controls for each patient
3
Motivation: Medical issues
identified risks factors
no treatment available
necessity to consider a global risk instead of concentrating prevention
efforts on individual ones
risk comportments dramatically increases cardio-vascular disease
emergence, but no one knows when
Relations between risk factors and clinical demonstration of
atherosclerosis?
Time intervals over which these relations are valid?
4
Motivation: WinMiner
WinMiner: a single optimised way to find sequential
patterns in data along with their optimal time
intervals, under user constraints
WinMiner suggests to experts possible temporal
dependencies among occurrences of event types
WinMiner outputs "small" collections of sequential
patterns
5
About WinMiner
Mining context
large event sequences
episode & episode rules
A
B
A
A
B
B
C
C
6
About WinMiner
Selecting patterns
support: how many times an episode/episode rule occurs within
an event sequence?
AB
ABC
confidence: what is the probability of the RHS of an episode rule
to occur knowing that its LHS already occured?
ABC
patterns are selected using:
– a minimum support threshold
– a minimum confidence threshold
7
About WinMiner
Selecting the optimal window span
C1
confidence
C2 <= C1 - C1*decRate
C2
minimum
confidence
First Local Maximum
(FLM)
optimal window span
window span such
that the episode rule
is frequent
w
8
About WinMiner
WinMiner :
– checks all possible episode rules satisfying to
frequency and confidence thresholds
– outputs only the FLM-rules, along with their
respective optimal window sizes
– uses a maximal gap constraint
9
DM effort: Aims
Give to the medical expert:
a mean to follow both the evolution of risk factors
and:
(1) impact of medical intervention
(2) modifications in patients’ behavior
in addition:
– significant time periods of observation
– frequency
– probability
10
DM effort: Data preprocessing
Mainly focused on table CONTROL (1226 patients/10572
examinations)
Joint operations to export information from table ENTRY
Categorization of some factors
Choice of relevant factors according to:
– Medical expertise
– Mining approach
Table Contr_Mod_2
11
DM Effort: Data preprocessing
Important factors (according to medical experts):
–
–
–
–
cholesterol
hypertension
smoking
physical activity
–
–
–
–
–
age
diabetes
alcohol consumption
BMI
family anamnesis
– level of education
12
DM Effort: Data preprocessing
Contr_mod_2 large event sequence
For each patient: a subsequence containing all his control
examinations
Coding guarantees that events corresponding to 2 different
patients can not be associated in the same episode rule
Large event sequence: concatenation of all sub sequences
constructed for patients.
13
DM effort: Results
Examples:
– "If the patient has no hypercholesterolemia, and if
he sometimes follows his diet, then the patient has
no hypercholesterolemia with a probability of 0.8
within 40 months. This rule is supported by 201
examples in the event sequence."
– " If one eats less of fats and carbohydrates and he
has claudication observed some time later, then
this claudication does not disappear with a
probability of 0.8 over 30 months. This rule is
supported by 21 examples. "
14
DM effort: Results
Well known phenomena:
– indication about correctness in pre-processing as well as in
mining data
Added-value: suggestion concerning their temporal aspects
To be expected:
– with new data and new risk factors put in evidence in the last
decade, discovering new phenomena along with their
optimal window sizes
15
Conclusion
With STULONG data: Searching for temporal dependencies between
atherosclerosis risk factors and clinical demonstration of
atherosclerosis that have an optimal interval/window size
Offers to the medical expert a possibility to explicit impact of a risk
factor and to refine its part in comparison with other ones within a time
interval
A few episode rules obtained, that allows experts to manually analyse
the outputs
Could be applied to other medical data sets to help in finding unknown
phenomena
New perspectives both for data miners and
physicians
16