presentation

Download Report

Transcript presentation

Mining Episode Rules in STULONG
dataset
•N. Méger1, C. Leschi1, N. Lucas2 & C. Rigotti1
1 INSA Lyon
- LIRIS FRE CNRS 2672
2 Université d’Orsay – LRI CNRS UMR 8623
This work has been partially funded by the European Project AEGIS (IST-2000-26450).
1
Content

Motivation

About WinMiner

Data Mining Effort

Conclusion
2
Motivation : Data

STULONG Data : A 20 year longitudinal study of risk factors related
to atherosclerosis in a population of middle-aged men

Tables ENTRY and CONTROL:
– 1216 patients described by:
•
•
•
•
Identification and social characteristics
Behavior
Health events
Physical and biochemical examinations
– From 1 up to 21 control per patients
 A sequence of controls for each patient
3
Motivation: Medical issues

identified risks factors

no treatment available

necessity to consider a global risk instead of concentrating prevention
efforts on individual ones

risk comportments dramatically increases cardio-vascular disease
emergence, but no one knows when
 Relations between risk factors and clinical demonstration of
atherosclerosis?
 Time intervals over which these relations are valid?
4
Motivation: WinMiner

WinMiner: a single optimised way to find sequential
patterns in data along with their optimal time
intervals, under user constraints

WinMiner suggests to experts possible temporal
dependencies among occurrences of event types

WinMiner outputs "small" collections of sequential
patterns
5
About WinMiner
Mining context

large event sequences

episode & episode rules
A
B
A
A
B
B
C
C
6
About WinMiner
Selecting patterns

support: how many times an episode/episode rule occurs within
an event sequence?
AB

ABC
confidence: what is the probability of the RHS of an episode rule
to occur knowing that its LHS already occured?
ABC

patterns are selected using:
– a minimum support threshold
– a minimum confidence threshold
7
About WinMiner

Selecting the optimal window span
C1
confidence
C2 <= C1 - C1*decRate
C2
minimum
confidence
First Local Maximum
(FLM)
optimal window span
window span such
that the episode rule
is frequent
w
8
About WinMiner

WinMiner :
– checks all possible episode rules satisfying to
frequency and confidence thresholds
– outputs only the FLM-rules, along with their
respective optimal window sizes
– uses a maximal gap constraint
9
DM effort: Aims

Give to the medical expert:
a mean to follow both the evolution of risk factors
and:
(1) impact of medical intervention
(2) modifications in patients’ behavior
in addition:
– significant time periods of observation
– frequency
– probability
10
DM effort: Data preprocessing

Mainly focused on table CONTROL (1226 patients/10572
examinations)

Joint operations to export information from table ENTRY

Categorization of some factors

Choice of relevant factors according to:
– Medical expertise
– Mining approach

Table Contr_Mod_2
11
DM Effort: Data preprocessing

Important factors (according to medical experts):
–
–
–
–
cholesterol
hypertension
smoking
physical activity
–
–
–
–
–
age
diabetes
alcohol consumption
BMI
family anamnesis
– level of education
12
DM Effort: Data preprocessing

Contr_mod_2  large event sequence

For each patient: a subsequence containing all his control
examinations

Coding guarantees that events corresponding to 2 different
patients can not be associated in the same episode rule

Large event sequence: concatenation of all sub sequences
constructed for patients.
13
DM effort: Results

Examples:
– "If the patient has no hypercholesterolemia, and if
he sometimes follows his diet, then the patient has
no hypercholesterolemia with a probability of 0.8
within 40 months. This rule is supported by 201
examples in the event sequence."
– " If one eats less of fats and carbohydrates and he
has claudication observed some time later, then
this claudication does not disappear with a
probability of 0.8 over 30 months. This rule is
supported by 21 examples. "
14
DM effort: Results

Well known phenomena:
– indication about correctness in pre-processing as well as in
mining data

Added-value: suggestion concerning their temporal aspects

To be expected:
– with new data and new risk factors put in evidence in the last
decade, discovering new phenomena along with their
optimal window sizes
15
Conclusion

With STULONG data: Searching for temporal dependencies between
atherosclerosis risk factors and clinical demonstration of
atherosclerosis that have an optimal interval/window size

Offers to the medical expert a possibility to explicit impact of a risk
factor and to refine its part in comparison with other ones within a time
interval

A few episode rules obtained, that allows experts to manually analyse
the outputs

Could be applied to other medical data sets to help in finding unknown
phenomena
 New perspectives both for data miners and
physicians
16