MUDIM (Petr Šimeček, Euromise)

Download Report

Transcript MUDIM (Petr Šimeček, Euromise)

MUDIM (Petr Šimeček, Euromise)

system for multidimensional
compositional models
(Radim Jiroušek)


C++ code, distributed as R-package
focused on medical applications
Contents:
idea of conditional independence and
(de)composition
possible applications of MUDIM





expert system
data mining
STULONG dataset
CI - Theory of Storks
BIRTH RATE
STORK
POPULATION
CI - Theory of Storks
Do storks deliver newborns?
BIRTH RATE
Statistically
connected
STORK
POPULATION
CI - Theory of Storks
ENVIRONMENT
No!
BIRTH RATE
STORK
POPULATION
CI - Theory of Storks
ENVIRONMENT
BIRTH RATE
STORK
POPULATION
CI – Weather
WEATHER
TODAY
WEATHER
YESTERDAY
WEATHER
TOMORROW
CI – Weather
WEATHER
TODAY
WEATHER
YESTERDAY
WEATHER
TOMORROW
CI – Sample Medical Data
= variable (attribute);
f.e. AGE, BLOOD
PREASURE, …
CI – Sample Medical Data
= variable (attribute);
f.e. AGE, BLOOD
PREASURE, …
=
(unconditional)
statistical connection
(correlation) between
the pair of variables
CI – Storks & Weather
ENVIRONMENT
STORK
POPULATION
BIRTH RATE
TODAY
YESTERDAY
TOMORROW
CI – Storks & Weather
ENVIRONMENT
STORK
POPULATION
BIRTH RATE
TODAY
YESTERDAY
TOMORROW
CI – Sample Medical Data
= variable (attribute);
f.e. AGE, BLOOD
PREASURE, …
=
causality between
the pair of variables
Locality - illustration
Variable X
Directly explanatory
variables for X
Other variables
If we know information about directly explanatory
variables for X, then knowledge about other explanatory
variables is useless for predicting X.
Applications – Expert Systems
Causality
Applications – Expert Systems
Causality
Applications – Expert Systems
Causality
Applications – Expert Systems
Causality
Applications – Expert Systems
Causality
Idea of Compositional Models
π( X1,X2 )  κ( X2,X3 ) 
π( X1,X2 ) κ( X2,X3 )

κ( X2 )
Applications – Expert Systems
What is the distribution of
if we know
Causality
?
Data Mining
We don’t know “anything”, there are lots
of variables and lots of possible
relations between them.
We need to formulate possible
hypothesis, suggest some promising
models, etc. (useful in pre-research).
Data Mining
Variables
Data
Direction of Causality Problem
is equivalent to
are equivalent, but they are not
equivalent to
STULONG Dataset
= Dataset containing research data on
cardiovascular disease (1976-79)



1417 patients (Czech middle-aged men)
244 attributes surveyed with each
patient at the entry examination
37 selected attributes are described
here
(Incomplete) List of Attributes









AGE
MARITAL STATUS
EDUCATION
OCCUPATION
PHISICAL ACTIVITY
TRANSPORT TO JOB
SMOKING
ALCOHOL
TEA AND COFFEE









MYOCARDIAL
INFARCTION
HYPERTENSION
ICTUS
HYPERLIPIDEMIA
CHEST PAIN
ASTHMA
HEIGHT & WEIGHT
BLOOD PREASURE
…
Graph of Correlated Pairs
464 of 666
possible
pairs are
statistically
connected
(p=0.05)
RESP
ACT.IN.JOB
EDUC
ACT.AFTER.JOB
MARIT.STAT
TRANSPORT
AGE
TRANSPORT.TIME
URINE
SMOKING
TRIGL
SMOKING.YR
CHLST
ALCOHOL.FREQ
SUBSC
BEER.DAILY
TRIC
WINE.DAILY
DIAST2
LIQ.DAILY
SYST2
COFFEE
DIAST1
TEA
SYST1
SUGAR
WEIGHT
IM
HT
HEIGHT
ASTHMA
HTD
PAIN.LL
HTL
PAIN.CHEST
DIABET
HYPLIP
Graph of Correlated Pairs 2
160 of 666
possible
pairs are
statistically
connected
(p=0.05/666)
RESP
ACT.IN.JOB
EDUC
ACT.AFTER.JOB
MARIT.STAT
TRANSPORT
AGE
TRANSPORT.TIME
URINE
SMOKING
TRIGL
SMOKING.YR
CHLST
ALCOHOL.FREQ
SUBSC
BEER.DAILY
TRIC
WINE.DAILY
DIAST2
LIQ.DAILY
SYST2
COFFEE
DIAST1
TEA
SYST1
SUGAR
WEIGHT
IM
HT
HEIGHT
ASTHMA
HTD
PAIN.LL
HTL
PAIN.CHEST
DIABET
HYPLIP
56
arrows
TRANSPORT.TIME
RESP
MARIT.STAT
TRIGL
URINE
CHLST
DIABET
EDUC
TRANSPORT
ACT.IN.JOB
AGE
WINE.DAILY
ALCOHOL.FREQ
ACT.AFTER.JOB
LIQ.DAILY
BEER.DAILY
SMOKING
SUBSC TRIC
SMOKING.YR
COFFEE
SYST2
DIAST1
SYST1
TEA
SUGAR
DIAST2
HT
HTL
HTD
ASTHMA
PAIN.CHEST
HYPLIP
IM
WEIGHT
PAIN.LL
HEIGHT
Risk Factors for Hypertension
>summary(glm(HT~HYPLIP+IM+AGE+SUBSC,data=C,family="bino
mial"))
Coefficients:
Estimate Std. Error z value Pr(>|z|)
Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.322730
1.274252 -3.392 0.000693 ***
IM
1.246937
0.513342
2.429 0.015138 *
HYPLIP
1.126383
0.333971
3.373 0.000744 ***
SUBSC
0.009521
0.003978
2.393 0.016699 *
AGE
0.245182
0.136678
1.794 0.072835 .
--Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.'
0.1 ` ' 1
Risk Factors for Hypertension
Interpretation:
 HYPERLIPIDEMIA and IM triple odds of
ratio
 Each three years of AGE double odds of
ratio
 There is also small, but evincible
connection to skinfold above musculus
subscapularis (SUBSC)