Transcript Taxinge1

Brain Research and Data Mining
Stefan Arnborg, KTH and SICS
http://www.nada.kth.se/~stefan
Visualization or Statistics?
• A good visualization strikes the investigator
between the eyes with the truth
J. Tukey
• The human perception system is biased towards
wishful thinking - we normally see what we
want to see.
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
The human eye finds structure
Bayes factor is 1.2 in favor of no structure vs structure
The points are generated completely randomly
1
100
90
80
70
60
50
40
30
20
10
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Cumulative plots of x- and y-coordinate
1
Variables in test matrix
scanid
Diagnosis (A or C)
Demographics:
Gendre
Height
Weight
BMI
Hand
Age-MRI
Birth-Mon
Age-Pma
Blood tests:
B-MCV-01
S-ALAT-K1
S-ASAT-K1
S-CDT-B1
S-GLU-K2
S-GT-K1
S-KOL-K1
S-LDL-B1
S-PROL-K1
MR Volumes
S-CDT-02
S-K-K1
S-TG-K1
MR Volumes in test matrix
(144 subjects)
BrsCSF
BrsGrey
BrsWhite
Cer-CSF CerGrey
CerWhite
FroCSF
FroGrey
FroWhite
OccCSF
OccGrey
OccWhite
ParCSF
ParGrey
ParWhite
SubCSF
SubGrey
SubWhite
TemCSF TemGrey
TemWhite
VenCSF
VenGrey
VenWhite
Rel-volGrey
Total-intr
Total-CSF
Total-Grey
Total-White
Int-nocl Int-blood
CSF/Grey
Grey/Total White/Total
Vermis (manually traced, 109 subjects):
CH TV AV PSV PIV
CSF/Total
Thomas Bayes (1703-1762)
• If I suspect that a coin used for betting is
unbalanced, how should I test it? inverse probability.
• Prior: Before the experiment my probability for
heads is uniformly distributed between 0 and 1.
• Posterior: After the experiment my probability is
described by a ‘beta distribution’.
C. S. Peirce (1839 - 1914).
Pragmaticism: How does
our understanding of Nature
develop? How should Science
be developed?
Semeiotics: Meaning is
created by signs sent around in
the nervous system. Scientific
knowledge is created in a
never-ending process of
discontent with the current
theories which force new
models of thought.
C. S. Peirce (1839 - 1914).
The Sign of Three
Umberto Eco, Thomas Sebeok
Indiana University Press, 1983.
Love, Chance and Logic:
C.S. Peirce, 1923
A person is not absolutely an
individual. His thoughts are
what he is ‘saying to himself’,
that is, saying to that other self
that is coming into life in the
flow of time. When one
reasons it is that critical self
that one is trying to persuade.
It is a necessity of Logic that
every logical evolution of
thought should be dialogic.
Every thought is a sign.
Sherlock Holmes: common sense
inference
Techniques used by Sherlock are
modelled on Conan Doyle’s
professor in medical school,
who followed the
methodological tradition of
Hippocrates and Galen.
Abductive reasoning, first
spelled out by Peirce, is found in
217 instances in Sherlock
Holmes adventures - 30 of them
in the first novel, ‘A study in
Scarlet’.
Bayes’ factor
• Choice between two hypotheses, H1 and H2,
given experimental/observational data D
P(H1|D) P(D|H1) P(H1)
P(H2|D) P(D|H2) P(H2)
Posterior odds
Bayes factor prior odds
Bayes factor 8 is
significant,
32 is strong
Hierarchical models
• Model parametrized:
Hl : f( x | l ) , l  L
f(l | x) ~ f( x | l ) f(l),
• Modell hierarchical or composite:
prior f(l) for l,
H1: f( x | l ) and f(l),
P(D|H1) = ∫ f(D|l) f(l) dl= ∫ ∏f(d i|l) f(l) dl
Hypothesis in test matrix
• H1: (no effect) a data column is generated
independently of diagnosis (composite model)
• H2: the data for controls are generated by one
composite model, for affected by another one.
Effect for TemGrey, not for CerGrey
Bayes factor 0.4 - weakly no effect
+ - affected o - controls
Bayes factor 4- weak sign of effect
Difference women-men
BF 0.1: no effect for women
BF 4 : weak sign of effect,men
Mass testing effects, confounders
• In a 1000-column table there are ~5000
accidental associations on the 1% level.
• Bayesian analysis, properly applied, avoids all
problems of overfitting and mass testing.
• Observational studies are prone to misleading
conclusions from known & unknown confounder
• Causal graphical models is a tool to avoid this
Compensating mass testing
• Bonferroni 1937:
For level a and n tests, use level a/n
• Benjamini 1996: Control False Discovery Rate
• Composite Bayes model (1763-2001):
Theoretically optimal procedure, but requires
an explicit (composite) alternative to the null
hypothesis (with nuisance parameters).
Dependence on sample
BrsCSF
TemCSF
VenWhite
Total-CSF
SubWhite
VenCSF
FrCSF
SubCSF
BMI
Rel-volGrey
S-GLU-K2
CerGrey
7.4
6.6
6.3
4.2
3.5
3.1
2.9
2.6
1.9
1.6
1.6
1.4
(144 subject sample)
PSV
9.5
BrsCSF
7.7
TemCS
7.3
VenCSF
4.8
TV
4.7
Total-CSF 4.4
SubCSF
3.5
VenWhite 3.4
FrCSF
3.4
age
3.4
AV
2.3
SubWhite 2.2
OccWhite 1.1
S-GLU-K2 1.0
(109 subject sample)
1
0.9
0.8
0.7
1
0.6
0.9
0.5
0.4
0.8
0.3
0.7
0.2
0.6
0.1
0.5
0
20
25
30
35
40
45
Age-first-MRI
50
55
60 0.4 65
70
0.3
0.2
0.1
0
20
25
30
35
40
age
45
109 Sample not matched wrt age!
50
55
60
1
0.9
Gendre
differences
0.8
0.7
0.6
0.5
0.4
1
0.3
0.9
0.2
0.8
0.1
0.7
0.6
0
30
35
40
45
SubWhite
Men
50
55
60
65
0.5
0.4
0.3
0.2
Women
0.1
0
25
30
35
40
SubWhite
45
50
55
Graphical models
X
X
Y
f(x,y,z)=
f(x,z)f(x,y)/f(x)
Y
Z
Z
X
f(x,y,z)=f(x)f(y)f(z)
Z
f(x,y,z)
Y
144 sample without Vermis variables
Totalwhite
SubCSF
Diagnosis
BrsCSF
TotalCSF
SubWhite
TemCSF
VenWhite
FrCSF
FrWhite
ParCSF
106 sample with Vermis variables
TV
Diagnosis
PIV
PSV
TotalGrey
Graphical models, directed
X
X
Y
f(x,y,z)=
f(x)f(y|x)f(x|y)
Y
Z
Z
f(x,y,z)=f(x)f(y)f(z)
X
Y
Z
f(x,y,z)=f(x)f(y)f(x|y,z)
Experimental vs observational data
• Is there an association between treatment and
recovery?
• Is there a causal link? or a backdoor path(confounder)?
• Can we decide if a patient had recovered with a
different treatment?
• Can we decide which treatment has best chance
of recovery for patient?
Cause or effect?
• Association between drinking red wine and good
health is known since long. Drinking 1 litre
a day is equivalent, for life insurance purposes,
to temperance (Skandia-If statistics, 1998)
• Does drinking red wine promote health?
• Does sound lifestyle promote drinking red wine?
• Or both? Causes are today only hypotheses!
(Svenska Dagbladet Sept 3 2001)
Causal graphs-Bayesian networks
X
Y
• Statistical DAG: f(x,y) = f(y|x) f(x)
• Causal graph: Arrow means causation:
y <— F(x,e)
Controlling eelworms by fumigants
(Cochran 1981)
Z0
Z1
X
X: Fumigants
Y: Yield
B: Birds
Z2
Y
P(y|x)= S P(y|x z0) P(z0)
Must condition on Z0, or on Z1 and B.
B
Z3
Z0: Eelworms in winter
Z1: Eelworms at treatment
Z2: Eelworms after treatment
Z3: at end of season
Classification (Cheeseman, Stutz)
cases
...
Variables
a
b
..
b
b
a
Model assumption:
within each class,
columns are generated
independently of each
other.
(Other options exist
hidden for numerical data)
class
Classification explains data!
X
Y
Z
H
W
X
Y
Z
W
Autoclass1
Autoclass10
Autoclass10 vs CSF
Autoclass100
Mining causal chains
Are there pairs of variables where the association is different
for controls than for affected? Can this indicate a regulation path
that is disturbed for affected?
65
60
55
50
45
40
35
30
25
35
40
45
50
VenGrey
55
60
65
70
Strong difference in association
Rel-volGrey
SubWhite
VenGrey
SubWhite
Cer-CSF
SubWhite
Total-intr
SubWhite
FroCSF
SubWhite
Int-nocl
SubWhite
TemGrey
SubWhite
ParGrey
Total-CSF
11.3
10.5
10.5
10.4
10.3
10.2
9.9
9.6