P(H|D) - School of Life Sciences

Download Report

Transcript P(H|D) - School of Life Sciences

PHILOSOPHY OF SCIENCE:
Bayesian inference
Thomas Bayes
1702-1761
Zoltán Dienes, Philosophy of Psychology
Subjective probability:
Personal conviction in an opinion – to which a number is assigned
that obeys the axioms of probability.
Probabilities reside in the mind of the individual not the external
world. There are no true or objective probabilities.
You can’t be criticized for your subjective probability regarding any
uncertain proposition – but you must revise your probability in the
light of data in ways consistent with the axioms of probability.
Subjective odds of a theory being true:
How much would you just be willing to pay if the theory is
found to be false in compensation for a commitment from
someone else to pay you one unit of money if the theory is
found to be true?
E.g. the theory: The next toss will be heads.
I will pay you a pound if the next toss is heads.
Will you play if I want 50p if it is tails? 90p? £1? £1.50?
Subjective odds of a theory being true:
How much would you just be willing to pay if the theory is
found to be false in compensation for a commitment from
someone else to pay you one unit of money if the theory is
found to be true?
E.g. the theory: The next toss will be heads.
I will pay you a pound if the next toss is heads.
Will you play if I want 50p if it is tails? 90p? £1? £1.50?
Assuming the highest amount you picked is £1
Odds in favour of the next toss being heads = 1.
NB: Odds = Prob(true)/Prob(false).
Subjective odds of a theory being true:
How much would you just be willing to pay if the theory is found to
be false in compensation for a commitment from someone else to
pay you one unit of money if the theory is found to be true?
Theory: There is a two-headed winged monster behind the door
I will pay you a pound if we open the door and find a monster
Will you play if I want 50p if there is no monster? 25p 0p??
Subjective odds of a theory being true:
How much would you just be willing to pay if the theory is found to
be false in compensation for a commitment from someone else to
pay you one unit of money if the theory is found to be true?
Theory: There is a two-headed winged monster behind the door
I will pay you a pound if we open the door and find a monster
Will you play if I want 50p if there is no monster? 25p 0p??
Assuming you picked 0
Odds in favour of there being a monster = 0.
Odds = Prob(true)/Prob(false)
Prob(true) = odds/(odds + 1)
This is a notion of probability that applies to the truth of theories!
(Remember objective probability does not apply to theories)
So that means we can answer questions about p(H) – the probability
of a hypothesis being true – and also p(H|D) – the probability of a
hypothesis given data (which we cannot do on the Neyman-Pearson
approach).
Axioms of probability include:
P(Y and C) = P(Y|C)*P(C)
P(C) = ½
P(Y|C) = 1/3
=> P(Y and C) = 1/6
ALSO
P(Y) = 4/6
P(C|Y) = 1/4
=> P(Y and C) = 1/6
Bayes Theorem:
P(H and D) = P(H|D)*P(D)
P(H and D) = P(D|H)*P(H)
=> P(H|D)*P(D) = P(D|H)*P(H)
Bayes Theorem:
P(H and D) = P(H|D)*P(D)
P(H and D) = P(D|H)*P(H)
=> P(H|D)*P(D) = P(D|H)*P(H)
=> P(H|D) = P(D|H)*P(H) /P(D)
. . . (1)
Bayes Theorem:
P(H and D) = P(H|D)*P(D)
P(H and D) = P(D|H)*P(H)
=> P(H|D)*P(D) = P(D|H)*P(H)
=> P(H|D) = P(D|H)*P(H) /P(D)
. . . (1)
So considering different hypotheses for the same data
P(H|D) is proportional to P(D|H) * P(H)
Posterior probability
of the hypothesis in
the light of data D
Likelihood
Prior
probability
. . . (2)
P(H|D)
is proportional to
Posterior
P(D|H) *
P(H)
Likelihood
Prior
All the support for a theory provided by data D is captured by the
likelihood.
What is a likelihood?
Consider an example from the Neyman-Pearson lectures:
sample mean blood pressure with drug = Md
sample mean blood pressure with placebo = Mp
Probability of
obtaining exactly that
mean difference
SE
Hypothesis that
“population mean
= sample mean”
has highest
likelihood
Likelihood
p(D|H)
0
(Md – Mp)
Population values of mean difference:
each value is a different H
Height of the curve
for a population
mean of 0 gives
likelihood of null
hypothesis
Difference between likelihood and significance testing :
p(D|H)
0
p(D|H0)
(Md – Mp)
Different possible population means (H)
With likelihood you are just
interested in the height of the curve
0
(Md – Mp)
Different possible sample means
With sig testing, you are interested
in the area under the curve beyond
that point (significance level)
Likelihood is gives a continuous graded measure of support for
different hypotheses; significance testing asks you to make a black
and white decision
Likelihood reflects just what the data were; significance testing uses
tail areas, reflecting what might have happened but did not.
Likelihoods are insensitive to whether you are performing a post hoc
test or a planned comparison, and how many other tests you are
conducting.
P(H|D) is proportional to P(D|H) * P(H)
Posterior
probability of the
hypothesis in the
light of data D
Likelihood
Prior
We can use Bayes’ theorem to:
1. Calculate the probable values of a parameter (“credibility intervals”)
2. Compare the relative probability of different hypotheses, e.g. how
likely is the alternative hypothesis compared to the null? ( “Bayes
factor”)
1. Credibility intervals
We want to determine how probable different population values of
e.g. (Md – Mp) are.
First decide on a prior
Assume some normal distribution does not violate the shape of your
prior too much: i.e. you think certain values are reasonably likely, and
more extreme values less likely in a symmetrical way
Prior
prob.
Possible population values of (Md – Mp)
The value you think
is most likely
The spread in your values – the standard deviation - can be assigned
by remembering:
You should think that plus or minus one standard deviation has a 68%
probability of including the actual population value
You should think that plus or minus two standard deviations has a
95% probability of including the actual population value
The bigger your standard deviation the more open minded you are
Prior
prob.
Possible population values of (Md – Mp)
What you think the most
likely value is
If the standard deviation is infinite, you think all values are equally
likely – this is called a “flat prior”. You have NO idea what the
population value is likely to be.
Prior prob.
Possible population values of (Md – Mp)
To choose a prior decide:
1. Does it have a roughly normal shape?
2. The mean of your normal (call it M0)
3. The standard deviation of the normal (call it S0)
Remember: There are no “right” answers! This is YOUR prior!
Prob
Possible population values
A possible prior: You think an effect of 0 is most plausible
and you are virtually certain that the true effect, whatever
it is, lies between –10 and +10
M0 = 0, S0= 3
Likelihood
Prior
Possible population values
You collect data from a normal population; your sample has a
mean of 2.8 and a standard error of 1.09. Assuming your N is
above 30, you can represent the likelihood as a normal
distribution with a mean of 2.8 and a standard deviation of 1.09.
p(H|D) is proportional to p(D|H) * p(H)
Likelihood
Posterior probability
Prior
Need to multiply corresponding points on the graphs:
Likelihood
Prior
Multiplying:
0
0
0
quite big about 0 again = posterior
Posterior = likelihood * prior
(scale so that area = 1)
Posterior
Likelihood
Prior
Possible population values
Note: For reasonably diffuse prior, the posterior is dominated by the
likelihood, i.e. by the data
If both prior and likelihood are normal, it turns out the posterior is
normal.
Mean of prior = M0
Mean of sample = Md
Standard deviation of prior = S0
Precision of prior = c0 = 1/S02
Precision of sample = cs = 1/SE2
Posterior precision c1= c0 + cs
Posterior mean M1 = (co/c1)*Mo + (cs/co)Md
Posterior standard deviation, S1 = sqrt(1/c1)
95% credibility interval (also: probability interval, highest density
region or HDR).
Posterior
95% of area
M1-1.96*S1
M1+1.96*S1
Find the values of blood pressure change that include 95% of the area:
M1, plus or minus 1.96*S1.
95% credibility interval (also: probability interval, highest density
region or HDR).
Posterior
Population blood pressure change 0.5
95% of area
4.5
Find the values of blood pressure change that include 95% of the area:
You believe that there is a 95% probability of the true blood pressure
change caused by the drug lies between 0.5 and 4.5 mmHg
If the prior is flat, the 95% credibility interval is the same interval as
the 95% confidence interval of Neyman-Pearson.
BUT different meaning:
Confidence interval is associated with an objective probability:
IF you repeated your experiment an indefinite number of times, the
true population value would lie in the 95% confidence interval 95%
of the time.
However, you CANNOT make any claim about how likely THIS
interval is in enclosing the true population mean
You cannot really be 95% confident that the true population value lies
in the 95% confidence interval
Savage: I know of no good use for a confidence interval if not to be
confident in it!
ALSO:
Confidence interval will have to be adjusted according to how
many other tests you conducted, under what conditions you
planned to stop collecting data, and whether the test was planned
or post hoc.
Credibility interval is unaffected by all these things (EXCEPT
some stopping rules – we discuss this later).
The credibility interval IS affected by any prior information you
had.
2. Compare the relative probability of different hypotheses, e.g. how
likely is the alternative hypothesis compared to the null? ( “Bayes
factor”)
Bayes: P(H|D) is proportional to P(D|H) * P(H)
=> P(H1|D) is proportional to P(D|H1) * P(H1)
P(H0|D) is proportional to P(D|H0) * P(H0)
2. Compare the relative probability of different hypotheses, e.g. how
likely is the alternative hypothesis compared to the null? ( “Bayes
factor”)
Bayes: P(H|D) is proportional to P(D|H) * P(H)
=> P(H1|D) is proportional to P(D|H1) * P(H1)
P(H0|D) is proportional to P(D|H0) * P(H0)
=> P(H1|D) / P(H0|D) =
posterior odds
=
P(D|H1)/P(D|H0) * P(H1)/P(H0)
likelihood ratio
* prior odds
The likelihood ratio (in this case) is called the “Bayes factor” (B) in
favour of the alternative hypothesis
Consider a theory you might be testing in your project.
Prior odds of the theory being true:
How much would you just be willing to pay if the theory is found to be
false in compensation for a commitment from someone else to pay you
one unit of money if the theory is found to be true?
Experimental results tell you by how much to increase your odds (B)
Not a black and white decision like in significance testing.
If B = about 1, experiment was not sensitive.
(Automatically get a notion of sensitivity;
contrast: just relying on p values in significance testing.)
EXAMPLE WITH REAL DATA:
Sheldrake’s (1981) theory of morphic resonance
EXAMPLE WITH REAL DATA:
Sheldrake’s (1981) theory of morphic resonance
- Any system by virtue of assuming a particular form, becomes
associated with a “morphic field”
- The morphic field then plays a causal role in the development and
maintenance of future systems, acting perhaps instantaeously through
space and without decay through time
- The field guides future systems to take similar forms
- The effect is stronger the more similar the future system is to the
system that generated the field
- The effect is stronger the more times a form has been assumed by
previous similar systems
- The effect occurs at all levels of organization
Nature editorial by John Maddox 1981:
The “book is the best candidate for burning there has been in many years
. . . Sheldrake’s argument is pseudo-science . . . Hypotheses can be
dignified as theories only if all aspects of them can be tested.”
Wolpert, 1984:
“ . . . It is possible to hold absurd theories which are testable, but that
does not make them science. Consider the hypothesis that the poetic
Muse resides in tiny particles contained in meat. This could be tested by
seeing if eating more hamburgers improved one’s poetry”
Repetition priming
Subjects identify a stimulus more quickly or accurately with repeated
presentation of the stimulus
Lexical decision
Subjects decide whether a presented letter string makes a meaningful
English word or not (in the order actually presented).
Two aspects of repetition priming are consistent with an explanation
that involves morphic resonance:
Durability, stimulus specificity
Unique prediction of morphic resonance:
Should get repetition priming between separate subjects! (ESP)
Design:
Stimuli:
shared+unique
shared
shared+unique . . .
Subject no:
1
2..9
10,
...
boosters
resonator
...
Subject type: resonator
Design:
Stimuli:
shared+unique
shared
shared+unique . . .
Subject no:
1
2..9
10,
...
boosters
resonator
...
Subject type: resonator
- There were 10 resonators in total with nine boosters between each.
Resonators were assigned randomly in advance to their position in the
sequence.
- The shared stimuli received morphic resonance at ten times the rate
as the unique stimuli
- There was a distinctive experimental context (white noise, essential
oil of ylang ylang, stimuli seen through a chequerboard pattern)
Design:
Stimuli:
shared+unique
shared
shared+unique . . .
Subject no:
1
2..9
10,
...
boosters
resonator
...
Subject type: resonator
- There were 10 resonators in total with nine boosters between each
- The shared stimuli received morphic resonance at ten times the rate
as the unique stimuli
- There was a distinctive experimental context (white noise, essential
oil of ylang ylang, stimuli seen through a chequerboard pattern)
Prediction of theory of morphic resonance:
The resonators should become progressively faster on the shared as
compared to the unique stimuli
80
60
40
20
0
-20
-40
-60
-80
0
1
2
3
4
5
Resonator number
Data for words.
slope (ms/resonator) = 0.9 , SE = 3.1
Neyman-Pearson: p = 0.9, ns
6
7
8
9
10
80
60
40
20
0
-20
-40
-60
-80
1
2
3
4
5
Resonator number
Data for words.
slope (ms/resonator) = - 5.0 , SE = 1.5
Neyman-Pearson: p = 0.009 significant
6
7
8
9
10
Morphic resonance theory: (shared – unique) RT should be more
negative in the same rather than different context.
6 more resonators run in same context (ylang ylang etc); 6 in different
context
80
60
40
20
0
-20
CONTEXT
-40
same context
-60
different context
-80
Total Population
1
2
3
4
5
6
7
8
9
10
Resonator number
No difference between same and different contexts.
Overall slope = 2.8 ms/resonator SE = 1.09, p = .018, still sig
Bayesian analysis
P(H1|D) / P(H0|D) = P(D|H1)/P(D|H0) * P(H1)/P(H0)
posterior odds
= likelihood ratio
* prior odds
Need to determine p(D|H0) and p(D|morphic resonance exists)
p(D|H0):
p(D|H0)
H0: population slope = 0
p(D|H0) is just height of normal curve at a z of
(mean slope)/(SE slope)
i.e. at a z of -2.8/1.09 = -2.6.
The height here is .013
-2.6
0
z
p(D|morphic resonance) = ?
Morphic resonance is consistent with a number of population slopes; in
fact, at first blush, any slope > 0.
Need to determine p(population slope|morphic resonance) for all slope
values.
Completely flat prior for positive slopes:
p(populaton
slope|M)
0
Possible population slope
But morphic resonance cannot allow any slope: the between-subject
priming must be less than within subject priming.
Within a subject, RT sped up by 20 ms with a repetition.
1 resonator = 10 boosters
Each booster saw each stimulus 3 times
So boosting between each resonator = 30 repetitions.
So slope cannot be more than 30 x 20 = 600 ms
So slope cannot be more than 30 x 20 = 600 ms
Assuming we have no preference whatsoever in thinking any of the
slopes in the range from 0 to 600ms are more likely than any other
value (an implausible assumption but let’s just see the consequences):
p(population
slope|M)
0
600 ms
Possible population slope
To go from p(population slope|M) to p(observing a slope|M) need to
smear the graph by the SE of the sample.
In fact, since the SE is about 1ms this smearing is negligible in this
case.
p(data|M) is pretty much the same as p(population slope|M)
p(data|M)
SE
600 ms
0
observed slope
Assume p(data|M) is given by the distribution below, where obtained
sample SE =1.09
Since the distribution is SO long, i.e. so many observed values are
possible, the probability of observing a slope in any one 1ms interval,
e.g. 2-3ms, is actually very small!
p(data|M)
SE
600 ms
0
observed slope
Assume p(data|M) is given by the distribution below, where SE =1.09
Since the distribution is SO long, i.e. so many observed values are
possible, the probability of observing a slope in any 1ms interval, e.g.
2-3ms, is actually very small!
Actual value = 2.8 ms
p(observing slope = 2.8ms| this model of morphic resonance) = .002
So Bayes factor = p(D|M)/ p(D|H0) = .002/.013 = .15
p(data|M)
SE
600 ms
0
observed slope
Posterior odds = Bayes factor * prior odds
Bayes factor = .15
Means data should REDUCE your confidence in morphic resonance and
INCREASE your confidence in the null hypothesis!
Contrast Neyman-Pearson in this case: p = .018, so we reject the null
hypothesis!
Moral: On a Bayesian analysis, a significant result may lead one to prefer
the null hypothesis even more rather than a poorly specified theory!
Contrast distributions for two theories below.
Note area under the curve must always be 1.
White theory is more precise than the yellow: So each possible
slope in its range has a higher p than yellow.
Finding data in the range 0-150 would support white more than
yellow because white more strongly predicted it.
p(Slope|M)
p(100|w) / p(100/y) = 4
0
150
600
Possible population slope
Bayes factor = .15
Means data should REDUCE your confidence in morphic resonance and
INCREASE your confidence in the null hypothesis!
Contrast Neyman-Pearson in this case: p = .018, so we reject the null
hypothesis!
Moral: On a Bayesian analysis, a significant result may lead one to prefer
the null hypothesis even more than one did before over a poorly
specified theory!
BUT morphic resonance is not so poorly specified as in this example.
The assumption that morphic resonance allows all slope values between
0 and 600 equally is implausible.
Based on studies Sheldrake has explained in terms of morphic
resonance, between-subjects effects have been roughly .005 * withinsubjects effects.
So a likely value for the slope is .005*600 = 3 ms.
A rectangular distribution doesn’t capture our intuitions very well
either; presumably the probability of large slopes is smaller than small
slopes:
p(population
slope|M)
0
3
Possible population slopes
To 600
With a distribution like this, Bayes factor = 16
i.e. whatever one’s prior odds in favour of morphic resonance you
should multiply them by 16 in the light of these data.
Contrast Neyman-Pearson: Result was significant so should
categorically reject null hypothesis
With Bayesian approach, if before you had a very low odds in favour of
morphic resonance, they can still be very low afterwards.
NB:
I ran further studies:
Expt 2:
20 boosters between each resonator, 2 resonators run after each set of
boosters. Nineteen resonators also run in Goetingen – could they show
which word set was being boosted in Sussex??
Expt 3:
20 boosters again, 2 resonators, one in each pair in a highly distinctive
context
All results flat as a pancake.
Combined Bayes factor for non-word data = about 1/5
Combining with word data = about 1/16
Does not rule out morphic resonance – just changes our odds
Summary:
A Bayes factor tells you how much to multiply your prior odds in the
light of data.
Advantages:
Low sensitive experiments show up as having Bayes’ factors near 1.
You are not tempted to accept the null hypothesis just because the
experiment was insensitive.
Penalises vague theories – data significantly different from null may
still actually support null! (Compare and contrast Popper)
Disadvantages:
Note the somewhat arbitrary way in which we settled on p(D|M):
p(D|M) then reflects not only the data but also our subjective
judgements (so it is not a true likelihood)
So Bayes factor also does not reflect just the data
In ideal cases, the theory specifies p(D|theory) precisely; this will be
rare in psychology
Likelihoods, and hence Bayes factors, are insensitive to many
stopping rules:
e.g. consider example from previous lecture: the proportion of
women with G spot orgasm
“stop when have reached a certain sample size”
vs
“stop when have counted a certain number of women with G stop
orgasm”
Makes no difference to likelihood, so makes no difference to Bayes
factor (or credibility interval).
Stopping rule can be conditioned on anything in itself uninformative
about the hypothesis.
So: “Stop when Bayes factor = 4 OR ¼”
“Stop when Bayes factor = 4”
is a fine rule
cannot be used.
Similarly for credibility intervals:
“Stop when 95% credibility interval has a width of 4 mmHG” is a
splendid rule
“Stop when 95% credibility interval excludes 0” is no good.
Explanation: The rule “Stop when Bayes factor = 4” means that the standardly
computed likelihood no longer contains ALL the information provided by the data
about the truth of the hypothesis; there is additional information in the time taken to
reach that likelihood. Therefore one cannot obtain the posterior by simply
multiplying the prior by the likelihood.
Summary
With a Bayesian analysis you can
1. Calculate credibility intervals – the probability that a population
value lies in certain intervals.
2. Calculate Bayes factor: How much more likely does the data make
a theory compared to the null hypothesis
Both procedures encourage you to think about what size are the effects
your theory specifies
The strengths of Bayesian analyses are also its weaknesses:
1. Are our subjective convictions really susceptible to the
assignment of precise numbers and are they really the sorts of
things that do or should follow the axioms of probability?
Should papers worry about the strength of our convictions in
their result sections, or just the objective reasons for why
someone might change their opinions?
The strengths of Bayesian analyses are also its weaknesses:
1. Are our subjective convictions really susceptible to the
assignment of precise numbers and are they really the sorts of
things that do or should follow the axioms of probability?
Should papers worry about the strength of our convictions in
their result sections, or just the objective reasons for why
someone might change their opinions?
2. Their insensitivity to stopping rules means Bayesian
procedures are not guaranteed to control error probabilities
(Type I, type II). (compare Mayo, 1996).
If you calculated 100 95% credibility intervals when the null
was always true, you would expect about 5 to exclude 0.
A Bayesian thinks this is is just as it should be. But should she
be worried?
10 measures of early toilet training are correlated with 10 measures
of adult personality
Out of these 100 correlations, 4 are found to be significant.
Neyman Pearson:
One expects about 5 to be significant by chance alone; these are
weak data and do not lead one to accept any hypothesis about toilet
training affecting personality.
On A Bayesian analysis:
These four 95% credibility intervals exclude 0. No need to take into
account you also conducted 96 other tests. You have good support
for four specific hypotheses concerning toilet training and
personality.
But: Bayesian is not making a black and white decision; just getting a
continuous measure of support.
If something looks interesting she can simply decide to collect more
data until she is satisfied.
Also: Bayesian does not ignore all the null results in evaluating a
grand theory concerning toilet training and personality;
e.g. if a Freudian theory predicted ALL tested relationships, its
Bayes factor in the light of the 100 correlations could be very low!
BUT you can still pick out one result that is significant – that specific
hypothesis gets lots of support.
SHOULD one’s confidence in that hypothesis be reduced because of
all the other tests that were done?
But why should what else you may or may not have done matter?
If you were an editor would you publish the paper because it was good
support for one of the 100 specific hypotheses?
What if the author just reported testing that one correlation and
constructed a plausible theory for it to put in his introduction.
According to the Bayesian, there is nothing wrong with that.
According to classical statistics, that is clearly cheating.