Forelesning om beskr. stat. etc

Download Report

Transcript Forelesning om beskr. stat. etc

Bayesian Statistics
HSTAT1101: 10. november 2004
Arnoldo Frigessi
[email protected]
Reverend Thomas Bayes 1702-1761
Mathematician who first used probability inductively and established
a mathematical basis for probability inference
He set down his findings on probability in "Essay Towards Solving a
Problem in the Doctrine of Chances" (1763),
published posthumously in the Philosophical Transactions of the
Royal Society of London.
• Inventor of a “Bayesian analysis” for the binomial
model
• Laplace at the same time discovered Bayes
Theorem and a new analytic tool for approximating
integrals
• Bayesian statistics was the method of statistics
until about 1910.
Statistics estimates unknown parameters (like the mean of a population).
Parameters represent things that are unknown. They are some properties
of the population from which the data arise.
Questions of interest are expressed as questions on such parameters:
confidence intervalls, hypothesis testing etc.
Classical (frequentist) statistics considers parameters as specific to the
problem, so that they are not subject to random variability. Hence parameters
are just unknown numbers, they are not random, and it is not possible
to make probabilistic statements about parameters (like the parameter
has 35% chances to be larger than 0.75).
Bayesian statistics considers parameters as unknown and random and hence
it is allowed to make probabilistic statements about them (like the above).
In Bayesian statistics parameters are uncertain either because they are random
or because of our imperfect knowledge of them.
Example:
”Treatment 2 is more cost-effective than treatment 1 for a certain hospital.”
Parameters involved in this statement:
- mean cost and mean efficacy for treatment 1
- mean cost and mean efficacy for treatment 2
across all patients in the population for which the hospital is responsible.
Bayesian point of view: we are uncertain about the statement, hence this
uncertainty is described by a probability. We will exactly calculate the
probability that treatment 2 is more cost-effective than treatment 1 for the
hospital.
Classical point of view: either treatment 2 is more cost-effective or it is not.
Since this experiment cannot be repeated (it happens only once), we
cannot talk about its probability.
... but, in classical statistics we can make a test!
Null-hypothesis Ho: treatment 2 is NOT more cost-effective than treatment 1
... and we can obtain a p-value!
What is a p-value?
Correct: 2 (but it is quite a complicated explanation, isn’t it?)
1 and 3 are ways significance is commonly interpreted.
BUT they are not correct.
Answer 3 makes a probabilistic statement about the hypothesis, which is
not random but either true or false.
Answer 1 is about individual patients, while the test is on cost-efficacy.
We cannot interprete a p-value as a probability, because in the classical
setting it is irrelevant how probable the hypothesis was a priori, before
the data where collected.
Example:
Can a healer cure cancer?
A healer treated 52 cancer patients and 33 of these were better after
one session.
Null Hypothesis Ho: the healer does not heal.
p-value (one sided) = 3,5%. Hence reject at 5% level.
Should we believe that it is 96.5% sure that the healer heals?
Most doctors would regard healers as highly unreliable and in no way
they would be persuaded by a single small experiment. After seeing the
experiment, most doctors would continue to believe in Ho. The experiment
was due to chance.
In practice, classical statistics would recognise that a much stronger
evidence would be needed to reject a very likely Ho.
So, the p-value in reality does not mean the same thing in all situations.
To interprete the p-value as the probability of the null hypothesis is not only
wrong but dangerous when the hypothesis is a priori highly unlikely.
All practical statisticians are disturbed that a p-value cannot be seen as
the probability that the null hypotheis is true.
Similarly, it is disturbing that a 95% confidence interval for a treatment difference
does NOT mean that the true difference has 95% chance of lying in this
interval.
Classical confidence interval:
[3.5 – 11.6] is a 95% confidence interval for the mean cost of …
Interpretation: There is a 95% chance that the mean lies between 3.5 and 11.6.
Correct? NO!
It cannot mean this since the mean cost is not random!
In the Bayesian context, parameters are random and when we compute a
Bayesian interval for the mean it means exactly the interpretation usually given
to a confidence interval.
In classical inference, the words confidence and significance are technical terms
and should be interpreted as such!
One widely used way of presenting
a cost-effectiveness analysis is
through the Cost-Effectiveness
Acceptability Curve (CEAC),
introduced by van Hout et al
(1994). For each value of the
threshold willingness to pay λ, the
CEAC plots the probability that one
treatment is more cost-effective
than another. This probability can
only be meaningful in a Bayesian
framework. It refers to the
probability of a one-off event (the
relative cost-effectiveness of these
two particular treatments is one-off,
and not repeatable).
Example: randomised clinical trial evidence
Studies:
1 RCT (n = 107)
Boyd O, Grounds RM, Bennett ED. A randomised clinical trial of the effect of
deliberate perioperative increase of oxygen delivery on mortality in high-risk
surgical patients. JAMA 1993; 270:2699-707
Comparators:
dopexamine vs standard care
Follow-up:
28 days
Economics:
Single cost analysis
Trial results
Costs (£)
mean
Standard
Adrenaline
Dopexamine
£ 11,885
£ 10,847
£ 7,976
Survival (days)
se
£ 3,477
£ 3,644
£ 1,407
mean
se
541
615
657
50.2
64.1
34.7
Trial CEAC curves
1
Probability strategy is cost-effective
Control
Dopexamine
Adrenaline
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
£0
£10,000
£20,000
£30,000
£40,000
£50,000
l
£60,000
£70,000
£80,000
£90,000 £100,000
The Bayesian method: learn from the data
The role of data is to add to our knowledge and so to update what we can say
about hypothesis and parameters.
If we want to learn from a new data set, we have to first say what we already
know about the hypothesis, a priori, before we see the data.
Bayesian statistics summerises the a priori known things on an unknown parameter
(say the mean cost of something) in a distribution for the unknown quantity, called
prior distribution.
The prior distribution synthetises what is known or believed to be true before we
analyse the new data.
Then we will analyse the new data and summerise again the total information
about the unknown hypothesis (or parameter) in a distribution called posterior
distribution.
Bayes’ formula is the mathematical way to calculate the posterior distribution
given the prior distribution and the data.
prior
posterior
data
Bayes recognises the strength of each curve: the posterior is more influenced
by the data then by the prior, since the data have a more narrow distribution.
Peaks: prior = 0
data = 1,60
posterior = 1,30
The data curve is called likelihood, and it is also important in classical
statistics. It describes the support that come from the data for the various
possible values of the unknown parameter.
Classical statistics uses only the likelihood, bayesian statistics all three curves.
The classical estimate here would be the peak of the likelihood (1.6)
The bayes estimate is about 1,3, since this includes our prior believe that the
partameter should have a value which is below 2 or so.
The bayesian estimate is a compromise between data and
prior knowledge.
In this sense, bayesian statistics is able to make use of more
information than classical statistics and obtain hence stronger
results.
0.89
Bayesian statistics reads confidence intervals, estimates etc from the posterior
distribution.
A point estimate for the parameter is the a-posteriori most likely value, the peak
of the posterior, or the expected value of the posterior.
If we hav an hypothesis (for example that the paramter is positive), then we
read from the posterior that the posterior probability for the paramter to be
larger than zero is 0.89.
If we are less sure about the parameter a priori, then we would use
a flatter prior. Consequence is that the posterior looks more similar to
the likelihood (data).
posterior = (constant number)  prior  likelihood
 prior  likelihood
P(parameter | data)  P(parameter)  P(data | parameter)
P( | data)  P()  P(data | )
P( | data) = P()  P(data | ) / P(data) (Bayes formula)
How do we choose a prior distribution?
The prior is subjective. Two different experts can have different knowledge
and believes, which would lead to two different priors. If you have no
opinion then it is possible to use a totally flat prior, which adds no information
to what is in the data.
If we want to have clear probabilistic interpretations of confidence and
significance, then we need to have priors.
This is considered as a weakness by many who are trained to reject
subjectivity whenever possible.
BUT:
- Science is not objective in any case. Why should the binomial or the gaussian
distribution we the true ones for a data set?
- subjective evidence can be tuned down as much as one wishes.
- if there is no consensus, and different priors lead to different decisions, why
hiding it?
Example: Cancer at Slater School
(Example taken from an article by Paul Brodeur in the New Yorker in Dec. 1992.)
 Slater School is an elementary school where the staff was concerned that their
high cancer rate could be due to two nearby high voltage transmission lines.
Key Facts
 there were 8 cases of invasive cancer over a long time among 145 staff members
whose average age was between 40 and 44
 based on the national cancer rate among woman this age (approximately 3/100),
the expected number of cancers is 4.2
Assumptions:
1) the 145 staff members developed cancer independently of each other
2) the chance of cancer, , was the same for each staff person.
Therefore, the number of cancers, X, follows a binomial distribution: X ~ bin (145, )
How well do each of four simplified Competing Theories explain the data?
Theory A:  = .03 (the national rate, i.e. no effect of lines)
Theory B:  = .04
Theory C:  = .05
Theory D:  = .06
The Likelihood of Theories A-D
To compare the theories, we see how well each explains the data.
That is, for each hypothesized , we calculate the binomial distribution:
145  8
 (1   )137
Pr( X  8 |  )  
 8 
Theory A: Pr(X = 8 |  = .03 )  .036
Theory B: Pr(X = 8 |  = .04 )  .096
Theory C: Pr(X = 8 |  = .05 )  .134
Theory D: Pr(X = 8 |  = .06 )  .136
This is a ratio of approximately 1:3:4:4. So, Theory B explains the
data about 3 times as well as theory A. There seems to be an
effect of the lines!
A Bayesian Analysis
There are other sources of information about whether cancer can be
induced by proximity to high-voltage transmission lines.
- Some epidemiologists show positive correlations between cancer
and proximity
- Other epidemiologists don’t show these correlations, and physicists
and biologists maintain believe that energy in magnetic fields
associated with high-voltage power lines is too small to have an
appreciable biological effect.
Supposes we judge the opposite expert knowledge equally reliable.
Therefore, Theory A (no effect) is as likely as Theories B, C, and D
together, and we judge theories B, C, and D to be equally likely.
So, Pr(A)  0.5  Pr(B) + Pr(C) + Pr(D)
Also, Pr(B)  Pr(C)  Pr(D)  0.5 / 3 = 1/6
These quantities will represent our prior distribution on the four
possible hypothesis.
prior 
Bayes’ Theorem
Pr( A | X  8) 
Pr( A and X  8)
Pr(X  8)
Pr( A) Pr(X  8 | A)
Pr( A) Pr(X  8 | A)  Pr( B) Pr(X  8 | B)  Pr(C) Pr(X  8 | C)  Pr( D) Pr(X  8 | D)
(1 / 2)(. 036)
Pr( A | X  8) 
(1 / 2)(. 036)  (1 / 6)(. 096)  (1 / 6)(. 134)  (1 / 6)(. 136)
Pr( A | X  8)  0.23
Pr( A | X  8) 
P( A | X = 8 ) = 0.23
Likewise,
Pr( B | X = 8 ) = 0.21
Pr( C | X = 8 ) = 0.28
Pr( D | X = 8 ) = 0.28
posterior 
Accordingly, we’d say that each of these four theories is almost equally
likely,
So the probability that there is an effect of the lines at Slater is about
0.21 + 0.28 + 0.28 = 0.77. So the probability of an effect is pretty high,
but not close enough to 1 to be a proof.
A non-Bayesian Analysis
Classical test of the hypothesis
Ho:  = .03 (no effect)
against the alternative hypothesis.
Calculate the p-value; we find:
p-value = Pr(X=8|  = 0.03 )+ Pr(X=9|  = 0.03 )+ Pr(X=10|  = 0.03 )
+…+ Pr(X=145|  = 0.03 ) (138 terms to be added)
 .07
Under a classical hypothesis test, we would not reject the null
hypothesis. So there is no indication of an effect of the lines.
By comparison, the Bayesian analysis revealed that the probability
that Pr( > .03)  0.77
Today’s posterior is the prior of tomorrow!
Example: Hospitalisation
A new drug seems to have good efficacy relative to a standard treatment.
Is it cost-effective? Assume that it would be so if it would also reduce
hospitalisation.
Data:
100 patients in each treatment group.
Standard treatment group: 25 days. (sample variance was 1.2)
New treatment group: 5 days. (sample variance was 0.248)
Classical test (do it!) would show that the difference is significant at 5% level.
Pharmaceutical company would then say: ”The mean number of days in
hospital under the new treatment is 0.05 per patient (5/100) while it is 0.25
with the standard treatment.” Cost effective!!!!!
Example: Hospitalisation & genuine prior information
BUT: this was a rather small trial.
Is there other evidence available?
Suppose a much larger trial of a similar drug produced a mean number
of days in hospital per patient of 0.21, with a standard error of 0.03 only.
This would suggest that the 0.05 of the new drug is optimistic and there
is a doubt on the real difference between new and standard treatment cost.
BUT, the interpretation of how pertinent this evidence is, is subjective. It was
a similar, not the same drug.
It is however reasonable to suppose that the new drug and this similar one
should be rather similar.
Because the drug is not the same, we cannot simply put the two data sets
together. Classical statistics does not know what to do, except to lower the
required significance.
Example: Hospitalisation
Bayesian statistics solves the problem by treating the early trial as
giving prior information to the new trial.
Assume that our prior says that the mean number of days in
hospital per patient with the new treatment should be 0.21 but with
a standard deviation which is larger, say 0.08, to mark that the two
drugs are not the same.
Now we compute the posterior estimate, given the new small trial,
and obtain that the number of days in hospital per patient is 0.095.
So this is still better than the standard treatment (0.25 days).
In fact we can compute the probability that the new drug reduces
hospitalisation with respect to the standard one and we get 0.90!
Conlcusion: the new treatment has a 90% chance to reduce hospitalisation
(but not 95%) and that the mean number of days is about 0.1 (not 0.05).
http://www.bayesian-initiative.com/
The Bayesian Initiative in Health
Economics & Outcomes Research
I hope I have not confused you too much!
BUT I also hope that you are a bit confused now and that at later stages
in your education and profession you will want to learn this better!
For now: Bayesian statistics is NOT part of the pensum.