Bayesian Statistics - National University of Singapore

Download Report

Transcript Bayesian Statistics - National University of Singapore

SPH6004 Advanced Biostatistics
Part 1: Bayesian Statistics
Chapter 1: Introduction to Bayesian Statistics
Golden rule:
please stop me to ask questions
Week
Starting
Tuesday
Friday
1
13 Jan
Alex
[Alex]
2
20 Jan
3
27 Jan
4
3 Feb
Alex
[Alex]
5
10 Feb
Alex
[Alex]
6
17 Feb
Alex
[Hyungwon]
R
24 Feb
7
3 Mar
Hyungwon
Hyungwon
8
10 Mar
Hyungwon
Hyungwon
9
17 Mar
Hyungwon
Hyungwon
10
24 Mar
YY
YY
11
1 Apr
YY
YY
12
7 Apr
YY
YY
Week
Starting
Tuesday
Friday
1
13 Jan
Introduction to
Bayesian statistics
Importance sampling
2
20 Jan
3
4
27 Jan
3 Feb
5
10 Feb
6
17 Feb
Markov chain
JAGS and STAN
Monte Carlo
Hierarchical
Variable selection and
modelling
model checking
Bayesian inference
for mathematical
models
Objectives
●
●
●
●
Describe differences between Bayesian and classical
statistics
Develop appropriate Bayesian solutions to nonstandard problems, describe the model, fit it, relate
analysis to problem
Describe differences between computational methods
used in Bayesian inference, understand how they
work, implement them in a programming language
Understand modelling and data analytic principles
Expectations
Know already
●
Basic and intermediate statistics
●
Likelihood function
●
Pick up programming in R
●
Generalised linear models
●
Able to read notes
Why the profundity?
●
●
●
Bayes' rule is THE way to invert conditional
probabilities
ALL probabilities are conditional
Bayes' rule therefore provides the 'calculus' to
manipulate probability, moving from p(A|B) to
p(B|A).
For early detection of breast
cancer, starting at some age,
women are encouraged to have
routine screening, even if they
have no symptoms
Imagine you conduct such
screening using mammography
Prof Gerd Gigerenzer
The following information is
available about asymptomatic
women aged 40 to 50 in your
region who have mammography
screening
• The probability a woman has breast cancer is 0.8%
• If she has breast cancer, the probability is 90% that she has
a positive mammogram
• If she does not have breast cancer, the probability is 7%
that she still has a positive mammogram
The challenge:
• Imagine a woman who has a positive mammogram
• What is the probability she actually has breast cancer?
Their answers...
I never inform my patients about
statistical data. I would tell the
patient that mammography is not
so exact, and I would in any case
perform a biopsy.
The following information is available about
asymptomatic women aged 40 to 50 in your
region who have mammography screening
• The probability a woman has breast cancer is 0.8%
• If she has breast cancer, the probability is 90% that she has
a positive mammogram
• If she does not have breast cancer, the probability is 7%
that she still has a positive mammogram
Can we write the above mathematically?
Key point 1
• p(B = 1 | A = 1)---the probability prior to
observing the mammogram
• p(B = 1 | M = 1, A = 1)---the probability after
observing it
• Bayes’ rule provides the way to update the
prior probability to reflect the new information
to get the posterior probability
• (Even the prior is a posterior)
Key point 2
●
Bayes' rule allows you to switch from
– pr(something known | something unknown)
●
to
– pr(something unknown | something known)
Bayesians and frequentists
Bayes' rule is used to
switch to
pr(unknowns|knowns)
for all situations in
which there is
uncertainty including
parameter estimation
Bayes' rule is only used to
make probability
statements about events,
that in principle could be
repeatedly observed
Parameter estimation is done
using methods that
perform well under some
arbitrary desiderata, such
as being unbiased, and
uncertainty is quantified by
appealing to large samples
The Thai AIDS vaccine trial
The modified intention to treat analysis
Seroconverted
Participated
Vaccine arm
51
8197
Placebo arm
74
8198
Q: what is the “underlying” probability pv
of infection over this time window for
those on the vaccine arm?
What does that actually mean?
• Participants are not randomly selected from the
population: they are referred or volunteer
• Participants must meet eligibility requirements
• Not representative of Thai population
• Risk of infection different in Thailand and, eg,
Singapore
• Nebulous: risk of infection in an hypothetical
second trial in same group of participants
• Hope pv/pu has some relevance in other settings
Model for data
•
•
•
•
Seems appropriate to assume Xv ~ Bin(Nv,pv)
Xv = 51 = number vaccinees infected
Nv = 8197 = number vaccinees
pv = ?
Point estimate to
summarise the data
Interval estimate to
summarise uncertainty
(later) measure of
evidence that the
vaccine is effective
Refresher: frequentist approach
• Traditional approach to estimate pv:
– find the value of pv that maximises the probability
of the data given that the hypothetical value were
the true value
– using calculus
– numerically (Newton-Raphson, simulated
annealing, cross entropy etc)
– EITHER CASE use log likelihood
Refresher: frequentist approach
• Differentiating wrt argument we want to max
over
• setting derivative to zero, adding hat, solving,
gives
• which is just the empirical proportion infected
Refresher: frequentist approach
• To quantify the uncertainty might take a 95%
interval
• You probably know
• (involves cheating: assuming you know pv and
assuming the same size is close to infinity--actually there are better equations for small
samples)
Interpretation
• The maximum likelihood estimate of pv is not
the most likely value of pv
• Classical statisticians cannot make probabilistic
statements about parameters
• Not a 95% chance pv lies in the interval
(0.45,0.79)%
• 95% of such intervals over your lifetime (with
no systematic error, small samples) will contain
the true value
Tackling it Bayesianly
• Target: point and interval estimate
• Intermediate: probability of the parameter pv
given the data Xv and Nv, ie
likelihood fn
posterior for pv
prior for pv
dummy variable pi
• Likelihood function is same as before
• What is the prior?
What is the prior?
• There is no the prior
• There is a prior: you choose it just as you
choose a Binomial model for the data
• It represents information on the parameter
(proportion of vaccinees that would be
infected) before the data are in hand
• Perhaps justifiable to assume all probs
between [0,1] are equiprobably before data
observed
What is the prior?
• 1{A}=1 if A true and 0 if A false
• Nv can be dropped from the condition as I
assume sample size and probability of
infection are independent
What is the posterior?
• pv on the range (0,1)
• C a constant
1
•Smart way
•(later)
2
•Dumb way
•(now)
The dumb way
• Grid of values for pv, finely spaced, on sensible
range
• Evaluate log posterior +C
• Transform to posterior ×C
• Approximate integral by sum over grid
• Scale to get rid of C exploiting fact that
posterior is a pdf and integrates to 1
The dumb way
The posterior
Point estimates
• If you have a sample x1, x2, ... from a
distribution, can represent overall location
using:
– mean
– median
– mode
• Similarly can report as point estimate mean,
median or mode of posterior
In R
Method
Mean
Mode
Median
MLE
Estimate
0.63%
0.63%
0.62%
0.62%
Uncertainty
• Two common methods to get uncertainty
interval/credible interval/intervals:
– quantiles of the posterior (eg 2.5%ile, 97.5%ile)
– highest posterior density interval
• Since there is a 95% chance if you drew a
parameter value from the posterior of it falling
in this interval, the interpretation is how many
people think of confidence intervals
Highest posterior density intervals
In R
(0.47,0.82)%
(0.45,0.79)%
(0.47,0.81)%
Important points
• In some situations it doesn’t really matter if
you do a Bayesian or a classical analysis as the
results are effectively the same
– sample size is large, asymptotic theory justified
– no prior/external information for analysis
– someone has already developed a classical
routine
• In other situations, Bayesian methods come
into their own!
Philosophical points
• If you really love frequentism and hate
Bayesianism, you can pragmatically use
Bayesian approaches and interpret them like
classical ones
• If vice versa, you can
– use classical estimates from literature as if
Bayesian
– arguably interpret classical point/interval
estimates the way you want to
Priors and posteriors
•• A
prior
probability
of
BC
reflects
the
A prior probability density function for pv reflects
information
beforethe
observing
the are
the
informationyou
you have
have before
study results
mammogram: all you know is the risk class the
known
sits inprobability density function reflects
• patient
The posterior
information
the study,
anything
•the
The
posteriorafter
probability
ofincluding
BC reflects
the
known
before
and
everything
from
the
study
itself
information after observing the mammogram
How much knowledge, how much uncertainty
Justification
•ForStatistician,
Ms
A,
is
analysing
some
data.
She
instance, Ms A wants to do a logistic regression on the
comesdata
up with a model for the data based on
following
outcome:
got infected
by H1N1 asShe
measured
some
simplifying
assumptions.
mustbyjustify
serology
thispredictors:
choice if others
arerecent
to believe
age, gender,
overseasher
travel,
number ofMr
children
household, ...
• Bayesian statistician,
B, isinanalysing
some
There
is no
why
the effect
age on
the risk of
infection
data.
Hereason
must
come
up of
with
a model
for
the
should be linear in the logit of risk. There is no reason why each
data and for the parameters. He too must
predictor’s effect is additive on the logit of risk. There is no
justify
choice.should be taken to be independent.
reason
whyhis
individuals
These are choices made by the statistician
Support
• Each parameter of a model has a support
• The prior should match this
𝑋~𝐵𝑖𝑛 𝑛, 𝑝
𝑝 ∈ 0,1
𝑌~𝑁 𝜇, 𝜎 2
𝜇 ∈ ℝ, 𝜎 2 ∈ ℝ+
𝑌𝑖 ~𝑁 𝑎 + 𝑏𝑥𝑖 , 𝜎 2 𝑎 ∈ ℝ, 𝑏 ∈ ℝ, 𝜎 ∈ ℝ+
• All a bit silly:
𝑝~𝑁 0,1002
𝜎~𝐵𝑒 10.2,3.9
𝑏~exp(1000)
Priors for multiple parameters
𝑌𝑖 ~𝑁 𝑎 + 𝑏𝑥𝑖 , 𝜎 2 𝑎 ∈ ℝ, 𝑏 ∈ ℝ, 𝜎 ∈ ℝ+
• You must specify a joint prior for all parameters, eg
p(a,b,σ)
• Often easiest to assume the parameters are a priori
independent, ie eg
p(a,b,σ) = p(a) p(b) p(σ)
• (note this does not force them to be independent a
posteriori)
• But you can incorporate dependency if
appropriate, eg if you analyse dataset 1 and use its
posterior as a prior for dataset 2
Aim for this part
• Look at different classes of priors:
– informative, non-informative
– proper, improper
– conjugate
Informative and noninformative priors
Informative
Encapsulates information
beyond that available solely in
the data directly at hand
Non-informative
Opposite: a distribution that is
flat or approximately flat over
the range of parameter values
with high likelihood values
For instance, if someone has
Eg pv ~ U(0,1) is non-informative
previously estimated the risk of as it is flat over the range 0.5-infection by HIV in Thai adults
1.5% where the data tell you pv
and reported point and interval should be
estimates, you could take those Eg mu~U(-1000000,1000000)
and convert into an appropriate might be non-informative for a
prior distribution
parameter on the real line; as
might N(0,10002)
When to choose which?
Use a non-informative prior if:
Use an informative prior if:
Your primary data set has so much
Your primary data set doesn’t give
information in it you can estimate the enough information to estimate all
parameter with no problems
unknowns well (see next chapter for
an example)
You only have one data set
You have multiple data sets and can
best analyse them one at a time
You have no really solid estimates
from the literature that you can
supplement the information from
your primary data
You have really good estimates from
the literature that everyone accepts
You want to approximate a
frequentist analysis
You are analysing the data for your
own benefit, to make a decision, say,
and do not need the acceptance of
others
Q: I’ve decided I want a noninformative prior. But what form?
Parameter
support
Possible non-informative prior
[0,1]
U(0,1), Be(1,1), Be(1/2,1/2)
Positive part U(0, ∞), U(0,big number), exp(big mean),
of real line gamma(big variance?), log N(mean 1, big
variance?), truncated N(0, big variance)
Real line
U(−∞, ∞), U(−big number, big number), N(0,big
variance)
Exact choice rarely makes a difference
Q: I’ve decided I want an informative
prior and have found an estimate in
the literature. So, how?
Aim for this part
• Look at different classes of priors:
– informative, non-informative
– proper, improper
– conjugate
Proper and improper priors
•
•
•
•
•
Recall:  f ( x ) dx  1
X
Distributions are supposed to integrate to 1
Prior distributions really should, too
A prior that integrates to 1 is proper
One that doesn’t is improper
p( )
Proper and improper posteriors
An improper posterior is a bad outcome!
Prior
Proper
Improper
Improper
Posterior
Proper

Proper

Improper 
Bad likelihoods
• If the likelihood is ‘badly behaved’ then not
only do you need a proper prior, you need an
informative prior, as there is insufficient
information in the data to estimate that
parameter (or those parameters)
Aim for this part
• Look at different classes of priors:
– informative, non-informative
– proper, improper
– conjugate
Conjugate priors
• So, with our binomial model, we moved
– from a prior for pv that was beta
– to a posterior for pv that was beta
• We therefore say that the beta is conjugate to
the binomial
(
)
Conjugate priors
• There are a handful of other data models with
conjugate priors
• May encounter some later in the course
• Most real problems do not have conjugate
priors though
• If it does, it makes sense to exploit it
• Eg for the Thai vaccine, once you realise pv is
beta a posteriori can summarise the posterior
directly
Summarising a posterior directly
/(2+nv)
Different kinds of priors
Conjugate
Non-conjugate
Different kinds of priors





Conjugate
Non-conjugate

Different kinds of priors





Conjugate
Non-conjugate

Information to Bayesians
prior
posterior
data
Information to Bayesians
prior
posterior 1
data 2
data 1
posterior 2
Information to Bayesians
prior
posterior 1
data 2
data 1
posterior 2
?
A Gedanken
• Consider experiments to estimate a probability p
given a series of Bernoulli trials, xi, with yi = Σj=1:i xj
• Use a Be(α,β) prior for p
• Experimentor 1, instead of waiting for all the data to
come in, recalculates the posterior from scratch
based on yi and (α,β) each time a data point comes
in
• Experimentor 2, uses his last posterior and xi to
recalculate the posterior
Experimentor 1
Experimentor 2
• The two experimentors, using the same prior
and same data, end with the same posterior
• Experimentor 1 started afresh each time with
the original prior and all data
• Experimentor 2 updated the old posterior with
the new datum
Implications
• If data come to you piecemeal, it doesn’t
matter if you analyse them once at the end, or
at each intermediate point and update your
prior
• (In practice one or the other may be
convenient: eg if posterior is not analytic,
makes
estimate/approximate
once,
You
cansense
take to
estimates
from the literature
rather and
thanconvert
once perthem
datum)into priors
You can always treat an
old posterior obtained
elsewhere as a prior
What did we learn in chapter 1?
Bayes rule
Refresher on frequentist estimation
Estimating a proportion given x, n
Saw how Bayes rule could be
Applied to probability of a state used to derive posterior
of nature (BC) given evidence
probability density of
(MG) and background risk (age) parameter given data
Priors
Accumulation
of evidence
What did we learn in chapter 1?
• Don’t know how to do Bayesian inference for
problems with >1 parameter!
Chapter 2 & 3:
computing posteriors
Importance sampling
Markov chain Monte Carlo