Bayesian Methods and Subjectiv Probability

Download Report

Transcript Bayesian Methods and Subjectiv Probability

Bayesian Methods and
Subjectiv Probability
Daniel Thorburn
Stockholm University
2011-01-10
Outline
1.
2.
3.
4.
5.
6.
7.
8.
9.
Background to Bayesian statistics
Two simple rules
Why not design-based?
Bayes, Public statististics and sampling
De Finetti theorem, Bayesian Bootstrap
Comparisons between paradigms
Preposterior analysis
Statistics in science
Complementary Bayesian meethods
2
1. Background
• Mathematically:
– Probability is a positive, finite, normed,
(s)-additive measure defined on a (s)algebra
• But what does that correspond to in real
life?
What is the probability of heads in the
following sequence?
Does it change? And when?
– This is a fair coin
– I am now going to toss it in the corner
– I have tossed it but noone has seen the result
– I have got a glimpse of it but you have not
– I know the result but you don´t
– I tell you the result
4
• Laplace definition. ”All outcomes are equally probable if
there is no information to the contrary”. (number of positive
elementary events/number of possible elementary events)
• Choose heads and bet on it, with your neighbour. You get
one krona if you are right and lose one if you are wrong. When
should you change from indifference?
• Frequency interpretation. (LLN). If there is an infinite
sequence of independent experiments then the relative
frequency converges a.s. towards the true value. Cannot
be used as a definition for two reasons
– It is a vicious circle. Independence is defined in terms of
probability
– It is logically impossible to define over-countably many different
quantities by a countable procedure.
5
Probabilities do not exist
(de Finetti)
• They only describe your lack of knowledge
• If there is a God almighty, he knows everything
now, in the past and in the future. (God does not
play dice, (Einstein))
• But lack of knowledge is personal, thus
probability is subjective
• Kolmogorovs axioms only does not say anything
about the relation to reality
6
• Probability is the language which
describes uncertainty
• If you do not know a quantity you should
describe your opinion in terms of
probability
• Probability is subjective and varies
between persons and over time,
depending on the background information.
7
Rational behaviour – one person
• Axiomatic foundation of probability. Type:
– For any two events A and B exactly one of the following must hold A <
B, A > B or A v B (pronounce A as more likely than B, B more likely
than A, equally likely)
– If A1, A2, B1 and B2 are four events such that A1A2 = B1B2 is empty and
A1 > B1 and A2 > B2 then A1 U A2 > B1 U B2. If further either A1 > B1 or
A2 > B2 then A1 U A2 > B1 U B2
– …
• If these axioms hold all events can be assigned probabilities, which
obey Kolmogorovs axioms (Villegas, Annals Math Stat, 1964),
• Axioms for behaviour. Type …
– If you prefer A to B, and B to C then you must also prefer A to C
– …
• If you want to behave rationally, then you must behave as if all
events were assigned probabilities (Anscombe and Aumann, Annals
Math Stat, 1963)
• Axioms for probability (these six are enough to prove that a
probability following Kolmogorovs axioms can be defined plus
the definition of conditional probability)
– For any two events A and B exactly one of the following must hold A <
B, A > B or A v B (pronounce A as more likely than B, B more likely
than A, equally likely)
– If A1, A2, B1 and B2 are four events such that A1A2 = B1B2 is empty and
A1 > B1 and A2 > B2 then A1 U A2 > B1 U B2. If further either A1 > B1 or
A2 > B2 then A1 U A2 > B1 U B2
– If A is any event then A > (the impossible (empty) event)
– If Ai is an strictly decreasing sequence of events and B a fixed event
such that Ai > B for all i then (the intersection of all A ii) > B
– There exists one random variable which has a uniform distribution
– For any events A, B and D, (A|D) < (B|D) if and only if AD < BD
• Then one needs some axioms about comparing outcomes,
(utilities) in order to be able to prove rationality…
9
• Further one needs some axioms about comparing outcomes,
(utilities) in order to be able to prove rationality
– For any two outcomes, A and B, one either prefers A to B or B to A or is
indifferent
– If you prefer A to B, and B to C then you must also prefer A to C
– If P1 and P2 are two distributions over outcomes they may be compared
and you are indifferent between A and the distribution with P(A)=1
– Two measurability axioms like
• If A is any outcome and P a distribution then the event that P gives an outcome
preferred two A can be compared to other events (more likely …)
– If P1 is preferred to P2 and A is an event, A > 0, then the game giving
P1 if A occurs is preferred to the game giving P2 under A if the results
under the not-A are the same.
– If you prefer P1 to P and P to P2, then there exists numbers a>0 and
b>0 such that P1 with probability 1-a and P2 with probability a is
preferred to P, which is preferred to P1 with probability b and P2 with
probability 1-b.
10
There is only one type of numbers,
which may be known or unknown.
• Classical inference has a mess of different
types of numbers e.g.
–
–
–
–
–
–
–
–
Parameters
Latent variables like in factor analysis
Random variables
Observations
Independent (explaining) variables
Dependent variables
Constants
a.s.o.
• Superstition!
2. Two simple requirements for
rational inference
12
Rule 1
• What you know/believe in advance +
The information in the data =
What you know/believe afterwards
13
Rule 1
• What you know/believe in advance +
The information in the data =
What you know/believe afterwards
• This is described by Bayes’ Formula:
• P(q|K) * P(X|q,K) a P(q|X,K)
14
Rule 1
• What you know/believe in advance +
The information in the data =
What you know/believe afterwards
• This is described by Bayes’ Formula:
• P(q|K) * P(X|q,K) a P(q|X,K)
• or in terms of the likelihood
• P(q|K) * L(q|X) a P(q|X,K)
15
Rule 1 corrolarium
• What you believe afterwards +
the information in a new study =
What you believe after both studies
16
Rule 1 corrolarium
• What you believe afterwards +
the information in a new study =
What you believe after both studies
• The result of the inference should be possible to
use as an input to the next study
• It should thus be of the same form!
• Note that hypothesis testing and confidence
intervals can never appear on the left hand side
so they do not follow rule 1
17
Rule 2
• Your knowledge must be given in a form that
can be used for deciding actions. (At least in a
well-formulated problem with well-defined
losses/utility).
18
Rule 2
• Your knowledge must be given in a form that
can be used for deciding actions. (At least in a
well-formulated problem with well-defined
losses/utility).
• If you are rational, you must use the rule which
minimizes expected ”losses” (maximizes
utility)
• Dopt = argmin E(Loss(D, q )|X,K)
= argmin X Loss(D,q) P(q |X,K) dq
19
Rule 2
• Your knowledge must be given in a form that can be
used. (At least in a well-formulated problem with
well-defined losses/utility.
• If you are rational, you must use the rule which
minimizes expected ”losses” (maximizes utility)
• Dopt = argmin E(Loss(D, q )|X,K)
= argmin X Loss(D,q) P(q |X,K) dq
• Note that classical design-based inference has no
interface with decisions.
20
Statistical tests are useless
• They cannot be used to combine with new
data.
• They cannot be used even in simple
decision problems.
• They can be compared to the blunt plastic
knife given to a three year old child
– He cannot do much sensible with it
– But he cannot harm himself either
3. An example of the the stupidity of
frequency-based (design-based) methods
N=4, n=2, SRS. Dichotomous data, black or white. The variable is known
to come in pairs, i.e. the total is T=0, 2 or 4.
Probabilities:
Population\outcome 0 white 1 white
No white T=0
1
2 white T=2
1/6
All white T=4
2/3
2 white
1/6
1
If you observe 1 white you know for sure that the population contains 2 white.
If you observe 0 or 2 you white the only unbiased estimate is T*= 0 resp. 4
The variance of this estimate is 4/3 if T=2 (=1/6*4+4/6*0+1/6*4) and 0 if T=0
or 4
So if you know the true value the design-based variance is 4/3 and if you are
uncertain the design-based variance is 0. (Standard unbiased variance
22
estimates are 2 resp. 0)
Bayesian analysis works OK
– We saw the Bayesian analysis when t=1,
(T*=2).
– If all possibilities are equally likely à priori, the
posterior estimates of T when t = 0 (2) is T* =
2/7 (26/7) and the posterior variance is 24/49.
23
Always stupid?
• It is always stupid to believe that the
variance of an estimator is a measure of
precision in one particular case. (It is
defined as a long run property for many
repetitions)
• But it is not always so obvious and so
stupid as in this example.
• Is this a consequence of the unusual prior
with T must be even?
24
Example without the prior info
Still stupid but not quite as much
\outcome 0
population\
0
1
2
3
4
Var(q|X)
6/6
3/6
1/6
9/20
1
3/6
4/6
3/6
6/10
2
1/6
3/6
6/6
9/20
Var(q*|q)=
Var(2T|q)
0
1
4/3
1
0
25
Example without the prior info
Still stupid but not quite as much
\outcome 0
population\
0
1
2
3
4
Var(q|X)
6/6
3/6
1/6
9/20
1
3/6
4/6
3/6
6/10
2
1/6
3/6
6/6
9/20
Var(q*|q)=
Var(2T|q)
0
1
4/3
1
0
If you observe 1, the true error is never larger than 1, but
the standard deviation is always larger than 1 for all
possible parameter values.
26
Always stupid?
• It is always stupid to assume that the variance of
an estimator is a measure of precision in one
particular case. (It is defined as a long run
property for many repetitions)
• But it is not always so obvious and stupid as in
these examples.
• Under suitable regularity conditions
designbased methods are asymptotically as
efficient as Bayesian methods
Var (q * | q )
 1 a.s. p (q , X n ) as n  
Var (q | X n )
27
• Many people say that one should choose
the approach that is best for the problem
at hand. Classical or Bayesian.
28
• Many people say that one should choose
the approach that is best for the problem
at hand. Classical or Bayesian.
• So do Bayesians.
• But they also draw the conclusion:
29
• Many people say that one should choose
the approach that is best for the problem
at hand. Classical or Bayesian.
• So do Bayesians.
• But they also draw the conclusion:
• Always use Bayesian methods!
30
• Many people say that one should choose
the approach that is best for the problem
at hand. Classical or Bayesian.
• So do Bayesians.
• But they also draw the conclusion:
• Always use Bayesian methods!
• Classical methods can sometimes be seen
as quick and dirty approximations to
Bayesian methods.
• Then you may use them.
31
4.What is special for many statistical
surveys, e.g. public statistics?
32
3. What is special for many statistical
surveys, e.g. public statistics?
• Answer 1
The producer of the survey is not the user.
– Often many readers and many users.
– The producer has no interest in the figures par se
• P( q |Kuser) is not known to the producer and not unique
P(q| Kuser) * L(q|X) a P(q|X, Kuser)
• Solution?
33
3. What is special for many statistical
surveys, e.g. public statistics?
• Answer 1
The producer of the survey is not the user.
– Often many readers and many users.
– The producer has no interest in the figures par se
• P( q |Kuser) not known to the producer and not unique
P(q| Kuser) * L(q|X) a P(q|X, Kuser)
• Publish L( q |X) so that any reader can plug in his prior
• Usually given in the form of the posterior with a vague,
uninformative (often = constant) prior
L(q|X) t P(q|K0) * L(q|X) a P(q|X,K0)
34
Describing the likelihood
• Estimates are often asymptotically normal. Then
it is enough to give the posterior mean and
variance or a (symmetric) 95% prediction
interval (for large samples)
• When the maximum likelihood estimator is
approximately efficient and normal the MLestimate and inverse Fisher information are
enough. (t standard confidence interval)
• Asymptotically efficient t for large samples
almost as good as Bayesian estimates, which
are known to be admissible also for finite
samples
What is special for many statistical
surveys, e.g. public statistics?
• Answer 2
There is no parameter or more exactly:
The parameter consists of all the N values of all
the units in the population.
36
What is special for many statistical
surveys, e.g. public statistics?
• Answer 2
There is no parameter or more exactly
The parameter consists of all the N values of all
the units in the population.
• Use this vector as the parameter q in Bayes’
formula.
• If you are interested in a certain function, e.g.
the total T, integrate out all nuisance parameters
in the posterior, to get the marginal of interest
P(YT|X,K) = X…XS qi = YT P(q1, … , qN |X,K) P1N-1
dqi
37
5. De Finetti’s theorem
• Random variables are said to be exchangeable
if there is no information in the ordering. This is
for instance the case with SRS
• If a sequence of random variables is infinitely
exchangeable than they can be described as
independent variables given q, where q is a
latent random variable. (The proof is simple but
needs some knowledge of random processes.
Formally q is defined on the tail s-algebra)
• Latent means in this case that it does not exist
but can be useful when desscribing the
distribution.
38
• This imaginary random variable can take the
place of a parameter
• But note that it does not exist (is not defined)
until the full infinite sequence has been defined
and the full sequence will never be observed.
• Note also that most sequences in the real world
are not independent but only exchangeable. If
you toss a coin 1000 times and get 800 heads it
is more likely that the next toss will be heads
(compared to the case with 200 heads).
• So obviously there is a dependence between the
first 1000 tosses and the 1001st
39
Dichotomous variables or
The Polya Urn scheme
–
–
–
–
In an urn there is one white and one black ball.
Draw one ball at random. Note its colour.
Put it back together with one more ball of the same colour
Draw one at random …
• This sequence can be shown to be exchangeable and it
can by de Finetti’s theorem be described as
– Take q € U(0,1) = Beta(1,1)
– Draw balls independently with this probability of being white
• There is no way to see the difference between a
Bernoulli sequence (binomial distribution) with an
unknown p and a Polya urn scheme. Since the outcomes
follow the same distribution there cannot exist any test to
differentiate between them.
40
Dichotomous variables or
The Polya Urn scheme
• We could have started with another
number of balls. This had given other
parameters in the prior Beta-distribution
• Beta(a,b)  a white balls and b black
balls
• E(Beta) = a /(a + b)
• Var(Beta) = ab /((a + b)2(a + b+1))
41
Dichotomous variables or
The Polya Urn scheme
• This can be used to derive the posterior
distribution of the number (yT) of balls/persons
with a certain property (white) in a population,
given an observed SRS-sample of size n with yS
white balls/persons.
• Use a prior with parameters so that the expected
value is your best guess of the unknown
proportion and the standard deviation describes
your uncertainty about it.
42
Properties
• The posterior distribution can be shown to be
( N - n )! ( y S  a )! ( n  b - y S )!
Pr(YT  yT | y S ) 
( yT - y S )! ( N - n - yT  y S )! ( n  a  b  1)!
( N - n )(a  y S )
n a  b
(a  y S )( n  b - y S )( N  a  b )( N - n )
Var (YT | y S ) 
( n  a  b )( n  a  b  1)
E (YT | y S )  y S 
• With both parameters set to 0, the expected value is Np*
and the variance p*(1-p*)N(N-n)/(n+1).
• The designbased estimate and variance estimator are
good approximations to this (equal apart from n in place
of n+1)
43
Simulation
• It is often easier to simulate from the posterior
than to give its exact form.
• In this case the urn scheme gives a simple way
to simulate the full population. Just continue with
the Polya sampling scheme starting from the
sample
• If you repeat this 1000 times, say, and plot the
1000 simulated population totals in a histogram,
you will get a good description of the distribution
of the unknown quantity
44
Dirichlet-multinomial
• If the distribution is discrete with a finite number of categories,
a similar procedure is possible
• Just draw from the set of all observations and put it back
together with a similar observation. Continue until N
• Repeat and you get a number of populations which are drawn
from the posterior distribution.
• For each population compute the parameter of interest, e.g. the
mean or median, and plot the values in a histogram
• If this is described as in de Finetti’s Theorem, the parameter
comes from a Dirichlet distribution and the observations are
conditionally independent multinomial.
45
The Bayesian Bootstrap
• This procedure is called the Bayesian bootstrap (if an
uninformative prior i.e. all parameters = 0 is used)
• This can be generalised to variables measured on a continuous
scale
• The design-based estimate gives the same mean estimate as
this (for polytomous populations).
• The design-based variance estimator is also close to the true
variance apart from a factor n/(n+1)
• Note, that if the distribution is skew, this method does not
work well, since it does not use the prior information of
skewness (nor does the designbased methods)
• Note also that with many categories it may be better to use
even smaller parameters e.g. – 0.9.
46
Other Bayesian models
• There are many other models/methods within
Bayesian survey sampling, than the Bayesian
Bootstrap
• Another approach starts with a normal-gamma
model
– Given m,s2, data come from an iid normal(m,s2) model
– The variance s2 follows a priori an inverse gamma
– The mean m follows a priori a normal model with
mean m and variance ks2
• and later relaxes the normality assumption
• but I have not enough time here.
47
6. Properties of some different paradigms
within survey sampling
Design-based
Model-based
Bayesian
Uncertainty,
Randomness
Home-made
Given by nature, Subjective, ratfrequencybased ionality axioms
Main focus
Population
Parameters
Population
Parameters
Population
values
Unknown
unobservable
Do not exist, but
useful, de Finetti
Inference
Long run properties based
Long run properties based
This case, probability based
Output
Point estimates
intervals
Point estimates
confidence
intervals
Full posterior
distributions,
means, variance
Possible use
Not my problem
Not my problem
Interface with
decisions
48
7. Preposterior analysis
Study/experimental design
• In the design of a survey one must take
into account the posterior distribution.
• You may e.g. want to
– Get a small posterior variance
– Get a short 95 % prediction interval
– Make a good decision
• This analysis of the possible posterior
consequences before the experiment is
carried out, is called preposterior analysis
49
Preposterior analys with public statistics
• Usually when you make a survey for your own
benefit you should use your own prior both in the
preposterior and the posterior analysis
• With public statistics you should have a flat prior
in the posterior analysis,
– e.g. the posterior variance is Var(q |X, K0).
• But the design decision is yours and you should
use all your information for that decision
– e.g. find the design, which minimizes
E(Var(q |X, K0)| KYou )
50
Example: Neyman allocation;
Dichotomous data
• M strata with Nm elements. Unknown
proportion in stratum m is pm.
• How many elements should be drawn from
each stratum in order to estimate the
average proportion best?
• Neyman: Chose nm a Nm(pm(1-pm))1/2.
• But pm is unknown. Classical people: Use
your best subjective guess pm0
• Neyman: Chose nm a Nm (pm0(1-pm0))1/2.
Bayesians:
Do not use a one point prior.
It is to subjective!
Take also your prior uncertainty into
account!
• Chose e.g. the prior Beta(am,bm). m<M
– where pm0 = am/(am+bm)
– and
Var(pm) = ambm/((am+bm)2(am+bm+1))
Example: Optimal allocation
• M strata, size Nm , dichotomous data, independent priors (am, bm) (as we saw
above). Posterior variance:
N ( N - nm )
Var (Ytot | yS m .m  1,..., M )   pm * (1 - pm *) m m
nm  1
• The expected value of this is
N m ( N m - nm ) nm - 1
a mbm

nm
nm  1 (a m  b m )(a m  b m  1)
• Minimising this gives approximately the sample sizes
nm  a m  b m 
Nm
cm
a m bm
(a m  b m )(a m  b m1)
• The terms (am bm) on the left hand side should not be there in the case of
public statistics (c cost)
• This differs from Neyman-allocation since it takes the prior uncertainty of
the proportions into account
53
8. Statistics in science
Science is more complicated
•
One may divide it into (at least) three phases
1. Exploratory
2. Trying to get a good picture, convincing yourself
3. Proving the fact - convincing others
•
These phases may require different
approaches and priors
1. Your own prior but critical
2. .
1. During work: Your own prior and often informative based on
theory, arguments, experience or the exploration.
2. In the presentation: Usually vague prior.
3. Other possible priors (but use also vague)
8.1 Exploratory
• Sometimes called hypothesis generating
• Most theories are false. Most substances are useless
against cancer.
• Use your own priors, which most often say that all facts
are most unlikely. Some examples
– Screening – all substances have a probability of 0.001 to have
some effect
– Regression situations with an abundance (M) of explaining
variables. If all variables are ordered after importance all M!
orderings are equally likely. Given the ordering the m:th
regression coefficient is N(0,1/(m-1)2) (after standardisation of X)
(another possibility is that reduction in unexplained variance 1-R2
is Beta(1,m2))
• When there is a support from theory or previous
experiences other priors may be used
8.2 Getting a good picture yourself
(Assuming that you are the scientist)
• In classical terms this is the phase when you
can formulate the hypotheses that you want to
test. (Your prior is strong enough to formulate
hypotheses)
• (In classical theory there is no description of the
first phase. Mostly one said that: if you can
formulate a hypothesis you may test it, as long
as you do not formulate to many)
• Your priors should still be your own.
• The reporting in this phase is quite similar to
what was said about official statistics. I.e. try to
give a good picture of the likelihood function
• But, contrary to public statistics, in the design of
experiments it is your posterior precision that
should be maximised. (It is assumed that you
are an expert in the field and it is no reason to
believe that your opinion is far from the present
state of knowledge)
8.3 Proving scientific facts
• It is very easy to convince people who
believe in the fact from the beginning.
• It is often fairly simple to convince yourself
even if you are broadminded
• But to prove a scientific fact you must
convince also those that have reasonable
doubts.
Proving scientific facts
• P(q|X,K) a P(X|q,K) *P(q|K)
• A person is convinced of a fact when his posterior
probability is close to one for the fact.
• But to prove the fact scientifically this must hold for
all reasonable priors including those describing
reasonable doubt.
• Even if there is no such person this must hold also for
that prior as long as it is reasonable
• I.e. a result is ”proved” if
inf (P(q|X,K); K reasonable) > 1 – a for some a.
• Reporting: Use vague priors, but also show what the
consequences are for some priors with (un-)
reasonable doubt.
• When you prove something all available data should
be used. Type: Meta analysis. In some fields one
study is ususall not enough to convince people
• Designing experiments: Design your experiments so
that you maximise E(inf (P(q|X,K); K reasonable) |
KYOU) (if you are convinced).
What is reasonable doubt?
Convincing others
• You have to contemplate what is meant by
reasonable doubt.
• Depends on the importance of the subject.
• It can be just putting very small prior
probability on the fact to be proven
• But you must also try to find the possible
flaws in your theory and designing your
experiments to counterprove them.
Priors with reasonable doubt
• Use priors with reasonable doubt
– In an experiment to prove telepathic effects you could
e.g. use priors like P(logodds ratio = 0) = 0,9999. If
the logodds ratio is different from 0 it may be
modelled as N(0,s2), where s2 may be moderate or
large.
– If the posterior e.g. says that P(p > 0,5) > 0.95, and
may consider the fact as proved. (Roughly this means
that you need about 20 successes in a row, where a
random guess has probability ½ to be correct).
– Never use the prior P=1, since you must always be
open-minded (only fundamentalists do so. They will
never change their opinion whatever the evidence).
• In more standard situations you will probably not
need quite so negative priors
Modelling flaws in the theory/study
– Example: Several studies needed
• An argument often met in medical studies is that
no effect is proven until it is corroborated by at
least three independent studies.
• This means that different conclusions must be
drawn with different prior knowledge. (No, one or
two previous studies).
– People arguing like that violate the Neyman-Pearson
theory
• How can this be modelled in Bayesian terms?
Some type of multilevel model
q i*  m  ai  bi  e i ;
i  1,..., k
• Where m is the worldwide mean, ai is the unknown
methodology bias of study number i, bi is the site-specific
bias and ei is the precision of the experiment (usually with
a known (posterior) distribution)
• m has an uninformative prior
• bi can often be assumed to follow normal distributions
with common mean and variance following Normal-c-2distributions.
• ai has probably prior distributions with much longer tails.
• The prior distributions for ai and bi might be estimated
from other studies.
• If the prior for ai is chosen so that two similar
outliers out of two trials is not impossible but
three similar outliers is unlikely, we would end
up in requiring three independent trials with
similar results.
• There should probably also be included a
selection bias. This can be done but that is too
complicated for this short talk.
• In the same way the distribution of m may
depend on how strict the inclusion criteria are.
• One may argue that the k trials are not
independent. Studies following the same
protocol may get the same bias.
. Some situations which designbased sampling cannot handle
• Many people say that one should choose the approach
that is best for the problem at hand. Many problems are
more difficult than others to handle design-based
• For instance
–
–
–
–
–
–
–
–
–
–
Missing data
Multiple imputation
Small area estimation
Outlier detection
Editing
Meta-analysis
Synthetic estimation
Coding and classification
Total survey design
…
66
Missing data
• A model for the missingness property is needed. The following Bayesian
notions are commonly used, but not everyone realises that they are
Bayesian
• Missing completely at random (MCAR):
F(x,y,z,qxyz)= F(x,y,qxy)F(z,qz)
• Missing at Random (MAR): ”Given what you know the response
mechanism is independent of the other variables”
F(x,y,z,qxyz)=F(y,qy|X=x)F(z,qz|X=x)F(X,qx)
(where x is known, for unit non-response)
(or F(y,z,qyz|y, y€R)=F(y,qy|y€R)F(z,qz|y€R))
(with item non-response)
• Not Missing at Random (NMAR)
• X auxiliary variables, Y study variables, Z missingness indicator, parameter
indexed with the variable for which it contains information.
67
Multiple imputation
• Many different situations. We only look at one
situation with two y-variables (but use an
MCMC-technique).
• For some respondents one of the y-variables is
missing, but which one differs between
respondents
• We assume MAR!
• We also assume just now that (y1i,y2i) comes
from a normal super-population with unknown
mean q and unknown variance S.
68
Multiple imputation – MCMC-procedure
1. Impute starting values for all missing values
2. Put b=1
3. Find true posterior of q and S (assuming vague normal
inverse-gamma prior)
4. Draw possible qb and Sb from this distribution
5. Find the distribution of the missing values assuming
these parameter values
6. Draw new random numbers from this distribution and
impute them
7. Draw a random value from the conditional distribution
of the sum of the non-sampled units (given parameters
and imputed values.
8. Add all values to get an estimate of the totals YT1b YT2b
9. Save them
10. If b < B0 + B set b = b+1 and go to 3 else stop
69
Multiple imputation cont.
• This is an ergodic Markov chain. The
distribution converges to the true
distribution of T, q and S.
• Choose a burn-in period B0 so large that
convergence is reached.
• The remaining B observations are thus
drawn from the true posterior
• This distribution may be plotted
70
Multiple imputation – not fully Bayesian
1-6 As before to get an imputed data set nr b
7
Estimate the total Tb (or whatever) from the sample, with standard
methods and its variance, Sb
8
Save them
9
If b < B0 + B set b = b+1 and go to 3 else stop
•
•
•
•
•
Compute the mean of the last B estimates (T). Use this as the
estimate of the total
Compute the mean of the last B variances (S)
Compute the variance of the last B estimates (T)
Use the sum of these two values as a variance estimator of the
estimate of the total
Note that this is a mixture of Bayesian and design-based
variances. But it works as a classical estimate.
Multiple imputation cont.
• Posterior under normality
• But what if the distribution is not normal?
72
Multiple imputation cont.
• Posterior under normality
• But what if the distribution is not normal?
• The means are still a BLUE estimators of
the parameter and the total and a
consistent estimator of the variance
• But if the distribution is skew it will not be
particularly good. This is not a defect with
multiple imputation, but a problem with
skew distributions in general
73
Conclusions
• Always use Bayesian methods
–
–
–
–
You will get new tools (e.g. full posterior distributions)
You will produce something useful
You will be logically consistent
You will be able to tackle many more problems within
the theory
• You may use design-based methods as ”quick
and dirty methods” when you know that the
result will be almost equivalent to the Bayesian
approach.
74
Thank you for your attention!
75