Bayesian Methods in Survey Sampling

Download Report

Transcript Bayesian Methods in Survey Sampling

Statistical Methods
Bayesian methods
Daniel Thorburn
Stockholm University
2012-03-20
Outline
1.
2.
3.
4.
5.
Background to Bayesian statistics
Two simple rules
Why not design-based?
Bayes´ theorem and likelihood
Binomial and Normal observations
2
1. Background to Bayesian
statistics
• Mathematically:
– Probability is a positive, finite, normed,
(s)-additive measure defined on a (s)algebra (Kolmogorovs axioms)
• But what does that correspond to in real
life?
What is the probability of heads in the
following sequence?
Does it change? And when?
– This is a fair coin
– I am now going to toss it in the corner
– I have tossed it but noone has seen the result
– I have got a glimpse of it but you have not
– I know the result but you don´t
– I tell you the result
5
What is the probability that Mitt
Romney will be the next president
of the USA?
• If I offer you 10 Swedish kronor if you guess right, which
alternative would yo choose (Yes or No)?
– p < or > ½
• If I offer you 15 Swedish kronor if you correctly says yes
and 5 if correctly no. What would you choose?
– p < or > ¾
• If I offer you 17 Swedish kronor if you correctly says yes
and 3 if correctly no. What would you choose?
– p < or > 0,85
• A.s.o.
Prior probability
• In this way you can decide your prior
probability.
• It varies between different persons
depending on their knowledge
• Also your opinion will change over time as
you get more knowledge (e.g. the result
from upcoming primaries)
• Laplace definition. ”All outcomes are equally probable if
there is no information to the contrary”. (number of positive
elementary events/number of possible elementary events)
• Choose heads and bet on it, with your neighbour. You get
one krona if you are right and lose one if you are wrong. When
should you change from indifference?
• Frequency interpretation. (LLN). If there is an infinite
sequence of independent experiments then the relative
frequency converges a.s. towards the true value. Cannot
be used as a definition for two reasons
– It is a vicious circle. Independence is defined in terms of
probability
– It is logically impossible to define over-countably many quantities
by a countable procedure.
8
Probabilities do not exist
(de Finetti)
• They only describe your lack of knowledge
• If there is a God almighty, he knows everything
now, in the past and in the future. (God does not
play dice, (Einstein))
• But lack of knowledge is personal, thus
probability is subjective
• Kolmogorovs axioms only does not say anything
about the relation to reality
9
• Probability is the language which
describes uncertainty
• If you do not know a quantity you should
describe your opinion in terms of
probability (or equivalently odds =p/(1-p))
• Probability is subjective and varies
between persons and over time,
depending on the background information.
10
The distribution of foreign
students in this group
• My prior – before going here
• See Excel sheet!
• Posterior after getting more and more
observations
2. Two simple requirements for
rational inference
12
Rule 1
• What you know/believe in advance +
The information in the data =
What you know/believe afterwards
13
Rule 1
• What you know/believe in advance +
The information in the data =
What you know/believe afterwards
• This is described by Bayes’ Formula:
• P(q|K) * P(X|q,K) a P(q|X,K)
14
Rule 1
• What you know/believe in advance +
The information in the data =
What you know/believe afterwards
• This is described by Bayes’ Formula:
• P(q|K) * P(X|q,K) a P(q|X,K)
• or in terms of the likelihood
• P(q|K) * L(q|X) a P(q|X,K)
15
Rule 1 corrolarium
• What you believe afterwards +
the information in a new study =
What you believe after both studies
16
Rule 1 corrolarium
• What you believe afterwards +
the information in a new study =
What you believe after both studies
• The result of the inference should be possible to
use as an input to the next study
• It should thus be of the same form!
• Note that hypothesis testing and confidence
intervals can never appear on the left hand side
so they do not follow rule 1
17
Rule 2
• Your knowledge must be given in a form that
can be used for deciding actions. (At least in a
well-formulated problem with well-defined
losses/utility).
18
Rule 2
• Your knowledge must be given in a form that
can be used for deciding actions. (At least in a
well-formulated problem with well-defined
losses/utility).
• If you are rational, you must use the rule which
minimizes expected ”losses” (maximizes
utility)
• Dopt = argmin E(Loss(D, q )|X,K)
= argmin X Loss(D,q) P(q |X,K) dq
19
Rule 2
• Your knowledge must be given in a form that can be
used. (At least in a well-formulated problem with
well-defined losses/utility.
• If you are rational, you must use the rule which
minimizes expected ”losses” (maximizes utility)
• Dopt = argmin E(Loss(D, q )|X,K)
= argmin X Loss(D,q) P(q |X,K) dq
• Note that classical design-based inference has no
interface with decisions.
20
Statistical tests are useless
• They cannot be used to combine with new
data.
• They cannot be used even in simple
decision problems.
• They can be compared to a blunt plastic
knife given to a three year old child
– He cannot do much sensible with it
– But he cannot harm himself either
3. An example of the the
stupidity of frequency-based
methods
N=4, n=2, SRS. Dichotomous data, black or white. The variable is known to come
in pairs, i.e. the total is T=0, 2 or 4.
Probabilities:
Population\outcome
0 white
No white T=0
1
2 white T=2
1/6
All white T=4
1 white
2 white
2/3
1/6
1
If you observe 1 white you know for sure that the population contains 2 white.
If you observe 0 or 2 white the only unbiased estimate is T*= 0 resp. 4 (Prove it)
The variance of this estimate is 4/3 if T=2 (=1/6*4+4/6*0+1/6*4) and 0 if T=0 or 4
So if you know the true value the variance is 4/3 and if you are uncertain, variance is 0.
(Standard unbiased variance estimates are 2 resp. 0)
23
Bayesian analysis works OK
– We saw the Bayesian analysis when t=1,
(T*=2).
– If all possibilities are equally likely à priori, the
posterior estimates of T when t = 0 (2) is T* =
2/7 (26/7) and the posterior variance is 24/49.
24
Always stupid?
• It is stupid to believe that the variance of an
estimator is a measure of precision in one
particular case. (It is defined as a long run
average for many repetitions if the parqameter
has a specified value)
• But it is not always so obvious and so stupid as
in this example.
• Is this a consequence of the unusual prior where
T must be even?
25
Example without the prior info
Still stupid but not quite as much
\outcome 0
population\
0
1
2
3
4
Var(q|X)
6/6
3/6
1/6
9/20
1
3/6
4/6
3/6
6/10
2
1/6
3/6
6/6
9/20
Var(q*|q)=
Var(2T|q)
0
1
4/3
1
0
26
Example without the prior info
Still stupid but not quite as much
\outcome 0
population\
0
1
2
3
4
Var(q|X)
6/6
3/6
1/6
9/20
1
3/6
4/6
3/6
6/10
2
1/6
3/6
6/6
9/20
Var(q*|q)=
Var(2T|q)
0
1
4/3
1
0
If you observe 1, the true error is never larger than 1, but
the standard deviation is always larger than (or equal to) 1
for all possible parameter values.
27
Always stupid?
• It is always stupid to assume that the variance of
an estimator is a measure of precision in one
particular case. (It is defined as a long run
property for many repetitions)
• But it is not always so obvious and stupid as in
these examples.
• Under suitable regularity conditions
designbased methods are asymptotically as
efficient as Bayesian methods
Var (q * | q )
 1 a.s. p (q , X n ) as n  
Var (q | X n )
28
• Many people say that one should choose
the approach that is best for the problem
at hand. Classical or Bayesian.
29
• Many people say that one should choose
the approach that is best for the problem
at hand. Classical or Bayesian.
• So do Bayesians.
• But they also draw the conclusion:
30
• Many people say that one should choose
the approach that is best for the problem
at hand. Classical or Bayesian.
• So do Bayesians.
• But they also draw the conclusion:
• Always use Bayesian methods!
31
• Many people say that one should choose
the approach that is best for the problem
at hand. Classical or Bayesian.
• So do Bayesians.
• But they also draw the conclusion:
• Always use Bayesian methods!
• Classical methods can sometimes be seen
as quick and dirty approximations to
Bayesian methods.
• Then you may use them.
32
4. Bayes´ theorem and
likelihood
Bayes´ theorem:
Let A be an event and B a partition of the sample space
P( A) P( Bi | A)
P( A | Bi ) 
 P( A) P( B j | A)
j
P( A | Bi )  P ( A) P( Bi | A)
fY ( y | X  x) 
fY ( y ) f X ( x | Y  y )
f
Y
( s ) f X ( x | Y  s )ds
s
fY ( y | X  x)  fY ( y ) f X ( x | Y  y )
f  (q | X  x)  f  (q ) f X ( x |   q )  f  (q ) L(q , x)
where Y and X are random variables, with the joint distribution fYX (y,x) = fX(x)fY(y|X=x)
=fY(y)fX(x|Y=y)
Likelihood
• You have heard about the likelihood.
• What does it mean?
• Does it in some way reflect how likely
different values are?
• The maximum likelihood is often a good
estimate but otherwise?
• Are those values with smaller values less
likely than those with higher likelihood
values?
Likelihood
• Yes and No
• The probability depends also on what else you
know
• But given that you have a uniform prior reflecting
”no prior information” (sometimes called a
”vague prior”), it describes how likely differen
parameter values are.
• If you have another prior the probability
increases most where the likelihood is high.
5. Binomial and Normal
distributions
Binomial
• Likelihood a pk(1-p)n-k
• Possible prior: Beta distribution
(a   )
q a -1 (1 - q )  -1
(a  1)(   1)
• Se Excel sheet
• This is a common prior but there are many
other which are possible.
Binomial
• Likelihood a pk(1-p)n-k
• Possible prior: Beta distribution
(a   )
q a -1 (1 - q )  -1
(a  1)(   1)
• Posterior: Beta distributed with parameters ak and
+n-k.
• Proof: …
• Prior has a weight corresponding to (a)/(n+a)
Example
• A manufacturer gets mail in large shipments. He
knows from previous experience that the
shipments vary in quality. The proportion of units
with quality under the specified level is around
15 percent with a standard error of 10 percental
unit.
• He tests some units in a new shipment and
wants to assess the quality of the shipment.
• He selects a prior with the known properties
• Beta variable mean a/(a)  0.15
• Variance a/((a)2(a1))  0.01
• Solving for the parameters gives that the prior
corresponds to a Beta distribution with a=1.763
and b= 9.99
• He first tests 10 units and finds 2 under
specification.
– Here the prior information and data weights equally)
• In the following 90, there are 25 under the
specification
– Here the data dominates and the likelihood and
posterior are almost equal
– It is seen that the posterior distribution in this case
converges to normality
• Note that you could look at the data during the study and
see and stop testing when the estimate or the estimated
precision is good enough.
• (which may depend on the outcome. Very few under the
specified limits may mean that you can stop early since
you are certain that the shipment has good quality. Many
under the limits may also lead to a quick stop.)
• This is not allowed for standard statistics, where it is
important to specify the sampling plan in advance. In
Bayesian theory you just have to specify your prior in
advance.
• What is the probability that the next unit will be
under the specified limits?
• The expected value of the posterior is
a k
a  n
• And the probability that the following two units
both are under the specified value
a k
a  k 1

a    n a    n 1
Normal distribution with known
variance
•
•
•
•
x1, x2, … are i.i.d. N(m, s2)
Prior: m € N(m0, t02)
X-bar: N(m, s2/n)
Posterior: Normal with
m0
mn 
t
2
0
1
t
2
0


nx
s2
n
s
2
t 
2
n
1
1
t
2
0

n
s2
Precision
• Precision is sometimes defined as
1/variance.
• The posterior mean is the prior mean and
the data mean weighted with their
precision
• And the posterior precision is the sum of
the prior precision and the data precision
A vague prior
• If you do not know anything in advance the prior
variance, t02, is large.
• And the precision t0-2 is small and thus the weight of
the prior
• Sometimes one lets the precision tend to 0
• i.e. N(m0,  ).
• This corresponds to a prior uniform on the real line,
which is mathematically impossible (since you cannot
get the area under the curve to be 1). But it is possible
to do many of the usual calculations anyway
Thank you for your attention!
47