Bild 1 - People of Statistics

Download Report

Transcript Bild 1 - People of Statistics

Statistical Methods
Bayesian methods 3
Daniel Thorburn
Stockholm University
2012-04-03
Slides presented previous time
Rational behaviour – one person
• Axiomatic foundation of probability. Type:
– For any two events A and B exactly one of the following must hold A <
B, A > B or A v B (pronounce A as less likely than B, B less likely than
A, equally likely)
– If A1, A2, B1 and B2 are four events such that A1A2 = B1B2 is empty and
A1 > B1 and A2 > B2 then A1 U A2 > B1 U B2. If further either A1 > B1 or
A2 > B2 then A1 U A2 > B1 U B2
– …(see next page)
• If these axioms hold all events can be assigned probabilities, which
obey Kolmogorovs axioms (Villegas, Annals Math Stat, 1964),
• Axioms for behaviour. Type …
– If you prefer A to B, and B to C then you must also prefer A to C
– …
• If you want to behave rationally, then you must behave as if all
events were assigned probabilities (Anscombe and Aumann, Annals
Math Stat, 1963)
• Axioms for probability (these six are enough to prove that a
probability following Kolmogorovs axioms can be defined plus
the definition of conditional probability)
– For any two events A and B exactly one of the following must hold A <
B, A > B or A v B (pronounce A as less likely than B, B less likely than
A, equally likely)
– If A1, A2, B1 and B2 are four events such that A1A2 = B1B2 is empty and
A1 > B1 and A2 > B2 then A1 U A2 > B1 U B2. If further either A1 > B1 or
A2 > B2 then A1 U A2 > B1 U B2
– If A is any event then A > (the impossible (empty) event)
– If Ai is an strictly decreasing sequence of events and B a fixed event
such that Ai > B for all i then V i Ai (the intersection of all Ai) > B
– There exists one random variable which has a uniform distribution
– For any events A, B and D, (A|D) < (B|D) if and only if AD < BD
4
• Further one needs some axioms about comparing outcomes,
(utilities) in order to be able to prove rationality
– For any two outcomes, A and B, one either prefers A to B or B to A or is
indifferent
– If you prefer A to B, and B to C then you must also prefer A to C
– If P1 and P2 are two distributions over outcomes they may be compared
and you are indifferent between A and the distribution with P(A)=1
– Two measurability axioms like
• If A is any outcome and P a distribution then the event that P gives an outcome
preferred two A can be compared to other events (more likely …)
– If P1 is preferred to P2 and A is an event, A > 0, then the game giving
P1 if A occurs is preferred to the game giving P2 under A if the results
under the not-A are the same.
– If you prefer P1 to P and P to P2, then there exists numbers a>0 and
b>0 such that P1 with probability 1-a and P2 with probability a is
preferred to P, which is preferred to P1 with probability b and P2 with
probability 1-b.
5
There is only one type of numbers,
which may be known or unknown.
• Classical inference has a mess of different
types of numbers e.g.
–
–
–
–
–
–
–
–
Parameters
Latent variables like in factor analysis
Random variables
Observations
Independent (explaining) variables
Dependent variables
Constants
a.s.o.
• Superstition!
Outline
10. Inference – intervals and proving
scientific results
11. Probability assessment
12. Linear models
13. Predictive distributions
7
10. Inference – intervals and
proving scientific results
Intervals
• Definiton of confidence intervals:
– An interval constructed in this way will in the long run cover the
true values in 1-a of all cases if it is repeated many many many
times.
– Like a person throwing rings (or horse-shoes) around a peg. If he
is skilful he will get the ring around the peg in 95% of all cases.
It’s a property of the person not a particular throw
• Definition of probability intervals.
– The true value lies with probability 1-a in the interval in this case
(given what is known)
– The interval is fixed and known. The probability statement refers
to the unknown quantity
– Synonyms (roughly): credibility intervals, prediction intervals
Probability intervals
• An interval (a,b) such that
b
1  a   f ( | X  x)d
a
• Synonyms: Prediction intervals, Credibility intervals
• Highest Posterior Density (HPD)-interval
f (a)  f (b)  f ( x) for a  x  b and
f (a)  f ( x) for x  a or f ( x)  b
(under unimodality)
Shortest Possible intervals
• Find two numbers a<b such
which are small for all
positive a and b
L ( a , b,  ) 
c(b  a )  dI (a    b)
b
• Minimise.
• Differentiate with respect to
a and b
• Putting the derivatives equal
to 0 shows that this is a
HPD-interval (if unimodal)
E ( L(a, b, ))  c(b  a)  d  f ( )d
a
 c  df (a);
c  df (b)
Note that this interval may be
empty if the distribution is wide
spread i.e.
f ( x)  d / c for all x
Comment
• One often speaks about HPD-regions, which
are regions defined by f()  c for some
positive value c
• Can in one dimension consist of several
intervals if the distribution is unimodal.
• In several dimensions it is often used. E.g. for
bivariate normal distributions it is an ellips.
”Hypothesis testing”
• One may differentiate between (at least)
two problems
1. Convince yourself
2. Prove the fact – convince all others
• These problems need different
approaches and priors.
– Much of the previous discussion has been
about your own subjective opinion
Prove scientific facts to others
• It is very easy to convince people who
believe in the fact from the beginning.
• It is often fairly simple to convince yourself
even if you are broadminded
• But to prove a scientific fact, you must
convince also all those who have
reasonable doubts.
Prove scientific facts
• P(|X,K) a P(X|,K) *P(|K)
• A person is convinced of a fact when his posterior probability is close to
one for the fact.
• But to prove the fact scientifically this must hold for all reasonable priors
P(|K) including those describing reasonable doubt.
• Even if there is no such person whose prior is K’ this must hold also for
that prior as long as K’ is reasonable.
• I.e. a result is ”proved” if
inf (P(|X,K); K reasonable) > 1 – a for some a.
You must check for many possible priors that might be reasonable. Try priors
that assess small probabilitis to the statement that you intend to prove.
• Reporting: Use vague priors, but also show what the
consequences are for some priors with (un-)
reasonable doubt.
• When you prove something all available data should
be used. Type: Meta analysis. In some fields one
study is usually not enough to convince people
• Designing experiments: Design your experiments so
that you maximise E(inf (P(|X,K); K reasonable) |
KYOU) (if you are convinced).
What is reasonable doubt?
Convincing others
• You have to contemplate what is meant by
reasonable doubt.
• Depends on the importance of the subject.
• It can be just putting very small prior probability
on the fact to be proven
• But you must also try to find the possible flaws in
your theory and designing your experiments to
counterprove them.
Priors with reasonable doubt
• Use priors with reasonable doubt
– In an experiment to prove telepathic effects you could e.g. use priors
like P(logodds ratio = 0) = 0,9999.
• If the logodds ratio is different from 0 it may be modelled as N(0,s2),
where s2 may be moderate or large.
• If the posterior e.g. says that P(p > 0,5) > 0.95, and may consider the fact
as proved. (Roughly this means that you need about 20 successes in a row,
where a random guess has probability ½ to be correct).
– Never use the prior P=1, since you must always be open-minded (only
fundamentalists do so. They will never change their opinion whatever
the evidence).
– In more standard situations you will probably not need quite so
negative priors
11. Probability assessment
Probability assessments
Probability of the outcomes under your assessments and
independence (upper dot is total ignorance)
Your results (correct sun
shined, OMX30 lower,
Sweden beat Canada, No
medal, ?, Oslo more to the
south, Swedish GDP less
than 4 Billions
1,6
1,4
1,2
1
0,8
0,6
0,4
0,2
0
0
0,01
0,02
0,03
0,04
0,05
0,06
0,07
0,08
0,09
Abrar Raza Khan was best
according to both measures.
L = P(pi) = 0.081 resp MSE
= 0.82.
Total square error of your assessments (upper dot is total
ignorance)
The results were not quite
as spread as last time, but
still the average was not
better than chance. (5 resp
6) worse than chance.)
1,6
1,4
1,2
1
0,8
0,6
0,4
0,2
0
0
0,5
1
1,5
2
2,5
3
3,5
There are many studies on how
people assess probability
•
Small familiar risks are underestimated (e.g. risk for smokers to get lung cancer by
smokers, driving accidents by drivers)
•
Small unfamiliar risks are overestimated (health risks of E211 in the food is
assessed higher than of sodium benzoate by people making jam at home or E160 c
more risky than Paprika extract))
•
People confuse the events with the conditions
– E.g. P(the beauty-contest was won a shortsighted librarian) is usually assessed smaller
that P(the beauty contest was won by a young blonde shortsighted librarian with in a
bathing suit)
•
The probability to win the top prizes on lotteries is considered to be overestimated
but many of the studies have not taken into account the value of the pleasure in
having the chance (speculating what to do).
Continuous priors
• There are many suggested methods to find priors in the
literature
• Without assumption on the form.
– Betting (See lecture 1)
– Successively dividing the intervals into two equally probable
events
• Try to find a value such that the probability that the value is above
or below are equal (=½)
• Suppose that the value is above this value. Try to find a value such
the value is above or below are equal (=½) given that it is above
• Suppose that the value is below this value. Try to find a value such
the value is above or below are equal (=½) given that it is below
• Continue dividing the intervals
– a.s.o.
Assessing continuous priors
• Of a special form (usually a conjugate)
– Give the most likely value and how many observations are needed for
the prior and posterior to contain the same information. Identify
parameters
– Give the most likely value and expected standard deviation uncertainty
(This is not adviced to try on persons without some statistical training)
– Give two values. One upper bound that you are pretty certain that the
true value does not exceed (should be surprised if it exceeds) and one
lower bond in the similar way. Say that this corresponds to a 95 %
probability interval and identify the parameters.
– Give three values: Most likely value and two bounds like above. (Identify
parameters of a skew distribution e.g. Weibull)
12. Linear models
Theorem
• Let (X,Y) have a bivariate normal distribution with
means (mx,my) and variances/covariance sxx, syy and
sxy
• Then the distribution of Y given X=x is normal with
mean my,+ (sxy/sxx)(x-mx) and variance syy- s2xy/
sxx.
• Apply to a simple Bayesian model: (m,X) has a
bivariate normal distribution with means (m,m) and
variances/covariance t2, t2+s2 and t2.
• Then the posterior given X=x is normal with mean
m+ t2/(t2+s2)(x-m) = (s2m+t2x)/(t2+s2) and
variance t2-t4/(t2+s2)s4/(t2+s2)
Multivariate version
– Let (X,Y) be a pair of row vectors having multivariate normal
distribution with means (mx,my) and variances/covariance Sxx, Syy
and Sxy
– Then the distribution of Y given X=x is normal with mean my,+
SxyS1xx(x-mx) and variance Syy - SyxSxx1Sxy.
• This can be applied to a Bayesian linear model:
– Let m, be a vector of parameters and with normal prior mean m
and covariance matrix T.
– Let X have a multivariate normal distribution given m with mean
Am and covariance matrix S.
TA’
T
– Then (m,X) is normal with mean
(m,Am) and covariance matrix
AT ATA’+S
– The posterior given X=x is normal with mean
m+ TA’(ATA’S)1(x-Am) and variance T TA’(ATA’S)1AT
Example
• Consider a agricultural study on six piglets
with two explanatory variables which are
strongly colinear. The piglets stem from
three litters.
• Model Yi=a +bx1i+cx2i+hk(i)+ei where all
error terms are iid Normal(0,1) (For
simplicity known and equal variances)
• This is a multilevel study. (Too small to be
realistic. Can easily be analysed by good
multilevel computer packages like MLwin)
Data
y
x1
x2
Litter
32
1
47
A
141
2
53
B
544
6
90
B
528
6
97
C
336
4
76
C
239
3
65
C
Parameters a=47 b=129 c=-3 ABC 0,87 -0,59 -1,2
Identification of arguments
• Prior m=(50,1,0,0,0,0)
• Prior variance S = I
•
• (Large uncertainty)
T=
• Data
• y=(32,141,544,528,336,239)
•
A=
1000
0
0
0
0
0
0
1000
0
0
0
0
0
0
50
0
0
0
0
0
0
1
0
0
0
0
0
0
1
0
0
0
0
0
0
1
1
1
47
1
0
0
1
2
53
0
1
0
1
6
90
0
1
0
1
6
97
0
0
1
1
4
76
0
0
1
1
3
65
0
0
1
Results
• Estimates a*=27.6 b1*=123.4 b2*=-2.48
• Covariance
89.9
23.2
-2.45
matrix
23.2
6.24
-0.65
-2.45
-0.65
0.067
• Note that the correlation between the estimates of the
regression coefficients is very close to -1. The
bayesian analysis tells us that we cannot differentiate
between them. Do not worry in the analysis – the
apparatus takes care of that
• We could have done the analysis successively for one
observation at a time. The number of observations
does not have to be larger than the number of
parameters, if the prior is proper.
13. Predictive distributions
Prediction
• Suppose that we intend to study a new piglet from a new
litter with parameters 5 and 82.
• Its expected value given the data is a+
b1*5+b2*82=27.6+123.4*5-2.48*82=441.3 (with better
precision)
• With variance 89.9+6.24*25+0.067*82*82+2*23.2*52*2.45*82-2*0.65*5*82 + 1 + 1 = 2.99 (with better
precision)
• Thus a 95%-prediction interval is 441.3+/-3.4
Predictive distributions
• Parameters do not exist – only observations. Parameters
are only a help in describing their distributions
• So the interesting things is how well we can predict future
observations by our inference
 P( x
new
| , data) f ( | data)d   P( xnew | ) f ( | data)d
• Where the last equality holds if future observations and
the old data are independent given .
• We just saw one example for the piglet data and we have
also seen it in Bernoulli examples
Simulate from posterior and
predictive distributions
• In some cases there is no closed
expression for the posterior distribution. In
most of these cases it is nevertheless
possible to simulate from it.
• We shall illustrate it in a simple situation
(where it is possible to get the posterior)
• (y,x) is N((0,0), 1, 1, r)
Algorithm
•
•
•
•
•
•
•
Starting value anything e.g. x0=0
Take y1 from the conditional distribution of Y1 given that X1=x0.
Y1 = Normalrand(rx0,1-r22);
Take x1 from the conditional distribution of X1 given that Y1=y1.
X1 = Normalrand(ry1,1-r22)
Continue like this. Take yi+1 from the conditional distribution of Yi+1 given that
Xi+1=xi ,
Yi+1 = Normalrand(rxi,1-r22);
Take xi+1
from the conditional distribution of Xi+1 given that Yi+1=yi+1. X1 =
Normalrand(ry1,1-r22)
The pairs of (Yi,Xi) form a Markov chain, which converges in distribution to
N((0,0), 1, 1, r) (convergenc is exponential, but slow for r close to one)
Thus after a ”burn-in” period the pairs will be a sequence of (dependent) random
variables from the correct distribution. (see Excel sheet)
In this way on can construct densities from a multivariate distribution using only
the one-dimensional conditional distributions. Histograms and different
multidemensional plots may be constructed.
This was a simple example of an MCMC-technique. Usually burn-ins of more than
100 are used and the chain is repeated for at least 10 000 times. Special techniques
are used to hurry upp the convergence.
Predictive distributions
• Missing data
– Suppose we have observed 100 x-values but 5 y values are missing
due to chance
• Model Y=a+bx+e where e is normal
– Predict a and b B=100 times using by drawing them from the posterior
distributions.
– For each pair (a,b) find a set of the five missing values using their
distribution given a, b, and the five x-values.
– Fill in the five values. You have now 100 sets with complete data.
–
• You may now do the analysis that you intended to do without losing
information.
– (The estimates should be the mean of the 100 estimates and the
posterior variance is the sum of the estimated variance and the between
prediction variances)