Transcript Bild 1

Statistical Methods
Bayesian methods 2
Daniel Thorburn
Stockholm University
2012-03-29
Outline
6.
7.
8.
9.
Probability assessment; exercize
Conjugate distributions
Vague and other priors
Inference - Point estimates, decisions
and intervals
2
6. Probability assessment
Exercize in probability assessment
• Probability assessment is difficult and
training is needed.
• The results from last weeks handout were
not impressing.
• One observation is that many of you seem
to use to small or large probabilities. Never
use 0 or 1 if you are not absolutely certain.
Results of your assessments
Probability of the outcomes under your assessments and
independence (upper dot is total ignorance)
1,6
1,4
1,2
1
0,8
0,6
0,4
0,2
0
0
0,01
0,02
0,03
0,04
0,05
0,06
0,07
0,08
Here the likelihood is used. 6 out of 13 performed worse than saying 50 % on all.
Jiayun Jin was best.
This approach is sensitive to assessing probabilities 0 (1) to events which (did
not) occur.
As a curiosity for those in favour of likelihood inference: Jin is the ”ML-estimate” of
the best probability assesser. An interval based on the deviance (-2loglikelihood3.84) says that all with probabilities below 0.015 can be rejected on the 95% level.
According to this test none of you is significantly better than chance.
Results of your assessments
Another picture the fit of your assessments where the measure is mean
square error. It is more forgiving to those who assessing extreme
probabilities.
Still Jiayun Jin is best and only 5 out of 13 assessed probabilities worse
than the totally ignorance.
Total squares error of your assessments (upper dot is total
ignorance)
1,6
1,4
1,2
1
0,8
0,6
0,4
0,2
0
0
0,5
1
1,5
2
2,5
3
3,5
4
7. Conjugate Distributions
Recall Bayes theorem
• What you know in afterwards fQ(q|X=x) =
• What you knew before fQ(q)
• + the information in the data
* fX(x|Q=q) = L(x,q)
• or
• ln(fQ(q|X=x)) = fQ(q) +ln(ln L(x,q))
• Compare the Likelihood principle saying that all
the information in an experiment is contained in
the likelihood.
Conjugate distributions
• In some cases there exist simple priors where
the calculations are very simple to perform.
• The posteriors corresponding to those priors
belong to the same distributions except for a
change in the parameters reflecting the new
information
• This was more important previously but
nowadays with good computer packages one
can use also more complicated prior
distributions.
Binomial
• Binomial distribution
n k
P( X = k ) =   p (1  p) n k
k 
• n and k are sufficient. View this as a distribution
where a=k+1 and b=n-k+1 are the parameters.
(a  b )
a 1
b 1
f P ( x) = Cp (1  p)
where C =
(a )( b )
• This will work as a prior distribution and the
number of parameters will be two
Conjugate priors
•
The Beta-distribution is called the conjugate distribution to the binomial
•
The same prior will be conjugate to all Bernoulli-based distributions
(sampling schemes) e.g. Geometric or Negative binomial
•
Conjugate distributions are simple to work with since it will be possible to
describe the posterior with only a few parameters
•
In (almost) all cases when there are finite-dimensional sufficient statistics
there will also exist conjugate distributions (e.g. for the exponential family)
•
Remark: that posterior will be the same regardless of the sampling plan. It
does not matter if you decided to toss a coin n times and got k positive or if
you decided to toss the coin until you had observed k positive.
– (A Neyman Pearson person would make the inference differently. An unbiased
estimate is k/n in the first case and (k-1)/(n-1) in the second).
– In Bayesian analysis your result depends only on what you observed and not on
what you could have observed but did not.
Poisson
• Poisson distribution
P( X = x) =
x
x!
exp(  )
Likelihood after n obs

xi
x!
exp( n )
i
• Sufficient statistics are n and Sxi. Normaliseing and
putting p = S xi-1 and b = n gives the Gamma
distribution
b p x p 1 exp(  xb)
f  ( x) =
( p )
• The conjugate distribution of the Poisson is thus the
Gamma-distribution
Example
• You are studying a homogeneous portfolio of car insurances. You want to
study the risk per 1000 km to have an accident.
• Your model is that the number of claims per car yi is Po(*xi) where xi is
the distance driven by the car (in Mm (one thousand km)).
• Your prior on  has been based on the experience from previous years of
this and other car brands. It is G(50, 5 000) (prior mean is p/b = 50/5000 =
0.01. Relative standard deviation is (2/p)0.5 = 0.20
• Your posterior after observing the number of accidents Syi for all cars is
G(50 + Syi, 5 000 + Sxi)
• This year the company have had 260 cars insured of this brand with 54
claims and a total driven distance of 4 800 Mm. Their posterior will be
G(104, 9 800)
• The posterior mean for  is 104/9 800 = 0.0106 and the relative s.d. is
(2/104)0.5 = 0.14
• A 95% credibility interval is thus 0,0106 +/- 1.96*0.0106*0.14 = 0.0106
+/- 0.0029. (Gamma with p = 104 is approximately normal)
• Credibility intervals is the terms for this type of intervals in actuarial
science
Gamma (c2)
• Density
b p x p 1 exp(  xb)
f  ( x) =
( p )
where b = ½ (s  2 / 2) and p = d . f . for c 2
• This is apart from normalising constants symmetric
in b. I.e. the conjugate distribution for b is Gamma
with parameters p+1 and x.
• I.e. s-2 is gamma distributed. The conjugate
distribution for s2 is said to follow an inverse
Gamma distribution.
• C.f.
s2
s2
2
2
s
2
 c (n  1) given s and
s
2
 c (n  1) given s
Normal with unknown variance
• Combine the above: Conjugate
Prior/posterior of variance is inverse
gamma and prior for the mean given the
variance is normal.
• This also means that the mean is
unconditionally t-distributed.
Some conjugate distributions
Binomial(n,p)
p is Beta(a+k,b+n-k)
Neg Binomial, Geometric
p is Beta(a+k,b+X-k)
Poisson ()
 is Gamma(p+Sx,b+n)
Exponential ()
 is Gamma(p+n,b+Sx)
Normal(m, s2) s2 known
m is Normal(ms2 +Sxnt2)/(s2+nt2),
nt2 /(1+nt2),
Normal(0,s2)
s2 is Inverse Gamma(p+n/2, b+s2/2)
Normal (m, s2)
s2 is Inverse Gamma
m given s2 is normal
Unconditionally m is t
Uniform (0,b)
b is Pareto(max(a,xi), p+n)
8. Vague and other priors
Vague – uninformative - priors
• If you do not want your prior to affect the result, you may
use unininformative priors.
• As you saw above the number of observations
corresponds to one of the parameters of the prior. If we
decrease that parameter as far as possible we get what
is called uninformative prior distributions. These are not
always true distributions but can be handled as limits. (If
they are not they are sometimes called ”improper”).
• N(m, K) = Uniform(- K, + K); f(x) t1
• Gamma(0,b); f(x) t 1/x or ln(x) is U(-K, K)
• Beta(0,0); f(x) t 1/(x(1-x)) or logit(x) is u(-K, K)
Statistical reporting
• Priors are your own and personal
• Readers of scientific articles may have other opinions
• When you report the result of an experiment, report the data so that
all readers can plug it in together with their own opinions, i.e. report
the likelihood function.
• This is sometimes technical and it may be easier to report a
posterior given an uniformative prior, which often is easier to
understand.
• ML-estimate is the mode of this and if you believe that modes are a
good way to describe use it. Otherwise the mean or median are
often better.
• The observed Fisher Information is one way of describing the
spread (- second derivative of the likelihood function at the mode),
but you may also use the standard deviation of the posterior.
• If the posterior(likelihood) is approximately normal, everything is
equivalent
• Other ”reference priors” than an uninformative are sometimes used
Other priors
• Using a computer it is nowadays often quite
easy to handle more complicated priors
• It is often sensible to safeguard against
misspecifying the priors e.g. by using a mixture
of two priors.
• Previous experience says one thing e.g.
Beta(a,b) but if this case is unique prior
experience is useless and a vague may be a
good choice.
• Use a mixed distribution e.g. Beta(a,b) with
probability 0.99 and Beta(0,0) with probability
0.01 (see Excel)
• You may see from the Excel sheet that the
posterior for large values of n is close to normal
• In fact the posterior distribution tends to normal
under very limited assumptions. (If the density
for q is twice continuously differentiable in a set
of probability 1) and the variance is the observed
Fisher information.
• The posterior converges to the same distribution
regardless of the prior information. When the
data dominates everyone agrees on the
posterior
9. Inference - Point estimates,
decisions and intervals
Estimates
• Suppose we have a posterior distribution f(q).
• If we should only give one value. What to do?
• Use what you know from descriptive statistics.
How to describe a distribution of values with only
one value
– Mean: Usually the best choice. Posterior mean. Best
with, smallest mean square error
– Median, smallest mean absolute error
– Mode, most typical value, (the ML-estimate
corresponds to the mode with uninformative prior),
smallest 0-1-loss (with an error of at most e where e is
small. Least common in descriptive statistics
Decision theory approach
min imise  L(q ,d ) f (q )dq
min imise  (q  d ) 2 f (q )dq
derivative 2(d q ) f (q )dq
gives d =  qf (q )dq = E (q )
d
min imise  | q  d | f (q )dq =  (d  q ) f (q )dq   (q  d ) f (q )dq
d
d
derivative f (q )dq   f (q )dq
d
gives d = median
Decision problem
• The demand Q for a product is unknown but modelled by the
distribution N(100,502)
• If a shop orders d units the net profit will be 10min(Q,d) – 5(dmin(Q,d))
• How much should be ordered?
• The expected profit is
d
 (10q  5(d  q)) f (q)dq  10df (q)dq
d
• The derivative is
d
10df (d )   5 f (q )dq  10df (d )   10 f (q )dq =
d
•
10(1  F (d ))  5 F (d )
Thus he should order the amount corresponding to the 66 2/3percentile i.e. 121.5 units in this case
• The expected profit is easily found to be 727.4 (by doing the integral
above)
• The expected profit if he knew the demand would be 10*100=1000
• The expected value of perfect information (EVPI) is 1000-727.4 =
272.6
• If he could pay 100 for doing a market research which gives an
estimate of Q with a standard deviation of 25 should he do so?
• Combining with what he already know his posterior would have the
variance 1/(1/2500+1/625) = 500 = 22.362.
• His expected profit will thus be 1000 – 100 – (22.36/50)*272.6 =
778.1
• He should thus order the market research and his expected profit
will increase by 50.7
• (Check if you could fill in the details of this example. Could you have
solved it using classical methods?)
• Let us now instead suppose that the size of the market
research is not fixed. A study with size n gives a
standard deviation of 200/rot(n). The cost for such a
market study is 36+n. (The previous study corresponds
to n = 64)
• Doing the same calculations for different sizes and
maximising profits gives the picture on the next page. He
should settle for a study of size n=51 getting the total
expected profit almost 780.
• These illustrates the importance of statistics having an
interface with decision theory. What would you have
done if the manager had come to you with the question
and if you had been confined to classical statistics?
782
780
778
776
774
772
770
768
766
764
762
760
0
10
20
30
40
50
60
70
80
90
Confidence intervals
• An interval constructed in this way will in the
long run cover the true values in 1-a of all cases
if it is repeated many many many times.
• Like a person throwing rings around a peg. If he
is skilful he will get the ring around the peg in
95% of all cases
• Probability intervals. The true value lies with
probability 1-a in the interval in this case (given
what is known)
• Synonyms (roughly): credibility intervals,
prediction intervals
Probability intervals
• An interval (a,b) such that
b
1  a =  f (q | X = x)dq
a
• HPD-interval the shortest possible interval
f (a) = f (b)  f ( x) for a  x  b and
f (a)  f ( x) for x  a or f ( x)  b