nm_bayesx - UCL Department of Geography

Download Report

Transcript nm_bayesx - UCL Department of Geography

GEOGG121: Methods
Introduction to Bayesian analysis
Dr. Mathias (Mat) Disney
UCL Geography
Office: 113, Pearson Building
Tel: 7670 0592
Email: [email protected]
www.geog.ucl.ac.uk/~mdisney
Lecture outline
• Intro to Bayes’ Theorem
–
–
–
–
–
–
Science and scientific thinking
Probability & Bayes Theorem – why is it important?
Frequentists v Bayesian
Background, rationale
Methods
Advantages / disadvantages
• Applications:
– parameter estimation, uncertainty
– Practical – basic Bayesian estimation
Reading and browsing
Bayesian methods, data analysis
• Gauch, H., 2002, Scientific Method in Practice, CUP.
• Sivia, D. S., with Skilling, J. (2008) Data Analysis, 2nd ed., OUP, Oxford.
•
Shih and Kochanski (2006) Bayes Theorem teaching notes: a very nice
short intro to Bayes Theorem:
http://kochanski.org/gpk/teaching/0401Oxford/Bayes.pdf
Computational
• Press et al. (1992) Numerical Methods in C, 2nd ed – see
http://apps.nrbook.com/c/index.html
• Flake, W. G. (2000) Computational Beauty of Nature, MIT Press.
• Gershenfeld, N. (2002) The Nature of Mathematical Modelling,, CUP.
• Wainwright, J. and Mulligan, M. (2004) (eds) Environmental Modelling:
Finding Simplicity in Complexity, John Wiley and Sons.
Reading and browsing
Papers, articles, links
P-values
•
•
Siegfried, T. (2010) “Odds are it’s wrong”, Science News, 107(7),
http://www.sciencenews.org/view/feature/id/57091/title/Odds_Are,_Its_Wrong
Ioannidis, J. P. A. (2005) Why most published research findings are false, PLoS Medicine,
0101-0106.
Bayes
•
•
•
•
Hill, R. (2004) Multiple sudden infant deaths – coincidence or beyond coincidence, Pediatric
and Perinatal Epidemiology, 18, 320-326
(http://www.cse.salford.ac.uk/staff/RHill/ppe_5601.pdf)
http://betterexplained.com/articles/an-intuitive-and-short-explanation-of-bayes-theorem/
http://yudkowsky.net/rational/bayes
http://kochanski.org/gpk/teaching/0401Oxford/Bayes.pdf
So how do we do science?
•
•
•
•
•
•
Carry out experiments?
Collect observations?
Test hypotheses (models)?
Generate “understanding”?
Objective knowledge??
Induction? Deduction?
Induction and deduction
• Deduction
– Inference, by reasoning, from general to particular
– E.g. Premises: i) every mammal has a heart; ii)
every horse is a mammal.
– Conclusion: Every horse has a heart.
– Valid if the truth of premises guarantees truth of
conclusions & false otherwise.
– Conclusion is either true or false
Induction and deduction
• Induction
– Process of inferring general principles from
observation of particular cases
– E.g. Premise: every horse that has ever been
observed has a heart
– Conclusion: Every horse has a heart.
– Conclusion goes beyond information present, even
implicitly, in premises
– Conclusions have a degree of strength (weak ->
near certain).
Induction and deduction
• Example from Gauch (2003: 219) which we will
return to:
– Q1: Given a fair coin (P(H) = 0.5), what is P that 100
tosses will produce 45 heads and 55 tails?
– Q2: Given that 100 tosses yield 45 heads and 55 tails,
what is the P that it is a fair coin?
• Q1 is deductive: definitive answer – probability
• Q2 is inductive: no definitive answer – statistics
– Oh dear: this is what we usually get in science
Bayes: see Gauch (2003) ch 5
• Informally, the Bayesian Q is:
– “What is the probability (P) that a hypothesis (H) is
true, given the data and any prior knowledge?”
– Weighs different hypotheses (models) in the light of
data
• The frequentist Q is:
– “How reliable is an inference procedure, by virtue of
not rejecting a true hypothesis or accepting a false
hypothesis?”
– Weighs procedures (different sets of data) in the
light of hypothesis
Probability? see S&S(1006) p9
• To Bayes, Laplace, Bernoulli…:
– P represents a ‘degree-of-belief’ or plausibility
– i.e. degree of truth, based on evidence at hand
• BUT this appears to be subjective, so P was
redefined (Fisher, Neyman, Pearson etc.) :
– P is the ‘long-run relative frequency’ with which an event
occurs, given (infinite) repeated expts.
– We can measure frequencies, so P now an objective tool for
dealing with random phenomena
• BUT we do NOT have infinite repeated
expts…?
Bayes’ Theorem
• The “chief rule involved in the process of
learning from experience” (Jefferys, 1983)
• Formally:
P ( H | D) µ P ( D | H ) ´ P ( H )
• P(H|D) = Posterior i.e. probability of hypothesis
(model) H being true, given data D
• P(D|H) = Likelihood i.e probability of data D
being observed if H is true
• P(H) = Prior i.e. probability of hypothesis being
true before measurement of D
Bayes: see Gauch (2003) ch 5
• Prior?
– What is known beyond the particular experiment at
hand, which may be substantial or negligible
• We all have priors: assumptions, experience,
other pieces of evidence
• Bayes approach explicitly requires you to
assign a probability to your prior (somehow)
• Bayesian view
– probability as degree of belief rather than a
frequency of occurrence (in the long run…)
Bayes’ Theorem
• Importance? P(H|D) appears on the left of BT
• i.e. BT solves the inverse (inductive) problem –
probability of a hypothesis given some data
• This is how we do science in practice
• We don’t have access to infinite repetitions of
expts (the ‘long run frequency’ view)
Bayes’ Theorem
P ( Hypoth. | Data, I ) µ P ( Data | Hypoth., I ) ´ P ( Hypoth. | I )
• I is background (or conditioning) information as there is
‘no such thing as absolute probability’ (see S & S p 5)
• P(rain today) will depend on clouds this morning, whether
we saw forecast etc. etc. – I is usually left out but ….
• Power of Bayes’ Theorem
– Relates the quantity of interest i.e. P of H being true given D, to
that which we might estimate in practice i.e. P of observing D,
given H is correct
Bayes’ Theorem & marginalisation
• To go from to  to = we need to divide by P(D|I)
P ( D | H, I ) ´ P ( H, I )
P ( H | D, I ) =
P(D | I )
• Where P(D|I) is known as the ‘Evidence’
• Normalisation constant which can be left out for parameter
estimation as independent of H
• But is required in model selection for e.g. where data
amount may be critical
Bayes’ Theorem: example
•
•
•
Suppose a drug test is 99% accurate for true positives, and 99% accurate for
true negatives, and that 0.5% of the population use the drug.
What is the probability that someone who tests positive is in fact a user i.e.
what is P(User|+ve)?
So P (User | +ve) =
True +ve
•
P ( +ve |User ) ´ P (User )
P(+ve |User)P(User) + P(+ve | Non - user)P(Non - user)
=
0.99 ´ 0.005
= 0.332
0.99 ´ 0.005+ 0.01´ 0.995
False +ve
P(D) on bottom , evidence, is the sum of all possible models (2 in this case) in
the light of the data we observe
http://kochanski.org/gpk/teaching/0401Oxford/Bayes.pdf
http://en.wikipedia.org/wiki/Bayes'_theorem
Bayes’ Theorem: example
•
•
•
•
•
•
•
So, for a +ve test, P(User) is only 33% i.e. there is 67% chance they are NOT
a user
This is NOT an effective test – why not?
Number of non-users v. large compared to users (99.5% to 0.5%)
So false positives (0.01x0.995 = 0.995%) >> true positives (0.99x0.005 =
0.495%)
Twice rate (67% to 33%)
So need to be very careful when considering large numbers / small results
See Sally Clark example at end….
http://kochanski.org/gpk/teaching/0401Oxford/Bayes.pdf
http://en.wikipedia.org/wiki/Bayes'_theorem
Eg Laplace and the mass of Saturn
• Laplace (1749-1827) estimated MSaturn from orbital data
• i.e. posterior prob(M|{data},I) where I is background knowledge of
orbital mechanics etc.
• Shaded area under posterior pdf shows degree of belief that m1 ≤
MSaturn < m2 (he was right to within < 0.7%)
• How do we interpret this pdf in terms of frequencies?
– Some ensemble of universes all constant other than MSaturn? Distribution of
MSaturn in repeated experiments?
– But data consist of orbital periods, and these multiple expts. didn’t happen
Best estimate of M
Degree of
certainty of M
The posterior pdf expresses
ALL our best understanding
of the problem
Example: is this a fair coin?
Heads I win, tails you lose?
• H? HT? HTTTTHTHHTT?? What do we mean fair?
• Consider range of contiguous propositions (hypotheses)
about range in which coin bias-weighting, H might lie
• If H = 0, double tail; H = 1, double head; H = 0.5 is fair
• E.g. 0.0 ≤ H1 < 0.01; 0.01 ≤ H2 < 0.02; 0.02 ≤ H3 < 0.03 etc.
Example: is this a fair coin?
• If we assign high P to a given H (or range of Hs), relative to
all others, we are confident of estimate of ‘fairness’
• If all H are equally likely, then we are ignorant
• This is summarised by conditional (posterior) pdf
prob(H|{data},I)
• So, we need prior prob(H,I) – if we know nothing let’s use
flat (uniform) prior i.e.
ì 1 0 £ H £1
prob ( H | I ) = í
î 0 otherwise
P ( Hypoth. | Data, I ) µ P ( Data | Hypoth., I ) ´ P ( Hypoth. | I )
Example: is this a fair coin?
• Now need likelihood i.e. prob({data}|H,I)
• Measure of chance of obtaining {data} we have actually
observed if bias-weighting H was known
• Assume that each toss is independent event (part of I)
• Then prob(R heads in N tosses) is given by binomial
theorem i.e.
prob ({data} | H, I ) µ H (1- H )
R
N-R
– H is chance of head and there are R of them, then there must be NR tails (chance 1-H).
P ( Hypoth. | Data, I ) µ P ( Data | Hypoth., I ) ´ P ( Hypoth. | I )
Example: is this a fair coin?
• How does prob(H|{data},I) evolve?
TTT
HT
HTTTH
prob ({data} | H, I ) µ H (1- H )
R
N-R
ì 1 0 £ H £1
prob ( H | I ) = í
î 0 otherwise
Gaussian prior μ = 0.5, σ = 0.05
• How does prob(H|{data},I) evolve?
H0 (mean) not always at peak
Particularly when N small
T
prob ({data} | H, I ) µ H (1- H )
R
N-R
ì 1 0 £ H £1
prob ( H | I ) = í
î 0 otherwise
Summary
• The posterior pdf summarises our knowledge, based on
{data} and prior
– Note{data} in this case actually np.random.binomial(N, p)
• Weak prior shifted easily
ì 1 0 £ H £1
prob ( H | I ) = í
î 0 otherwise
• Stronger Gaussian prior (rightly) requires a lot more data to
be convinced
( H-m )
prob ( H | I ) = e 2s
• See S & S for other priors….
• Bayes’ Theorem encapsulates the learning process
2
2
P ( Hypoth. | Data, I ) µ P ( Data | Hypoth., I ) ´ P ( Hypoth. | I )
Summary
• Takes a lot of coin tosses to estimate H to within 0.2-0.3
• If we toss 10 times and get 10 T, this might be strong
evidence for bias
• But if we toss 100 times and get 45H 55T, difference still
10 BUT much more uncertain
• Gaussian: Although H(0.5) ~ 250000 H(0.25), 1000 tosses
gets posterior to within 0.02
P ( Hypoth. | Data, I ) µ P ( Data | Hypoth., I ) ´ P ( Hypoth. | I )
Reliability and uncertainty
• Can we summarise PDF prob(H|{data},I) concisely (mean, error)?
• Best estimate Xo of parameter X is given by condition
dP
d 2P
=0
<0
dX Xo
dX 2 Xo
• Also want measure of reliability (spread of pdf around Xo)
f¢ a
f ¢¢ a
• Use Taylor series expansion f ( x ) = f ( a) + ( ) ( x - a) + ( ) ( x - a) 2 +...
1!
2!
• Use L = loge[prob(H|{data},I)] - varies much more slowly with X
• Expand about X-Xo = 0 so
1 d2L
L = L ( X0 ) +
2 dX 2
( X - X0 )
2
+...
X0
• First term is constant, second term linear (X-Xo) not important as we
are expanding about maximum. So, ignoring higher order terms….
Reliability and uncertainty
• We find
é1 d 2L
prob ( X | { data}, I ) » Aexp ê
2
êë 2 dX
( X - X0 )
X0
2
ù
ú
úû
• Where A is a normalisation constant. So what is this function??
• It is pdf of Gaussian (normal) distribution i.e.
é ( x - m)2 ù
1
ú
prob ( x | m, s ) =
exp ê2
s 2p
êë 2s úû
• Where μ, σ are maximum and width (sd)
• Comparing, we see μ at Xo and
æ d2L
s = - çç 2
è dX
-1 2
ö
÷
÷
X0 ø
• So X = Xo ±σ
http://en.wikipedia.org/wiki/File:Normal_Distribution_PDF.svg
Reliability and uncertainty
• From the coin example prob ( H | { data}, I ) µ H R (1- H )
• So L = loge éë prob ( H | { data}, I )ùû = const + R loge ( H ) + ( N - R) loge (1- H )
• Therefore
dL
R ( N - R)
EXERCISE: verify
=
=0
dH H H 0 (1- H 0 )
these expressions
N-R
0
yourself
• So Ho = R/N, and then
d 2L
dH 2
=H0
R ( N - R)
N
=
H 2 (1- H ) 2
H 0 (1- H 0 )
s=
H 0 (1- H 0 )
N
• Ho tends to a constant, therefore so does Ho(1-Ho), so σ  1/√ N
• So can express key properties of pdf using Ho and σ
• NB largest uncertainty (σmax ) when Ho = 0.5 i.e. easier to identify
highly-biased coin than to be confident it is fair
Reliability and uncertainty
•
•
•
•
Asymmetric pdf? Ho still best estimate
But preal more likely one side of Ho than another
So what does ‘error bar’ mean then?
Confidence intervals (CI)
95%CI
X
– shortest interval enclosing X% of area under pdf, say 95%
• Assuming posterior pdf normalised
(total area = 1) then need X1, X2
X
such that prob ( X1 £ X < X2 | {data}, I ) = ò prob ( X | {data}, I ) dX » 0.95
X
• The region X1 ≤ X < X2 is the shortest 95% CI
• For normalised pdf, weighted average given by X = ò Xprob ( X | { data}, I ) dX
• Multimodal pdf?
2
1
– As pdf gets more complex, single estimates of mean not relevant
– Just show posterior pdf, then you can decide…..
Parameter estimation: 2 params
• Example: signal in the presence of background noise
• Very common problem: e.g. peak of lidar return from forest canopy?
Presence of a star against a background? Transitioning planet?
s=
A
B
0
x
See p 35-60 in Sivia & Skilling
H 0 (1- H 0 )
N
Gaussian peak + background
• Data are e.g. photon counts
in a particular channel, so expect count in
N
kth channel Nk to be µ å( A ( xk ) + B ( xk ) )where A, B are signal and
k=1
background
• Assume peak is Gaussian (for now), width w, centered on xo so ideal
datum Dk then given by
2
-( xk -x0 )
é
Dk = n0 ê Ae
ë
2w2
+ Bùú
û
• Where n0 is constant (integration time). Unlike Nk, Dk not a whole no.,
so actual datum some integer close to Dk
• Poisson distribution is pdf which represents this property i.e.
D N e-D
prob ( N | D) =
N!
Aside: Poisson distribution
• Poisson: prob. of N events occurring over some fixed time interval if
events occur at a known rate independently of time of previous event
• If expected number over a given interval is D, prob. of exactly N events
D N e-D
prob ( N | D) =
N!
Used in discrete counting experiments,
particularly cases where large number of
outcomes, each of which is rare (law of rare
events) e.g.
• Nuclear decay
• No. of calls arriving at a call centre per
minute – large number arriving BUT rare
from POV of general population….
See practical page for poisson_plot.py
Gaussian peak + background
DkNk e-Dk
• So likelihood for datum Nk is prob ( N k | A, B, I ) =
Nk !
• Where I includes reln. between expected counts Dk and A, B i.e. for our
Gaussian model, xo, w, no are given (as is xk). D = n é Ae-( x -x ) 2w + Bù
2
k
0
êë
k
0
2
úû
• IF data are independent, then likelihood over all M data is just product
of probs of individual measurements i.e.
M
prob ({ N k } | A, B, I ) = Õ prob ( N k | A, B, I )
k=1
• As usual, we want posterior pdf of A, B given {Nk}, I
prob ( A, B | { N k }, I ) µ prob ({ Nk } | A, B, I ) ´ prob ( A, B | I )
Gaussian peak + background
• Prior? Neither A, nor B can be –ve so most naïve prior pdf is
ìï const
prob ( A, B | I ) = í
ïî 0
A ³ 0 & B ³ 0,
otherwise.
• To calc constant we need Amax, Bmax but we may assume they are
large enough not to cut off posterior pdf i.e. prob ({ N k } | A, B, I ) Is
effectively 0 by then
M
• So, log of posterior L = loge éë prob ( A, B | { N k }, I )ùû = const + åéë N k loge ( Dk ) - Dk ùû
k=1
• And, as before, we want A, B to maximise L
• Reliability is width of posterior about that point
Gaussian peak + background
• ‘Generate’ experimental data (see practical)
• n0 chosen to give max expectation Dk = 100. Why do Nk > 100?
Gaussian peak + background
• Posterior PDF is now 2D
• Max L A=1.11, B=1.89 (actual 1.09, 1.93)
Gaussian peak + background
• Changing the experimental setup?
– E.g. reducing counts per bin (SNR) e.g. because of
shorter integration time, lower signal threshold etc.
Same signal, but data look much
noisier – broader PDF
Truncated at 0 – prior important
Gaussian peak + background
• Changing the experimental setup?
– Increasing number of bins (same count rate, but spread
out over twice measurement range)
Much narrower posterior PDF
BUT reduction mostly in B
Gaussian peak + background
• More data, so why uncertainty in A, B not reduced
equally?
– Data far from origin only tell you about background
– Conversely – restrict range of x over which data are collected
(fewer bins) it is hard to distinguish A from B (signal from noise)
– Skewed & high correlation between A, B
Marginal distribution
• If only interested in A then according to marginalisation
rule integrate
joint posterior PDF wrt B i.e.
¥
prob ( X | I ) = ò prob ( X,Y I ) dY
-¥
¥
• So prob ( A | { N k }, I ) = ò prob ( A, B { N k }, I ) dB
0
• See previous experimental cases…..
1
15 bins, ~100 counts maximum
2
15 bins, ~10 counts maximum
Marginal distribution
•
•
•
•
•
Marginal prob ( A | { N k }, I ) ¹ prob ( A | { Nk }, B, I ) conditional
Marginal pdf: takes into account prior ignorance of B
Conditional pdf: assumes we know B e.g. via calibration
Least difference when measurements made far from A (3)
Most when data close to A (4)
3
31 bins, ~100 counts maximum
4
7 bins, ~100 counts maximum
Note: DO NOT PANIC
• Slides from here are for completeness
• I will NOT ask you about uncertainty in the 2
param or general cases from here on
Uncertainty
•
•
•
•
Posterior L shows reliability of parameters & we want optimal
For parameters {Xj}, with post. P = prob ({ X j } {data}, I )
Max??
Optimal {Xoj} is set of simultaneous eqns ¶P
=0
For i = 1, 2, …. Nparams
¶Xi { X }
oj
• So for log of P i.e. L = loge é prob
ë
parameters we want
¶L
¶X
=
Xo ,Yo
¶L
¶Y
ùand for 2
X
data
,
I
{
}
{
}
( j
)û
=0
Xo ,Yo
• where
L = loge éë prob ( X,Y {data}, I )ùû
Sivia & Skilling (2006) Chapter 3, p 35-51
Uncertainty
• To estimate reliability of best estimate we want spread of P about
(Xo, Yo)
• Do this using Taylor expansion i.e.
• Or
f ( x ) = f ( a) +
¥
f ( x) = å
n=0
f(
n)
f ¢ ( a) f ¢¢ ( a)
f ¢¢¢ ( a)
2
3
+
x
a
+
( )
( x - a) +....
1!
2!
3!
( a)
n!
( x - a)
n
• So for the first three terms (to quadratic) we have
1 é ¶2 L
L = L ( Xo ,Yo ) + ê 2
2 êë ¶X
( X - Xo )
Xo ,Yo
2
¶2 L
+ 2
¶Y
(Y -Yo )
Xo ,Yo
2
¶2 L
+2
¶X¶Y
ù
( X - Xo ) (Y -Yo )ú +...
úû
Xo ,Yo
• Ignore (X-Xo) and (Y-Yo) terms as expansion is about maximum
Sivia & Skilling (2008) Chapter 3, p 35-51
http://en.wikipedia.org/wiki/Taylor_series
Uncertainty
• So mainly concerned with quadratic terms. Rephrase via matrices
• For quadratic term, Q
æ A C öæ X - Xo ö
÷
Q = X - Xo Y -Yo ç
֍
è C B øçè Y -Yo ÷ø
¶2 L
¶2 L
¶2 L
A= 2
,B= 2
,C =
• Where
¶X Xo ,Yo
¶Y Xo ,Yo
¶X¶Y Xo ,Yo
(
Y
k
Yo
e2
Q=k
l2
k
l1
Xo
e1
X
Sivia & Skilling (2008) Chapter 3, p 35-51
)
• Contour of Q in X-Y plane i.e. line
of constant L
• Orientation and eccentricity
determined by A, B, C
• Directions e1 and e2 are the
eigenvectors of 2nd derivative
matrices A, B, C
Uncertainty
• So (x,y) component of e1 and e2 given by solutions of
æ x ö
æ A C öæ x ö
÷÷ = l çç
÷÷
ç
÷çç
è C B øè y ø
è y ø
• Where eigenvalues λ1 and λ2 are 1/k2 (& k1,2 are widths of ellipse
along principal directions)
• If (Xo, Yo) is maximum then λ1 and λ2 < 0
• So A < 0, B < 0 and AB > C2
• So if C ≠ 0 then ellipse not aligned to axes, and how do we
estimate error bars on Xo, Yo?
• We can get rid of parameters we don’t want (Y for e.g.) by
integrating i.e.
prob ( X { data}, I ) =
Sivia & Skilling (2008) Chapter 3, p 35-51
+¥
ò prob ( X,Y { data}, I )
-¥
Uncertainty
æQö
• And then use Taylor again & prob ( X,Y { data}, I ) = exp ( L ) µ exp ç ÷
è2ø
• So (see S&S p 46 & Appendix)
æ 1 é AB - C 2 ù
ö
2
prob ( X { data}, I ) µ exp ç ê
ú( X - Xo ) ÷
è2ë B û
ø
• And so marginal distn. for X is just Gaussian with best estimate
(mean) Xo and uncertainty (SD)
-B
sX =
AB - C 2
-A
sY =
AB - C 2
• So all fine and we can calculate uncertainty……right?
Sivia & Skilling (2008) Chapter 3, p 35-51
Uncertainty
•
•
•
•
æ A C ö
÷ and is λ1 x λ2
Note
is determinant of ç
è C B ø
2
So if λ1 or λ2 0 then AB-C 0 and σX and σY ∞
Oh dear……
So consider variance of posterior
AB-C2
Var ( X ) = ( X - m )
2
=
ò ( X - m ) prob ( X { data}, I )dX
2
• Where μ is mean X
• For a 1D normal distribution this gives
• For 2D case (X,Y) here
s 2X = ( X - Xo ) =
2
e2
( X - m)
2
=s 2
òò ( X - Xo ) prob ( X,Y { data}, I ) dXdY
2
• Which we have from before. Same for Y so…..
Sivia & Skilling (2008) Chapter 3, p 35-51
e1
Uncertainty
• Consider covariance σ2XY
s 2XY = ( X - Xo ) (Y -Yo ) =
òò ( X - X ) (Y -Y ) prob ( X,Y {data}, I ) dXdY
o
o
• Which describes correlation between X and Y and if estimate of X
has little/no effect on estimate of Y then s 2 << s 2 s 2
XY
X Y
• And, using Taylor as before
s 2XY =
C
AB - C 2
•
• So in matrix notation
• Where we remember that
¶2 L
A= 2
¶X
¶2 L
,B= 2
¶Y
Xo ,Yo
æ 2
ç sX
ç s2
è XY
¶2 L
,C =
¶X¶Y
Xo ,Yo
Sivia & Skilling (2008) Chapter 3, p 35-51
ö
2
s XY
÷
s Y2
Xo ,Yo
-1
1 æ -B C ö æ A C ö
=
ç
÷ = -ç
÷
÷ AB - C 2 è C -A ø è C B ø
ø
Uncertainty
• Covariance (or variance-covariance) matrix describes covariance
of error terms
• When C = 0, σ2XY = 0 and no correlation, and e1 and e2 aligned
with axes
• If C increases (relative to A, B), posterior pdf becomes more
skewed and elliptical - rotated at angle ± tan-1(√A/B)
C=0, X, Y uncorrelated
Large, -ve correlation
After Sivia & Skilling (2008) fig 3.7 p. 48
Large, +ve correlation
Uncertainty
• As correlation grows, if C =(AB)1/2 then contours infinitely wide in
one direction (except for prior bounds)
• In this case σX and σY v. large (i.e. very unreliable parameter
estimates)
• BUT large off-diagonals in covariance matrix mean we can
estimate linear combinations of parameters
• For –ve covariance, posterior wide in direction Y=-mX, where
m=(A/B)1/2 but narrow perpendicular to axis along Y+mX = c
• i.e. lot of information about Y+mX but little about Y – X/m
• For +ve correlation most info. on Y-mX but not Y + X/m
After Sivia & Skilling (2008) fig 3.7 p. 48
Uncertainty
• Seen the 2 param case, so what about generalisation of Taylor
quadratic approximation to M params?
• Remember, we want {Xoj} to maximise L, (log) posterior pdf
• Rephrase in matrix form Xo i.e. for i = 1, 2, …. M we want
• Extension of Taylor expansion to M variables is
•
1 M M ¶2 L
L = L ( Xo ) + åå
( Xi - Xoi ) (Yj -Yoj ) +…
2 i=1 j=1 ¶Xi¶X j X
o
é X ù
So if X is an M x 1 column vector êê X úú and ignoring higher terms,
ê
ú
exponential of posterior pdf is
ê
ú
1
2
êë X M úû
é1
ù
T
prob ( X {data}, I ) µ expê ( X - Xo ) ÑÑL ( Xo ) ( X - Xo )ú
ë2
û
Sivia & Skilling (2008) Chapter 3, p 35-51
Uncertainty
• Where ÑÑL is a symmetric M x M matrix of 2nd derivatives
é
T
ê ¶L
• And (X-Xo) is the transpose of (X-Xo) (a row vector)
ê ¶X ¶X
ê
• So this is generalisation of Q from 2D case
ÑÑL = ê
2
1
æ 1 é AB - C 2 ù
ö
2
prob ( X { data}, I ) µ exp ç ê
ú( X - Xo ) ÷
è2ë B û
ø
é1
ù
T
prob ( X {data}, I ) µ expê ( X - Xo ) ÑÑL ( Xo ) ( X - Xo )ú
ë2
û
1
ê ¶2 L
ê ¶X ¶X
1
M
êë
• And contour map from before is just a 2D slice through our now M
dimensional parameter space
-M
• Constant of proportionality is ( 2p ) 2 det ( ÑÑL )
Sivia & Skilling (2008) Chapter 3, p 35-51
ù
¶2 L
ú
¶X M ¶X1 ú
ú
ú
ú
¶2 L
¶X M ¶X M úú
û
Uncertainty
• So what are the implications of all of this??
• Maximum of M parameter posterior PDF is Xo & we know ÑL ( Xo ) = 0
• Compare to 2D case & see ÑÑL is analogous to -1/σ2
• Can show that generalised case for covariance matrix σ2 is
• Square root of diagonals (i=j) give marginal error bars and offdiagonals (i≠j) decribe correlations between parameters
• So covariance matrix contains most information describing model
fit AND faith we have in parameter estimates
Sivia & Skilling (2008) Chapter 3, p 35-51
Uncertainty
• Sivia & Skilling make the important point (p50) that inverse of
diagonal elements of matrix ≠ diagonal of inverse of matrix
• i.e. do NOT try and estimate value / spread of one parameter in M
dim case by holding all others fixed at optimal values
Xj
Incorrect
‘best fit’ σii
• Need to include marginalisation to
get correct magnitude for uncertainty
• Discussion of multimodal and
asymmetric posterior PDF for which
Gaussian is not good approx
• S&S p51….
Xoj
σii
Xi
After Sivia & Skilling (2008) p50.
Summary
• We have seen that we can express condition for best estimate of set
of M parameters {Xj} very compactly as ÑL ( Xo ) = 0
• Where jth element of ÑL is ¶L ¶X j (log posterior pdf) evaluated at
(X=Xo)
• So this is set of simultaneous equations, which, IF they are linear i.e.
ÑL = HX + C
• Then can use linear algebra methods to solve i.e. Xo = -H -1C
• This is the power (joy?) of linearity! Will see more on this later
• Even if system not linear, we can often approximate as linear over
some limited domain to allow linear methods to be used
• If not, then we have to use other (non-linear) methods…..
Summary
• Two parameter eg: Gaussian peak + background
• Solve via Bayes’ T using Taylor expansion (to quadratic)
• Issues over experimental setup
– Integration time, number of bins, size etc.
– Impact on posterior PDF
• Can use linear methods to derive uncertainty estimates
and explore correlation between parameters
• Extend to multi-dimensional case using same method
• Be careful when dealing with uncertainty
• KEY: not always useful to look for summary statistics – if
in doubt look at the posterior PDF – this gives full
description
END
Sally Clark case: Bayesian odds ratios
• Two cot-deaths (SIDS), 1 year apart, aged 11 weeks and
8 weeks. Mother Sally Clark charged with double
murder, tried and convicted in 1999
– Statistical evidence was misunderstood, “expert” testimony was
wrong, and a fundamental logical fallacy was introduced
• What happened?
• We can use Bayes’ Theorem to decide between 2
hypotheses
– H1 = Sally Clark committed double murder
– H2 = Two children DID die of SIDS
•
•
http://betterexplained.com/articles/an-intuitive-and-short-explanation-of-bayestheorem/
http://yudkowsky.net/rational/bayes
59
The tragic case of Sally Clark
P ( H1| D) P ( D | H1) P ( H1)
=
´
P ( H 2 | D) P ( D | H 2 ) P ( H 2 )
prob. of H1 or H2
given data D
Likelihoods i.e. prob. of
getting data D IF H1 is
true, or if H2 is true
Very important PRIOR probability i.e.
previous best guess
• Data? We observe there are 2 dead children
• We need to decide which of H1 or H2 are more
plausible, given D (and prior expectations)
• i.e. want ratio P(H1|D) / P(H2|D) i.e. odds of H1 being
true compared to H2, GIVEN data and prior
60
The tragic case of Sally Clark
• ERROR 1: events NOT independent
• P(1 child dying of SIDS)? ~ 1:1300, but for affluent nonsmoking, mother > 26yrs ~ 1:8500.
• Prof. Sir Roy Meadows (expert witness)
– P(2 deaths)? 1:8500*8500 ~ 1:73 million.
– This was KEY to her conviction & is demonstrably wrong
– ~650000 births a year in UK, so at 1:73M a double cot death is a 1
in 100 year event. BUT 1 or 2 occur every year – how come?? No
one checked …
– NOT independent P(2nd death | 1st death) 5-10 higher i.e. 1:100 to
200, so P(H2) actually 1:1300*5/1300 ~ 1:300000
61
The tragic case of Sally Clark
• ERROR 2: “Prosecutor’s Fallacy”
– 1:300000 still VERY rare, so she’s unlikely to be innocent, right??
• Meadows “Law”: ‘one cot death is a tragedy, two cot deaths is suspicious and,
until the contrary is proved, three cot deaths is murder’
– WRONG: Fallacy to mistake chance of a rare event as chance that
defendant is innocent
• In large samples, even rare events occur quite frequently someone wins the lottery (1:14M) nearly every week
• 650000 births a year, expect 2-3 double cot deaths…..
• AND we are ignoring rarity of double murder (H1)
62
The tragic case of Sally Clark
• ERROR 3: ignoring odds of alternative (also very rare)
– Single child murder v. rare (~30 cases a year) BUT generally significant
family/social problems i.e. NOT like the Clarks.
– P(1 murder) ~ 30:650000 i.e. 1:21700
– Double MUCH rarer, BUT P(2nd|1st murder) ~ 200 x more likely given first,
so P(H1|D) ~ (1/21700* 200/21700) ~ 1:2.4M
• So, two very rare events, but double murder ~ 10 x rarer than
double SIDS
• So P(H1|D) / P(H2|D)?
– P (murder) : P (cot death) ~ 1:10 i.e. 10 x more likely to be double SIDS
– Says nothing about guilt & innocence, just relative probability
63
The tragic case of Sally Clark
• Sally Clark acquitted in 2003 after 2nd appeal (but not on
statistical fallacies) after 3 yrs in prison, died of alcohol
poisoning in 2007
– Meadows “Law” redux: triple murder v triple SIDS?
• In fact, P(triple murder | 2 previous) : P(triple SIDS| 2 previous) ~ ((21700 x
123) x 10) / ((1300 x 228) x 50) = 1.8:1
• So P(triple murder) > P(SIDS) but not by much
• Meadows’ ‘Law’ should be:
– ‘when three sudden deaths have occurred in the same family, statistics give no
strong indication one way or the other as to whether the deaths are more or less
likely to be SIDS than homicides’
From: Hill, R. (2004) Multiple sudden infant deaths – coincidence or beyond coincidence, Pediatric and
Perinatal Epidemiology, 18, 320-326 (http://www.cse.salford.ac.uk/staff/RHill/ppe_5601.pdf)
64
Bayes’ Theorem & marginalisation
• Generally, using X for Hypothesis, and Y for Data
prob (Y | X, I ) ´ prob ( X, I )
prob ( X | Y, I ) =
prob(Y | I )
prob ( X | I ) =
+¥
ò prob ( X,Y | I ) dY
-¥
• Where prob(X|I) is the marginalisation equation
• But if Y is a proposition, how can we integrate over it?
Bayes’ Theorem & marginalisation
•
•
•
•
Suppose instead of Y and Y (not Y) we have a set of alternative possibilities:
Y1, Y2, …. YM = {Yk}
Eg M candidates for an election, Y1 = prob. candidate 1 will win, Y2 cand. 2
will win etc.
Prob that X is true e.g. that unemployment will fall in 1 year, irrespective of
who wins (Y) is
M
As long as
prob ( X | I ) = å prob ( X,Yk | I )
k=1
M
å prob ( X,Y | I ) =1
k
k=1
•
i.e. the various probabilities {Yk} are exhaustive and mutually exclusive, so
that if one Yk is true, all others are false but one must be true
Bayes’ Theorem & marginalisation
•
•
•
•
As M gets larger, we approach prob ( X | I ) =
ò prob ( X,Y | I ) dY
-¥
Eg we could consider an arbitrarily large number of propositions about the
range in which my weight WMD could lie
Choose contiguous intervals and large enough range (M ∞), we will have a
mutually exclusive, exhaustive set of possibilities
So Y represents parameter of interest (WMD in this case) and integrand
prob(X,Y|I) is now a distribution – probability density function (pdf)
pdf ( X,Y = y | I ) = lim
prob ( X, y £ Y < y + d y | I )
dy
d y®0
•
+¥
And prob. that Y lies in finite range between y1 and y2 (and X is also true) is
prob ( X, y1 £ Y < y2 | I ) =
y2
ò p df ( X,Y | I ) dY
y1