Bayesian Reasoning: Markov Chain Monte Carlo

Download Report

Transcript Bayesian Reasoning: Markov Chain Monte Carlo

Bayesian Reasoning:
Maximum Entropy
A/Prof Geraint F. Lewis
Rm 560: [email protected]
Common Sense
We have spent quite a bit of time exploring the posterior probability
distribution, but, of course, to calculate this we need to use the
likelihood function and our prior knowledge.
However, how our prior knowledge is encoded is the biggest
source of argument about Bayesian statistics, with cries of
subjective choice influencing outcomes (but shouldn’t this be the
case?)
Realistically, we could consider a wealth of prior probability
distributions that agree with constraints (i.e. the mean is specified),
but which do we choose?
Answer: we pick the one which is maximally non-committal about
missing information.
Lecture 8
http://www.physics.usyd.edu.au/~gfl/Lecture
Shannon’s theorem
In 1948, Shannon developed a measure on the uncertainty of a
probability distribution which he labeled Entropy. He showed that
the uncertainty of a discrete probability distribution is
Jaynes argued that the maximally non-committal probability
distribution is the one with the maximum entropy; hence, of all
possible probability distributions we should choose the one that
maximizes S.
The other distributions will imply some sort of correlation (we’ll see
this in a moment).
Lecture 8
http://www.physics.usyd.edu.au/~gfl/Lecture
Example
You are told that an experiment has two possible outcomes; what is
the maximally non-committal distribution you should assign to the
two outcomes?
Clearly, if we assign p1=x, then p2=(1-x) and the entropy is
The maximum value of the entropy occurs at p1=p2=1/2.
But isn’t this what you would have guessed?
If we have any further information (i.e. the existence of any
correlations between the outcome of 1 and 2) we can build this into
our measure above and re-maximize.
Lecture 8
http://www.physics.usyd.edu.au/~gfl/Lecture
The Kangaroo Justification
Suppose you are given some basic information about the population of
Australian kangaroos;
1) 1/3 of kangaroos have blue eyes
2) 1/3 of kangaroos are left handed
How many kangaroos are blue eyed and left handed? We know that;
Blue eyes
Lecture 8
Left-Handed
True
False
True
p1
p2
False
p3
p4
http://www.physics.usyd.edu.au/~gfl/Lecture
The Kangaroo Justification
What are the options?
1)
Independent case (no correlation)
2)
Maximal positive correlation
3)
Maximal negative correlation
Lecture 8
http://www.physics.usyd.edu.au/~gfl/Lecture
The Kangaroo Justification
So there are a range of potential p1 values (which set all the other
values), but which do we choose?
Again, we wish to be non-committal and not assume any prior
correlations (unless we have evidence to support any particular
prior).
What constraint can we put on {pi} to select this particular case;
Variation function
Optimal z
Implied Correlation
- pi ln( pi )
1/9
uncorrleated
- pi2
1/12
negative
 ln( pi )
0.1303
positive
 pi1/2
0.1218
positive
So the variational function that selects the non-committal case is
the entropy. As we will see, this is very important for image
reconstruction.
Lecture 8
http://www.physics.usyd.edu.au/~gfl/Lecture
Incorporating a prior
Section 8.4 of the textbook discusses a justification of the MaxEnt
approach, considering the rolling of a weighted die and examining
the “multiplicity” of the outcomes (i.e. some potential outcomes are
more likely than others).
Suppose you have some prior information you want to incorporate
some prior information into entropy measure, so we have {mi} prior
estimates of our probabilities {pi}, following the arguments we see
that the quantity we want to maximize is the Shannon-Jaynes
entropy
If mi are equal, this has no influence on the maximization; we will
see this is important in considering image reconstruction.
Lecture 8
http://www.physics.usyd.edu.au/~gfl/Lecture
Incorporating a prior
When considering a continuous probability distribution, the entropy
becomes
where m(y) is known as the Lebesgue measure.
This quantity (which still encodes our prior) ensure that the entropy
is insensitive to a change of coordinates (as m(x) and p(x)
transform the same way).
Lecture 8
http://www.physics.usyd.edu.au/~gfl/Lecture
Some examples
Suppose you are told some experiment has n possible outcomes.
Without further information, what prior distribution would you assign
the outcomes?
Your prior estimates of the outcomes (without additional
information) would be to assign {mi } = 1/n; what does MaxEnt say
the values of {pi } should be?
The quantity we maximize is our entropy with a Lagrange Multiplier
to account for the constraint on the probability;
Lecture 8
http://www.physics.usyd.edu.au/~gfl/Lecture
Some examples
Taking the (partial) derivative of Sc with respect to the pi and
multiplier , we can show that
and so
All that is left is to evaluate , which we get from the constraint so
Given the constraints on {mi}, =-1 and {mi}= {pi}.
Lecture 8
http://www.physics.usyd.edu.au/~gfl/Lecture
Nicer examples
What if you have additional constraints, such as knowing the mean
of the outcome? Then your constrained entropy is
Where we now have two Lagrange multipliers, one each for each of
the constraints.
Through the same procedure, we can look for the maximum, and
find;
Generally, solving for either  is difficult analytically, but is straightforward numerically.
Lecture 8
http://www.physics.usyd.edu.au/~gfl/Lecture
Nicer examples
Suppose that you are told that a die has a mean score of  dots per
roll; what is the probability weighting of each face? If these are
equal, the die is unweighted and fair. If, however, the probabilities
are different, we should suppose that the die is unfair.
If =3.5, it’s easy to show from the constraints that 0=-1 and 1=0
(write out the two constraints in terms of the previous equation and
divide out 0). If we have no prior reason to thing otherwise, each
face would be weighted equally and so the final result is that
{pi} = {mi}.
The result is as we expect; the an (unweighted) average of 3.5, the
most probable distribution is that all faces have equal weight.
Lecture 8
http://www.physics.usyd.edu.au/~gfl/Lecture
Nicer examples
Suppose, however, you were told that the mean was =4.5, what is
the most probable distribution for {pi}? We can follow the same
procedure as in the previous example, but now find that 0=-0.37
and 1=0.49; with this, the distribution in {pi} is
As we would expect, the
distribution is now
skewed to the higher die
faces (increasing the
mean on a sequence of
rolls).
Lecture 8
http://www.physics.usyd.edu.au/~gfl/Lecture
Additional constraints
Additional information will provide additional constraints on the probability
distribution. If we know a mean and a variance, then;
Given what we have seen previously, we should expect the solution to be
of the form (when taking the continuum limit) of
Which, when appropriately normalized, is the (expected) Gaussian
distribution.
Lecture 8
http://www.physics.usyd.edu.au/~gfl/Lecture
Image Reconstruction
In science, we are interested in gleaning underlying physical properties
from data sets, although in general data contains signals which are blurred
(through optics or physical effects), with added noise (such as photon
arrival time or detector noise). So, how do we extract our image from the
blurry, noisy data?
Lecture 8
http://www.physics.usyd.edu.au/~gfl/Lecture
Image Reconstruction
Naively, you might assume that you can simply “invert” the process and
recover the original image. However, the problem is ill-posed, and a
“deconvolution” will amplify the noise in a (usually) catastrophic way.
We could attempt to suppress the noise (e.g. Wiener filtering) but isn’t
there another way?
Lecture 8
http://www.physics.usyd.edu.au/~gfl/Lecture
Image Reconstruction
Our image consists of a series of pixels, each with a photon count of Ii. We
can treat this as a probability distribution, such that
The value in each pixel, therefore, is the probability that the next
photon will arrive in that pixel.
Note that for an image, pi≥0, and so we are dealing with a “positive,
additive distribution” (note, this is important, as some techniques
like to add negative flux in regions to improve a reconstruction).
Lecture 8
http://www.physics.usyd.edu.au/~gfl/Lecture
Image Reconstruction
We can apply Bayes theorem to calculate the posterior probability of a
proposed “true” image, Imi, from the data. Following the argument given in
the text, we see that
Lecture 8
http://www.physics.usyd.edu.au/~gfl/Lecture
Image Reconstruction
So we aim to maximize
The method, therefore, requires us to have a method for generating
proposal images (i.e. throwing down blobs of light), convolving with our
blurring function (to give Iji) and comparing to the data through 2.
The requirements on pi ensures that proposal image is everywhere
positive (which is good!).
What does the entropy term do? It provides a “regularization” which
drives the solution towards our prior distribution (mi) while the 2 drives a
fit to the data. Note, however, we sometimes need to add additional
regularization terms to enforce smoothness on the solution.
Lecture 8
http://www.physics.usyd.edu.au/~gfl/Lecture
Image Reconstruction
Here is an example of MaxEnt
reconstruction with differing pointspread functions (psf) and added
noise.
Exactly what you get back
depends on the quality of your
data, in each case you can read
the recovered message.
Lecture 8
http://www.physics.usyd.edu.au/~gfl/Lecture
Image Reconstruction
Reconstruction of the radio galaxy M87 (Bryan & Skilling 1980)
using MaxEnt. Note the reduction in the noise and higher detail
visible in the radio jet.
Lecture 8
http://www.physics.usyd.edu.au/~gfl/Lecture
Image Reconstruction
Not always a good thing!! (MaxEnt)
Lecture 8
http://www.physics.usyd.edu.au/~gfl/Lecture