Bayesian methods - University of Sheffield

Download Report

Transcript Bayesian methods - University of Sheffield

Bayesian methods, priors
and Gaussian processes
John Paul Gosling
Department of Probability and Statistics
Overview
• The Bayesian paradigm
• Bayesian data modelling
• Quantifying prior beliefs
• Data modelling with Gaussian processes
24-25 January 2007
An Overview of State-of-the-Art Data Modelling
2
Bayesian methods
The beginning, the subjectivist philosophy,
and an overview of Bayesian techniques.
Subjective probability
• Bayesian statistics involves a very different way
of thinking about probability in comparison to
classical inference.
• The probability of a proposition is defined to a
measure of a person’s degree of belief.
• Wherever there is uncertainty, there is
probability
• This covers aleatory and epistemic uncertainty
24-25 January 2007
An Overview of State-of-the-Art Data Modelling
4
Differences with classical inference
To a frequentist, data are repeatable, parameters
are not:
P(data|parameters)
To a Bayesian, the parameters are uncertain, the
observed data are not:
P(parameters|data)
24-25 January 2007
An Overview of State-of-the-Art Data Modelling
5
Bayes’s theorem for distributions
• In
In early
This
Bayesian
canprobability
be extended
statistics,
to
wecontinuous
use
theorem
courses,
weBayes’s
are taught
distributions:
in
a particular
way:
Bayes’s
theorem
for events:
24-25 January 2007
An Overview of State-of-the-Art Data Modelling
6
Prior to posterior updating
Prior
Data
Posterior
Bayes’s theorem is used to update our beliefs.
The posterior is proportional to the prior times the
likelihood.
24-25 January 2007
An Overview of State-of-the-Art Data Modelling
7
Posterior distribution
• So, once we have our posterior, we have
captured all our beliefs about the parameter of
interest.
• We can use this to do informal inference, i.e.
intervals, summary statistics.
• Formally, to make choices about the parameter,
we must couple this with decision theory to
calculate the optimal decision.
24-25 January 2007
An Overview of State-of-the-Art Data Modelling
8
Sequential updating
Prior beliefs
Posterior beliefs
Posterior
beliefs
Today’s posterior is tomorrow’s prior
24-25 January 2007
An Overview of State-of-the-Art Data Modelling
9
The triplot
• A triplot gives a graphical representation of prior
to posterior updating.
Prior
Likelihood
Posterior
24-25 January 2007
An Overview of State-of-the-Art Data Modelling
10
Audience participation
Quantification of our prior beliefs
• What proportion of people in this room are
left handed? – call this parameter ψ
• When I toss this coin, what’s the probability
of me getting a tail? – call this θ
24-25 January 2007
An Overview of State-of-the-Art Data Modelling
11
A simple example
• The archetypal example in probability
theory is the outcome of tossing a coin.
• Each toss of a coin is a Bernoulli trial with
the probability of tails given by θ.
• If we carry out 10 independent trials, we
know the number of tails(X) will follow a
binomial distribution. [X | θ ~ Bi(10, θ)]
24-25 January 2007
An Overview of State-of-the-Art Data Modelling
12
Our prior distribution
• A Beta(2,2) distribution may reflect our beliefs
about θ.
24-25 January 2007
An Overview of State-of-the-Art Data Modelling
13
Our posterior distribution
• If we observe X = 3, we get the following triplot:
24-25 January 2007
An Overview of State-of-the-Art Data Modelling
14
Our posterior distribution
• If we are more convinced, a priori, that θ = 0.5
and we observe X = 3, we get the following
triplot:
24-25 January 2007
An Overview of State-of-the-Art Data Modelling
15
Credible intervals
• If asked to provide an interval in which there is a
90% chance of θ lying, we can derive this
directly from our posterior distribution.
• Such an interval is called a credible interval.
our example,
using our
firstare
prior
distribution,
• In frequentist
statistics,
there
confidence
we can report
a 95% posterior
credible
interval
intervals
that cannot
be interpreted
in the
same
for θ of (0.14,0.62).
way.
24-25 January 2007
An Overview of State-of-the-Art Data Modelling
16
Basic linear model
• Yesterday we saw a lot of this:
• We have a least squares solution given by
• Instead of trying to find the optimal set of
parameters, we express our beliefs about them.
24-25 January 2007
An Overview of State-of-the-Art Data Modelling
17
Basic linear model
• By selecting appropriate priors for the two
parameters, we can derive the posterior
analytically.
• It is a normal inverse-gamma distribution.
• The mean of our posterior distribution is then
which is a weighted average of the LSE and
prior mean.
24-25 January 2007
An Overview of State-of-the-Art Data Modelling
18
Bayesian model comparison
• Suppose we have two plausible models for a set
of data, M and N say.
• We can calculate posterior odds in favour of M
using
24-25 January 2007
An Overview of State-of-the-Art Data Modelling
19
Bayesian model comparison
• The Bayes factor is calculated using
• A Bayes factor that is greater than one would
mean that your odds in favour of M increase.
• Bayes factors naturally help guard against too
much model structure.
24-25 January 2007
An Overview of State-of-the-Art Data Modelling
20
Advantages/Disadvantages
• Bayesian methods are often more complex than
frequentist methods.
• There is not much software to give scientists offthe-shelf analyses.
• Subjectivity: all the inferences are based on
somebody’s beliefs.
24-25 January 2007
An Overview of State-of-the-Art Data Modelling
21
Advantages/Disadvantages
• Bayesian statistics offers a framework to deal with all the
uncertainty.
• Bayesians make use of more information – not just the
data in their particular experiment.
• The Bayesian paradigm is very flexible and it is able to
tackle problems that frequentist techniques could not.
• In selecting priors and likelihoods, Bayesians are
showing their hands – they can’t get away with making
arbitrary choices when it comes to inference.
• …
24-25 January 2007
An Overview of State-of-the-Art Data Modelling
22
Summary
• The basic principles of Bayesian statistics
have been covered.
• We have seen how we update our beliefs
in the light of data.
• Hopefully, I’ve convinced you that the
Bayesian way is the right way.
24-25 January 2007
An Overview of State-of-the-Art Data Modelling
23
Priors
Advice on choosing suitable prior
distributions and eliciting their parameters.
Importance of priors
• As we saw in the previous section, prior beliefs
about uncertain parameters are a fundamental
part of Bayesian statistics.
• When we have few data about the parameter of
interest, our prior beliefs dominate inference
about that parameter.
• In any application, effort should be made to
model our prior beliefs accurately.
24-25 January 2007
An Overview of State-of-the-Art Data Modelling
25
Weak prior information
• If we accept the subjective nature of Bayesian statistics
and are not comfortable using subjective priors, then
many have argued that we should try to specify prior
distributions that represent no prior information.
• These prior distributions are called noninformative,
reference, ignorance or weak priors.
• The idea is to have a completely flat prior distribution
over all possible values of the parameter.
• Unfortunately, this can lead to improper distributions
being used.
24-25 January 2007
An Overview of State-of-the-Art Data Modelling
26
Weak prior information
• In our coin tossing example, Be(1,1), Be(0.5,0.5) and
Be(0,0) have been recommended as noninformative
priors. Be(0,0) is improper.
24-25 January 2007
An Overview of State-of-the-Art Data Modelling
27
Conjugate priors
• When we move away from noninformative
priors, we might use priors that are in a
convenient form.
• That is a form where combining them with the
likelihood produces a distribution from the same
family.
• In our example, the beta distribution is a
conjugate prior for a binomial likelihood.
24-25 January 2007
An Overview of State-of-the-Art Data Modelling
28
Informative priors
• An informative prior is an accurate
representation of our prior beliefs.
• We are not interested in the prior being part of
some conjugate family.
• An informative prior is essential when we have
few or no data for the parameter of interest.
• Elicitation, in this context, is the process of
translating someone’s beliefs into a distribution.
24-25 January 2007
An Overview of State-of-the-Art Data Modelling
29
Elicitation
• It is unrealistic to expect someone to be able to
fully specify their beliefs in terms of a probability
distribution.
• Often, they are only able to report a few
summaries of the distribution.
• We usually work with medians, modes and
percentiles.
• Sometimes they are able to report means and
variances, but there are more doubts about
these values.
24-25 January 2007
An Overview of State-of-the-Art Data Modelling
30
Elicitation
• Once we have some information about their
beliefs, we fit some parametric distribution to
them.
• These distribution almost never fit the
judgements precisely.
• There are nonparametric techniques that can
bypass this.
• Feedback is essential in the elicitation process.
24-25 January 2007
An Overview of State-of-the-Art Data Modelling
31
Normal with unknown mean
Noninformative prior:
24-25 January 2007
An Overview of State-of-the-Art Data Modelling
32
Normal with unknown mean
Conjugate prior:
24-25 January 2007
An Overview of State-of-the-Art Data Modelling
33
Normal with unknown mean
Proper prior:
24-25 January 2007
An Overview of State-of-the-Art Data Modelling
34
Structuring prior information
• It is possible to structure our prior beliefs in a
hierarchical manner:
Data model:
x
First level of prior:
Second level of prior:
• Here
24-25 January 2007
is referred to as the hyperparameter(s).
An Overview of State-of-the-Art Data Modelling
35
Structuring prior information
• An example of this type of hierarchical is a
nonparametric regression model.
Data model:
First level of prior:
Second level of prior:
• We want to know about μ so the other
parameters must be removed. The other
parameters are known as nuisance parameters.
24-25 January 2007
An Overview of State-of-the-Art Data Modelling
36
Analytical tractability
• The more complexity that is built into your prior
and likelihood the more likely it is that you won’t
be able to derive your posterior analytically.
• In the ’90’s, computational techniques were
devised to combat this.
• Markov chain Monte Carlo (MCMC) techniques
allow us to access our posterior distributions
even in complex models.
24-25 January 2007
An Overview of State-of-the-Art Data Modelling
37
Sensitivity analysis
• It is clear that the elicitation of prior distributions
is far from being a precise science.
• A good Bayesian analysis will check that the
conclusions are sufficiently robust to changes in
the prior.
• If they aren’t, we need more data or more
agreement on the prior structure.
24-25 January 2007
An Overview of State-of-the-Art Data Modelling
38
Summary
• Prior distributions are an important part of
Bayesian statistics.
• They are far from being ad hoc, pick-theeasiest-to-use distributions when modelled
properly.
• There are classes of noninformative priors
that allow us to represent ignorance.
24-25 January 2007
An Overview of State-of-the-Art Data Modelling
39
Gaussian processes
A Bayesian data modelling technique that
fully accounts for uncertainty.
Data modelling: a fully
probabilistic method
• Bayesian statistics offers a framework to
account for uncertainty in data modelling.
• In this section, we’ll concentrate on
regression using Gaussian processes and
the associated Bayesian techniques
24-25 January 2007
An Overview of State-of-the-Art Data Modelling
41
The basic idea
We have:
or
and
are uncertain.
In order to proceed, we must elicit our beliefs
about these two.
can be dealt with as in the previous section.
24-25 January 2007
An Overview of State-of-the-Art Data Modelling
42
Gaussian processes
A process is Gaussian if and only if every finite
sample from the process is a vector-valued
Gaussian random variable.
• We assume that f(.) follows a Gaussian process
a priori.
• That is:
• i.e. any sample of f(x)’s will follow a MV-normal.
24-25 January 2007
An Overview of State-of-the-Art Data Modelling
43
Gaussian processes
We have prior beliefs about the form of the
underlying model.
We observe/experiment to get data about
the model with which we train our GP.
We are left with our posterior beliefs about
the model, which can have a ‘nice’ form.
24-25 January 2007
An Overview of State-of-the-Art Data Modelling
44
A simple example
Warning: more audience
participation coming up
24-25 January 2007
An Overview of State-of-the-Art Data Modelling
45
A simple example
• Imagine we have data about some one
dimensional phenomenon.
• Also, we’ll assume that there is no observational
error.
• We’ll start with five data points between 0 and 4.
• A priori, we believe
is roughly linear and
differentiable everywhere.
24-25 January 2007
An Overview of State-of-the-Art Data Modelling
46
A simple example
24-25 January 2007
An Overview of State-of-the-Art Data Modelling
47
A simple example
24-25 January 2007
An Overview of State-of-the-Art Data Modelling
48
A simple example
24-25 January 2007
An Overview of State-of-the-Art Data Modelling
49
A simple example with error
• Now, we’ll start over and put some
Gaussian error on the observations.
,
• Note: in kriging, this is equivalent to
adding a nugget effect.
24-25 January 2007
An Overview of State-of-the-Art Data Modelling
50
A simple example with error
24-25 January 2007
An Overview of State-of-the-Art Data Modelling
51
The mean function
Recall that our prior mean for
where
functions evaluated at
is given by
is vector of regression
and
is a vector of
unknown coefficients.
The form of the regression functions is dependent
on the application.
24-25 January 2007
An Overview of State-of-the-Art Data Modelling
52
The mean function
• It is common practice to use a constant (bias)
• Linear functions
• Gaussian basis functions
• Trigonometric basis functions
•…
It is important to capture your beliefs about
in the mean function.
24-25 January 2007
An Overview of State-of-the-Art Data Modelling
53
The correlation structure
The correlation function defines how we believe
will deviate nonparametrically from the mean
function.
In the examples here, I have used a stationary
correlation function of the form:
24-25 January 2007
An Overview of State-of-the-Art Data Modelling
54
Dealing with the model parameters
We have the following hyperparameters:
can be removed analytically using
conjugate priors.
are not so easily accounted for…
24-25 January 2007
An Overview of State-of-the-Art Data Modelling
55
A 2-D example
Rock porosity somewhere in the US
24-25 January 2007
An Overview of State-of-the-Art Data Modelling
56
A 2-D example
Mean of our
posterior
beliefs about
the underlying
model, f(.).
24-25 January 2007
An Overview of State-of-the-Art Data Modelling
57
A 2-D example
Mean of our
posterior
beliefs about
the underlying
model, f(.), in
3D!!!
24-25 January 2007
An Overview of State-of-the-Art Data Modelling
58
A 2-D example
Our
uncertainty
about f(.) –
two standard
deviations
24-25 January 2007
An Overview of State-of-the-Art Data Modelling
59
A 2-D example
Our
uncertainty
about f(.)
looks much
better in 3D.
24-25 January 2007
An Overview of State-of-the-Art Data Modelling
60
A 2-D example - prediction
• The geologists held back two observations at:
P1 = (0.60,0.35), z1 = 10.0 and P2 (0.20,0.90), z2 = 20.8
• Using our posterior distribution for f(.) and e, we
get the following 90% credible intervals:
z1|rest of points in (8.7,12.0) and
z2|rest of points in (21.1,26.0)
24-25 January 2007
An Overview of State-of-the-Art Data Modelling
61
Diagnostics
• Cross validation allows us to check the validity of our GP
fit.
• Two variations are often used: leave-one-out or leavefinal-20% out.
• Leave-one-out
• Hyperparameters use all data and are then fixed when prediction
is carried out for each omitted point.
• Leave-final-20%-out (hold out)
• Hyperparameters are estimated using the reduced data subset.
• Cross validation is not enough to justify GP fit.
24-25 January 2007
An Overview of State-of-the-Art Data Modelling
62
Cross validation for the 2-D e.g.
• Applying leave-one-out cross validation gives a RMSE of:
Constant:
2.1787
Linear:
2.1185
(Using a linear function, reduces RMSE by 2.8%)
• Applying leave-last-20%-out cross validation gives:
Constant:
6.8684
Linear:
5.7466
(A 16.3% difference)
24-25 January 2007
An Overview of State-of-the-Art Data Modelling
63
Benefits and limitations of GPs
• Gaussian processes offer a rich class of models,
which, when fitted properly, is extremely flexible.
• It also offers us a framework in which we can
account for all of our uncertainty.
• If there are discontinuities, the method will
struggle to provide a good fit.
• The computation time hinges on the inversion of
a square matrix of size (number of data points).
24-25 January 2007
An Overview of State-of-the-Art Data Modelling
64
Extensions
• Nonstationarity in the covariance can be
modelled by added extra levels to variance term
or deforming the input space.
• Discontinuity can be handled by using piecewise
Gaussian process models.
• The GP model can be applied in a classification
setting.
• There is a lot more research on GPs and there
probably will be a way of using them in your
applications.
24-25 January 2007
An Overview of State-of-the-Art Data Modelling
65
Further details
I have set up a section on my website that
has a comprehensive list of references for
extended information on the topics covered
in this presentation.
j-p-gosling.staff.shef.ac.uk
24-25 January 2007
An Overview of State-of-the-Art Data Modelling
66