Transcript BayesianNNs

Bayesian Neural Networks
Bayesian statistics
• An example of Bayesian statistics: “The
probability of it raining tomorrow is 0.3”
• Suppose we want to reason with information that
contains probabilities such as: ''There is a 70\%
chance that the patient has a bacterial infection''.
Bayes theories rest on the belief that for
everything there is a prior probability that it could
be true.
Priors
• Given a prior probability about some hypothesis
(e.g. does the patient have influenza?) there must
be some evidence we can call on to adjust our
views (beliefs) on the matter.
• Given relevant evidence we can modify this prior
probability to produce a posterior probability of
the same hypothesis given new evidence.
• The following terms are used:
Terms
• p(X) means prior probability of X
• p(X|Y) means probability of X given that we have
observed evidence Y
• p(Y) is the probability of the evidence Y occurring
on its own.
• p(Y|X) is the probability of the evidence Y
occurring given the hypothesis X is true (the
likelihood).
Bayes Theorem:
p(Y | X ) p( X )
p( X | Y ) 
p(Y )
likelihood  prior
posterior 
evidence
Bayes rule
• We know what p(X) is - the prior probability of
patients in general having influenza.
• Assuming that we find that the patient has a fever,
we would like to find P(X:Y) the probability of
this particular patient having influenza given that
we can see that they have a fever (Y).
• If we don't actually know this we can ask the
opposite question, i.e. if a patient has influenza,
what is the probability that they have a fever?
Bayes rule
• Fever is probably certain in this case, we'll assume
that it is 1.
• The term p(Y) is the probability of the evidence
occurring on it's own, i.e. what is the probability
of anyone having a fever (whether they have
influenza or not? p(Y) can be calculated from:
Bayes
p(Y)  p(X | Y)p(X)  p(Y | notX)p(not Y)
• This states that the probability of a fever occurring
in anyone is the probability of a fever occurring in
an influenza patient times the probability of
anyone having influenza plus the probability of
fever occurring in a non-influenza patient times
the probability of this person being a noninfluenza case.
Bayes
• From the original prior probability of p(X)held in
our knowledge base we can calculate p(X|Y) after
having asked about the patients fever.
• We can now forget about the original p(X) and
instead use the new p(X|Y) as a new p(X).
• So the whole process can be repeated time and
time again as new evidence comes in from the
keyboard (i.e. the user enters answers).
Bayes
• Each time an answer is given the probability of the
illness being present is shifted up or down a bit
using the Bayesian equation.
• Each time a different prior probability being used
which has been derived from the last posterior
probability.
Example
The hypothesis X is that ‘X is a man’ and notX is
that ‘X is a woman’, and we want to calculate
which is the most likely given the available
evidence.
We have evidence that the prior probability of X,
p(X) is 0.7, so that p(not X) = 0.3.
We have evidence Y that X has long hair, and
suppose that p(Y|X) is 0.1 {i.e. most men don’t
have long hair} and p(Y) is 0.4 {i.e. quite a few
people have long hair}.
Example
• Our new estimate of P(X|Y) i.e. that X is a man
given that we now know that X has long hair is:
• p(X|Y) = p(Y|X)P(X)/P(Y)
• = (0.1*(0.7))/0.4
• = 0.175
Example
• So our probability of ‘X is a man’ has moved from
0.7 to 0.175, given the evidence of long hair.
• In this way new P(X|Y) are calculated from old
probabilities given new evidence.
• Eventually, having gathered all the evidence
concerning all of the hypotheses, we, or the
system, can come to a final conclusion about the
patient.
Inference
• What most systems using this form of inference do
is set an upper and lower threshold.
• If the probability exceeds the upper threshold that
hypothesis is accepted as a likely conclusion to
make.
• If it falls below the lower threshold then it is
rejected as unlikely.
Problems
• Computationally expensive
• The Prior probabilities are not always available
and are often subjective – much research in how to
discover ‘informative’ prior probabilities
Problems
 Often the Bayesian formulae don’t correspond
with the expert’s degrees of belief.
• For Bayesian systems to work correctly, an expert
should tell us that ‘The presence of evidence Y
enhances the probability of the hypothesis X, and
the absence of evidence Y decreases the
probability of X’
Problems
• But in fact many experts will say that ‘The
presence of Y enhances the probability of X, but
the absence of Y has no significance’, which is not
true in a strict Bayesian framework.
• Assumes independent evidence
Bayes and NNs
• Bayesian methods are often used in both statistics
and Artificial Intelligence based around expert
systems.
• However, they can also be used with neural
networks.
• Conventional training methods for multilayer
perceptrons (such as backpropagation) can be
interpreted in statistical terms as variations on
maximum likelihood estimation.
Bayes and NNs
• The idea is to find a single set of weights for the
network that maximize the fit to the training data,
perhaps modified by some sort of weight penalty
to prevent overfitting.
• Bayesian training automatically modifies weight
decay terms so that weights that are unimportant
decay to zero
• In this way unimportant weights are effectively
‘pruned’ – preventing overfitting
Bayes and NNs
• Typically, the purpose of training is to make
predictions for future cases where only the inputs
to the network are known.
• The result of conventional network training is a
single set of weights that can be used to make such
predictions.
• In contrast, the result of Bayesian training is a
posterior distribution over network weights.
Bayes and NNs
• If the inputs of the network are set to the values
for some new case, the posterior distribution over
network weights will give rise to a distribution
over the outputs of the network, which is known
as the predictive distribution for this new case.
• If a single-valued prediction is needed, one might
use the mean of the predictive distribution, but the
full predictive distribution also tells you how
uncertain this prediction is.
Why bother?
• The hope is that Bayesian methods will provide
solutions to such fundamental problems as:
• How to judge the uncertainty of predictions. This
can be solved by looking at the predictive
distribution, as described above.
• How to choose an appropriate network
architecture (e.g., the number hidden layers, the
number of hidden units in each layer).
Why bother
• How to adapt to the characteristics of the data
(e.g., the smoothness of the function, the degree to
which different inputs are relevant).
• Good solutions to these problems, especially the
last two, depend on using the right prior
distribution, one that properly represents the
uncertainty that you probably have about which
inputs are relevant, how smooth the function you
are modelling is, how much noise there is in the
observations, etc.
Hyperparameters
• Such carefully vague prior distributions are
usually defined in a hierarchical fashion, using
hyperparameters, some of which are analogous
to the weight decay constants of more
conventional training procedures.
• An ‘Automatic Relevance Determination’ scheme
can be used to allow many possibly-relevant
inputs to be included without damaging effects.
Methods
• Implementing all this is one of the biggest
problems with Bayesian methods.
• Dealing with a distribution over weights (and
perhaps hyperparameters) is not as simple as
finding a single "best" value for the weights.
• Exact analytical methods for models as complex
as neural networks are out of the question.
• Two approaches have been tried:
Methods
 Find the weights/hyperparameters that are most
probable, using methods similar to conventional
training (with regularization), and then
approximate the distribution over weights using
information available at this maximum.
• Use a Monte Carlo method to sample from the
distribution over weights. The most efficient
implementations of this use dynamical Monte
Carlo methods whose operation resembles that of
backprop with momentum.
Advantages
 Network complexity (such as number of hidden
units) can be chosen as part of the training
process, without using cross-validation.
• Better when data is in short supply as you can
(usually) use the validation data to train the
network.
• For classification problems the tendency of
conventional approached to make overconfident
predictions in regions of sparse training data can
be avoided.
Regularisation
• Regularisation is a way of controlling the
complexity of a model by adding a penalty term
(such as weight decay). It is a natural consequence
of using Bayesian methods, which allow us to set
regularisation coefficients automatically (without
cross-validation).
• Large numbers of regularisation coefficients can
be used, which would be computationally
prohibitive if their values had to be optimised
using cross-validation.
Confidence
• Confidence intervals and error bars can be
obtained and assigned to the network outputs
when the network is used for regression problems.
• Allows straightforward comparison of different
neural network models (such as MLPs with
different numbers of hidden units or MLPs and
RBFs) using only the training data.
Advantages
 Guidance is provided on where in the input space
to seek new data (active learning allows us to
determine where to sample the training data next).
• Relative importance of inputs can be investigated
(Automatic Relevance Detection)
• Very successful in certain domains
• Theoretically the most powerful method
Disadvantages
• Requires to choose prior distributions, mostly
based on analytical convenience rather than real
knowledge about the problem
• Computationally intractable (long training
times/high memory requirements)
Summary
 In practice, Bayesian networks often outperform
standard networks (such as MLPs trained with
backpropagation).
• However, there are several unresolved issues (such
as how best to choose the priors) and more
research is needed
• Bayesian networks are computationally intensive
and therefore take a long time to train.