Transcript Dia 1
Bayesian Neural
Petteri
1.Introduction
The structure of the brain differs much from the way
computers are built, but at the same time brains can perform
certain tasks such as pattern recognition and classification more
effectively than any computer. As neurological (study of the
nervous system) research advanced, researchers were able to
build a model of the structure of the brain. In computer science
the AI community welcomed this research and designed their
own artificial brain model, which is a simplification of the actual
structure of the brain. The key components are neurons and
synapses. Neurons are the basic computational unit of the brain,
which receive some input and produce an output based on the
input. The synapses form the connections between the neurons
and together these building blocks can be used to build large
artificial (neural) networks. A neuron can either be activated or
deactivated or it can have a continuous activation level. This
value is based on the input (+ possible other parameters such as
bias) and the function that calculates the activation value from
the input is called an activation function. The most widely used
function is the sigmoid, which is shown in Figure 1.
Figure 1. The Sigmoid function
α is the slope parameter
w is the vector of synaptic weights
2. Multilayer Perceptron
Networks (MLP) [Hay99]
The most widely used neural network architecture is the
so-called multilayer feedforward network, in which the neurons
are organized into layers in an hierarchical manner. The first level
of neurons forms the input layer of the network and the last layer
forms the output layer of the network. Between the input and
output layer we may have an arbitrary number of hidden layers,
through which the signals are propagated. Figure 2 gives an
architectural overview of a multilayer feedforward (perceptron)
network, with two hidden layers.
…
…
…
Figure 2. A multilayer perceptron network with two
hidden layers
Learning in MLP networks is typically done by the
backpropagation algorithm (backprop). The network is given a set
of training samples, which are of form {x,t}, where x is an input
vector (values for each neuron in the input layer) and t is the
corresponding vector of target values (what the activation levels
of the neurons in the output layer should be). First the inputs are
propagated through the network, which gives us some output
values y. These are compared to the target values and the errors
of the output nodes are stored. Together the error values define
the error energy of the network. The goal is to minimize this error
energy (distribution).
The standard optimization technique for error
minimization is to use gradient optimization techniques. The
theoretical background stems from multivariable calculus as a
necessary condition for a minimum is that the gradient around
that point is zero. Thus by going into the direction that minimizes
the gradient of the error, we can eventually converge to a
minimizing point. Whether this is the best estimate or not is
impossible to say, but often even this simple framework produces
good results. The name backprop comes from the fact that we
must pass the error terms backwards from the output nodes to
the nodes in the last hidden layer etc. Otherwise it would be
impossible to estimate locally the gradient of the error
distribution.
A standard approach in gradient optimization methods is
to use small step sizes. This is achieved by using a learning rate
parameter η to weight the gradient values. To try to avoid local
minimum points, the learning rate parameter can be described as
a function of time so that as we calculate more and more learning
iterations, the value of the learning rate parameter decreases.
The backprop framework suffers from a major
drawback. Namely if the data samples are noisy, then the
network will be overfitted for the noise and it will not generalize
well to new cases. The simplest attempt to avoid this is to
introduce a penalizing term on the weight terms and try to keep
the weight values as small as possible. Now the error energy is
defined in terms of a) the errors between the activation levels of
the nodes in the output layer and the target values and b) the
penalizing weight term. This complete framework, which also
forms the basis for Bayesian neural networks, is illustrated in
Figure 3. The equation shown in Figure 3 uses mean square error
as the error “score” of the targets values and output values. For
the weights a squared penalizing term is used. The goal of
learning is now to find the weights, which minimize this error
distribution.
Figure 3. The penalized error term.
The first sum goes over the samples and the
second goes over the synapses in the network.
The terms α and β are weighting terms.
Networks
Nurmi
3. Bayesian Neural Networks
The Bayesian approach was designed to “fill in gaps in
the backpropagation framework” [Mac91]. There are still some
problematic issues even in the (weighted) penalized backprop
learning scheme. The first problem relates to the weighting
parameters α and β. Namely, how should one select values for
these parameters? This issue was discussed by Mackay [Mac91]
and he suggested to use hierarchical model, in which the
weighting parameters are assigned Gaussian hyperparameter
priors from which the values for the parameters can be sampled.
Neal [Nea96] ignores the use of weighting parameters and we
will focus more on Neal’s approach.
The MLP network architectures are typically used for a
predefined network architecture. As Mackay formulated the
Bayesian view for the network, he also made it possible to
evaluate different network architectures. Similarly it is possible to
define probability distributions over the possible choices of
penalizing functions etc. These problems can be solved with
similar kind of methods as are used in general cases. The
commonly used methods (nowadays) include such as MDL,
variational methods and information theoretic scoring criterion
such as the BIC (Bayesian information criterion). For information
about model learning methods, the interested reader is referred
to [MP01] or [Hec99].
To be able to use Bayesian methods for learning related
tasks, one must formulate the problem in a probabilistic manner.
Thus we have to derive a correspondence between the input
values and the target values. In data-analysis this is typically
done by using regression (see e.g. [GCS+04]) in which the values
of some variable are modelled as a function of some other
parameters. Moreover, to be precise, the actual regression model
is a Gaussian regression model. Figure 4 gives the exact formula
for calculating the probability of seeing a particular target value
given an input, the black-box parameters (=weighting terms), a
network architecture and a noise model for the data.
Figure 4:
Gaussian regression for neural networks
The regression term defines the likelihood of the data in
terms of the target values. What is still needed, is a definition of
a posterior distribution over the possible weight values. But
before we can formulate a posterior distribution, we need to
define a suitable prior distribution. The first approach, due to
Mackay, was to use Gaussian priors, which allows us to represent
the resulting posterior distribution as a Gaussian distribution. As
the resulting posterior is Gaussian, it is possible to use Gaussian
approximation for the posterior, which makes the process
computationally somewhat feasible. However, Neal [Nea96]
argues that this kind of approach breaks up as the number of
hidden units in the network grows. Neal [Nea94] discusses using
priors for infinite networks, in which the number of hidden units
approaches infinity. To avoid overfitting, the priors are scaled
according to the number of hidden units in the network.
However, for our purposes it satisfies to look at the
Gaussian priors introduced by Mackay. Figure 5 first gives the
prior distribution of the weights and then in the lower equation is
given the resulting posterior distribution.
Figure 5: Gaussian priors along with the resulting
posterior distribution
Note that currently we have only defined the posterior
distribution in terms of an individual sample and thus we still
need to have a way to evaluate the learning somehow. The
solution is to use as a evaluation metric the mean of the posterior
predictive distribution. Thus we calculate the average probability
that is attached to an unseen (and unclassified) sample. This
approach is illustrated in Figure 6. Another possibility is to divide
the training data into subsets and use some cross-validation
technique (leave-one-out, k-fold cross validation, Monte Carlo
cross-validation etc.).
Figure 6: Mean of the posterior predictive
distribution under the regression model
4. Hybrid MCMC-Learning
Although many Bayesian learning methods have been
proposed for neural networks (Gaussian approximation,
Metropolis-Hastings, Stochastic Dynamics, Langevin Monte-Carlo,
Generalized Hybrid Monte Carlo and Mean field estimation), we
will concentrate on the (generic) Hybrid Markov-Chain MonteCarlo method, which is used by Neal [Nea96]. The first step in
describing the algorithm, is to describe the learning task as a
(closed) physical system.
Assume that our dynamical system has n particles within
the system. We will use a vector q to describe the location of
each of the particles at a given instant of time. Thus q is a vector
of n-components. Assume first that each particle is not moving.
Then the energy of the system is defined in terms of the potential
energy of the particles. We define the probability density for the
variable q in the following way:
P(q)
∝ exp(-E(q))
Here E(q) is the “potential energy” function. Typically
also a pseudo-temperature variable T is used (for annealing
purposes), but in this case it is unnecessary.
We allow the particles to move around the system and
define a momentum term p, which describes the movement of
the particles. We define the kinetic energy of a system to be K(p)
and we select K(p) = ∑i=1n p2i / 2mi . Here pi is the ith component
of the momentum term and mi is the mass of the ith particle.
Together the potential and kinetic energy distributions allow us to
define a joint distribution of the variables p and q, which defines
the phase-space distribution of the system. This probability is
expressed in terms of a Hamiltonian term H(q,p) = E(q) + K(p),
which gives the total energy of the system. The probability
distribution is again defined as the exponential of the negative
energy. Thus we have the following:
P(q,p) ∝ exp(-H(q,p))
Next we describe the evolution of (pseudo-)time in
terms of Hamiltonian dynamics. The evolution of the parameters
is illustrated in Figure 7.
Figure 7: The Hamiltonian evolution of the phasespace dynamics
The Hamiltonian dynamics are especially useful, because
a) as the parameters q and p vary (according to the dynamic
equations ), the total energy H stays the same b) the distribution
P(q,p) is invariant with respect to transitions that consists of
following a trajectory in the phase-space, which evolves
according to Hamiltonian dynamics.
If we could evaluate the Hamiltonian dynamics exactly,
then the following sampling scheme could be used:
1) Sample new values for p and q with a fixed H
2) Sample new states H
3) Repeat
The above framework is called the stochastic dynamics
method and if the simulations can be done exactly, then the
method will (eventually) estimate every trajectory. However,
because the dynamics can’t be estimated exactly, one must resort
to approximation methods. One such is the Leapfrog
approximation, in which discrete steps are used to estimate the
evolution of the state. The equations for this method are given in
Figure 8.
Figure 8: Leapfrog state estimation
The Leapfrog estimation will actually change the value of
H slightly and thus the method suffers from a systematic error. If
the step-size is reduced to zero (in the limit), then no error will be
made, but the evaluation of a single trajectory in the phase-space
will require an infinite number of steps. To overcome this
problem, it is possible to combine the stochastic dynamics
method with the Metropolis-algorithm and thus avoid errors. The
idea is to perform the Leapfrog simulation for the parameters p
and q and then observe the change in the energy H. The new
state is considered as a candidate for the Metropolis-algorithm
and it will be accepted with a probability proportional to the
change in energy. The exact algorithm is given below:
Algorithm I: Hybrid MCMC
1. Let state be (q,p). Perform L leapfrog steps
arrive at state (q’,p’).
2. Perform Gibbs sampling on the momentum
variable p’
3. Accept new state with probability
min(1, exp(-(H(q’,p’) – H(q,p)))
What is still missing from this framework, is the
connection to learning in neural networks. The idea is to consider
the potential term q to describe the parameters of the neural
network. Thus q consists of the biases and weights of the
network. The momentum variables p are the hyperparameters,
which describe the (hierarchical) distribution of the network
parameters. Now the algorithm first “anneals” the state of the
weights and then performs Gibbs sampling on the
hyperparameters of the network. The resulting state is considered
as a candidate state for the Metropolis and either rejected or
accepted. Eventually the algorithm has visited all regions in the
phase-space and the weight distribution has converged to the
true distribution. The computations performed in this framework
are computationally very intensive, but on the other hand the
goal is to devise optimal learning methods for very large neural
networks.
5. Results
Neal tests the Hybrid Markov Chain Monte-Carlo method
against the Metropolis-algorithm in the robot-arm problem
[Nea92]. In addition he tests, how simulated annealing can affect
the convergence performance of the Hybrid MCMC method. The
extension to simulated annealing is pretty straightforward as now
the pseudo-temperature variable is introduced in the canonical
form of the joint distribution and this temperature is gradually
cooled. When the temperature is one, the method is equivalent to
the Hybrid MCMC method.
In the robot-arm problem the goal is to predict the robot
arms position at the next time step given the current position.
The sensor data is modelled as noisy and thus a Gaussian noise
model is introduced. The prediction task can be described as a
pair (two-dimensional data) of linear equations, which describe
the network function.
The step-size of the Hybrid MCMC method was selected
to be 1000 and the iterations were run 200 times, which results
in a total of 200 000 iterations. Similarly, the standard Metropolisalgorithm was evaluated using 200,000 iterations. The
computations took around 20 minutes (with a machine with
approx. 25 MIPS). Figure 9 illustrates, how the Hybrid MCMC and
the simulated annealing version of Hybrid MCMC perform. Both
methods were run 9 times and if the resulting prediction error is
larger than 0.0075, then the run was rated a failure. The size of
the square represents the amount of failures.
Figure 9: Hybrid MCMC with and without annealing
[Nea94].
From Figure 9 we can clearly see that the simulated
annealing improves performance. The constant parameters ε0
and ν are used to set the step-size in the Leapfrog estimation
according to the formula ε = ε0 exp(νC). Here C is sampled
from a Cauchy-distribution.
To further illustrate the advantage of using simulated
annealing, Figure 10 gives the prediction errors as a function of
the iterations.
Figure 10: Convergence curves with and without
annealing [Nea94].
Unfortunately no study was available, where the
different methods would be extensively tested against each other.
In this dummy example the Hybrid MCMC method works well, but
so did also the Gaussian approximation method used by Mackay.
On the other hand, Neal was able to motivate the use of Hybrid
MCMC against the usage of Metropolis algorithm as the
performance of the Hybrid MCMC was superior. The best result
obtained with the Metropolis algorithm exceeded the threshold
value by a factor of approximately two. This clearly illustrates that
200,000 iterations were not enough for the Metropolis-algorithm
to reach convergence.
6. Future Directions
After the few initial publications, the research in this field
has notably decreased. The main contributions to this area have
been in the form of new algorithmic solutions, of which especially
the mean field method is a very promising technique. Also the
research with optimizations for MCMC methods has continued,
but both of these research directions are more generically related
to probabilistic modelling and not purely on Bayesian neural
networks.
The applications of Bayesian neural networks are not
very numerous and most of the applications that have been done
are to very specific areas. For example Vehtari, Lampinen et al.
[VHL+98] use Bayesian neural networks to classify forest scenery
data.
7. References
[GCS+04] A. Gelman, J. Carlin, H. Stern, D. Rubin, Bayesian Data
Analysis, 2nd ed., CRC Press, 2004.
[Hay99] S. Haykin, Neural Networks: A Comprehensive Foundation,
Prentice Hall, 1999.
[Hec99] D. Heckerman, A Tutorial on Learning with Bayesian
Networks, In Learning in Graphical Models, ed. M.I. Jordan, MIT
Press, 1999.
[Mac91] D. Mackay, Bayesian Methods for Adaptive Models, Ph.D
thesis, California Institute of Technology, 1991.
[MP01] G. McLachlan, D. Peel, Finite Mixture Models, John Wiley
& Sons, 2001.
[Nea92] R. Neal, Bayesian training of backpropagation networks by the
Hybrid Monte Carlo Method, Technical Report CRG-TR-92-1,
University of Toronto, Dept. of CS, 1992.
[Nea94] R. Neal, Priors for Infinite Networks, Technical Report
CRG-TR-94-1, University of Toronto, Dept. of CS, 1994.
[Nea96] R. Neal, Bayesian Learning for Neural Networks, Lecture
Notes in Statistics No. 118, Springer-Verlag, 1996.
[VHL+98] A. Vehtari, J. Heikkonen, J. Lampinen, J. Juujärvi, Using
Bayesian Neural Networks to Classify Forest Scenes, In Intelligent
Robots and Computer Vision XVII: Algorithms, Techniques and
Active Vision, pp. 66-73, 1998.