Transcript exin6

unit #6
Giansalvo EXIN Cirrincione
Radial basis functions
Exact interpolation
The exact interpolation problem requires every input vector
to be mapped exactly onto the corresponding target vector.
Problem: given a mapping: x d  t  and a TS of N points, find
a function h(x) such that:
h(x n) = t n
n = 1, … , N
N basis functions
Fnn’ = F(xn- xn’ )
(wn) (t n)
generalized linear
discriminant function
usually Euclidean
Radial basis functions
Exact interpolation
localized  x   e
x2
 2
2
 x   x  
2

2 
  0
Many properties of the interpolating function are relatively
insensitive to the precise form of the non-linear kernel.
thin-plate spline function
 x  x 2 lnx
multi-quadric function for  = 1/2
 x   x  
2
 0    1
2 
 x   x 3
 x   x non-linear in the
components of x
Radial basis functions
Exact interpolation
Gaussian noise,
zero mean,
 = 0.05
hx   0.5  0.4 sin2 x 
30 points
Gaussian basis functions with  = 0.067
(roughly twice the spacing of the data points)
highly oscillatory
Radial basis functions
Exact interpolation
more than one output variable
RBF
RBF’s provide a smooth interpolating function in which the number of
basis functions is determined by the complexity of the mapping to be
represented rather than by the size of the data set.
The number M of basis functions is less than N.
The centres of the basis functions are determined by the training process.
Each basis function has its own width j which is adaptive.
Bias parameters are included in the linear sum and compensate for the
difference between the average value over the TS of the basis function
activations and the corresponding average value of the targets.
RBF
universal
approximator
fixed at 1
Gaussian
mMd
Elements mji
kernel
best
approximator
RBF
There is a trade-off between using a smaller
number of basis with many adjustable parameters
and a larger number of less flexible functions
RBF
There is a trade-off between using a smaller
number of basis with many adjustable parameters
and a larger number of less flexible functions
homework
Suppose that all of the Gaussian basis functions
in the network share a common covariance
matrix S. Show that the mapping represented
by such a network is equivalent to that of a
network of spherical Gaussian basis functions
with common variance 2 = 1, provided the
input vector x is first transformed by an
appropriate linear transformation. Find
expressions relating the transformed input
vector and kernel centres to the corresponding
original vectors.
homework
RBF training
first-layer weights
decoupling
second-layer weights
w 
kj
j
fixed first-layer weights
generalized linear discriminant
fast learning
RBF training
normal equations
 = 10.0
RBF training
 = 0.4
 = 0.08
•M=5
• centres = random subset of the TS
`Regularization theory
differential
Set the functional derivative w.r.t.
y(x) to zero:
operator
Euler-Lagrange equations
adjoint differential
operator
to Pof the
The solution can be written down
in terms
^
Green’s functions G(x,x’) of the operator PP:
If P is translationally and rotationally invariant, G depends only on
the distance between x and x’ (radial). The solution is given by:
RBF
Regularization theory
Regularization theory
Integrate over a small region around xn
Using the solution in terms of the Green’s functions:
The Green’s function is Gaussian with width
parameter  if the operator P is chosen as:
n = 0 implies RB exact interpolation
Regularization theory
•  = 0.067
• n = 40
• RB’s centred on each data
n = 0 implies RB exact interpolation
homework
Consider the functional derivative of the regularization functional (w.r.t.
y(x)) given by:
By using successive integration by parts, and making use of identities:
show that the operator Pˆ P is given by:
It should be assumed that boundary terms arising from the integration by
parts can be neglected. Now find the radial Green’s function of this
operator as follows. First introduce the multidimensional Fourier
transform of G in the form:
By using the last two formulae and using the following formula for the
Fourier transform of the Dirac function:
where d is the dimensionality of x and s, show that the Fourier transform
of the Green’s function is given by:
Now substitute this result into the inverse Fourier transform of G and
then show that the Green’s function is given by:
Regularization theory
Regularization can also be applied to general RBF’s. Also,
regularization terms can be considered for which the kernels
are not necessarily the Green’s functions. For example:
penalizes mappings which have large curvatures. This regularizer
leads to second-layer weights which are found by the solution of:
Relation to kernel regression
Technique for estimating regression functions from noisy
data, based on methods of kernel density estimation
Consider a mapping : x  y and a corresponding TS; a complete
description of the statistical properties of the generator of the data is
given by the probability density p(x,t) in the joint input-target space.
Parzen Gaussian kernel estimator
Relation to kernel regression
The optimal mapping is given by forming the regression, or conditional
average tx, of the target data, conditioned on the input variables.
Relation to kernel regression
The optimal mapping is given by forming the regression, or conditional
average tx, of the target data, conditioned on the input variables.
Nadaraya-Watson estimator
Relation to kernel regression
Extension: replace the kernel estimator with an adaptive mixture model
second-layer weights
normalized Gaussians
Nadaraya-Watson estimator
RBF’s for classification
px Ck 
Model the class distributions by local kernel functions
Multilayer
perceptron
RBF’s for classification
The outputs represent approximations to the posterior probabilities
RBF
Hidden-to-output
weight
RBF’s for classification
Use a common pool of M basis functions, labelled by an
index j, to represent all of the class-conditional densities
where
RBF’s for classification
PCk x    wkj j x 
M
j 1
The activations of the basis functions can be interpreted as the posterior
probabilities of the presence of corresponding features in the input
space, and the weights can similarly be interpreted as the posterior
probabilities of class membership, given the presence of the features.
The outputs represent the posterior probabilities of class membership.
Comparison with the multilayer perceptron
The hidden unit activation in a MLP is constant
on parallel (d-1)-dimensional hyperplanes.
The hidden unit (RB) activation in a RBF is constant on concentric
(d-1)-dimensional hyperspheres (more generally hyperellipsoids).
homework
In a MLP a hidden unit has a
constant activation function for input
vectors which lie on a hyperplanar
surface in input space given by
wTx+w0=const., while for a spherical
RBF a hidden unit has constant
activation on a hyperspherical surface
defined by ||x- m||2 =const.. Show that,
for suitable choices of the parameters,
these surfaces coincide if the input
vectors are normalized to unit length.
Illustrate this equivalence
geometrically for vectors in a threedimensional input space.
homework
Comparison with the multilayer perceptron
The hidden unit activation in a MLP is constant
on parallel (d-1)-dimensional hyperplanes.
The hidden unit (RB) activation in a RBF is constant on concentric
(d-1)-dimensional hyperspheres (more generally hyperellipsoids).
The MLP forms a distributed representation in the space of activation
values for the hidden units (problems: local minima, flat regions).
The RBF forms a representation in the space of activation values
for the hidden units which is local w.r.t. the input space.
All of the parameters in a MLP are usually determined at the same
time as part of a single global supervised training strategy.
RBF is trained in two steps (kernels determined first by unsupervised
methods, weights then found by fast linear supervised methods).
basis function optimization
The basis function parameters
should be chosen to form a
representation of the pdf of
the input data.
ignore any target information
mj’s as prototypes of the inputs
Problem: if the basis function centres are used to fill out a compact
d-dimensional region of the input space, then the number of kernel
centres will be an exponential function of d.
Problem: input variables which have significant variance
but play little role in determining the appropriate output
variables (irrelevant inputs).
basis function optimization
y independent of x2
Problem: input variables which have significant variance
but play little role in determining the appropriate output
variables (irrelevant inputs).
basis function optimization
A MLP can learn to ignore irrelevant inputs and obtain
accurate results with a small number of hidden units.
y independent of x2
Problem: input variables which have significant variance
but play little role in determining the appropriate output
variables (irrelevant inputs).
basis function optimization
The optimal choice of basis function parameters for input
density estimation needs not be optimal for representing the
mapping of the output variables.
input
input pdf
target pdf
basis function optimization
subsets of data points
basis function centres
basis function widths
set them equal to a
start with all data points
random subset of TS data
as kernel centres and then
all equal and set to some
determined from the
(use as initial conditions)
selectively remove centres
multiple of the average in such a way as to haveaverage distance of each
distance between the minimum disruption on kernel to its L nearest
kernel centres (overlap the system performance neighbours, where L is
for smoothing)
typically small
• very fast
• significantly sub-optimal
basis function optimization
K-means (basic ISODATA) clustering algorithm
Suppose there are N data points xn in total; it tries to find
decidedainsetadvance
of K representatives vectors mj where j = 1, ... ,k.
It seeks to partition the data points into K disjoint subsets
Sj containing Nj data points in such a way as to minimize
the sum-of-squares clustering function given by:
where mj is the mean of the data points in set Sj and is given by:
basis function optimization
K-means (basic ISODATA) clustering algorithm
batch (Lloyd, 1982)
It begins by assigning the points at random to K
sets and then computing the mean vectors of the
points in each set. Next, each point is re-assigned
to a new set according to which is the nearest
mean vector. The means of the sets are then
recomputed. This procedure is repeated until
there is no further change in the grouping of the
data points.
At each such iteration, the value of J will not
increase.
basis function optimization
K-means (basic ISODATA) clustering algorithm
batch (Lloyd, 1982)
K=2
basis function optimization
K-means (basic ISODATA)
clustering
algorithm
Robbins-Monro
for finding
the
root of
a regression
function given
sequential
(MacQueen,
1967)
by the derivative of J w.r.t. mj
The initial centres are randomly chosen from the data points, and as
each data point xn is presented, the nearest mj is updated using:
learning rate
Once the centres of the kernels have been found, the covariance
matrices of the kernels can be set to the covariances of the points
assigned to the corresponding clusters.
homework
Write a numerical implementation of the K-means
clustering algorithm using both the batch and online versions. Illustrate the operation of the
algorithm by generating data sets in two
dimensions from a mixture of Gaussian
distributions, and plotting the data points together
with the trajectories of the estimated means during
the course of the algorithm. Investigate how the
results depend on the value of K in relation to the
number of Gaussian distributions, and how they
depend on the variances of the distributions in
relation to their separation. Study the performance
of the on-line version of the algorithm for different
values of the learning rate parameter  and
compare the algorithm with the batch version.
Gaussian mixture models
The basis functions of the RBF can be regarded
as the components of a mixture density model
whose parameters are to be optimized by ML.
max
• P(j)
• kernel parameters
Once the mixture model has been optimized, the mixing coefficients
P(j) can be discarded, and the basis functions then used in the RBF
in which the second-layer weights are found by supervised training.
K-means as a particular limit of the EM
optimization of a Gaussian mixture model
basis function optimization
(supervised training)
The basis function parameters for regression can be
found by treating the kernel centres and widths, along
with the second-layer weights, as adaptive parameters
to be determined by minimization of an error function.
a sum-of-squares error
a spherical Gaussian basis functions
non-linear computationally intensive optimization problem
basis function optimization
(supervised training)
RBF training
well localized kernels
input
basis function optimization
(supervised training)
RBF training
basis function optimization
(supervised training)
RBF training
activation
only for a
small fraction
of kernels
training procedures can be speeded up significantly
by identifying the relevant kernels and therefore
avoiding unnecessary computation
basis function optimization
(supervised training)
RBF training
coarse (unsupervised) to fine (supervised)
no guarantee basis function will remain localized !