Lecture notes

Download Report

Transcript Lecture notes

6. Radial-basis function (RBF) networks
RBF = radial-basis function: a function which depends
only on the radial distance from a point
XOR problem
quadratically separable
So RBFs are functions taking the form
f (|| x  x i ||)
Where f is a nonlinear activation function, x is the input and xi is
the i’th position, prototype, basis or centre vector.
The idea is that points near the centres will have similar outputs
I.e. if x ~ xi then f (x) ~ f (xi) since they should have similar
properties.
Therefore instead of looking at the data points themselves
characterise the data by their distances from the prototype vectors
(similar to kernel density estimation)
For example, the simplest form of
f (x) = x
f is the identity function
x1=(0,1)
x
x2=(1,0.5)
(0,0)
(1,1)
(0,1)
(1,0)
d1 d2
1
1
0
2
Now use the distances as the inputs to a network and
form a weighted sum of these
1.1
.5
1.1
.5
Can be viewed as a Two-layer network
y1
y2
Input
Output
d
wj
yM
fN (y)=f |y-xN|
Hidden layer
output = S wi fi(y)
adjustable parameters are weights wj
number of hidden units = number of prototype vectors
Form of the basis functions decided in advance
• use a weighted sum of the outputs from the basis functions for e.g.
classification, density estimation etc
•Theory can be motivated by many things (regularisation, Bayesian
classification, kernel density estimation, noisy interpolation etc), but all
suggest that basis functions are set so as to represent the data.
• Thus centres can be thought of as prototypes of input data.
*
*
*
1
*
O1
MLP
distributed
vs
*
0
0
RBF
local
*
E.g. Bayesian interpretation: if we choose to model the
probability and we choose appropriate weights then we can
interpret the outputs as the posterior probabilities:
Ok = P(Ck|(x) a p(x|Ck) P(Ck)
O1
O2
O3
0
P(C1)
0
P(C3)
F1(x) = p(x|C1)
F3(x) = p(x|C3)
x
y
Starting point: exact interpolation
Each input pattern x must be mapped onto a target
value d
d
x
That is, given a set of N vectors xi and a corresponding set of N
real numbers, di (the targets), find a function F that satisfies the
interpolation condition:
F ( xi ) = di for i =1,...,N
or more exactly find:
N
F ( x) =  w jf (|| x  x j ||)
j =1
satisfying:
N
F ( x i ) =  w jf (|| x i  x j ||) = di
j =1
Example: XOR problem
x
(0,0)
(1,1)
(0,1)
(1,0)
d
0
0
1
1
Exact interpolation: RBF placed at position of each pattern vector
using 1) linear RBF
i.e. 4 hidden units in network
1) fi ( x) = || x  xi ||;
w
Network structure








0
1
1
0
1
2
1
2
1
2
0
1
2
1
1
0



Results 






w2 

w3 


w4 
w1



= 






2 

2 
2 

2 

1 
1
















w1


w2 

w3 

w4 
=








0

1

1


0
d1 = f1 ( x).w
2
2
 1
 2 1
2
2
2
2
d1 = 

 2 =0
2
2
d1 = 0 1  1
d 2 = f 2 ( x ).w
d 2 = 1 1  0 
2

2
2
2
 1 1 = 1
2
And general solution is:
N
F ( x) =  w jf (|| x  x j ||)
j =1
Ie F(x1,x2) = sqrt(x12+x22)

2
2
2

2
sqrt((x1-1)2+x22)
sqrt(x12+(x2-1)2)
+ sqrt((x1-1)2+(x2-1)2)








For n vectors get:
f ( x1 - x1 )
f(
f(
f ( xN - x1 )

x1 - xN )




xN - xN )

Interpolation Matrix
w 
 
 
 
 
 w 

 
1
N
=














d

d1
N
weight
f ( x i - xj ): scalar function of distance between vector x i
and xj
Equivalently
FW = D
If F is invertible we have a unique solution of the
above equation
Micchelli’s Theorem
Let xi , i = 1, ..., N be a set of distinct points in Rd, Then the
N-by-N interpolation matrix , whose ji-th element is f ( x i - xj ) ,
is nonsingular.
1
W =F D
So provided F is nonsingular then interpolation matrix will have
an inverse and weights to achieve exact interpolation
Easy to see that there is always a solution.
For instance, if we take f(x-y)=1 if x = y, and 0 otherwise (e.g.
a Gaussian with very small s), setting wi=di solves the
interpolation problem
However, this is a bit trivial as the only general conclusion
about the input space is that the training data points are
different.
To summarize:







For a given data set containing N points (xi,di), i=1,…,N
Choose a RBF function f
Calculate f(xj  xi )
Obtain the matrix F
Solve the linear equation F W = D
Get the unique solution
Done!
Like MLP’s, RBFNs can be shown to be able to approximate any
function to arbitrary accuracy (using an arbitrarily large numbers of
basis functions).
Unlike MLP’s, however, they have the property of ‘best
approximation’ i.e. that there exists an RBFN with minimum
approximation error.
Other types of RBFs include
(a) Multiquadrics
f (r ) = (r  c )
2
2 1/ 2
for some c>0
(b) Inverse multiquadrics
f (r ) = (r  c )
2
for some c>0
(c) Gaussian
2 1/ 2
 r2 

f (r ) = exp  
2 
 2s 
for some s >0
Linear activation function has some undesirable
properties e.g. f (xi) = 0. (NB f is still a non-linear
function as it is only piecewise linear in x).
• Inverse multiquadrics and Gaussian RBFs are both examples
of ‘localized’ functions
• Multiquadrics RBFs are ‘nonlocalized’ functions
‘Localized’: as distance from the centre increases the output of
the RBF decreases
• ‘Nonlocalized’: as distance from the centre increases
the output of the RBF increases
Example: XOR problem
x
(0,0)
(1,1)
(0,1)
(1,0)
d
0
0
1
1
Exact interpolation: RBF placed at position of each pattern vector
using 2) Gaussian RBF with s=1
i.e. 4 hidden units in network
|| x  x i ||2
2) fi ( x) = exp( 
)
2
w
Network structure








exp(0)
exp(-.5) exp(-.5) exp(-1) 

exp(-.5) exp(0) exp(-1) exp(-.5)

exp(-.5) exp(-1) exp(0) exp(-.5)


exp(-1) exp(-.5) exp(-.5) exp(0) 



Results 






w2 

w3 


w4 
w1



= 






3.4233 

3.4233 


-3.0359 
-3.0359
w1




 w2 


 w3 


 w4 

=








0

1

1


0
1) f(x1,x2) = sqrt(x12+x22)
-
2
2
sqrt((x1-1)2+x22)
- 2 sqrt(x12+(x2-1)2)
2
+ sqrt((x1-1)2+(x2-1)2)
2) f(x1,x2) = -3.0359 exp(-(x12+x22)/2)
+3.4233 exp(-(x1-1)2+x22)/2)
+3.4233 exp(-(x12+(x2-1)2)/2)
-3.0359 exp(-(x1-1)2+(x2-1)2)/2)
Large s = 1
Small s = 0.2
Problems with exact interpolation
can produce poor generalisation performance as only data
points constrain mapping
overfitting problem
Bishop(1995) example
Underlying function f(x)=0.5+0.4sine(2pi x)
sampled randomly for 30 points
added gaussian noise to each data point
30 data points 30 hidden RBF units
fits all data points but creates oscillations due added noise
and unconstrained between data points
All Data Points
5 Basis functions
To fit an rbf to every data point is very inefficient due to
the computational cost of matrix inversion and is very
bad for generalisation so:
• Use less RBF’s than data points I.e. M<N
• Therefore don’t necessarily have RBFs centred at data points
• Can include bias terms
• Can have gaussians with general covariance matrices but there is
a trade-off between complexity and the number of parameters to be
found
1 parameter
d parameters
for d rbfs we
have
d(d+1)/2 parameters
6. Radial-basis function (RBF) networks II
Generalised radial basis function networks
Exact interpolation expensive due to cost of matrix inversion
prefer fewer centres (hidden RBF units)
centres not necessarily at data points
can include biases
can have general covariance matrices
now no longer exact interpolation, so
M
y ( x) =  wifi ( x)
i =0
where M (number of hidden units) <N (number of training data)
Three-layer networks
f0 = 1
x1
x2
w0 = bias
Input: nD
vector
Output
wM
y
fM (x)=f (x-xM)
xN
Hidden layer
1. output = S wi fi(x)
2. adjustable parameters are weights wj, number of hidden
units M (<N)
3. Form of the basis functions decided in advance
F(x)
F(x)
S
S
w1
f(r1)
w2
f(r2)
*
w31
w32
sig(w1Tx)
sig(w2Tx)
w1
w2
r2
* r1 x
x
w1Tx = k
Comparison of MLP to RBFN
MLP
RBF
hidden unit outputs are monotonic
functions of a weighted linear sum
of the inputs => constant on (d1)D hyperplanes
hidden unit outputs are functions
of distance from prototype vector
(centre) => constant on concentric
(d-1)D hyperellipsoids
distributed representation as many
hidden units contribute to network
output => interference between
units => non-linear training =>
slow convergence
localised hidden units mean that
few contribute to output => lack of
interference => faster convergence
Comparison of MLP to RBFN
MLP
RBF
more than one hidden layer
one hidden layer
global supervised learning of all
weights
hybrid learning with supervised
learning in one set of weights
global approximations to
nonlinear mappings
localised approximations to
nonlinear mappings
Three-layer networks
f0 = 1
x1
x2
w0 = bias
Input: nD
vector
Output
wM
y
fM (x)=f (x-xM)
xN
Hidden layer
1. output = S wi fi(x)
2. adjustable parameters are weights wj, number of hidden
units M (<N)
3. Form of the basis functions decided in advance
Hybrid training of RBFN
Two stage ‘hybrid’ learning process
stage 1: parameterise hidden layer of RBFs
- hidden unit number (M)
-centre/position (ti)
-width (s)
use unsupervised methods (see below) as they are quick and
unlabelled data is plentiful. Idea is to estimate the density of the data
stage 2 Find weight values between hidden and output units
minimize sum-of-squares error between actual output and desired
responses
--invert matrix F if M=N
--Pseudoinverse of F if M<N
Stage 2 later, now concentrate on stage 1.
Random subset approach
Randomly select centres of M RBF hidden units from N data points
widths of RBFs usually common and fixed to ensure a degree
of overlap but based on an average or maximum distance between
RBFs e.g.
s = dmax /sqrt (2M)
where dmax is the maximum distance between the set of M
RBF units
The method is efficient and fast, but suboptimal and its important to
get s correct …
s = 10
s = 0.08
s = 0.4
Clustering Methods: K-means algorithm
--divides data points into K subgroups based on similarity
Batch version
1. Randomly assign each pattern vector x to one of K subsets
2. Compute mean vector of each subset
3. Reassign each point to subset with closest mean vector
4. Until no further reassignments, loop back to 2
On-line version
1. Randomly choose K data points to be basis centres mi
2. As each vector is xn presented, update the nearest mi using:
Δmi = h(xn  mi)
3. Repeat until no further changes
The covariance matrices (s) can now be set to the covariance of
the data points of each subset
-- However, note that K must be decided at the start
-- Also, the algorithm can be sensitive to initial conditions
-- Can get problems of no/few points being in a set: see
competitive learning lecture
-- Might not cover the space accurately
Other unsupervised techniques such as self organising maps and
Gaussian mixture models can also be used
Another approach is to use supervised techniques where the
parameters of the basis functions are adaptive and can be
optimised. However, this negates the speed and simplicity
advantages of the 1st stage of training.
Relationship with probability density function estimation
Radial basis functions can be related to kernel density functions
(Parzen windows) used to estimate probability density functions
E.g. In 2 dimensions the pdf at a point x can be estimated from the
fraction of training points which fall within a square of side h centred
on x
y
*
*
*
X
* x* h
*
Here p(x) = 1/6 x 1/(hxh) x Sn H(x-xn,h)
where H = 1 if |xn-x| < h ie estimate density by fraction of points within
each square
Alternatively, H(|xn-x|) could be gaussian giving a smoother estimate
for the pdf
In Radial basis networks the first stage of training is an attempt to
model the density of the data in an unsupervised way
As in kernel density estimation, we try to get an idea of the
underlying density by picking some prototypical points
Then use distribution of the data to approximate a prior
distribution
Back to Stage 2 for a network with M < N basis vectors
Now for each training data vector ti and corresponding target di we
want F ( ti ) = di , that is, we must find a function F that
satisfies the interpolation condition :
F ( ti ) = di for i =1,...,N
Or more exactly find:
M
F ( x) =  w jf (|| x  x j ||)
j =0
satisfying:
M
F (t i ) =  w jf (|| t i  x j ||) = di
j =0
So the interpolation matrix becomes:








1 f ( t1 - x1 )
1 f ( tN - x1 )
f(
f(

t1 - xM )



tN - xN )

Which can be written as:
FW = D
where F is an MxN matrix (not square).
w 
w 
 
 
 
 w 
 M
 
0
1
=














dN

d1
To solve this we need to generate an error function such as the least
squares error:
1
n
E =  ( y (t )  d n ) 2
2 n
and minimise it.
As the derivative of the least squares error is a linear
function of the weights it can be solved using linear
matrix inversion techniques (usually singular value
decomposition (Press et al., Numerical Recipes)).
Other error functions can be used but minimising the error then
becomes a non-linear optimisation problem.
However, note that the problem is OverDetermined
That is, by using N training vectors and only M centres
we have M unknowns (the weights) and N bits of
information eg training vectors (-2, 0), (1, 0), targets 1,
2 centre: (0, 0), linear rbf F W = D =>
 2
 1
  w =  
 1
 2
w =0.5 or w =2 ??? Unless N=M and there are no
degeneracies (parallel or nearly parallel) data vectors, we
cannot simply invert the matrix and must use the
pseudoinverse (using Singular Value Decomposition).
Alternatively, can view this as an ill-posed problem
Ill-posed problems (Tikhonov)
How do we infer function F which maps X onto y from a finite data
set?
This can be done if problem is well-posed
- existence = each input pattern has an output
- uniqueness = each input pattern maps onto only one output
- continuity = small changes in input pattern space imply small
changes in y
In RBFs however:
- noise can violate continuity condition
- different output values for same input patterns violates uniqueness
- insufficient information in training data may violate existence
condition
Ill-posed problem: the finite data set does not yield a
unique solution
Regularization theory (Tikhonov, 1963)
To solve ill-posed problems need to supplement finite data set
with prior knowledge about nature of mapping
-- regularization theory
• common to place constraint that mapping is smooth (since
smoothness implies continuity)
• add penalty term to standard sum-of squares error for non-smooth
mappings
E(F)=ES (F)+ l Ec(F)
where eg:
ES (F)= 1/2 S ( di- F(xi) )2
and
Ec(F)=1/2 || DF ||2
and DF could be, say the first or second order derivative of F etc.
l is called the regularization parameter:
l = 0
l
unconstrained (smoothness not enforced)
= infinity, smoothness constraint dominates and less
account is taken of training data error
l controls balance (trade-off) between a smooth mapping and
fitting the data points exactly
EC = curvature
l=0
l = 40
Regularization networks
--Poggio & Girosi (1990) applied regularization theory
to RBF networks
--By minimizing the new error function E(F) we obtain
(using results from functional analysis)
W = [F  lI ] D
1
where I is the unit matrix. Provided EC is chosen to be
quadratic in y, this equation can be solved using the same
techniques as the non-regularised network.
Problems of RBFs
1. Need to choose number of basis functions
2. Due to local nature of basis functions has problems in ignoring
‘noisy’ input dimensions unlike MLPs (helps to use dimensionality
reduction such as PCA)
1D data, M rbfs
Same data with
uncorrelated noise, M2 rbfs
Problems of RBFs 2
3. Optimal choice of basis function parameters may not be optimal for
the output task
Data from h => rbf at a, but gives a bad representation
of h. In contrast, one centred at b would be perfect
Problems of RBFs 3
4. Because of dependence on distance, if variation in one parameter is
small with respect to the others it will contribute very little to the
outcome (l + e)2 ~ l2. Therefore, preprocess data to give zero mean
and unit variance via
simple transformation:
x* = (x - m)
s
(Could achieve the
same using general
covariance matrices
but this is simpler)
However, this does not take into account correlations in the data.
Better to use whitening (Bishop, 1995, pp 299-300)
x* = L-1/2 UT (x - m)
where U is a matrix
whose columns are
the eigenvectors ui of
S, the covariance
matrix of the data, and
L a matrix with the
corresponding
eigenvalues li on the
diagonals i.e:
U = (u1, … …, un)
And:
L = diag(l1, ……, ln)
l1u1
l2u2
Using RBF Nets in practice
• Choose a functional form (Gaussian generally, but prior
knowledge/experience may suggest others)
• Select the type of pre-processing
--Reduce dimensionality (techniques to follow in next few
lectures) ?
--Normalise (whiten) data?
(no way of knowing if these will be helpful: may need to try a few
combinations)
• Select clustering method (k-means)
• Select number of basis functions, cluster and find basis centres
• Find weights (via matrix inversion)
• Calculate performance measure.
If only life were so simple…
• How do we choose k? Similar to problem of
selecting number of hidden nodes for MLP
• What type of pre-processing is best?
• Does the clustering method work for the data?
E.g might be better to fix s and try again.
There is NO general answer: each choice will be
problem-specific. The only info you have is your
performance measure.
Idea: try e.g. increasing k until performance measure
decreases (or gets to a minimum, or something more
adventurous).
Optimal k?
k
Note the dependence on the performance measure (make
sure it’s a good one).
Good thing about RBF Nets is that the training procedure is
relatively quick and so lots of combinations can be used.