Transcript PPT
Radial Basis Functions
If we are using such linear interpolation, then our
radial basis function (RBF) 0 that weights an input
vector based on its distance to a neuron’s reference
(weight) vector is 0(D) = D-1.
For the training samples xp, p = 1, …, P0, surrounding
the new input x, we find for the network’s output o:
1
o
P0
d x x , where d
p
0
p
p
f (xp )
p
(In the following, to keep things simple, we will assume
that the network has only one output neuron. However,
any number of output neurons could be implemented.)
November 4, 2010
Neural Networks
Lecture 15: Radial Basis Functions
1
Radial Basis Functions
Since it is difficult to define what “surrounding” should
mean, it is common to consider all P training samples
and use any monotonically decreasing RBF :
1 P
o dp x xp
P p 1
This, however, implies a network that has as many
hidden nodes as there are training samples. This in
unacceptable because of its computational complexity
and likely poor generalization ability – the network
resembles a look-up table.
November 4, 2010
Neural Networks
Lecture 15: Radial Basis Functions
2
Radial Basis Functions
It is more useful to have fewer neurons and accept that
the training set cannot be learned 100% accurately:
1
o
N
x μ
N
i 1
i
i
Here, ideally, each reference vector i of these N
neurons should be placed in the center of an inputspace cluster of training samples with identical (or at
least similar) desired output i.
To learn near-optimal values for the reference vectors
and the output weights, we can – as usual – employ
gradient descent.
November 4, 2010
Neural Networks
Lecture 15: Radial Basis Functions
3
The RBF Network
Example: Network function f: R3 R
output vector
o1
w0
w1
1,1
w2
w3
2,2
output layer
w4
3,3
4,4
RBF layer
1
input layer
x0=1
November 4, 2010
x2
input vector
x3
Neural Networks
Lecture 15: Radial Basis Functions
4
Radial Basis Functions
For a fixed number of neurons N, we could learn the
following output weights and reference vectors:
w1
1
N
,...,w N
N
N
, 1 ,..., N
To do this, we first have to define an error function E:
P
P
E E p (d p o p )
p 1
2
p 1
Taken together, we get:
E p d p wi x p μ i
i 1
N
November 4, 2010
Neural Networks
Lecture 15: Radial Basis Functions
2
5
Learning in RBF Networks
Then the error gradient with regard to w1, …, wN is:
E p
wi
2(d p o p ) x p μ i
For i,j, the j-th vector component of i, we get:
x p μ i
E p
2(d p o p ) wi
2
i , j
x p μ i
November 4, 2010
Neural Networks
Lecture 15: Radial Basis Functions
x μ
2
p
i
i, j
6
Learning in RBF Networks
The vector length (||…||) expression is inconvenient,
because it is the square root of the given vector
multiplied by itself.
To eliminate this difficulty, we introduce a function R
with R(D2) = (D) and substitute .
This leads to a simplified differentiation:
Rx μ R' x
x μ x μ
x p μ i
2
p
2
p
i
November 4, 2010
2
p
μi
2
i
p
i
Neural Networks
Lecture 15: Radial Basis Functions
7
Learning in RBF Networks
Together with the following derivative…
x p μ i
2
i , j
2x p , j i , j
… we finally get the result for our error gradient:
E p
i , j
4wi d p o p R' x p μ i x p , j i , j
November 4, 2010
Neural Networks
Lecture 15: Radial Basis Functions
2
8
Learning in RBF Networks
This gives us the following updating rules:
wi i (d p o p ) x p μ i
i , j i , j wi (d p o p ) R' x p μ i
2
x
p, j
i , j
where the (positive) learning rates i and i,j could be
chosen individually for each parameter wi and i,j.
As usual, we can start with random parameters and
then iterate these rules for learning until a given error
threshold is reached.
November 4, 2010
Neural Networks
Lecture 15: Radial Basis Functions
9
Learning in RBF Networks
If the node function is given by a Gaussian, then:
D
RD exp 2
As a result:
1
D
R' D 2 exp 2
November 4, 2010
Neural Networks
Lecture 15: Radial Basis Functions
10
Learning in RBF Networks
The specific update rules are now:
x μ
p
i
wi i (d p o p ) exp
2
and
i , j
2
x μ
p
i
i , j wi (d p o p )( x p , j i , j ) exp
2
November 4, 2010
Neural Networks
Lecture 15: Radial Basis Functions
2
11
Learning in RBF Networks
It turns out that, particularly for Gaussian RBFs, it is
more efficient and typically leads to better results to
use partially offline training:
First, we use any clustering procedure (e.g., k-means)
to estimate cluster centers, which are then used to set
the values of the reference vectors i and their
spreads (standard deviations) i.
Then we use the gradient descent method described
above to determine the weights wi.
November 4, 2010
Neural Networks
Lecture 15: Radial Basis Functions
12