Transcript Document

Topic 1
Neural
Networks
OUTLINES
Neural Networks
Cerebellar Model Articulation Controller (CMAC)
Applications
References
 C.L. Lin & H.W. Su, “Intelligent control theory in guidance and
control system design: an overview,” Proc. Natl. Sci. Counc. ROC
(A), pp. 15-30
Ming-Feng Yeh
1-2
1. Neural Networks
As you read these words you are using a complex
biological neural network. You have a highly
interconnected set of 1011 neurons to facilitate
your reading, breathing, motion and thinking.
In the artificial neural network, the neurons are
not biological. They are extremely simple
abstractions of biological neurons, realized as
elements in a program or perhaps as circuits
made of silicon.
Ming-Feng Yeh
1-3
Biological Inspiration
Human brain consists of a large number
(about 1011) of highly interconnected
elements (about 104 connections per
element) called neurons.
Three principle components are the
dendrites, the cell body and the axon.
The point of contact is called a synapse.
Ming-Feng Yeh
1-4
Biological Neurons
Dendrites(樹突): carry
electrical into the cell body
Cell Body(細胞體): sums and
thresholds these incoming
signals
Axon(軸突): carry the signal
from the cell body out to other
neurons
Synapse(突觸): contact
between an axon of one cell
and a dendrites of another cell
Ming-Feng Yeh
1-5
Neural Networks
Neural Networks: a promising new generation of
information processing systems, usually operate in
parallel, that demonstrate the ability to learn,
recall, and generalize from training patterns or
data.
Artificial neural networks are collections of
mathematical models that emulate some of the
observed properties of biological nervous systems
and draw on the analogies of adaptive biological
learning.
Ming-Feng Yeh
1-6
Basic Model ~ 1
s1
s2
node
…
sn
y = f(s1,s2,…,sn)
y
output
input
A neural network is composed of four pieces: nodes,
connections between the nodes, nodal functions, and a
learning rule for updating the information in the network.
Ming-Feng Yeh
1-7
Basic Model ~ 2
 Nodes: a number of nodes, each an elementary
processor (EP) is required.
 Connectivity: This can be represented by a matrix
that shows the connections between the nodes.
The number of nodes plus the connectivity define
the topology of the network. In the human brain,
all neurons are connected to about 104 other
neurons. Artificial nets can range from totally
connected to a topology where each node is just
connected to its nearest neighbors.
Ming-Feng Yeh
1-8
Basic Model ~ 3
Elementary processor functions: A node has
inputs s1,…, sn and an output y, and the node
generates the output y as a function of the inputs.
A learning rule: There are two types of learning:


Supervised learning: you have to teach the networks
the “answers.”
Unsupervised learning: the network figures out the
answers on its own. All the learning rules try to
embed information by sampling the environment.
Ming-Feng Yeh
1-9
Perceptron Model
Suppose we have a two class problem. If we can
separate these classes with a straight line (decision
surface), then they are separable.
The question is, how can we find the best line, and
what do we mean by “best.”
In n dimensions, we have a hyperplane separating
the classes. These are all decision surfaces.
Another problem is that you may need more than
one line to separate the classes.
Ming-Feng Yeh
1-10
Decision Surfaces
x
x
x
x
o o
o
Linearly separable
classes
Ming-Feng Yeh
x x x
x
x
x x
o o
o
x
o
o
o
o
Multi-line decision
surface
1-11
Single Layer Perceptron Model
x1
w1
f(x)
output
inputs
xn
wn
xi: inputs to the node; y: output; net   w x  
i i
i
wi: weights; : threshold value.
The output y can be expressed as: y  f ( wi xi   )
i
The function f is called the nodal (transfer) function and
is not the same in every application
Ming-Feng Yeh
1-12
Nodal Function
1
1
1
-1
Hard-limiter
Ming-Feng Yeh
Threshold
function
Sigmoid
function
1-13
Single Layer Perceptron Model
Two-input case: w1x1 + w2x2   = 0
If we use the hard limiter, then we could say that
if the output of the function is a 1, the input vector
belongs to class A. If the output is a –1, the input
vector belongs to class B.
XOR: caused the field of neural networks to lose
credibility in the 1940’s. The perceptron model
could not draw a line to separate the two classes
given by the exclusive-OR.
Ming-Feng Yeh
1-14
Exclusive OR problem
?
(0,1) x
o
(0,0)
Ming-Feng Yeh
o (1,1)
?
x
(1,0)
1-15
Two-layer Perceptron Model
x1
y1
w11
w’1
w12
w21
x2
w22
z
w’2
y2
The outputs from the two hidden nodes are
y1  f ( w11 x1  w12 x2   )
y2  f ( w21 x1  w22 x2   )
The network output is z  f (w1 ' y1  w2 ' y2  )
Ming-Feng Yeh
1-16
Exclusive-XOR problem
f
input
patterns
00
01


output
patterns
0
0.5

1
11

0
+1
1.5
1
10
+1
hidden
unit
+1
input units
x
Ming-Feng Yeh
-2 g
+1
output unit
y
1-17
Exclusive-XOR problem
g = sgn (1·x + 1·y 1.5)
f = sgn (1·x + 1·y  2g 0.5)
input (0,0)  g=0  f=0
input (0,1)  g=0  f=1
input (1,0)  g=0  f=1
input (1,1)  g=1  f=0
Ming-Feng Yeh
1-18
Multilayer Network
output patterns
ok  f k ( wkj o j )
k
j
wkj
o j  f j ( w jioi )
i
internal
representation
units
j
wji
oi  f i (ii )
i
Ming-Feng Yeh
input patterns
1-19
Weight Adjustment
Adjust weights by: wji(l+1) = wji(l) + wji
where wji(l) is the weight from unit i to unit j at
time l (or the lth iteration) and wji is the weight
adjustment.
The weight change may be computed by the delta
rule: wji =  j ii
where  is a trial-independent learning rate and j
is the error at unit j: j = tj  oj
where tj is the desired output and oj is the actual
output at output unit j.
Repeat iterations until convergence.
Ming-Feng Yeh
1-20
Generalized Delta Rule
 p w ji   (t pj  o pj )i pi   pji pi
t pj : the target output for jth component of the
output pattern for pattern p.
o pj : the jth element of the actual output pattern
produced by the presentation of input pattern p.
i pi : the value of the ith element of the input
pattern.
 pj  t pj  o pj
 p wij : is the change to be made to the weight from
the ith to the jth unit following presentation of
pattern p.
Ming-Feng Yeh
1-21
Delta Rule and
Gradient Descent
1
E p   (t pj  o pj ) 2 : the error on input/output pattern p
2 j
E   E p : the overall measure of the error.
We wish to show that the delta rule implements a gradient
descent in E when units are linear.
We will proceed by simply showing that

E p
w ji
  pji pi
which is proportional to  p wij as prescribed by the delta
rule.
Ming-Feng Yeh
1-22
Delta Rule &
Gradient Descent
When there are no hidden units it is easy to compute the
relevant derivative. For this purpose we use the chain rule
to write the derivative as the product of two parts: the
derivative of the error with respect to the output of the unit
times the derivative of the output with respect to the
weight.
E p E p o pj

w ji o pj w ji
The first part tells how the error changes with the output
of the jth unit and the second part tells how much
changing wji changes that output.
Ming-Feng Yeh
1-23
Delta Rule &
Gradient Descent
E p
o pj
 (t pj  o pj )   pj
no hidden units
The contribution of unit j to the error is simply
proportional to pj.
Since we have linear units, o pj   w jii pi .
i
From which we conclude that
Thus, we have 
Ming-Feng Yeh
E p
w ji
o pj
w ji
 i pi.
  pji pi
1-24
Delta Rule and
Gradient Descent
Combining this with the observation that
E p
E

w ji
p w ji
should lead us to conclude that the net change in
wji after one complete cycle of pattern
presentations is proportional to this derivative
and hence that the delta rule implements a gradient
descent in E. In fact, this is strictly true only if the
values of the weights are not changed during this
cycle.
Ming-Feng Yeh
1-25
Delta Rule for Activation Functions in
Feedforward Networks
The standard delta rule essentially implements
gradient descent in sum-squared error for linear
activation functions.
Without hidden units, the error surface is shaped like
a bowl with only one minimum, so gradient descent is
guaranteed to find the best set of weights.
With hidden units, however, it is not so obvious how
to compute the derivatives, and the error surface is
not concave upwards, so there is the danger of getting
stuck in local minimum.
Ming-Feng Yeh
1-26
Delta Rule for Semilinear Activation
Functions in Feedforward Networks
The main theoretical contribution is to show that
there is an efficient way of computing the
derivatives.
The main empirical contribution is to show that the
apparently fatal problem of local minima is
irrelevant in a wide variety of learning tasks.
A semilinear activation function is one in which
the output of a unit is a non-decreasing and
differentiable function of the net total input,
net pj   w ji o pi where o =i if unit i is an input unit.
i i
i
Ming-Feng Yeh
1-27
Delta Rule for Semilinear Activation
Functions in Feedforward Networks
Thus, a semilinear activation function is one in which
o pj  f j (net pj ) and f is differentiable and non-decreasing.
To get the correct generalization of the delta rule, we
must set
 p w ji  
E p
w ji
where E is the same sum-squared error function
defined earlier.
Ming-Feng Yeh
1-28
Delta Rule for Semilinear Activation
Functions in Feedforward Networks
As in the standard delta rule, it is to see this
derivative as resulting from the product of two parts:
One part reflecting the change in error as a function
of the change in the net input to the unit and one part
representing the effect of changing a particular weight
on the net input. E p  E p net pj
w ji
The second factor is
Ming-Feng Yeh
net pj w ji
net pj
w ji


w ji
 w jk o pk  o pi
k
1-29
Delta Rule for Semilinear Activation
Functions in Feedforward Networks
Define  pj  
Thus, 
E p
w ji
E p
net pj
  pj o pi
This says that to implement gradient descent in E we
should make our weight changes according to
 p w ji   pj o pi just as in the standard delta rule.
The trick is to figure out what pj should be for each
unit uj in the network.
Ming-Feng Yeh
1-30
Delta Rule for Semilinear Activation
Functions in Feedforward Networks
Compute  pj  
E p
net pj
The second factor:

o pj
E p o pj
o pj net pj
net pj
 f j ' (net pj )
which is simply the derivative of the function fj for
the jth unit, evaluated at the net input netpj to that unit.
Note: o pj  f j (net pj )
To compute the first factor, we consider two cases.
Ming-Feng Yeh
1-31
Delta Rule for Semilinear Activation
Functions in Feedforward Networks
First, assume that unit uj is an output unit of the
network. In this case, it follows from the definition of
Ep that
E p
o pj
 (t pj  o pj )
Thus,  pj  (t pj  o pj ) f j ' (net pj )
for any output unit uj.
Ming-Feng Yeh
1-32
Delta Rule for Semilinear Activation
Functions in Feedforward Networks
If uj is not an output unit we use the chain rule to
write
E p net pk
E p

 net
k

k
pk
E p
net pk
o pj

k
net pk o pj
 wki o pi
i
wkj    pk wkj
k
Thus,  pj  f j ' (net pj )  pk wkj
k
whenever uj is not an output unit.
Ming-Feng Yeh
1-33
Delta Rule for Semilinear Activation
Functions in Feedforward Networks
If uj is an output unit:  pj  (t pj  o pj ) f j ' (net pj )
If uj is not an output unit:  pj  f j ' (net pj )  pk wkj
k
The above two equations give a recursive
procedure for computing the ’s for all units in
the network, which are then used to compute
the weight changes in the network.
 p w ji   pj o pi
Ming-Feng Yeh
1-34
Delta Rule for Semilinear Activation
Functions in Feedforward Networks
The application of the
generalized delta rule, thus,
involves two phases.
During the first phase the
input is presented and
propagated forward through
the network to compute the
output value opj for each unit.
This output is then compared
with the targets, resulting in
an error signal pj for each
output unit.
Ming-Feng Yeh
1-35
Delta Rule for Semilinear Activation
Functions in Feedforward Networks
The second phase
involves a backward
pass through the network
(analogous to the initial
forward pass) during
which the error signal is
passed to each unit in the
network and the
appropriate weight
changes are made.
Ming-Feng Yeh
1-36
Ex: Function Approximation

g  p  = 1 + sin --- p 
4
t

p
e
+
1-2-1
Network
Ming-Feng Yeh
1-37
Network Architecture
p
Ming-Feng Yeh
1-2-1
Network
a
1-38
Initial Values
W1  0  = – 0.27
– 0.41
b 1  0  = – 0.48
2
W  0  = 0.09 – 0.17
– 0.13
2
b 0  = 0.48
3
Initial Network
Response:
a2
Network Response
Sine Wave
2
1
0
-1
-2
Ming-Feng Yeh
-1
0
p
1
2
1-39
Forward Propagation
Initial input:
0
a = p = 1
Output of the 1st layer:




a1 = f 1 W 1 a0 + b 1  = l ogsig  –0.27 1 + – 0.48  = logsi g – 0.75 
 –0.41
a1
– 0.13 
 – 0.54 
1
-------------------0.75
1
+
e
=
= 0.321
1
0.368
-------------0.54
------1+ e
Output of the 2nd layer:
2
2
2 1
2
a = f W a + b  = purelin ( 0.09 – 0.17 0.321 + 0.48 ) = 0.446
0.368
error:


 
2
 
e = t – a =  1 + sin  --- p   – a =  1 + sin --- 1   – 0.446 = 1.261
4 
4 


Ming-Feng Yeh
1-40
Transfer Func. Derivatives
n
d
1
e

f 1 (n)  

n 
dn  1  e  (1  e  n ) 2
1  1 

1
1
 1 

(
1

a
)(
a
)
 n 
n 
 1  e  1  e 
f 2 (n)  d (n)  1
dn
Ming-Feng Yeh
1-41
Backpropagation
E p   (t pj  o pj ) 2   p w ji  
j
E p
w ji

E p net pj
net pj w ji
net pj
  pj
w ji
The second layer sensitivity:  pj  (t pj  o pj ) f j ' (net pj )
δ 2  2F 2 (n 2 )(t  a)  2[ f 2 (n 2 )]e
 2 1 1.261  2.522
The first layer sensitivity:  pj  f j ' (net pj )  pk wkj
2

 2


w
(
1

a
)(
a
)
0
1,1
1
1
1
2 T 2

δ  F (n )( W ) δ  
δ
1
1  2 
0
(1  a2 )( a2 )  w1, 2 

0
(1  0.321)  0.321
  0.09
2.522




0
(1  0.368)  0.368  0.17 

1
1
 0.0495



0
.
0997


Ming-Feng Yeh
1
1
k
1-42
Weight Update
 p w ji   pj o pi
Learning rate   0.1
W 2 (1)  W 2 (0)  s 2 (a1 )T
 0.09  0.17  0.1[2.522]0.321 0.368
 0.171  0.0772
b 2 (1)  b 2 (0)  s 2  [0.48]  0.1[2.522]  [0.732]
W1 (1)  W1 (0)  s1 (a 0 )T
 0.27
 0.0495
 0.265

 0.1
[1]  




0
.
41

0
.
0997

0
.
420






 0.48
 0.0495  0.475
1
1
1
b (1)  b (0)  s  
 0.1





0
.
13

0
.
0997

0
.
140



 

Ming-Feng Yeh
1-43
Choice of Network Structure
Multilayer networks can be used to
approximate almost any function, if we
have enough neurons in the hidden layers.
We cannot say, in general, how many
layers or how many neurons are necessary
for adequate performance.
Ming-Feng Yeh
1-44
Illustrated Example 1
i
g  p  = 1 + sin ----- p 
4 
1-3-1 Network
3
3
2
2
1
1
0
0
-1
-2
0
i 1
1
2
-1
-2
3
3
2
2
1
1
0
0
-1
-2
Ming-Feng Yeh
-1
-1
0
i4
1
2
-1
-2
-1
-1
0
1
2
0
1
2
i2
i 8
1-45
Illustrated Example 2
3
6
g  p  = 1 + sin ------ p 
4 
2 p2
2
3
1-2-1
2
1
1
0
0
-1
-2
-1
0
1
2
3
2
-1
0
1
2
0
1
2
3
1-4-1
2
1
1
0
0
-1
-2
Ming-Feng Yeh
-1
-2
1-3-1
-1
0
1
2
-1
-2
1-5-1
-1
1-46
Convergence
g p = 1 + sinp  2  p  2
3
3
5
2
2
1
3
1
5
3
4
2
0
1
4
2
0
0
0
1
-1
-2
-1
0
1
2
-1
-2
-1
0
1
2
Convergence to Global Min.
Convergence to Local Min.
The numbers to each curve indicate the sequence of iterations.
Ming-Feng Yeh
1-47