No Slide Title

Download Report

Transcript No Slide Title

Neural Networks
• Artificial neural network (ANN) is a machine learning
approach inspired by the way in which the brain performs a
particular learning task.
• ANNs are modeled on human brain and consists of a
number of artificial neurons.
• Neuron in ANNs tend to have fewer connections than
biological neurons.
• Each neuron in ANN receives a number of inputs.
• A function called activation function is applied to these
inputs which results in activation level of neuron (output
value of the neuron).
• Knowledge about the learning task is given in the form of
examples called training examples.
• An Artificial Neural Network is specified by:
• neuron model: the information processing unit of the
NN,
• an architecture: a set of neurons and links
connecting neurons. Each link has a weight,
• a learning algorithm: used for training the NN
by modifying the weights in order to model a
particular learning task correctly on the training
examples.
• The aim is to obtain a NN that generalizes well,
i.e., that behaves correctly on new instances of the
learning task.
The Neuron
• The neuron is the basic information processing unit of a
NN. It consists of:
1 A set of links, describing the neuron inputs, with
weights W1, W2, …, Wm
2 An adder function (linear combiner) for computing the
weighted sum of the inputs:
m
(real numbers)
u   wjxj
j 1
3 Activation function  for limiting the amplitude of the
neuron output. Here ‘b’ denotes bias.
y   (u  b)
The Neuron diagram
Bias
b
x1
Input
values
w1
x2
w2

xm

wm
weights

Local
Field
Summing
function
v
Activation
function
 ()
Output
y
Bias of a Neuron
• The bias b has the effect of applying an transformation to
the weighted sum u
v=u+b
• The bias is an external parameter of the neuron. It can be
modeled by adding an extra input.
• v is called induced field of the neuron
m
v   wj x j
j 0
w0  b
Neuron Models
• The choice of activation function  determines the neuron model.
Examples:
a if v  c
• step function:
 (v )  
b if v  c
• ramp function:
a if v  c

 (v )  b if v  d
a  ((v  c )(b  a ) /(d  c )) otherwise

• sigmoid function with z,x,y parameters
1
 (v )  z 
1  exp( xv  y )
• Gaussian function:
 1  v   2 
1
 (v ) 
exp  
 

2 
 2   
Network architectures
• Three different classes of network architectures
– single-layer feed-forward
– multi-layer feed-forward
– recurrent
• The architecture of a neural network is linked with
the learning algorithm used to train
Single Layer Feed-forward
Input layer
of
source nodes
Output layer
of
neurons
Perceptron: Neuron Model
(Special form of single layer feed forward)
• The perceptron was first proposed by Rosenblatt (1958) is a simple
neuron that is used to classify its input into one of two categories.
• A perceptron uses a step function that returns +1 if weighted sum of
its input is greater than equal to a threshold  and -1 otherwise
 (v)
 1 if v  0
 
 1 if v  0
b (bias)
x1
x2
w1
w2
wn
xn
v
(v)
y
Perceptron for Classification
• The perceptron is used for binary classification.
• How can we train a perceptron for a classification task?
– We try to find suitable values for the weights in such a way
that the training examples are correctly classified.
– Geometrically, we try to find a hyper-plane that separates the
examples of the two classes.
• The perceptron can only model linearly separable classes.
• When the two classes are not linearly separable, it may be desirable
to obtain a linear separator that minimizes the mean squared error.
• Given training examples of classes C1, C2 train the perceptron in
such a way that :
– If the output of the perceptron is +1 then the input is assigned
to class C1
– If the output is -1 then the input is assigned to C2
Learning Process for perceptron
• Initially assign random weights to input between -0.5 and +0.5
• Training data is presented to perceptron and its output is observed.
• If output is incorrect, the weights are adjusted accordingly using following
formula.
wi  wi + (a* xi *e),
where e is error produced and ‘a’ is learning rate (0  a  1)
• ‘a’ is defined as 0 if output is correct, it is +ve if output is too low and –ve if
output is too high.
• Once the modification to weights has taken place, the next piece of training
data is used in the same way.
• Once all the training data have been applied, the process starts again until
all the weights are correct and all errors are zero.
• Each iteration of this process is known as an epoch.
Example: Perceptron to learn OR function
• Initially consider w1 = -0.2 and w2 = 0.4
• Training data say, x1 = 0 and x2 = 0. Expected output is 0.
• Compute y = Step(w1*x1 + w2*x2) = 0. Output is correct so
weights are not changed.
• For training data x1=0 and x2 = 1, output is 1
Compute y = Step(w1*x1 + w2*x2) = 0.4 = 1. Output is
correct so weights are not changed.
• Next training data x1=1 and x2 = 0 and output is 1
Compute y = Step(w1*x1 + w2*x2) = - 0.2 = 0. Output is
incorrect, hence weights are changed.
• Assume a = 0.2 and error e=1
wi = wi + (a * xi * e) gives w1 = 0 and w2 =0.4
• Repeat the process till we get stable result.
Perceptron: Limitations
• The perceptron can only model linearly separable
functions, (those can be drawn in 2-dim graph and single straight
line separating values in two part ) such as the following
boolean functions:
– AND
– OR
– COMPLEMENT
– It cannot model the XOR (non linearly separable).
• When the two classes are not linearly separable, it may be
desirable to obtain a linear separator that minimizes the
mean squared error.
Multi layer feed-forward
3-4-2 Network
Output
layer
Input
layer
Hidden Layer
Multi layer feed-forward NN (FFNN)
• FFNN is a more general network architecture.
• Between the input and output layers, there are hidden layers, as illustrated
below.
• Hidden nodes do not directly receive inputs nor send outputs to the external
environment.
• FFNNs overcome the limitation of single-layer NN: they can handle nonlinearly separable learning tasks.
Input
layer
Output
layer
Hidden Layer
XOR problem
• A typical example of non-linearly separable function is the XOR. This
function takes two input arguments with values in {-1,1} and returns one
output in {-1,1}, as specified in the following table:
x1
-1
-1
1
1
x2
-1
1
-1
1
x1 xor x2
-1
1
1
-1
• Here -1 and 1 are encoding of the truth values false and true,
Respectively and XOR computes the logical exclusive or, which yields
true if and only if the two inputs have different
truth values.
XOR problem
• In this graph of the XOR, input pairs giving output equal to 1 and -1 are
depicted with green and red circles, respectively.
• These two classes (green and black) cannot be separated using a line. We
have to use two lines, like those depicted in blue.
x1
1
-1
1
x2
-1
XOR problem
• The following NN with two hidden nodes realizes this non-linear separation,
where each hidden node describes one of the two blue lines.
•This NN uses the sign activation function. Arrows from input nodes to two
hidden nodes indicate the directions of the weight vectors (1,-1) and (-1,1).
• They indicate the regions where the network output will be 1.
• The output node is used to combine the outputs of the two hidden nodes.
-1
0.1
+1
x1
+1
-1
-1
x2
+1
+1
-1
FFNN NEURON MODEL
• The classical learning algorithm of FFNN is based on the
gradient descent method. For this reason the activation function
used in FFNN are continuous functions of the weights,
differentiable everywhere.
• A typical activation function that can be viewed as a continuous
approximation of the step (threshold) function is the Sigmoid
Function. The activation function for node j is:
 (v j ) 
wh ere v j 
1
w
1
av j
e
ji
wit h a  0
yi
i
wit h w ji weigh t o f lin k fro m n o de i
t o n o de j an d yi o ut p ut o f n o de i
FFNN NEURON MODEL
 (v j )
1
Increasing a
vj
-10 -8 -6 -4 -2
2
4
6
8
10
when ‘a’ tends to infinity then  becomes the step
function
Training Algorithm: Backpropagation
• The Backpropagation algorithm learns in the same way as
single perceptron.
• It searches for weight values that minimize the total error of
the network over the set of training examples (training set).
• Backprop consists of the repeated application of the
following two passes:
– Forward pass: in this step the network is activated on one example
and the error of (each neuron of) the output layer is computed.
– Backward pass: in this step the network error is used for updating
the weights (credit assignment problem). Start at the output layer,
the error is propagated backwards through the network layer by
layer. This is done by recursively computing the local gradient of
each neuron.
Backprop algorithm – Contd.
• Consider a network of three layers. Let us use i to represent nodes in
input layer, j to represent nodes in hidden layer and k represent nodes in
output layer.
• wij refers to weight of connection between a node in input layer and
node in hidden layer.
• The following equation is used to derive the output value Yj of node j
threshold for node j
Xj =  xi . wij - j , 1 i  n
Yj 
1
1
X j
e
the number of inputs to node j,
Total Mean Squared Error
• The error of output neuron k after the activation of the network on
the n-th training example (x(n), d(n)) is:
ek(n) = dk(n) – yk(n)
• The network error is the sum of the squared errors of the output
neurons:
E(n) 
1
2
2
e
 k (n)
j output node
• The total mean squared error is the average of the network errors
of the training examples.
EAV 
1
N
N
 E (n)
n 1
Weight Update Rule
• The Backprop weight update rule is based on the gradient
descent method:
• It takes a step in the direction yielding the maximum decrease
of the network error E.
• This direction is the opposite of the gradient of E.
wij  wij  wij
E
w ij  -
w ij
• Iteration of the Backprop algorithm is usually terminated when the
sum of squares of errors of the output values for all training data in
an epoch is less than some threshold such as 0.01
Backprop
• Back-propagation training algorithm
Network activation
Forward Step
Error propagation
Backward Step
• Backprop adjusts the weights of the NN in order to
minimize the network total mean squared error.
Error backpropagation
The flow-graph below illustrates how errors are backpropagated to hidden neuron j
w1
j ’(vj)
j
1
’(v1)
e1
k
’(vk)
ek
’(vm)
em
wk
j
wm j
m
Backprop learning algorithm
(incremental-mode)
n=1;
initialize w(n) randomly;
while (stopping criterion not satisfied or n <max_iterations)
for each example (x,d)
- run the network with input x and compute the output y
- update the weights in backward order starting from those of
the output layer:
w ji  w ji  w ji
with w ji computed using the (generalized) Delta rule
end-for
n = n+1;
end-while;
Backprop algorithm
• In the batch-mode the weights are updated only after
all examples have been processed, using the formula
w ji  w ji 
 w
x
ji
x trainingexample
• The learning process continues until the stopping
condition is satisfied.
• In the incremental mode choose a randomized
ordering for selecting the examples in the training
set in order to avoid poor performance.
Stopping criterions
• Sensible stopping criterions:
– total mean squared error change:
Back-prop is considered to have converged when the absolute
rate of change in the average squared error per epoch is sufficiently
small (in the range [0.1, 0.01]).
– generalization based criterion:
After each epoch the NN is tested for generalization. If the
generalization performance is adequate then stop. If this stopping
criterion is used then the part of the training set used for testing the
network generalization will not used for updating the weights.
Applicability of FFNN
Boolean functions:
• Every boolean function can be represented by a
network with a single hidden layer
Continuous functions:
• Every bounded piece-wise continuous function
can be approximated with arbitrarily small error
by a network with one hidden layer.
• Any continuous function can be approximated to
arbitrary accuracy by a network with two hidden
layers.
Applications of FFNN
Classification, pattern recognition:
• FFNN can be applied to tackle non-linearly separable
learning problems.
– Recognizing printed or handwritten characters,
– Face recognition
– Classification of loan applications into credit-worthy and noncredit-worthy groups
– Analysis of sonar radar to determine the nature of the source of a
signal
Regression and forecasting:
• FFNN can be applied to learn non-linear functions
(regression) and in particular functions whose inputs is a
sequence of measurements over time (time series).
Recurrent network
• Feed forward network is acyclic ( no cycle) as data passes from input
to the output nodes and not vice versa.
• Once the FFNN is trained, its state is fixed and does not alter as new
data is presented to it. It does not have memory.
• A recurrent network can have connections that go backward from
output to input nodes .
• In this way , a recurrent network’s internal state can alter as sets of
input data are presented. It can be said to have memory.
• Useful in solving problems where the solution depends not just on the
current inputs but on all previous inputs. (predict stock market price,
weather forecast).
• When learning, a recurrent network feeds its inputs through the
network, including feeding data back from outputs to inputs an repeats
this process until the values of the outputs do not change.
• This state is called equilibrium or stability.
Recurrent network
Recurrent Network with hidden neuron: unit delay operator d is
used to model a dynamic system
d
d
d
input
hidden
output
• A recurrent network can have connections that go
backward from output nodes to input nodes and in fact
can have arbitrary connections between any nodes.
• While learning, the recurrent network feeds its inputs
through the network including feeding data back from
outputs to inputs and repeat this process until the
values of the outputs do not change.
• Such point is called a state of equilibrium or stability.
• Useful applications to predict
• stock market,
• weather etc.
NN DESIGN
•
•
•
•
•
Data representation
Network Topology
Network Parameters
Training
Validation
Data Representation
• Data representation depends on the problem. In general ANNs
work on continuous (real valued) attributes. Therefore symbolic
attributes are encoded into continuous ones.
• Attributes of different types may have different ranges of values
which affect the training process. Normalization may be used,
like the following one which scales each attribute to assume
values between 0 and 1.
xi  mini
xi 
maxi  mini
for each value x i of attribute i , where mini and maxiare the
minimum and maximum value of that attribute over the
training set.
Network Topology
• The number of layers and of neurons depend on
the specific task. In practice this issue is solved
by trial and error.
• Two types of adaptive algorithms can be used:
– start from a large network and successively remove
some neurons and links until network performance
degrades.
– begin with a small network and introduce new
neurons until performance is satisfactory.
Network parameters
• How are the weights initialized?
• How is the learning rate chosen?
• How many hidden layers and how many
neurons?
• How many examples in the training set?
Initialization of weights
• In general, initial weights are randomly chosen, with typical
values between -1.0 and 1.0 or -0.5 and 0.5.
• If some inputs are much larger than others, random
initialization may bias the network to give much more
importance to larger inputs. In such a case, weights can be
initialized as follows:
w ji   21N
wkj   21N

For weights from the input to
the first layer
1
|x i |
i 1,...,N

i 1,...,N
(
1
w ji x )

i
For weights from the first to the
second layer
Choice of learning rate
• The right value of  depends on the application.
Values between 0.1 and 0.9 have been used in
many applications.
• Other heuristics adapt  during the training as
described in previous slides.
Training
• Rule of thumb:
– the number of training examples should be at least
five to ten times the number of weights of the
network.
• Other rule:
|W|
N
(1 - a)
|W|= number of weights
a=expected accuracy on test set
Radial-Basis Function Networks
• A function is radial basis (RBF) if its output depends on (is a nonincreasing function of) the distance of the input from a given stored
vector.
• RBFs represent local receptors, as illustrated below, where each green
point is a stored vector used in one RBF.
• In a RBF network one hidden layer uses neurons with RBF activation
functions describing local receptors. Then one output node is used to
combine linearly the outputs of the hidden neurons.
w3
w2
w1
The output of the red vector
is “interpolated” using the three
green vectors, where each vector
gives a contribution that depends on
its weight and on its distance from
the red point. In the picture we have
w1  w3  w2
RBF ARCHITECTURE
x1
1
w1
x2
y
 m1
wm1
xm
• One hidden layer with RBF activation functions
1...m1
• Output layer with linear activation function.
y  w11 (|| x  t1 ||)  ...  wm1m1 (|| x  tm1 ||)
|| x  t || distanceof x  ( x1,...,xm ) from vectort
HIDDEN NEURON MODEL
• Hidden units: use radial basis functions
φ( || x - t||)
the output depends on the distance of
the input x from the center t
x1
x2
xm

φ( || x - t||)
t is called center
 is called spread
center and spread are parameters
Hidden Neurons
• A hidden neuron is more sensitive to data points
near its center.
• For Gaussian RBF this sensitivity may be tuned
by adjusting the spread , where a larger spread
implies less sensitivity.
• Biological example: cochlear stereocilia cells (in
our ears ...) have locally tuned frequency
responses.
Gaussian RBF φ
φ:
center
 is a measure of how spread the curve is:
Large 
Small 
Types of φ
• Multiquadrics:
 (r)  (r  c )
2
2
1
2
• Inverse multiquadrics:
1
 ( r )  2 2 12
(r  c )
c0
r || x  t ||
c0
• Gaussian functions (most used):
 r2 
 ( r )  exp   2 
 2 
 0
Example: the XOR problem
• Input space:
• Output space:
x2
(0,1)
(1,1)
(0,0)
(1,0)
0
1
x1
y
• Construct an RBF pattern classifier such that:
(0,0) and (1,1) are mapped to 0, class C1
(1,0) and (0,1) are mapped to 1, class C2
Example: the XOR problem
• In the feature (hidden layer) space:
1 (|| x  t1 ||)  e
|| x t1 ||2
 2 (|| x  t2 ||)  e
φ2
(0,0)
1.0
0.5
(0,1) and (1,0)
with t1  (1,1) and t2  (0,0)
|| x t2 ||2
Decision boundary
(1,1)
0.5
1.0
φ1
• When mapped into the feature space < 1 , 2 > (hidden layer), C1 and C2
become linearly separable. So a linear classifier with 1(x) and 2(x) as
inputs can be used to solve the XOR problem.
RBF NN for the XOR problem
1 (|| x  t1 ||)  e
|| x t1 ||2
 2 (|| x  t2 ||)  e
with t1  (1,1) and t2  (0,0)
|| x t2 ||2
x1
t1
-1
y
x2
t2
-1
+1
y  e
|| x t1||2
e
|| x t2 ||2
1
If y  0 thenclass 1 otherwiseclass 0
RBF network parameters
• What do we have to learn for a RBF NN with a
given architecture?
– The centers of the RBF activation functions
– the spreads of the Gaussian RBF activation functions
– the weights from the hidden to the output layer
• Different learning algorithms may be used for
learning the RBF network parameters. We
describe three possible methods for learning
centers, spreads and weights.
Learning Algorithm 1
• Centers: are selected at random
– centers are chosen randomly from the training set
• Spreads: are chosen by normalization:
Maximumdistancebetween any 2 centers dmax


m
number of centers
1
• Then the activation function of hidden neuron
becomes:

i x  t i
2

 m1
 exp  2 x  t i
 d max
2



i
Learning Algorithm 1
• Weights: are computed by means of the pseudoinverse method.
– For an example
network
the output of the
( xi , dconsider
i)
y( xi )  w11(|| xi  t1 ||)  ...  wm1m1(|| xi  tm1 ||)
– We would like
y( xi )  dfor
i each example, that is
w11(|| xi  t1 ||)  ...  wm1m1 (|| xi  tm1 ||)  di
Learning Algorithm 1
• This can be re-written in matrix form for one example
1(|| xi  t1 ||) ...m1(|| xi  tm1 ||)[w1...wm1]T  di
and
1 (|| x1  t1 ||)... m1 (|| x1  tm1 ||) 
...
[ w ...w ]T  [d ...d ]T
1
N

 1 m1
1 (|| x N  t1 ||)... m1 (|| x N  tm1 ||)
for all the examples at the same time
Learning Algorithm 1
 1 (|| x1  t1 ||) ...  m1 (|| xN  tm1 ||)


...


1 (|| x N  t1 ||) ...  m1 (|| xN  tm1 ||)
let
then we can write
If
 w1  d1 
 ...   ... 
   
 wm1  d N 
is the pseudo-inverse of the matrix
we

obtain the weights using the following formula



[w1...wm1 ]   [d1...d N ]
T
T
Learning Algorithm 1: summary
1. Choose the centers randomly from the
training set.
2. Compute the spread for the RBF function
using the normalization method.
3. Find the weights using the pseudo-inverse
method.
Learning Algorithm 2: Centers
• clustering algorithm for finding the centers
1 Initialization: tk(0) random k = 1, …, m1
2 Sampling: draw x from input space
3 Similarity matching: find index of center closer to x
k(x) arg mink x(n) t k (n)
4 Updating: adjust centers
t k ( n  1) 
t k (n)  x(n) t k (n)
t k ( n)
if k  k(x)
otherwise
5 Continuation: increment n by 1, goto 2 and continue until no
noticeable changes of centers occur
Learning Algorithm 2: summary
• Hybrid Learning Process:
• Clustering for finding the centers.
• Spreads chosen by normalization.
• LMS algorithm (see Adaline) for finding the
weights.
Learning Algorithm 3
• Apply the gradient descent method for finding centers,
spread and weights, by minimizing the (instantaneous)
1
squared error
E  ( y ( x )  d )2
2
• Update for:
centers
spread
weights
E
t j  t j
 tj
E
 j   j
 j
E
w ij  ij
w ij
Comparison with FF NN
RBF-Networks are used for regression and for performing
complex (non-linear) pattern classification tasks.
Comparison between RBF networks and FFNN:
• Both are examples of non-linear layered feed-forward networks.
• Both are universal approximators.
Comparison with multilayer NN
• Architecture:
– RBF networks have one single hidden layer.
– FFNN networks may have more hidden layers.
• Neuron Model:
– In RBF the neuron model of the hidden neurons is different from the one of
the output nodes.
– Typically in FFNN hidden and output neurons share a common neuron
model.
– The hidden layer of RBF is non-linear, the output layer of RBF is linear.
– Hidden and output layers of FFNN are usually non-linear.
Comparison with multilayer NN
• Activation functions:
– The argument of activation function of each hidden neuron in a
RBF NN computes the Euclidean distance between input vector
and the center of that unit.
– The argument of the activation function of each hidden neuron in
a FFNN computes the inner product of input vector and the
synaptic weight vector of that neuron.
• Approximation:
– RBF NN using Gaussian functions construct local approximations
to non-linear I/O mapping.
– FF NN construct global approximations to non-linear I/O mapping.
Application: FACE RECOGNITION
• The problem:
– Face recognition of persons of a known group in an
indoor environment.
• The approach:
– Learn face classes over a wide range of poses using an
RBF network.
Dataset
• database
– 100 images of 10 people (8-bit grayscale, resolution 384 x
287)
– for each individual, 10 images of head in different pose
from face-on to profile
– Designed to asses performance of face recognition
techniques when pose variations occur
Datasets
All ten images for
classes 0-3 from
the Sussex
database, nosecentred and
subsampled to
25x25 before
preprocessing
Approach: Face unit RBF
• A face recognition unit RBF neural networks is trained
to recognize a single person.
• Training uses examples of images of the person to be
recognized as positive evidence, together with selected
confusable images of other people as negative evidence.
Network Architecture
• Input layer contains 25*25 inputs which represent the
pixel intensities (normalized) of an image.
• Hidden layer contains p+a neurons:
– p hidden pro neurons (receptors for positive evidence)
– a hidden anti neurons (receptors for negative evidence)
• Output layer contains two neurons:
– One for the particular person.
– One for all the others.
The output is discarded if the absolute difference of the two output
neurons is smaller than a parameter R.
RBF Architecture for one face recognition
Output units
Linear
Supervised
RBF units
Non-linear
Unsupervised
Input units
Hidden Layer
• Hidden nodes can be:
– Pro neurons: Evidence for that person.
– Anti neurons: Negative evidence.
• The number of pro neurons is equal to the positive examples of
the training set. For each pro neuron there is either one or two
anti neurons.
• Hidden neuron model: Gaussian RBF function.
Training and Testing
• Centers:
– of a pro neuron: the corresponding positive example
– of an anti neuron: the negative example which is most similar to the
corresponding pro neuron, with respect to the Euclidean distance.
• Spread: average distance of the center from all other centers. So
the spread
of a hidden 
neuron
n is
n
n 
1
H
|| t

2
h
where H is the number of hidden neurons and
n
 t ||
h
i
ist the center of neuron .
i determined using the pseudo-inverse method.
• Weights:
• A RBF network with 6 pro neurons, 12 anti neurons, and R equal to 0.3,
discarded 23 pro cent of the images of the test set and classified correctly 96
pro cent of the non discarded images.