CS 524 – High Performance Computing

Download Report

Transcript CS 524 – High Performance Computing

Multilayer Perceptrons
CS/CMPE 333 – Neural Networks
Introduction




Multilayer perceptrons (MLPs) are multilayer
feedforward networks with continuously differentiable
nonlinear activation functions
MLPs can be considered as an extension of the simple
perceptron to multiple layers of neurons (hence the
name multilayer perceptron)
MLPs are trained by the popular supervised learning
algorithm known as the error back-propagation
algorithm
MLPs’ training involves both forward and backward
information flow (hence the name BP for the
algorithm)
CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS
2
Historical Note



The idea of having multiple layers of computation
nodes date from the early days of neural networks
(1960s). However, at that time no algorithm was
available to train such networks
The emergence of the BP algorithm in the mid 1980s
made it possible to train MLPs, which opened the way
for their widespread use
Since the mid 80s there has been great interest in neural
networks in general and MLPs in particular with many
contributions made to the theory and applications of
neural networks
CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS
3
Distinguishing Characteristics of MLPs

Smooth nonlinear activation function
 Biological basis
 Ensures
that the input-output mapping of the network is
nonlinear
 The smoothness is essential for the BP algorithm

One or more hidden layers
 These
layers enhance mapping capability of the network
 Each additional layer can be viewed as adding more
complexity to the feature detection capability of the network

High degree of connectivity
 Highly
distributed information processing
 Fault tolerance
CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS
4
An MLP
CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS
5
Information Flow


Function signals: input signal that moves forward from
layer to layer
Error signal: error signal originates at the output and
propagates back layer-by-layer
CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS
6
The Back-Propagation (BP) Algorithm


The BP algorithm is a generalization of the LMS
algorithm. While the LMS algorithm applies correction
to one layer of weights, the BP algorithm provides a
mechanism for applying correction to multiple layers
of weights.
The BP algorithm apportion errors at output layer to
errors at hidden layers. Once the error at a neuron has
been determined the correction is distributed according
to the delta rule (as in LMS algorithm).
CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS
7
Some Notations





Indices i, j, k: identifies a neuron, such that the layer in
which neuron k lies follows the layer in which neuron j
and i lie.
wji: weight associated with the link connecting neuron
(or source node) i with neuron j
yj(n), dj(n), ej(n), vj(n): actual response, target response,
error, and net activity level of neuron j when presented
with pattern n
xi(n), oi(n): ith element of input vector n, and network
output vector when presented with input vector x(n)
Neuron j has a bias input equal to -1 with weight wj0 =
Θj
CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS
8
CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS
9
BP Algorithm (1)


Output of neuron j at iteration n (nth training pattern)
ej(n) = dj(n) – yj(n)
Instantaneous sum of square errors of the network
ξ(n) = ½ΣjЄC ej2(n)
C


= set of all neurons in output layer
For N training patterns, the cost function is
ξav = (1/N) Σn=1 n ξ(n)
ξ is a function of the free parameters (weights and
thresholds). The goal is to find the weights/threshdolds
that minimize ξav
 Weights
are updated on a pattern-by-pattern basis
CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS
10
BP Algorithm (2)
Considering neuron j in output layer
vj(n) = Σi=0 p wji(n)yi(n)
And
yj(n) = φ(vj(n))
 Instantaneous error-correction learning
wji(n+1) = wji(n) – η δξ(n)/δwji(n)
 Using the chain rule, the gradient can be written as
δξ(n)/δwji(n) =
[δξ(n)/δej(n)][δej(n)/δyj(n)][δyj(n)/δvj(n)][δvj(n)/δwji(n)]

CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS
11
BP Algorithm (3)



Computing the partial derivatives
δξ(n)/δej(n) = ej(n)
δej(n)/δyj(n) = -1
δyj(n)/δvj(n) = φj’(vj(n))
δvj(n)/δwji(n) = yi(n)
The gradient at iteration n wrt wji
δξ(n)/δwji(n) = - ej(n) φj’(vj(n))yi(n)
The weight change at iteration n
Δwji(n) = – η δξ(n)/δwji(n) = η ej(n) φj’(vj(n))yi(n)
= η δj(n)yi(n)
 This
is known as the delta rule
CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS
12
BP Algorithm (4)

Local gradient δj(n)
δj(n) = - [δξ(n)/δej(n)][δej(n)/δyj(n)][δyj(n)/δvj(n)]
= ej(n) φj’(vj(n))

Credit-assignment problem
 How
to computer ej(n) ?
 How to penalize and reward neuron j (weights associated
with neuron j) for ej(n) ?
 For output layer ?
 For hidden layer(s) ?
CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS
13
Output Layer

If neuron j lies in the output layer, the desired response
dj(n) is known and the error ej(n) can be computed
the local gradient δj(n) can be computed, and the
weights updated using the delta rule (as given by the
equations on the preceding slides)
 Hence,
CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS
14
Hidden Layer (1)
CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS
15
Hidden Layer (2)

If neuron j lies in a hidden layer, we don’t have a
desired response to calculate the error signal

The local gradient has to be computed recursively by
considering all neurons to which neuron j feeds.
δj(n) = - [δξ(n)/δyj(n)][δyj(n)/δvj(n)]
= - [δξ(n)/δyj(n)]φj’(vj(n))

We need to calculate the value of the partial derivative
δξ(n)/δyj(n)
CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS
16
Hidden Layer (3)


Cost function is
ξ(n) = ½ΣkЄC ek2(n)
Partial derivatives
δξ(n)/δyj(n) = Σk ek(n)[δek(n)/δvk(n)][δvk(n)/δyj(n)]
δek(n)/δvk(n) = δ[dk(n) – yk(n)]/δvk(n)
= δ[dk(n) – φk(vk(n))]/δvk(n)
= – φk’(vk(n)
δvk(n)/δyj(n) = δ[Σj=0 q wkj(n)yj(n)]/ δyj(n)
= wkj(n)
CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS
17
Hidden Layer (4)
Thus, we have
δξ(n)/δyj(n) = - Σk ek(n) φk’(vk(n)) wkj(n)
= - Σk δk(n)wkj(n)
And
δj(n) = φj’(vj(n)) Σk δk(n)wkj(n)

CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS
18
BP Algorithm Summary
Delta rule
wji(n+1) = wji(n) + Δwji(n)
where
Δwji(n) = ηδj(n)yi(n)

And, δj(n) is given by
 If
neuron j lies in the output layer
δj(n) = φj’(n)ej(n)
 If
neuron j lies in a hidden layer
δj(n) = φj’(vj(n)) Σk δk(n)wkj(n)
CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS
19
Sigmoidal Nonlinearity


For the BP algorithm to work, we need the activation
function φ to be continuous and differentiable
For the logistic function
yj(n) = φj(vj(n))
= [1 + exp(-vj(n))]-1
δyj(n)/ δvj(n) = φj’(vj(n))
= exp(-vj(n)/[1 + exp(-vj(n))]2
= yj(n)[1 - yj(n)]
CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS
20
Learning Rate

Learning rate parameter controls the rate and stability
of the convergence of the BP algorithm
η -> slower, smoother convergence
 Larger η -> faster, unstable (oscillatory) convergence
 Smaller

Momentum
Δwji(n) = αΔwji(n-1) + ηδj(n)yi(n)
α
= usually positive constant called momentum term
 Improves stability of convergence by adding a “momentum”
from the previous change in weights
 May prevent convergence to a shallow (local) minimum
 This update rule is known as the generalized delta rule
CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS
21
Modes of Training



N = number of training examples (input-output patterns
or training set)
One complete presentation of the training set is called
an epoch
It is good practice to randomize the order of
presentation of patterns from one epoch to another so
as to enhance the search in weight space
CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS
22
Pattern Mode of Training

Weights are updated after the presentation of each
pattern
 This
is the manner in which the BP algorithm was derived in
the preceding slides

Average weight change after N updates
Δw’ji = 1/N Σn=1 N Δwji(n)
= - η/N Σn=1 N δξ(n)/δwji(n)
= - η/N Σn=1 N ej(n) δej(n)/δwji(n)
CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS
23
Batch Mode of Training



Weights are updated after the presentation of all patterns in the
epoch
Average cost function
ξav = 1/2N Σn=1NΣjЄC ej2(n)
Δwji = - ηδξav/δwji
= - η/N Σn=1 N ej(n)δej(n)/ δwji
Comparison



The weight updates after a complete epoch is different for both modes
Pattern mode is an estimate for the batch mode
Pattern mode is suitable for on-line implementation, requires less storage,
and provides better search (because it is stochastic)
CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS
24
Stopping Criteria




The BP algorithm is considered to have converged
when Euclidean norm of the gradient vector reaches a
sufficiently small gradient threshold
||g(w)|| < t
The BP algorithm is considered to have converged
when the absolute rate of change in the average
squared error per epoch is sufficiently small
|[ξav(w(n+1)) - ξav(w(n))]/ξav(w(n-1)| < t
The BP algorithm is terminated at the weight vector
wfinal when ||g(wfinal)|| < t1 , or ξav(wfinal) < t2
The BP algorithm is stopped when the network’s
generalization properties are adequapte
CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS
25
Initialization


The initial values assigned to the weights affects the
performance of the network
If prior information is available, then it should be used
to assign appropriate initial values to the weights
 However,
prior information is usually not known. Moreover,
even when prior information of the problem is known, it is
not possible to assign weights since the behavior of a MLP is
complex and not understood completely.


If prior information is not available, the weights are
initialized to uniform random values within a range [0,
1]
Premature saturation
CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS
26
Premature Saturation


When the output of a neuron approaches the limits of
the sigmoidal function, little change occurs in the
weight. That is, the neuron is saturated, and learning
and adaptation is hampered.
How to avoid premature saturation?
 Initialize weights
from uniform distribution within a small
range
 Use as few hidden neurons as possible
 Premature saturation is least likely when neurons operature in
the linear range (middle of sigmoidal function
CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS
27
The XOR Problem (1)
CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS
28
The XOR Problem (2)
CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS
29
Hyperbolic Tangent Activation Function

The hyperbolic tangent is an asymmetric sigmoidal
function
 Experience has
indicated that using an asymmetric activation
function can speed up learning (i.e. it requires fewer training
iterations)

The tangent hyperbolic function varies from -a to a (as
opposed to [0, a] for the logistic function) (or -1 to 1
for a = 1)
φ(v) = a tanh(bv)
= a[1 – exp(-bv)][1+ exp(-bv)]-1
= 2a[1 + exp(-bv)]-1 – a
 Suggested
values for a and b; a = 1.716; b = 2/3
CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS
30
Some Implementation Tips (1)

Normalize the desired (target) responses dj to lie within
the limits of the activation function
 If
the activation function values range from -1.716 to 1.716,
we can limit the desired responses to [-1, +1]


The weights and thresholds should be uniformly
distributed within a small range to prevent saturation of
the neurons
All neurons should desirably learn at the same rate. To
achieve this, the learning-rate parameter can be set
larger for layers further away from the output layer
CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS
31
Some Implementation Tips (2)


The order in which the training examples are presented
to the network should be randomized (shuffled) from
one epoch to another. This enhances search for a better
local minima on the error surface.
Whenever prior information is available, include that in
the learning process
CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS
32
Pattern Classification


Since outputs of MLPs are continuous we need to
define decision rules for classification
In general, classification into m classes requires m
output neurons
 What

decision rules should be used?
A pattern x is classified to class k, if output neuron k
‘fires’ (i.e. its output is greater than a threshold)
 The
problem with this decision rule is that it is unambiguous;
more than one neurons may ‘fire’

A pattern x is classified to class k if the output of
neuron k is greater than all other neurons
yk > yj for all j not equal to k
CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS
33
Example (1)


Classify between two ‘overlapping’ two-dimensional,
Gaussian-distributed patterns
Conditional probability density function for the two
classes
f(x | C1) = 1/2πσ12 exp[-1/2σ12 ||x – μ1||2]
μ1 = mean = [0 0]T and σ12 = variance = 1
f(x | C2) = 1/2πσ22 exp[-1/2σ22 ||x – μ2||2]
μ2 = mean = [2 0]T and σ22 = variance = 4
= [x1 x2]T = two dimensional input
 C1 and C2 = class labels
x
CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS
34
Example (2)
CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS
35
Example (3)
CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS
36
Example (4)

Consider a two-input, four hidden neurons, and twooutput MLP
 Decision

rule: an input x is classified to C1 if y1 > y2
The training set is generated from the probability
distribution functions
 Using
BP algorithm, the network is trained for minimum
mean-square-error

The testing set is generated from the probability
distribution functions
 The

trained network is tested for correct classification
For other implementation details, see the Matlab code
CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS
37
Example (5)
CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS
38
Experimental Design

Number of hidden layers



Number of hidden layer neurons


Use the minimum number of hidden layers that gives the best
performance (least mean-square-error or best generalization)
In general, more than 2 hidden layers is rarely necessary
Use the minimum number of hidden layer neurons (>= 2) that gives the
best performance
Learning-rate and momentum parameters



The parameters that, on average, yield convergence to a local minimum in
least number of epochs
The parameters that, on average or in worst-case, yield convergence to
the global minimum in least number of epochs
The parameters that, on average, yield a network with best generalization
CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS
39
Generalization (1)

A neural network is trained to learn the input-output patterns
presented to it by minimizing an error function (e.g. meansquare-error)



In other words, the neural network tries to learn the given input-output
mapping or association as accurately as possible
But… can it generalize properly ? And, what is generalization?
A network is said to generalize well when the input-output
relationship computed by the network is correct (or nearly so)
for input-output pattern (test data) never used in creating and
training the network

In other words, a network generalizes well when it learns the input-output
mapping of the system from which the training data is obtained
CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS
40
Generalization (2)

Properly fitted data; good generalization
CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS
41
Generalization (3)

Over fitted data; bad generalization
CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS
42
Generalization (4)


How to achieve good generalization?
In general, good generalization implies a smooth
nonlinear input-output mapping
 Rigorous
mathematical criterion presented by Poggio and
Girosi (1990)


In general, the simplest function that maps the inputoutput patterns would be smoother
In a neural network…
 use
the simplest architecture possible with as few hidden
neurons as needed for the mapping
 use a training set that is consistent with the complexity of the
architecture (i.e. more patterns for more complex networks)
CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS
43
Cross-Validation


The design of a neural network is experimental. We
select the “best” network parameters based on a
criterion
From statistics, cross-validation provides a systematic
way of experimenting
 Randomly
partition data into training and testing samples
 Further partition training sample into an estimation and an
evaluation (cross-validation) sample
 Find the “best” model by training with estimation sample and
validating with evaluation sample
 Once the model has been found, train on entire training
sample and test generalization using the testing sample
CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS
44
Universal Approximator


A neural network can be thought of as a mapping from
a p-dimensional Euclidean space to a q-dimensional
Euclidean space. In other words, a neural network can
learn a function s: Rp -> Rq
A multilayer feedforward network with nonlinear,
bounded, monotone-increasing activation functions is a
universal approximator
 Universal
approximation: to learn a continuous nonlinear
mapping to any degree of accuracy

This is an existence theorem. That is, it does not say
anything about practical optimality (complexity,
computation time, etc)
CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS
45
MLP and BP: Remarks (1)

The BP algorithm is the most popular algorithm for
supervised training of MLPs. It is popular because




It is simple to compute locally
It performs stochastic gradient descent in weight space (for
pattern-by-pattern mode of training)
The BP algorithm does not have a biological basis.
Many of its operations are not found in biological
neural network. Nevertheless, it is of great
engineering importance.
The hidden layers act as feature extractors/detectors

An MLP can be trained to learn an identity mapping (i.e.
map a pattern to itself), in which case the hidden layers act
as feature extractors
CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS
46
MLP and BP: Remarks (2)




A MLP with BP is an universal approximator in the
sense that it can approximate any continuous
multivariate function to any desired degree of accuracy,
provided that sufficiently many hidden neurons are
available
The BP algorithm is a first-order approximation of the
method of steepest descent. Consequently, it converges
slowly.
The BP algorithm is a hill-climbing technique and
therefore suffers from the possibility of getting trapped
in a local minimum of the cost surface
The MLP scales poorly because of full connectivity
CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS
47
Accelerating Convergence

1.
2.
3.
4.
Four heuristics
Every adjustable network parameter of the cost function should
have its own individual learning rate parameter
Every learning rate parameter should be allowed to vary from
one iteration to the next
When the derivative of the cost function with respect to a
synaptic weight has the same algebraic sign for several
consecutive iterations of the algorithm, the learning rate
parameter for that particular weight should be increased
When the algebraic sign of the derivative of the cost function
with respect to a particular synaptic weight alternates for
several consecutive iterations of the algorithm, the learning
rate parameter for that weight should be decreased
CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS
48