Analysis and Improvement of Multiple Optimal Learning Factors for

Download Report

Transcript Analysis and Improvement of Multiple Optimal Learning Factors for

ANALYSIS AND
IMPROVEMENT OF MULTIPLE
OPTIMAL LEARNING FACTORS
FOR FEED-FORWARD
NETWORKS
PRAVEEN JESUDHAS
DR. MICHAEL T MANRY, SR. MEMBER IEEE
Grad Student, DEPT. OF ELECTRICAL ENGINEERING
UNIVERSITY OF TEXAS AT ARLINGTON
PROFESSOR, DEPT. OF ELECTRICAL ENGINEERING
UNIVERSITY OF TEXAS AT ARLINGTON
Contents of Presentation







Neural Network Overview
Review of Multilayer-Perceptron.
Review of first order training methods
The MOLF algorithm
Analysis of MOLF algorithm
Collapsing the Hessian of Molf algorithm
Experimental results.
Neural net overview
xp
- Input vector
yp
- Actual output
tp
- Desired output
Neural
Networks
Nv
- Number of
Patterns
Output (yp)
1
2
.
.
Nv
Input (xp)
x1 t1
x2 t2
.
.
.
.
xNv tNv
Contents
of a
Training
File
The Multilayer Perceptron
xp (1)
W
np (1)
op (1)
Woh
yp (1)
yp (2)
xp (2)
yp (3)
xp (3)
yp (M)
xp (N+1)
Input
Layer
np (Nh)
op (Nh)
Hidden Layer
Woi
Output
Layer
op(k) = f(np(k))
Overview of Back Propagation (BP)
Step 1 : The weights to be trained are initialized with
random values
Step 2 : The gradient vector G of the error E with respect to
weights are found as,
Step 3 : In each iteration the weights are updated as,
Step 4 : Step 2 is continued until end of iterations
Overview of OWO-BP
The output and bypass weights
Woh and Woi are solved linearly
using OLS. This process is
denoted as Output-weight –
Optimization (OWO).
Woh
The input weights W are then
updated by BP.
W
Woi
This is faster and better than
using only BP to update all
weights.
MOLF BP (Review)
In this process, the output and bypass weights are
calculated through OWO as described before.
The hidden weights are updated through the backpropagation procedure but with separate optimal
learning factors (zk) for each of the hidden units, hence
the name Multiple Optimal Learning Factors (MOLF).
The weight update equations are hence given as,
MOLF BP (Review)
The vector z containing each of the optimal learning factors zk is
found through Newton’s Method, by solving
where,
Problem : Lack convincing motivation for this approach
Strongly Equivalent networks (New)
xp (1)
W
np (1)
op (1) W
oh
yp (1)
yp (2)
xp (2)
yp (3)
xp (3)
xp (N+1)
Input
Layer
np (Nh)
op (Nh)
Hidden Layer
yp (M)
Woi
Output
Layer
xp (1)
W’
n’p (1)
op (1) W
oh
yp (1)
yp (2)
xp (2)
yp (3)
xp (3)
C
xp (N+1)
Input
Layer
n’p (Nh)
op (Nh)
Hidden Layer
MLP 2
MLP 1
MLP 1 and MLP 2, have identical performance if W = C • W’.
However, they train differently
Goal : To maximize the decrease in error E, with respect to C
yp (M)
Woi
Output
Layer
Optimal net function transformation
(New)
W’
xp (1)
xp (2)
xp (3)
xp (N+1)
n’p (1)
C
np (1)
np (2)
np (3)
np (Nh) n - Actual net
p
n’p (Nh)
Input
Layer
W
np (1)
xp (2)
np (2)
xp (3)
np (3)
xp (N+1)
function vector
np
n’p
xp (1)
n’p - Optimal net
function vector
np (Nh)
C
- Transformation
matrix
W = C • W’
MLP1 weight change equation
(New)
The MLP 2 weight update equation is
W’ = W’ + z•G’ where G’ = CTG, W’ = C-1W
Multiplying by C we get
W = W + z•CCTG , R=CCT
So RG is a valid update for MLP 1
Why use MLP 1 ? Because MLP 2 requires C-1
MOLF compared with Optimal net
function Transformation (New)
The weight update equations for MLP1 can be re-written
as ,
W = W + z • RG where R = CCT , W = C • W’
If R is diagonal with kth diagonal element r(k), the
individual weight update is,
where,
zk = z • r(k)
This is the MOLF update equation.
Linearly dependent input layers in
MOLF (Review)
Dependent Inputs :
Lemma 1 : Linearly dependent inputs, when added to
the network, each element of Hmolf gains
some first and second degree terms in the
variables b(n).
Linearly dependent hidden layers in
MOLF (Review)
Lemma 2: When OLS is used to solve the above equation,
each hidden units, dependent upon those
already orthonormalized, results in zero-valued
weights changes for that hidden unit.
Thus based on Lemmas 1 and 2, the Hmolf Hessian matrix
becomes ill-conditioned.
Collapsing the MOLF Hessian
Hmolf - MOLF
Hessian
Hmolf1 - Collapsed
MOLF
Hessian
Nh
- Number of
Hidden
units
NOLF - Size of
Collapsed
MOLF
Hessian
Variable Number of OLFs
Collapsing the MOLF Hessian is equivalent to assigning one or
more hidden units to each OLF.
New number of OLFs = NOLF
Number of hidden units assigned to an OLF = Nh/NOLF
Advantages :

NOLF can be varied to improve performance or decrease number of
multiplies.

The number of multiplies required for computing the OLFs at every
iteration, decreases cubically with a decrease in NOLF.
No. of multiplies vs NOLF
4
x 10
7
6
N o.of Multiplies
5
4
3
2
1
0
0
5
10
15
20
N
OLF
25
30
35
40
Simulation Results
We compare the performance of VOLF and MOLF to those of OWO-BP, LM and
conjugate gradient (CG)


Average Training Error V/s Number of Iterations
Average Training Error V/s Number of Multiplies
We have used 4 datasets for our simulations. The description is shown in the table
below.
Data Set Name
No. of Inputs
No. of Outputs
No. of Patterns
Twod.tra
8
7
1768
Single2.tra
16
3
10000
Power12trn.tra
12
1
1414
Concrete Data
Set
8
1
1030
Simulation conditions
 Different numbers of hidden units are chosen for each data file
based on network pruning to minimize validation error.
 The K-fold validation procedure is used to calculate the average
training and validation errors.
 Given a data set, it is split into K non-overlapping parts of
equal size, and (K − 1) parts are used for training and the remaining
one part for validation. The procedure is repeated till all k
combinations have been exhausted. (K = 10 for our simulations) e
 In all our simulations we have 4000 iterations for the first order
algorithms BP-OLF and CG, 4000 iterations for MOLF and VOLF
and for LM we have 300 iterations. In each experiment,
each algorithm uses the same initial network.
Twod data file trained with Nh=30
0.22
0.21
0.2
MSE
0.19
0.18
0.17
OWO-BP
CG
0.16
LM
MOLF
VOLF
0.15
0
10
1
10
2
10
No of iterations
3
10
4
10
Twod data file trained with Nh=30
0.22
0.21
0.2
MSE
0.19
0.18
OWO-BP
0.17
CG
LM
MOLF
0.16
0.15
6
10
VOLF
7
10
8
10
9
10
No of Multiplies
10
10
11
10
12
10
Single2 data file with Nh=20
0.9
0.8
0.7
0.6
MSE
0.5
0.4
OWO-BP
0.3
0.2
CG
LM
MOLF
0.1
0
0
10
VOLF
1
10
2
10
No of Iterations
3
10
4
10
Single2 data file with Nh=20
0.9
0.8
0.7
MSE
0.6
0.5
0.4
OWO-BP
0.3
0.2
0.1
0
7
10
CG
LM
MOLF
VOLF
8
10
9
10
10
10
No of Multiplies
11
10
12
10
13
10
Power12trn data file with Nh=25
7000
6500
MSE
6000
5500
5000
OWO-BP
CG
LM
4500
MOLF
VOLF
4000
0
10
1
10
2
10
No of Iterations
3
10
4
10
Power12trn data file with Nh=25
7000
6500
MSE
6000
5500
5000
OWO-BP
CG
LM
4500
MOLF
VOLF
4000
6
10
7
10
8
10
9
10
No of Multiplies
10
10
11
10
12
10
Concrete data file with Nh=15
70
60
MSE
50
40
OWO-BP
30
CG
LM
20
MOLF
VOLF
10
0
10
1
10
2
10
No of Iterations
3
10
4
10
Concrete data file with Nh=15
70
60
MSE
50
40
30
OWO-BP
CG
LM
20
MOLF
VOLF
10
5
10
6
10
7
10
8
10
No of Multiplies
9
10
10
10
11
10
Conclusions


The error found from the VOLF algorithm lies
between the errors produced by the MOLF and
OWO-BP algorithms based on the value of NOLF
chosen
The VOLF and MOLF algorithms are found to
produce good results approaching that of the LM
algorithm, with computational requirements only in
the order of first order training algorithms.
Future Work


Applying MOLF to more than one hidden layer.
Converting the proposed algorithm into a
single–stage procedure where all the weights
are updated simultaneously.
References









R. C. Odom, P. Pavlakos, S.S. Diocee, S. M. Bailey, D. M. Zander, and J. J. Gillespie, “Shaly sand analysis
using density-neutron porosities from a cased-hole pulsed neutron system,” SPE Rocky Mountain regional
meeting proceedings: Society of Petroleum Engineers, pp. 467-476, 1999.
A. Khotanzad, M. H. Davis, A. Abaye, D. J. Maratukulam, “An artificial neural network hourly temperature
forecaster with applications in load forecasting,” IEEE Transactions on Power Systems, vol. 11, No. 2, pp.
870-876, May 1996.
S. Marinai, M. Gori, G. Soda, “Artificial neural networks for document analysis and recognition,” IEEE
Transactions on Pattern Analysis and Machine Intelligence, vol. 27, No. 1, pp. 23-35, 2005.
J. Kamruzzaman, R. A. Sarker, R. Begg, Artificial Neural Networks: Applications in Finance and
Manufacturing, Idea Group Inc (IGI), 2006.
L. Wang and X. Fu, Data Mining With Computational Intelligence, Springer-Verlag, 2005.
K. Hornik, M. Stinchcombe, and H.White, “Multilayer Feedforward Networks Are Universal Approximators.”
Neural Networks, Vol. 2, No. 5, 1989, pp.359-366.
K. Hornik, M. Stinchcombe, and H. White, “Universal Approximation of an Unknown Mapping and its
Derivatives Using Multilayer Feedforward Networks,” Neural Networks,vol. 3,1990, pp.551-560.
Michael T.Manry, Steven J.Apollo, and Qiang Yu, “Minimum Mean Square Estimation and Neural Networks,”
Neurocomputing, vol. 13, September 1996, pp.59-74.
G. Cybenko, “Approximations by superposition of a sigmoidal function,” Mathematics of Control, Signals,
and Systems (MCSS), vol. 2, pp. 303-314, 1989.
References continued








D. W. Ruck et al., “The multi-layer perceptron as an approximation to a Bayes optimal discriminant
function,” IEEE Transactions on Neural Networks, vol. 1, No. 4, 1990.
Q. Yu, S.J. Apollo, and M.T. Manry, “MAP estimation and the multi-layer perceptron,” Proceedings of the
1993 IEEE Workshop on Neural Networks for Signal Processing, pp. 30-39, Linthicum Heights, Maryland,
Sept. 6-9, 1993.
D.E. Rumelhart, G.E. Hinton, and R.J. Williams, “Learning internal representations by error propagation,” in
D.E. Rumelhart and J.L. McClelland (Eds.), Parallel Distributed Processing, vol. I, Cambridge, Massachusetts:
The MIT Press, 1986.
J.P. Fitch, S.K. Lehman, F.U. Dowla, S.Y. Lu, E.M. Johansson, and D.M. Goodman, "Ship Wake-Detection
Procedure Using Conjugate Gradient Trained Artificial Neural Networks," IEEE Trans. on Geoscience and
Remote Sensing, Vol. 29, No. 5, September 1991, pp. 718-726.
Changhua Yu, Michael T. Manry, and Jiang Li, "Effects of nonsingular pre-processing on feed-forward
network training ". International Journal of Pattern Recognition and Artificial Intelligence , Vol. 19, No. 2
(2005) pp. 217-247.
S. McLoone and G. Irwin, “A variable memory Quasi-Newton training algorithm,” Neural Processing Letters,
vol. 9, pp. 77-89, 1999.
A. J. Shepherd Second-Order Methods for Neural Networks, Springer-Verlag New York, Inc., 1997.
K. Levenberg, “A method for the solution of certain problems in least squares,” Quart. Appl. Math., Vol. 2,
pp. 164.168, 1944.
References continued









G. D. Magoulas, M. N. Vrahatis, and G. S. Androulakis, ”Improving the Convergence of the Backpropagation
Algorithm Using Learning Rate Adaptation Methods,” Neural Computation, Vol. 11, No. 7, Pages 17691796, October 1, 1999.
Sanjeev Malalur, M. T. Manry, "Multiple Optimal Learning Factors for Feed-forward Networks," accepted by
The SPIE Defense, Security and Sensing (DSS) Conference, Orlando, FL, April 2010
W. Kaminski, P. Strumillo, “Kernel orthonormalization in radial basis function neural networks,” IEEE
Transactions on Neural Networks, vol. 8, Issue 5, pp. 1177 - 1183, 1997
D.E Rumelhart, G.E. Hinton, and R.J. Williams, “Learning representations of back-propagation errors,”
Nature, London, 1986, vol. 323, pp. 533-536
D.B.Parker, “Learning Logic,” Invention Report S81-64,File 1, Office of Technology Licensing, Stanford Univ.,
1982
T.H. Kim “Development and Evaluation of Multilayer Perceptron Training Algorithms”, Phd. Dissertation, The
University of Texas at Arlington, 2001.
S.A. Barton, “A matrix method for optimizing a neural network,” Neural Computation, vol. 3, no. 3, pp.
450-459, 1991.
Nocedal, Jorge & Wright, Stephen J. (1999). Numerical Optimization. Springer-Verlag
Bonnans, J. F., Gilbert, J.Ch., Lemaréchal, C and Sagastizábal, C.A. (2006), Numerical optimization,
theoretical and numerical aspects.Second edition. Springer.
References continued








R. Fletcher, "Conjugate Direction Methods," chapter 5 in Numerical Methods for Unconstrained Optimization,
edited by W. Murray, Academic Press, New York, 1972.
P.E. Gill, W. Murray, and M.H. Wright, Practical Optimization, Academic Press, New York,1981.
Charytoniuk, W.; Chen, M.-S.; "Very short-term load forecasting using artificial neural networks ," Power
Systems, IEEE Transactions on , vol.15, no.1, pp.263-268, Feb 2000
Ning Wang; Xianyao Meng; Yiming Bai;,"A fast and compact fuzzy neural network for online extraction of
fuzzy rules," Control and Decision Conference, 2009. CCDC '09. Chinese , vol.,no.,pp.4249-4254,17-19
June 2009
Wei Wu; Guorui Feng; Zhengxue Li; Yuesheng Xu; , "Deterministic convergence of an online gradient
method for BP neural networks," Neural Networks, IEEE Transactions on , vol.16, no.3, pp.533-540, May
2005
Saurabh Sureka, Michael Manry, A Functional Link Network With Ordered Basis Functions, Proceedings of
International Joint Conference on Neural Networks, Orlando, Florida, USA, August 12-17, 2007.
Lee, Y.; Oh, S.-H,; Kim, M.W., “The effect of initial weights on premature saturation in back-propagation
learning,” Neural Networks, 1991., IJCNN-91-Seattle International Joint Conference on, vol.1, 1991 pp.
765-770 vol. 1
P.L Narasimha, S. Malalur, and M.T. Manry, Small Models of Large Machines, Proceedings of the TwentyFirst
International Conference of the Florida Al Research Society, pp.83-88, May 2008.
References continued





Pramod L. Narasimha, Walter H. Delashmit, Michael T. Manry, Jiang Li, Francisco Maldonado, An integrated
growing-pruning method for feedforward network training, Neurocomputing vol. 71, pp. 2831–2847,
2008.
R.P Lippman, “An introduction to computing with Neural Nets,” IEEE ASSP Magazine,April 1987.
S. S. Malalur and M. T. Manry, Feedforward Network Training Using Optimal Input Gains, Proc.of IJCNN’09,
Atlanta, Georgia, pp. 1953-1960, June 14-19, 2009.
Univ. of Texas at Arlington, Training Data Files – http://wwwee.uta.edu/eeweb/ip/training_
data_files.html.
Univ. of California, Irvine, Machine Learning Repository - http://archive.ics.uci.edu/ml/
Questions