Design Review of MPS Japan FID 109182

Download Report

Transcript Design Review of MPS Japan FID 109182

Optimizing Pattern
Recognition Systems
Professor Mike Manry
University of Texas at Arlington
Outline
Introduction
Feature Selection
Complexity Minimization
Recent Classification Technologies
Neural Nets
Support Vector Machines
Boosting
A Maximal Margin Classifier
Growing and Pruning
Software
Examples
Conclusions
Intro -- Problems In Pattern Recognition Systems
Pattern Recognition Application
Raw
Images
Poor selection
of Training
Images
Segment
Poor
Algorithms
chosen
Feature
Extraction
Poor
highly
redundant
feature
vectors
Classify
Z
Inefficient
improperly
sized
classifier
Intro -- Problems In Pattern Recognition Systems
Good
Raw
Images
Feature
Extraction
Segment
Classify
Poor
Results
Problems: We usually don’t know which block(s) are bad
Leftmost blocks may be more expensive and difficult to change
Solution: Using the system illustrated on the next slide, we can quickly
optimize the rightmost blocks and narrow the list of potential
problems.
Intro -- IPNNL Optimization System
( www-ee.uta.edu/eeweb/ip/ )
Pattern Recognition Application
Segment
Raw
Images
Classify
Feature
Extraction
Training
Data
{ z,ic }
Feature
Selection
Interpret
x
Final
Subset
x
Final
Classifier
x
Predict Size
and
Performance
Dim(z) = N’
Dim(x) = N
N << N’
Classifier
Choices
IPNNL Optimization System
Prune
and
Validate
Intro -- Presentation Goals
Examine blocks in the Optimization System:
• Feature Selection
• Recent Classification Technologies
• Growing and Pruning of Neural Net Classifiers
Demonstrate the Optimization System on:
• Data from Bell Helicopter Textron
• Data from UTA Bioengineering
• Data from UTA Civil Engineering
Example System
Text Reader
yp1 =.01
yp2 =.01
yp3 =.01
yp4 = .91
Segment
yp5 =.01
Feature
Extraction
Classifier
yp6 =.01
yp7 =.01
w2
2D DFT
yp8 =.01
w1
yp9 =.01
yp10 =.01
Feature Selection
- Combinatorial Explosion
Number of size-N subsets of N’ features is
NS =
N’
N
( )
• Scanning: Generate candidate subsets
• Subset Evaluation: Calculate subset goodness, and save the
best ones
z
Scanning
Method
Subset
Evaluation
Metric
Candidate x
Chosen
Subsets x
Feature Selection
Example of Combinatorial Explosion
Given data for a classification problem with
N’ = 90 candidate features,
• there are 9.344 x 1017 subsets of size N=17
• and a total of 290 = 1.2379 x 1027 subsets
Feature Selection
- Scanning Methods
Available Methods :
Brute Force (BF) Scanning: Examine every subset (See previous slide)
Branch and Bound (BB) [1]: Avoids examining subsets known to be poor.
Plus L minus R (L-R) [1]: Adds L good features and eliminates the R worst
features
Floating Search (FS) [2]: Faster than BB.
Feature Ordering (FO) [1]: Given the empty subset, repeatedly add the best
additional feature to the subset. Also called forward selection.
Ordering Based Upon Individual Feature Goodness (FG)
Let G measure the goodness of a subset scanning method. Then
G(FG) < G(FO) < G(L-R) < G(FS) < G(BB) = G(BF)
Feature Selection
Scanning Methods: Feature Goodness vs. Brute Force
• Suppose that available features are:
z1 = x + n, z2 = x + n, z3 = y + 2n, and z4 = n
where x and y are useful and n is noise.
• FG: Nested subsets {z1}, {z1, z2 }, {z1, z2 , z3 }
• BF: Optimal subsets {z2}, {z1, z4 }, {z1, z3 , z4 }
since x = z1 - z4 and y = z3 - 2z4
• An optimal subset may include features that are
useless, according to the FG approach
Feature Selection
- Subset Evaluation Metrics (SEMs)
Let x be a feature subset, and let xk be a candidate feature
for possible inclusion into x.
Requirements for SEM f()
•f ( x U xk ) < f (x), ( U denotes union )
•f() related to classification success (Pe for example )
Example SEMs
• Brute Force Subset Evaluation: Design a good classifier
and measure Pe
• Scatter Matrices [1]
Feature Selection
A New Approach [3]
Scanning Approach : Floating Search
SEM
: Pe for piecewise linear classifier
FE1
FE2
Segmented
Data
FE3
FE4
FE5
Feature
Extraction
Large
Feature
Vector z
Feature
Selection
Small
Feature
Vector x
At the output, the absence of groups 1, 3, and 5 reveals problems
Feature Selection
Example 1
Classification of Numeral Images: N’ = 16 features and Nc = 10
Chosen Subsets
Error % Versus N
Error %
{6, }
14
{6, 9, }
{6, 9, 14, }
12
{6, 9, 14, 13, }
{6, 9, 14, 13, 3, }
10
{6, 9, 14, 13, 3, 15, }
{6, 9, 14, 13, 11, 15, 16, }
8
{6, 9, 14, 13, 11, 15, 16, 3, }
6
{6, 9, 14, 13, 11, 15, 16, 3, 4, }
{6, 9, 14, 13, 11, 15, 16, 3, 4, 1, }
4
{6, 9, 14, 13, 11, 15, 16, 3, 4, 1, 12, }
{6, 9, 14, 13, 11, 15, 16, 3, 4, 1, 12, 7, }
{6, 9, 14, 13, 11, 15, 16, 3, 4, 1, 12, 7, 8, } 2
{6, 9, 14, 13, 11, 15, 16, 3, 4, 1, 12, 7, 8, 5, }
0
{6, 9, 14, 13, 11, 15, 16, 3, 4, 1, 12, 7, 8, 5, 2, }
{6, 9, 14, 13, 11, 15, 16, 3, 4, 1, 12, 7, 8, 5, 2, 10, }
1
2
3
4
5
6
7
8
9
10
Subset Size N
11
12
13
14
15
16
Feature Selection
Example 2
Classification of Sleep Apnea Data (Mohammad Al-Abed and
Khosrow Behbehani): N’ = 90 features and Nc = 2
Error % Versus N
Chosen Subsets
30
Error %
{11, }
25
{11, 56, }
{11, 56, 61, }
20
{11, 28, 55, 63, }
{11, 28, 55, 63, 27, }
{11, 28, 55, 63, 62, 17, }
15
{11, 28, 55, 63, 53, 17, 20, }
{11, 28, 55, 63, 53, 17, 20, 62, }
10
{11, 28, 55, 63, 53, 17, 20, 4, 40, }
{11, 28, 55, 63, 53, 26, 20, 4, 40, 8, }
{11, 28, 22, 19, 53, 26, 20, 4, 40, 30, 80, }
5
{11, 28, 22, 19, 53, 26, 20, 4, 40, 30, 80, 85, }
{11, 28, 22, 19, 53, 26, 20, 4, 40, 30, 80, 85, 8, }
0
{11, 28, 22, 19, 53, 26, 20, 4, 40, 30, 80, 85, 8, 48, }
1
{11, 28, 22, 19, 53, 26, 20, 4, 40, 30, 80, 85, 8, 48, 38, }
{11, 28, 22, 19, 53, 26, 20, 4, 40, 30, 80, 85, 8, 48, 45, 87, }
{11, 28, 13, 19, 53, 26, 65, 4, 40, 30, 80, 85, 8, 48, 75, 87, 18, }
4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64 67 70 73 76 79 82 85 88
Subset Size N
Complexity Minimization
Complexity: Defined as the number of free parameters or coefficients
in a processor
Complexity Minimization (CM) Procedure
• Minimize training set Pe for each classifier size, with respect to all
weights or coefficients. Nw is the number of weights or coefficients.
• Measure Pe for a validation data set.
• Choose that network that minimizes the validation set’s Pe
• CM leads to smaller, better classifiers, if it can be performed. It is
related to SRM [4,5].
• Implication: To perform SRM, we need methods for quickly varying
network size during training (growing) or after training (pruning)
Complexity Minimization
E
validation
Nw
Vary Nw until validation error is minimized.
Recent Classification Technologies
Neural Nets
Minimize MSE E(w), where tpi = 1 for the correct class (i = ic )
and 0 for an incorrect class (i = id ) .
yi(x) is the ith output discriminant of the trained classifier.
Neural net support vector x satisfies: yic(x) – max{ yid (x)} = 1
The MSE between yi(x) and bi(x) = P(i|x) is
Theorem 1 [5,6]: As the number of training patterns Nv increases,
the training error E(w) approaches e(w)+C, where C is a constant.
Recent Classification Technologies
Neural Nets
Advantages
•
•
•
•
Neural net outputs approximate Bayes discriminant P(i | x)
Training modifies all network weights
CM easily performed via growing and pruning methods
Accommodates any size training data file
Problems
• E(w) and e(w) are not proportional to Pe.
• From Theorem 1,
yi = bi + εi,
where εi is random zero-mean noise. Noise degrades performance,
leaving room for improvement via SVM training and boosting.
Recent Classification Technologies
Support Vector Machines [4,5]
SVM: A neural net structure with these properties:
• Output weights form hyperplane decision boundaries
• Input vectors xp satisfying yp(ic) = +b and max{ y (id) } = -b are
called support vectors
• Correctly classified input vectors xp outside the decision margins
do not adversely affect training
• Incorrectly classified xp do not strongly affect training
• In some SVMs, Nh (number of hidden units) initially equals Nv
(the number of training patterns)
• Training may involve quadratic programming
Recent Classification Technologies
Support Vector Machines
x2
b
SV
b
SV
SV
SV
x1
SVM discriminant
LMS discriminant
Correctly classified patterns distort the LMS discriminant
Recent Classification Technologies
Support Vector Machines
Advantage: Good classifiers from small data sets
Problems
•
•
•
•
SVM methods are practical only for small data sets
Training difficult when there are many classes
Kernel parameter found by hit or miss
Fails to minimize complexity (far too many hidden
units required, hidden unit weights don’t adapt)
Current Work
• Developing SVM pruning approach
• Developing Regression-based maximal margin classifiers
Recent Classification Technologies
Boosting [5]
yik(x): ith class discriminant for the kth classifier. K is the
number of classifiers being fused.
In discriminant fusion, we calculate the ak so that the
weighted average discriminant,
has better performance than the yik(x)
Adaboost [5] sequentially picks a training subset {xp, tp}k, from the
available data and designs yik(x) and ak so they are functions of the
previous (k-1) classifiers.
Recent Classification Technologies
Boosting
Advantage
• Final classifier can be used to process video signals in real
time when step function activations are used
Problems
• Works best for the two-class case, uses huge data files.
• Final classifier is a large, highly redundant neural net.
• Training can take days, CM not performed.
Future Work
• Pruning should be tried to reduce redundancy.
• Feature selection should be tried to speed up training
Recent Classification Technologies
Problems with Adaboost
N = 3, Nc = 2, K = 3 (number of classifiers being fused )
Future Work : Prune ADA boosted networks
A Maximal Margin Classifier
Problems With MSE Type Training
The ith class neural net discriminant can be modeled as
yp(i)= tp(i) + εp(i)
This additive noise εp(i) degrades the performances of
regression-based classifiers, as mentioned earlier.
Correctly classified patterns contribute to MSE, and can
adversely affect training (See following page). In other
words, regression-based training tries to force all training
patterns to be support vectors
A Maximal Margin Classifier
Problems With MSE Type Training
x2
x1
(1) Standard regression approach tries to force all training vectors
to be support vectors
(2) Red lines are counted as errors, even though those patterns are
classified more correctly than desired
(3) Outliers can distort decision boundary locations
A Maximal Margin Classifier[5]
Existence of Regression-Based Optimal Classifiers
Let X be the basis vector (hidden units + inputs) of a neural net. The
output vector is y = W·X
The minimum MSE is found by solving
R·W = C
where
R = E[X·XT ] and
C = E[X·tT ]
(1)
(2)
If an “optimal” coefficient matrix Wopt exists, Copt = R·Wopt
from (1), so Copt exists. From (2), we can find Copt if the desired
output vector t is defined correctly.
Regression-based training can mimic other approaches.
A Maximal Margin Classifier
Regression Based Classifier Design [7]
Consider the Empirical Risk (MSE)
If
or
yp(ic) > tp(ic)
yp(id) < tp(id)
(Correct class discriminant large)
(Incorrect class discriminant small)
Classification error decreases but E increases.
A Maximal Margin Classifier
Regression Based Classifier Design-Continued
The discrepancy is fixed by re-defining the empirical risk as
If yp(ic) > tp(ic ) set tp‘(ic) = yp(ic)
If yp(id) < tp(id ) for an incorrect class id, set tp‘(id) = yp(id).
In both cases, tp‘(i) is set equal to yp(i) so no error is counted.
This algorithm partially mimics SVM training since correctly
classified patterns do not affect the MSE.( Related to HoKashyap procedures, [5] pp.249-256 )
A Maximal Margin Classifier
Problems With MSE Type Training
x2
x1
Recall that E counts all non-support vectors as erroneous
A Maximal Margin Classifier
Errors Contributing to E’
x2
x1
In E’, only errors (green lines) inside the margins are minimized.
Outliers can be eliminated.
A Maximal Margin Classifier
Comments
The proposed algorithm:
(1) Adapts to any number of training patterns
(2) Allows for any number of hidden units
(3) Makes CM straightforward
(4) Is used to train the MLP. The resulting classifier is called a
maximal margin classifier (MMC)
Questions:
(1) Does this really work ?
(2) How do the MMC and SVM approaches compare ?
A Maximal Margin Classifier
Two-Class Example
Numeral Classification:
N = 16, Nc = 2, Nv = 600
Goal: Discriminate numerals 4 and 9.
SVM: Nh > 150, Et = 4.33% and Ev = 5.33 %
MMC: Nh = 1 , Et = 1.67% and Ev = 5.83 %
Comments: The 2-Class SVM seems better, but the
price is too steep.
A Maximal Margin Classifier
Multi-Class Examples
Numeral Classification:
N = 16, Nc = 10, Nv = 3,000
SVM: Nh > 608, Ev = 14.53 %
MMC: Nh = 32 , Ev = 8.1 %
Bell Flight Condition Recognition:
N = 24,
Nc = 39, Nv = 3,109
SVM: Training fails
MMC: Nh = 20 , Ev = 6.97 %
Nh - number of hidden units
Nv - number of training patterns
Ev – validation error percentage
Conclusion: SVMs may not work for medium and large size
multi-class problems.
Growing and Pruning
Candidate MLP Training Block
If complexity minimization (CM) is used, the resulting
Ef(Nh) curve is monotonic
Ef(w)
validation
Nw
Practical ways of approximating CM, are growing and pruning.
Growing and Pruning
Growing
Ef(w)
validation
Nw
Growing: Starting with no hidden units, repeatedly add
Na units and train the network some more.
Advantages: Creates a monotonic Ef(Nh) curve.
Usefulness is concentrated in the first few units added.
Disadvantage: Hidden units are not optimally ordered.
Growing and Pruning
Pruning [8]
Ef(w)
validation
Nw
Pruning: Train a large network. Then repeatedly remove a
less useful unit using OLS.
Advantages: Creates a monotonic Ef(Nh) curve. Hidden units
are optimally ordered
Disadvantage: Usefulness is not concentrated in the first
few units
Growing and Pruning
Pruning a Grown Network [9]
Data set is for inversion of radar scattering from bare soil surfaces.
It has 20 inputs and 3 outputs
Training error for Inversion of Radar Scattering
Growing and Pruning
Pruning a Grown Network
Validation error for Radar Scattering Inversion
Growing and Pruning
Pruning a Grown Network
Prognostics data set for onboard flight load synthesis (FLS) in
helicopters, where we estimate mechanical loads on critical
parts, using measurements available in the cockpit. There are 17
inputs and 9 outputs.
Training error for
Prognostics data
Growing and Pruning
Pruning a Grown Network
Validation error plots for Prognostics data
Growing and Pruning
Pruning a Grown Network
Data set for estimating phoneme likelihood functions
in speech, has 39 inputs and 117 outputs
Training error
for
Speech data
Growing and Pruning
Pruning a Grown Network
Validation error for Speech data
Growing and Pruning
• Remaining Work: Insert into the IPNNL
Optimization System
IPNNL Software
Motivation
Theorem 2 (No Free Lunch Theorem [5] ) : In the
absence of assumptions concerning the training data,
no training algorithm is inherently better than
another.
Implication: We usually make some assumptions.
However, given training data, several classifiers
should be tried after feature selection.
IPNNL Software
Block Diagram
MLP
PLN
FLN
Feature
Selection
LVQ
Size
Train
Prune &
Validate
SVM
Data
RBF
Final
Network
SOM
Analyze
Your Data
Select
Network
Type
Produce Final
Network
IPNNL Software
Examples
The IPNNL Optimization system is demonstrated on:
• Flight condition recognition data from Bell
Helicopter Textron (Prognostics Problem)
• Sleep apnea data from UTA Bioengineering (Prof.
Khosrow Behbehani and Mohammad Al-Abed)
• Traveler characteristics data from UTA Civil
Engineering (Prof. Steve Mattingly and Isaradatta
Rasmidatta)
Examples
Bell Helicopter Textron
• Flight condition recognition (prognostics) data
from Bell Helicopter Textron
• Features: N’ = 24 cockpit measurements
• Patterns: 4,745
• Classes: Nc = 39 helicopter flight categories
Run Feature selection, and save new training and
validation files with only 18 features
Run MLP sizing, decide upon 12 hidden units
Run MLP training, save the network
Run MLP pruning, with validation. Final network
has 10 hidden units.
Examples
Behbehani and Al-Abed
• Classification of Sleep Apnea Data (Mohammad
•
Al-Abed and Khosrow Behbehani):
• Features: N’ = 90 features from Co-occurrence
features applied to STDFT
• Patterns: 136
• Classes: Nc = 2 ( Yes/No )
• Previous Software: Matlab Neural Net Toolbox
Run Feature selection, and save new training and
validation files with only 17 features. The curve is
ragged because of the small number of patterns.
Run MLP sizing, decide upon 5 hidden units
Run MLP training, save the network
Run MLP pruning, with validation. Final network
has 3 hidden units.
Examples
- Mattingly
and Rasmidatta
• Classification of traveler characteristics data
(Isaradatta Rasmidatta and Steve Mattingly):
• Features: N’ = 22 features
• Patterns: 7,325
• Classes: Nc = 3 (car, air, bus/train )
• Previous Software: NeuroSolutions by
NeuroDimension
Run Feature selection, and save new training and
validation files with only 4 features. The flat curve
means few features are needed.
Run MLP sizing, decide upon 2 hidden units. Flat curve
means few hidden units, if any, are needed.
Run MLP training, save the network
Run MLP pruning, with validation. Final network
has 1 hidden unit.
Conclusions
• An effective feature selection algorithm has been developed
• Regression-based networks are compatible with CM
• Regression-based training can extend maximal margin concepts to
many nonlinear networks
• Several existing and potential blocks in the IPNNL Optimization
System have been discussed
• The system has been demonstrated on three pattern recognition
applications
• A similar Optimization System is available for
approximation/regression applications
References
[1] K. Fukunaga, Introduction to Statistical Pattern Recognition, 2nd edition, Academic Press,
1990.
[2] P. Pudil, J. Novovicova, J. Kittler, " Floating Search Methods in Feature Selection " Pattern
Recognition Letters, vol 15 , pp 1119-1125, 1994
[3] Jiang Li, Michael T. Manry, Pramod Lakshmi Narasimha, and Changhua Yu, “Feature
Selection Using a Piecewise Linear Network”, accepted by IEEE Trans. on Neural Networks.
[4] Vladimir N. Vapnik, Statistical Learning Theory, John Wiley & Sons, 1998.
[5] Duda, Hart and Stork, Pattern Classification, 2nd edition, John Wiley and Sons, 2001.
[6] Dennis W. Ruck et al., "The Mulitlayer Perceptron as an Approximation to a Bayes Optimal
Discriminant Function," IEEE Trans. on Neural Networks, Vol. 1, No. 4, 1990.
[7] R.G. Gore, Jiang Li, Michael T. Manry, Li-Min Liu, Changhua Yu, and John Wei, "Iterative
Design of Neural Network Classifiers through Regression". International Journal on Artificial
Intelligence Tools, Vol. 14, Nos. 1&2 (2005) pp. 281-301.
[8] F. J. Maldonado and M.T. Manry, "Optimal Pruning of Feed Forward Neural Networks Using
the Schmidt Procedure", Conference Record of the Thirty Sixth Annual Asilomar Conference
on Signals, Systems, and Computers., November 2002, pp. 1024-1028.
[9] Pramod Lakshmi Narasimha, Walter Delashmit, Michael Manry, Jiang Li and Fransisco
Maldonado, “Fast Generation of a Sequence of Trained and Validated Feed-Forward
Networks”, accepted by the Nineteenth International Conference of the Florida AI Research
Society, May 2006.