Transcript Slide 1

Towards Efficient Learning of Neural Network
Ensembles from Arbitrarily Large Datasets
Kang Peng, Zoran Obradovic and Slobodan Vucetic
Center for Information Science and Technology, Temple University
303 Wachman Hall, 1805 N Broad St, Philadelphia, PA 19122, USA
Agenda





Introduction
Motivation
Related Work
Proposed Work
Experimental Evaluation




Datasets
Experimental Setup
Results
Conclusions
Introduction

More and more very large datasets become available






Geosciences
Bioinformatics
Intrusion detection
Credit fraud detection
…
Learning from arbitrarily large datasets is one of the
next generation data mining challenges
The MISR Data – a Real Life Example
MISR - Multi-angle Imaging SpectroRadiometer, launched into
orbit in December 1999 with the Terra satellite, for studying the
ecology and climate of Earth




9 cameras from different
angles
4 spectral bands at each
angle
Global coverage time of
every 9 days
Average data rate
3.3 Megabits per second
3.5 TeraBytes per year
Agenda





Introduction
Motivation
Related Work
Proposed Work
Experimental Evaluation




Datasets
Experimental Setup
Results
Conclusions
Feed-Forward Neural Networks

Feed-forward Neural Network (NN) is a powerful
machine learning / data mining technique
weights
y1
weights
x1
y2
x2
y3
x3
1
Inputs


bias
1
Hidden
Layer
y4
bias
Output
Layer
Outputs
Universal approximator – applicable to both classification
and regression problems
Learning – weights adjustments (e.g. back-propagation)
Motivation

Learning a single NN from an arbitrarily large
dataset could be difficult due to

The unknown intrinsic complexity of the learning task




Difficult to determine appropriate NN architecture
Difficult to determine how much data is necessary for sufficient
learning
The computational constraints
On the other hand, learning an ensemble of NNs
would be advantageous if


Each component NN needs only a small portion of data
Accuracy is comparable to single NN from all data
Motivation

Need: To learn an ensemble of optimal accuracy but
with fewest computational effort, one still has to decide

Model complexity



The number (E) of component NNs
The number (H) of hidden neurons for component NNs
Training sample sizes (N) for component NNs

Open problem: No efficient algorithm exists to find an
exact solution (i.e. optimal combination of E, H and N) even
if the component NNs are required to have same H and N

Proposed: An iterative procedure that


learns near-optimal NN ensembles with reasonable computational
effort
Adapts to the intrinsic complexity of underlying learning task
Agenda





Introduction
Motivation
Related Work
Proposed Work
Experimental Evaluation




Datasets
Experimental Setup
Results
Conclusions
NN Architecture Selection

Trial-and-error (manual) procedure




Training one model with each architecture
Trying as many architectures as possible and selecting
the one with highest accuracy
Ineffective and inefficient for large datasets
Constructive learning


Starting with a small network and gradually adding
neurons as needed
Examples




The tiling algorithm
The upstart algorithm
The cascade-correlation algorithm
Suitable for small datasets
NN Architecture Selection

Network pruning


Training a larger-than-necessary NN and then pruning redundant
neurons/weights
Examples




Optimal Brain Damage
Optimal Brain Surgeon
Suitable for small and medium datasets
Evolutionary algorithms




Population-based stochastic search algorithms
More efficient in searching NN architecture space
Applicable to learning rules selection as well as network training
(weight adjusting)
Inefficient for large datasets
Progressive Sampling
To achieve near-optimal accuracy but with significantly less
data than if using the whole dataset
accuracy

nmin
sample size

nall
 Originally proposed for decision tree learning
 It builds a series of models with increasingly
larger samples until accuracy no longer
improves
 The sample sizes follow a sample schedule
S = {n1, n2, …, nk }
where ni is sample size for the i-th model
 Geometric sampling schedule is efficient in
determining nmin
ni = n0* ai ,
where constant n0 is positive integer and a>1
Progressive sampling may not be suitable for NN learning

The learning algorithm should be able to adjust model complexity as samples
grow larger – this is not true for back-propagation algorithm
Agenda





Introduction
Motivation
Related Work
Proposed Work
Experimental Evaluation




Datasets
Experimental Setup
Results
Conclusions
An Iterative Procedure for Learning NN
Ensembles from Arbitrarily Large Datasets
The idea: Building a series of NNs such that
 Each NN is trained on a sample much smaller than
the whole dataset
 The sample sizes for individual NNs are increased
as needed
 The numbers of hidden neurons for individual NNs
increase as needed
 The final predictor is the best one of all possible
ensembles constructed from the trained NNs
The Proposed Iterative Procedure
Initialize Ha and Nb to certain
small values (e.g. 1 and 40)
Draw a sample S of size N from dataset D
Train a NN of H hidden
neurons with sample S
Accuracyc significantly
improved?
No
Increase
H or N
No
Converged OR resource
limitsd reached?
Yes
Identify the best ensemble
as the final predictor
Yes
a) H – number of hidden
neurons
b) N – number of training
sample size
c) The best accuracy of all
possible ensembles from
trained NNs, estimated on
an independent set
d) Could be main memory
(maximal sample size ) or
cumulative execution time
The Use of Dataset D

Dataset D is divided into 3 disjoint subsets




DTR – for training NN
DVS – for accuracy estimation during learning
DTS – for accuracy estimation of the final predictor
To draw a sample of size N from DTR



Assumption - data points are stored in random order
Sequentially take N data points
Rewind if the end of dataset is encountered
Accuracy Estimation during Learning

Accuracy ACCi (for i-th iteration) is estimated on the
independent subset DVS as accuracy of the best
possible ensemble from i trained NNs
 To determine if ACCi is significantly higher than
ACCi-1, test condition ACCi > ACCi-1 AND CIi-1  CIi
=
Here,
ACCi is accuracy for i-th iteration,
CIi is the 90% confidence interval for ACCi
calculated as ACCi1.645SE(ACCi),
where SE(ACCi) is standard error of ACCi
Accuracy Standard Error Estimation

For classification problems
SE ( ACC i ) 

ACC i 1  ACC i 
DVS
For regression problems



Draw 1000 bootstrap samples from DVS
Calculate R2 on each bootstrap sample
SE(ACCi) = standard deviation of these R2 values
Adjusting Model Complexity
and Sample Size

If ACCi is NOT significantly higher than ACCi-1

If ACCi-1 is NOT significantly higher than ACCi-2



If already increased N in the i-1 th iteration
then increase H by a pre-defined amount IH (IH is positive integer)
If already increased H in the i-1 th iteration
then multiply N by a pre-defined factor FA (FA > 1)
If ACCi-1 is significantly higher than ACCi-2
(i.e. neither H nor N is increased in the i-1 th iteration)
then multiply N by a pre-defined factor FA (FA > 1)
Convergence Detection

In each (i-th) iteration, test condition
standard_deviation( ACCk )
 c
mean( ACCk )
where C is a small positive constant, and
k ranges from i-4 to i
Agenda





Introduction
Motivation
Related Work
Proposed Work
Experimental Evaluation




Datasets
Experimental Setup
Results
Conclusions
The Waveform Dataset





Synthetic classification problem
From UCI Machine Learning Repository
3 classes of waveforms
21 continuous attributes
Originally reported accuracy of 86.8%


with an Optimal Bayes classifier
100,000 examples were generated for each class

|DTR| = 80,000, |DVS| = 10,000, |DTS| = 10,000
The Covertype Dataset

Real-life classification problem
 From UCI Machine Learning Repository
 7 classes of forest cover types
 44 binary and 10 continuous attributes


Originally reported accuracy of 70%


40 binary attributes (for soil type) were transformed into 7
continuous attributes
obtained using a neural network classifier
581,012 examples

|DTR| = 561,012, |DVS| = 10,000, |DTS| = 10,000
The MISR Dataset

Real-life regression problem
 From NASA
 1 continuous target


36 continuous attributes


retrieved aerosol optical depth
constructed from raw MISR data
45,449 examples


Retrieved over land for the 48 contiguous United States during a 15day period of summer 2002
|DTR| = 35,449, |DVS| = 5,000, |DTS| = 5,000
Experimental Setup

The procedure was repeated 50 times on each dataset



Stopped when convergence was reached or the sample size
exceeded a pre-defined upper limit Nmax = 20,000
Parameters IH = 4, FA = 1.5 and C = 0.0025 selected based
on preliminary experiments on Waveform dataset
For comparison purpose, “simple” NN ensembles of
known parameters were also built




Trained and tested on DTR and DTS, respectively
Ensemble size (E)  {1, 5, 10}
Number of hidden neurons (H)  {1, 5, 10, 20, 40, 80}
Sample size (N)  {200, 400, 800, 1600, …, 204800}
Evaluation Criteria

Prediction accuracy



Classification – percentage of correct classifications
Regression – percentage of variances in target variable that can be
explained by the regression model (coefficient of determination R2)
Computational learning cost

Ensembles learnt with the proposed procedure: i=1~E Hi*Ni
where E is ensemble size, Hi is # of hidden neurons for i-th NN, and Ni is
training sample size for i-th NN


“Simple” ensembles: H*N*E (since Hi = H and Ni = N for all i = 1~E)
Scatter plot

prediction accuracy vs. computational learning cost
Results Summary

For Waveform and MISR datasets


The resulting ensembles were comparable to the optimal
solution in terms of accuracy and computational effort
For Covertype data

The resulting ensembles were slightly inferior to the
optimal solution in terms of accuracy, but with near one
order of magnitude smaller computational effort
The optimal solution refers to the optimal combination of (E, H, N),
assuming exact same component NNs
Results – Waveform
87
86
accuracy (%)
85
84
83
proposed procedure
single NN
ensemble of 5 NN
ensemble of 10 NN
82
81
3
10
10
4
10
5
10
6
i=1~EH
or EH*N*E
H*N
*N
i*NS
i H*N*E
10
7
10
8
Results – Covertype
80
accuracy (%)
75
70
65
proposed procedure
single NN
ensemble of 5 NN
ensemble of 10 NN
60
10
3
10
4
10
5
10
6
i=1~EH*N
Hi*N*N
i or H*N*E
S
E
10
7
10
8
Results – MISR
0.9
0.8
0.7
R
2
0.6
0.5
0.4
0.3
proposed procedure
single NN
ensemble of 5 NN
ensemble of 10 NN
0.2
0.1
3
10
10
4
10
5
10
6
i=1~EH*N
Hi*NSi*N
orE H*N*E
10
7
10
8
Summary of the Resulting Ensembles
Dataset
Ntotal
Accuracy
E
N
H
waveform
12,753
86.10.1%
6-12
649-2,950
10-24
covertype
86,309
78.01.5%
2-6
9,983-14,022
31-37
MISR
100,815
0.850.01
4-8
7,854-14,187
18-26
Here,
Ntotal – total number of examples used,
Accuracy – prediction accuracy on DTS,
E – final ensemble size,
N – individual sample size,
H – number of hidden neurons for single NN
Agenda





Introduction
Motivation
Related Work
Proposed Work
Experimental Evaluation




Datasets
Experimental Setup
Results
Conclusions
Conclusions
A cost-effective iterative procedure was proposed to
learn NN ensembles from arbitrarily large datasets

It can learn ensembles of near-optimal accuracy
with moderate computational effort
 It is adaptive to the inherent complexity of the
datasets
 It is different from progressive sampling


Automatically adjusts model complexity
Utilize previously built models to guide the learning
process
Thanks!