models - FAKE GAME Project

Download Report

Transcript models - FAKE GAME Project

GMDH and FAKE GAME:
Evolving ensembles of inductive
models
Pavel Kordík
CTU Prague, Faculty of Electrical Engineering,
Department of Computer Science
Outline of the tutorial …
• Introduction, FAKE GAME framework,
architecture of GAME
• GAME – niching genetic algorithm
• GAME – ensemble methods
– Trying to improve accuracy
• GAME – ensemble methods
– Models’ quality
– Credibility estimation
2/67
Theory in background
•
•
•
•
•
•
•
•
Knowledge discovery
Data preprocessing
Data mining
Neural networks
Inductive models
Continuous optimization
Ensemble of models
Information visualization
3/67
Problem statement
• Knowledge discovery (from databases) is time
consuming and expert skills demanding task.
• Data preprocessing is extremely time consuming
process.
• Data mining involves many experiments to find
proper methods and to adjust their parameters.
• Black-box data mining methods such us neural
networks are not considered credible.
• Knowledge extraction from data mining methods
is often very complicated.
4/67
Solutions proposed – FAKE GAME concept
KNOWLEDGE
EXTRACTION
and
INFORMATION
VISUALIZATION
INPUT
DATA
KNOWLEDGE
AUTOMATED
DATA
PREPROCESSING
AUTOMATED
DATA MINING
FAKE INTERFACE
5/67
GAME ENGINE
Detailed view of proposed solutions
DATA
WAREHOUSING
DATA
INTEGRATION
Classification, Prediction,
Identification and Regression
Classes boundaries,
relationship of variables
MO DEL
DATA
CLEANING
INPUT
DATA
MO DEL
AUTOMATED
DATA
PREPROCESSING
MO DEL
GAME ENGINE
Credibility
estim ation
MO DEL
MO DEL
DATA
INSPECTION
DATA
COLLECTION
MO DEL
Math equations
Interesting behaviour
Feature ranking
PROBLEM
IDENTIFICATION
FAKE INTERFACE
6/67
The GAME engine for automated
data mining
REGRESSION
MO DEL
OR
MO DEL
MO DEL
GAME
PREPROCESSED
DATA
MO DEL
MO DEL
CLASSIFICATION
OR
IDENTIFICATION
OR
MO DEL
What is hidden inside?
7/67
PREDICTION
Background (GMDH)
• Group Method of Data Handling
GMDH – prof. Ivakhnenko in 1968
input variables
P
• The principle of INDUCTION
employed:
Inductive model grows from data, it
decomposes a multivariate problem
into subproblems, solves it in the
space of lower dimensionality (two
dimensions for units with 2 inputs),
combines partial solutions to get the
global solution of the problem.
8/67
P
P
P
P
P
Units with polynomial
transfer function
P
x1
P
P
x2
y
P
For instance:
P
y=ax1+bx2+c
output variable
MIA GMDH – the most
commonly used algorithm
Our improvements
MIA GMDH
ModGMDH
input variables
GAME
input variables
input variables
P
P
P
P
P
P
P
unified
units
P
C
P
P
P
P
2 inputs
P
P
C
C
Non-heuristic
search
interlayer
connections
P
output variable
P
3 inputs
C max
L
output variable
GA+DC
9/67
G
P
3 inputs
P
diversified
units
C
G
P
output variable
Exhaustive
search
P
L
Recent improvements (GAME)
• Group of Adaptive Models
Evolution (GAME)
- Heterogeneous units
- Several competing training
methods
- Interlayer connections, number
of inputs allowed increases
- Niching genetic algorithm (will
be explained) employed in
each layer to optimize the
topology of GAME networks.
- Ensemble of models generated
(will be explained)
10/67
input variables
P
P
L
C
C
3 inputs
P
max
G second layer
P
P
first layer
C
third layer
interlayer connection
4 inputs max
L
output layer
output variable
Heterogeneous units in GAME
x1
x2
...
xn
x1
y   ai xi  an 1
x1
x2
...
xn
x1
xn
x1
Gaussian (GaussianNeuron)
y  1  an 1  * e
2
 i1
1 an2 2
Logistic (SigmNeuron)
1
y
n
1 e
  ai xi
 a0
Exponential (ExpNeuron)
 a0
xn
x1
Rational (PolyFractNeuron)
y
x2
...
x1
y  an2 * e
i 1
 a0
n
n
xn
11/67
n
 a x  a
i i
i 1 j 1
x x j  an2 1
 a0
n*i  j i
Universal (BPNetwork)
2 n 1
...
n
an1*  ai xi
an2 2
i 1
x2
...
 m r
y    ai  x j   a0
i 1 
j 1

n
...
xn
i 1
Polynomial (CombiNeuron)
x2
  xi  ai 
x2
xn

 n

y  an 1  sin an  2    ai xi  an 3   a0
 i 1


...
i 1
n
xn
Sin (SinusNeuron)
x2
n
x2
...
x1
Linear (LinearNeuron)
y   q
q 1

n
p 1
 pq ( x p )

Optimization of coefficients
(learning)
x1
x2
...
xn
Gaussian (GaussianNeuron)
 x i  a i 2
n
y’
y  1  an1  *e
 i 1
1 a n2 2
 a0
We are looking for optimal values of coefficients a0, a1, …, an+2
We have inputs x1, x2, …, xn and target output y in the training data set
The difference between unit output y’ and the target value y should
be minimal for all vectors from the training data set
m
E   ( y ' y ) 2
12/67
i 1
How to derive analytic gradient?
Error of the unit (energy surface)
Gradient of the error
Unit with gaussian transfer function
Partial derivation of error in the direction of coefficient ai
13/67
Partial derivatives of the Gauss unit
14/67
Optimization of their coefficients
a)
coefficients
a1 , a 2 , ..., a n
Unit
optimize coefficients
given inintial values
Unit does not provide analytic gradient
just error of the unit
repeat
new values
Optimization
error
method
compute
error on
training
estimate
data
final values
gradient
b)
coefficients
a1 , a 2 , ..., a n
Unit
compute
compute
error on
gradient
training
of the
data
error
optimize coefficients
given inintial values
Unit provides analytic gradient
and the error of the unit
repeat
new values
Optimization
error
method
gradient
final values
15/67
Very efficient gradient based training for
hybrid networks developed!
16/67
Outline of the talk …
• Introduction, FAKE GAME framework,
architecture of GAME
• GAME – niching genetic algorithm
• GAME – ensemble methods
– Trying to improve accuracy
• GAME – ensemble methods
– Models’ quality
– Credibility estimation
17/67
Genetic algorithm
18/67
Example of encoding into
chromosome
4
1
5
Niching
GA
Linear transfer unit
1234567
1001000
Inputs
2
6
(LinearNeuron)
Settings
not imp lemened
Transfer func tion
y  a1 x1  a2 x2  a0
Polynomial trasfer unit
3
7
n.i.
(CombiNeuron)
1234567 1234567 1234567
0000110 2115130 1203211
Inputs
Transfer func tion
y  a1 x1 x23  a2 x12 x2  a0
19/67
Settings
n.i.
The process of GAME models
evolution – Genetic Algorithm
Encoding the GAME unit to genotype (first layer):
Inputs
P
Type
Other
0 0 1 0 P trans.fn.
GA (no niching)
Select N best
individuals
P
P
P
P
Result: REDUNDANT information extracted
20/67
All individuals are connected to the most significant feature after standard GA.
Better way is to use worse, but
non-correlated units
f (A) = 8
f (B) = 7.99
A
f (X) = 8
B
C
f (Y) = 5
X
f (C) = 8
f (Z) = 9
Y
Z
f (C) < f (Z)
How to do it?
The outputs of units A and B are correlated
combining them does not bring improvement in performance
By using less significant features we get more additional information than
by using several best individuals connected
to the most significant feature!
21/67
Niching Genetic Algorithms
• Niching method - extends genetic algorithm to be able to
locate multiple solutions.
• Promote the formation and maintenance of stable
subpopulations in genetic algorithm.
Mnoho metod pro zachování
různorodosti jedinců (niching
methods):
Canonical GAs
Niching GAs
22/67
Overspecification – dobré u
dynamických systémů
Ecological genetic algorithms –
více prostředí
Heterozygote advantage –
diploidní genom, fitness zvýšit o
dist. rodičů
Crowding (restricted replacement)
– nahrazení podobných jedinců v
subpop.
Restricted competition – turnaj
kompatibilních jedinců (threshold)
Fitness sharing – sharing methods
– přelidníno –> hledám jinde
Immune system models - antigeny
Niching GA – Deterministic Crowding
• Deterministic Crowding: S. W. Mahfoud (1995).
• Subpopulations (niches) are based on the distance
of individuals.
For units in the
GAME
network:
distance of
units in one
population is
based on the
phenotypic
difference of
units.
Distance = Hamming dist. of chromosomes + Hamming dis. of inputs processed
Unit Chromosome Inputs_processed
A
B
C
A
1000
1000
B
0010
0010
C
010010
1100
D
000011
1010
D
d (A, B) = H(1000, 0010) + H(1000, 0010) = 4
d (C, D) = H(010010, 000011) + H(1100, 1010) = 4
23/67
Distance of GAME units
1
Distance(P1,P2) = genotyphic distance + correlation of errors
Computed from units deviations
on training & validation set
2
7
P1
3
8
4
6
Niching GA
5
P2
Normalized distance of Inputs
Hamming(100010,101100) + features used
+
Euclid distance of coefficients
Normalized distance of Transfer functions
+
Distance of configuration variables
Normalized distance of Other attributes
Encoding units
to chromosomes:
P1
Transfer function
Other
123456
100010
Inputs
24/67
P2
Transfer function
Other
123456
101100
Inputs
The process of GAME models
evolution – GA with DC
Encoding the GAME unit to genotype (first layer):
Inputs
P
Type
Other
0 0 1 0 P trans.fn.
Niching GA
(Deterministic
Crowding)
Select the best
individual from
each niche
P
P
S
Result: MAXIMUM information extracted
25/67
Individuals connected to less significant features survived. We gained more info.
Experimental Results: Standard GA
versus GA+Deterministic Crowding
Units connected to
the same input
belongs to one niche.
P
P
P
P
Single uniform population
P
P
S
Three subpopul.(niches)
26/67
Experimental Results: Standard GA
versus GA+Deterministic Crowding
RMS cold water consumption
RMS energy consumption
RMS hot water consumption
8,80E-03
5,30E-02
4,10E-02
5,25E-02
4,05E-02
8,70E-03
5,20E-02
4,00E-02
8,60E-03
5,15E-02
8,50E-03
8,40E-03
5,10E-02
3,95E-02
5,05E-02
3,90E-02
5,00E-02
3,85E-02
4,95E-02
8,30E-03
3,80E-02
4,90E-02
8,20E-03
8,10E-03
GA
Results:

GA+DC
4,85E-02
3,75E-02
4,80E-02
3,70E-02
GA
GA+DC

GA
GA+DC

The statistical test proved that on the level of significance 95%, the GA+DC
performs better than simple GA for the27/67
energy and hot water consumption.
Experimental Results: Standard GA
versus GA+Deterministic Crowding
RMS on-ground nuclear tests
2,0E-05
WAWE PRESSURE
SUMA RADIATION
2,5E-05
INSTATNT RADIATION
3,0E-05
FIRE RADIUS
3,5E-05
CRATER DIAMETER
4,0E-05
CRATER DEPTH
Average RMS error of 20 models
4,5E-05
Small simple
dataset
DC off
DC on
1,5E-05
1,0E-05
GA
GA+DC
5,0E-06
0,0E+00
Except the models of the fire radius attribute,
the performance of all other models is
28/67Crowding enabled.
significantly better with Deterministic
Simple GA
 GA+DC
Results: the best way is to combine
genotypic and phenotypic distance
RMS Error on Boston testing data set
0.292
Simple Ensemble
0.29
Average, Minimum and Maximum RMS Error
of 10 Ensemble Models on Boston data set
0.3
Weighted Ensemble
0.296
0.288
0.286
0.292
0.284
0.288
0.282
0.284
0.28
0.28
0.278
0.276
0.276
None
Genome Correlation Gen&Corr.
None
29/67
Genome
Correlation Gen&Corr.
Distance of units can be monitored
during the training phase
Chromos. dist. Correlation
Error on training vectors RMSE
Epoch 1
Start of the niching Genetic
Algorithm, units are randomly
initialized, trained and their
error is computed,
Epoch 30
after 30 epochs the niching
Genetic Algorithm terminates,
Sorted
30/67
finally units are sorted
according to their RMSE,
chromosome difference
and the correlation.
Another visualization of diversity
genotypic distances
correlation matrix
genotypic distances
correlation matrix
Clusters are
not apparent
in the matrix
Each unit
has different
genotype
(units are
connected
to diverse
inputs)
Units connected
to the same input
Half of units
are better
trained than
the others
Some of them
are well trained
Model imU,layer 1, epoch 9
Model pp,layer 10, epoch 10
31/67
GA or niching GA?
• Genetic search is more effective than
exhaustive search
• The use of niching genetic algorithm (e.g.
with Deterministic Crowding) is beneficial:
– More accurate models are generated
– Feature ranking can be derived
– Non-correlated units (active neurons) can be
evolved
• The crossover of active neurons in transfer
function brings further improvements
32/67
Remember? Units with various transfer
functions competes in the niching GA
x1
x2
...
xn
x1
y   ai xi  an 1
x1
x2
...
xn
x1
xn
x1
Gaussian (GaussianNeuron)
y  1  an 1  * e
2
 i1
1 an2 2
Logistic (SigmNeuron)
1
y
n
1 e
  ai xi
 a0
Exponential (ExpNeuron)
...
 a0
xn
x1
Rational (PolyFractNeuron)
y
x2
...
...
y  an2 * e
an1*  ai xi
i 1
x1
n
xn 33/67
n
 a x  a
i i
i 1 j 1
x x j  an2 1
 a0
n*i  j i
Universal (BPNetwork)
2 n 1
...
 a0
an2 2
n
i 1
x2
n
 m r
y    ai  x j   a0
i 1 
j 1

n
xn
i 1
Polynomial (CombiNeuron)
x2
  xi  ai 
x2
xn

 n

y  an 1  sin an  2    ai xi  an 3   a0
 i 1


...
i 1
n
xn
Sin (SinusNeuron)
x2
n
x2
...
x1
Linear (LinearNeuron)
y   q
q 1

n
p 1
 pq ( x p )

Natural selection of units really works!
The best performance is achieved when units of all types
are competing in the construction process.
RMS on the Boston data set
4.03E+05
0.350
0.300
Training set
0.250
Testing set
0.200
0.100
al
l
Fr
G
a
au ct
M ss
ul
i
tiG an
au
al ss
Pe l-f
rc as
ep t
al tro
l-s n
im
pl
G e
au
s
Po S s
ly igm
no
m
ia
l
Si
n
C
om Ex
bi p
R
30
Li 0
ne
a
C r
om
bi
0.150
34/67
Optimization methods available in
GAME
35/67
Experimental results of competing
training methods on Building data set
Energy consumption
CG
DE
QN
SADE
all
SOS
CACO
PSO
HGAPSO
ACO
OS
palDE
Cold water consumption
Hot water consumption
QN
CG
SADE
DE
all
HGAPSO
CACO
SOS
palDE
ACO
PSO
OS
QN
all
DE
SADE
CG
OS
HGAPSO
CACO
SOS
palDE
ACO
PSO
RMS error on testing data sets (Building data) averaged over 5 runs
36/67
RMS error on the Boston data set
37/67
Classification accuracy [%] on the
Spiral data set
38/67
Evaluation on diverse data sets
What is it All?
39/67
Remember the Genetic algorithm
optimizing the structure of GAME?
4
1
5
Niching
GA
Linear transfer unit
1234567
1001000
Inputs
2
6
Optimization method
not imp lemened
CACO
Transfer func tion
y  a1 x1  a2 x2  a0
added into
c hromosomes
Polynomial trasfer unit
3
7
1234567 1234567 1234567
0000110 2115130 1203211
Inputs
Transfer func tion
y  a1 x1 x23  a2 x12 x2  a0
40/67
Opt. m.
DE
Outline of the talk …
• Introduction, FAKE GAME framework,
architecture of GAME
• GAME – niching genetic algorithm
• GAME – ensemble methods
– Trying to improve accuracy
• GAME – ensemble methods
– Models’ quality
– Credibility estimation
41/67
Ensemble models – why to use it?
• Hot topic in machine learning communities
• Successfully applied in several areas such as:
–
–
–
–
–
–
–
Face recognition [Gutta96, Huang00]
Optical character recognition [Drucker93, Mao98]
Scientific image analysis [Cherkauer96]
Medical diagnosis [Cunningham00,Zhou02]
Seismic signal classification [Shimshoni98]
Drug discovery [Langdon02]
Feature extraction [Brown02]
42/67
Ensemble models – what is it?
• The collection of models (e.g. neural
networks) is trained for the same task.
• Outputs of models are then combined.
Input variables
Model 1
Model 2
Model 3
Output variable
…
Output variable
combination
Ensemble output
43/67
Model N
Bagging
•
•
•
Sample with replacement several training sets of size n (instead of having just one
training set of size n).
Build model/classifier for each training set.
Combine classifier’s predictions.
Sampling with
replacement
Training
data
Sample 1
Learning
algorithm
Model 1
(classifier)
Sample 2
Learning
algorithm
Model 2
(classifier)
..
.
..
.
..
.
Sample M
Learning
algorithm
Model M
(classifier)
44/67
Averaging or voting
Ensemble
model
(classifier)
output
We employed Bagging in GAME
• Group of Adaptive Models Evolution is a method we are
working on in our department (see next slide).
Sampling with
replacement
Training
data
GAME
GAME
model 1
Sample 2
GAME
GAME
model 2
..
.
..
.
..
.
Sample M
GAME
GAME
model M
Sample 1
45/67
Averaging or voting
GAME
ensemble
output
Results of Bagging GAME models
Weighted ensemble of GAME models
RMS skeleton age estimation – training&validation data set
RMS skeleton age estimation – testing data set
156
138
154
136
152
134
150
148
132
146
130
144
128
142
140
Simple ensemble
126
Simple ensemble
138
124
136
1Individual
3
5 GAME
7
9 models
11 13
15 Weighted
17 19 21
23 25
ensemble
27
1Individual
3
5 GAME
7
9models
11 13
15 Weighted
17 19 ensemble
21 23 25
The weighted ensemble has apparent tendency to overfitt the data.
While its performance is superior on the training and validation data,
46/67
on the testing data there are several individual models performing better.
27
GAME ensembles are not
statistically significantly better than
the best individual GAME model!
Why?
• Are GAME models diverse?
–
–
–
–
–
–
Vary input data 
(bagging)
Vary input features 
(using subset of features)
Vary initial parameters  (random initialization of weights)
Vary model architecture  (heterogeneous units used)
Vary training algorithm  (heterogeneous units used)
Use a stochastic method when building model  (niching GA)
Yes, they use several methods to promote diversity!
47/67
Ensemble models – why it works?
Bias-variance decomposition
(theoretical tool to study how a training data affects the performance of models/classifiers)
Total expected error of model = variance + bias
G.Brown: Diversity in Neural networks ensembles, 2004
variance: the part of the expected error due to
the nature of the training set
bias: the part of the expected error caused by
the fact the model is not perfect
Ensembling reduces variance
Ensembling reduces bias
training data model 1
model 1
model 2
ensemble model
training data model 2
48/67
ensemble model
GAME ensembles are not
statistically significantly better than
the best individual GAME model!
Why?
Ensembling reduces bias …
• GAME models grows until
optimal complexity
(grows from the minimal form up to the
required complexity)
• Therefore bias cannot be
further reduced
• Bagging slightly reduces
variance, but not
significantly (for datasets we have
been experimented with)
49/67
model 1
model 2
ensemble model
… just for non-optimal models
model 1
model 2
model 3
ensemble model
Results of Bagging GAME models
Simple ensemble of GAME models
RMS – cold water consumtion
RMS – age estimation
0,274
170
0,273
165
0,272
160
0,271
155
0,27
0,269
150
0,268
145
0,267
0,266
140
0,265
135
1
2
3
4
5ensemble
6
1
2
3
4
5
6
7
8
9
10 11ensemble
12
Is the ensemble is significantly better than any of individual models?
Suboptimal GAME models YES
Optimal GAME models
50/67
NO
Outline of the talk …
• Introduction, FAKE GAME framework,
architecture of GAME
• GAME – niching genetic algorithm
• GAME – ensemble methods
– Trying to improve accuracy
• GAME – ensemble methods
– Models’ quality
– Credibility estimation
51/67
Ensembles can estimate credibility
• We found out that ensembling GAME models
does not further improve accuracy of modelling.
• Why to use ensemble then?
• There is one big advantage that ensemble models
can provide: Using ensemble of models we can
estimate credibility for any input vector.
• Ensembles are starting to be used in this manner in
several real world applications:
52/67
Visualization of model’s behavior
GAME
GAME
53/67
Estimating credibility using
ensemble of GAME models
We need to know when the response of a
model is based on training data and when
it is just random guessing – then we can
estimate the credibility of the model.
1
2
EE1
…
EE1
GAME
GAME
WE1
WW1
EW1
n
EE1
GAME
GAME
WE1
WW1
EW1
EW1
EE2
EE2
WE2
WE2
WE2
WW2
WW2
WW2
EW2
EW2
EE3
EE3
WE3
WE3
S
EW3
RR3
RS3
SV1
S
S
S
FA1
P
S
FA2
S
FV2
S
RS3
S
S
SA2
SV2
S
S
SA3
SV3
S
SV3
S
FA1
P
FV1
S
Healthy
SV1
S
FA1
FV1
RR3
SA1
S
SA3
SV3
S
EW3
S
SA2
SV2
SA3
FV2
Healthy
S
WW3
S
SA1
S
SA2
FA2
WE3
RR3
S
SA1
SV2
EE3
S
EW3
S
RS3
SV1
EW2
S
WW3
S
GAME
WW1
EE2
WW3
GAME
WE1
P
FV1
S
FA2
S
FV2
FA3
FA3
FA3
FV3
FV3
FV3
S
S
Healthy
When
responses of
models differ
then models are
not credible
(dark
background).
54/67
While x2 is varying from min to max and other inputs stay constant, the output
of the
model shows sensitivity of the model to variable x2 in the configuration x1, x3 = const.
In regions, where we have enough data, all
models from ensemble have compromise
response – they are credible.
Artificial data set – partial definition
Advantages:
• No need of the training data
set,
• Modeling method success
considered,
• Inputs importance considered.
Credibility: the criterion is a dispersion
of models` responses.
55/67
Practical example – Nuclear data
A.G. Ivakhnenko – experiments with GMDH
Extremely sparse data set – 6 dimensions, 10 data vectors
56/67
Typical application of softcomputing methods in industry:
Building data set
Looking for the optimal
configuration of the GAME
simulator – Pavel Staněk
GAME ensemble
Black box
Model of energy
consumption
Model of cold
water consumpt.
Model of hot
water consumpt.
57/67
Building data set
The data set is problem A of the "Great energy predictor shootout" contest. "The Great Energy Predictor
Shootout" - The First Building Data Analysis And Prediction Competition; ASHRAE Meeting; Denver,
58/67
Colorado; June, 1993;
Credibility of models in classification
GAME model 2
GAME models (1*2*3)
*
*
=
*
*
=
*
*
=
Iris Versicolor
Iris Setoza
GAME model 3
Iris Virginica
GAME model 1
59/67
Credibility of models in classification
Before
After
60/67
Other visualization techniques
61/67
Automated retrieval of 3D plots
showing interesting behaviour
Genetic
Algorithm
Genetic algorithm with special fitness function is used to
adjust all other inputs (dimensions)
62/67
Conclusion
New quality in modeling:
• GAME models adapt to the data set complexity.
• Hybrid models more accurate than models with
single type of unit
• Compromise response of the ensemble
indicates valid system states.
• Black-box model gives estimation of its accuracy
(2,3 ± 0,1)
• Automated retrieval of “interesting plots” of
system behavior
• Java application developed – real time
simulation of models
63/67
Benchmarks of GAME
Classification
of very
complex data
set
Benchmarking Spiral
data classification
problem solved by
GAME
Note: inner crossvalidation
mechanism to prevent
overfitting was disabled for
this problem
64/67
Internet advertisements database
BackProp best result:
65/67 GAME
best result:
87.97%
93.13%
Other benchmarking problems
Zoo database:
Accuracy of classification into genotypes:
Backpropagation: 80.77%; GAME: 93.5%
Stock value prediction (S&P DEP RECEIPTS):
Change in direction prediction ratio:
Backpropagation with momentum: 82.4%; GAME: 91.3%
Pima indians data set: 10fold crossvalidation results:
66/67
Classification accuracy[%]
Thank you!
67/67