Building Blocks - University of Houston
Download
Report
Transcript Building Blocks - University of Houston
A Genetic Programming System for Building
Block Analysis to Enhance Data Analysis and
Data Mining Techniques
Christoph F. Eick, Walter D. Sanz, and Ruijian Zhang
www.cs.uh.edu/~ceick/matta.html
University of Houston
Organization
1. Introduction
2. General Approach and Corresponding Methodology
3. Wols --- A Genetic Programming System to Find Building Blocks
4. Experimental Results
5. Summary and Conclusion
SPIE Data Mining Conference, Orlando, April 1999
1. Introduction
Regression analysis, which is also know as curve fitting, may be
broadly defined as the analysis of relationships among variables. The
actual specification of the form of the function weather it be linear,
quadratic, or polynomial, is left up entirely to the user.
Symbolic regression is the process of discovering both the functional
form of a target function and all of its necessary coefficients. Symbolic
regression involves finding a mathematical expression in symbolic
form, that provides a good, best or perfect fit between a finite sampling
of values of the input variables and the associated values of the
dependent variables.
Genetic Programming simulates the Darwinian evolutionary process
and naturally occurring genetic operations on computer programs.
Genetic programming provides a way to search for the fittest computer
program in a search space which consists of all possible computer
programs composed of functions and terminals appropriate to the
problem domain.
SPIE Data Mining Conference, Orlando, April 1999
Example -- Evolving Syntax Trees with Genetic Programming:
*
A*B=
A
B
Population = 4
Best So Far
-
*
cos
*
+
Geneartion 0:
A
B
C
D
D
C
5
0.1
0.2
0.5
0.9
*
-
cos
*
A
B
0.1
cos
Geneartion 1:
A
C
B
0.07
D
0.7
A
/
5
E
crossover
B
0.1
/
5
E
copying
0.05
mutation
0.05
SPIE Data Mining Conference, Orlando, April 1999
Motivating Example
Task: Approximate boundaries between classes with rectangles
C1
C2
Easy!
C1
C2
Difficult!
Idea: Transform the representation space so that the approximation
problem becomes simpler (e.g. instead (x,y), just use (x-y)).
Problem: How do we find such representation space transformations?
SPIE Data Mining Conference, Orlando, April 1999
Building Blocks
Definition: Building Blocks are patterns which occur with a high frequency in a
set of solutions.
Example:
(- (* C D) (+ A B))
(- C (+ B A))
(- (+ A B) (COS A))
(- (+ C (/ E (SQRT D))) (+ A B))
Leaf Building Block: (+ A B)
Root Originating Building Block: (- ? (+ A B))
FUNC1 = (+ A B)
SPIE Data Mining Conference, Orlando, April 1999
Constructive Induction
Definition: Constructive Induction is a general approach for coping with
inadequate attributes, redundant, and useless attributes found in original data.
Constructive Induction explores various representation spaces. Building block
analysis and constructive induction are used in this research to transform the
original data by adding new attributes and by dropping useless ones.
Example:
Primative Attributes
Cases
A
B
C
#1
0.8
0.1
0.2
#2
0.4
0.5
0.3
Primative Attributes and New Attributes
Cases
A
B
C
A-B
A-C
B-C
#1
0.8
0.1
0.2
0.82
0.84
0.28
#2
0.4
0.5
0.3
0.7
0.58
0.65
SPIE Data Mining Conference, Orlando, April 1999
2. General Approach and
Corresponding Methodology
Primary Learning System: the one we use for the task at hand (e.g. C5.0)
Secondary Learning System: is used to find building blocks
General Steps of the Methodology:
0. Run the primary learning system for the data collection. Record prediction
performance and good solutions.
1. Run the genetic programming/symbolic regression/building block analysis system
(secondary learning system) several times for the data collection with a rich function
set. Record prediction performance and good solutions.
2. Analyze the good solutions found by the secondary learning system for the
occurrence of building blocks.
3. Using the results of step 2, transform the data collection by generating new attributes.
4. Redo steps 0-3 for the modified data collection, as long as a significant improvement
occurs
SPIE Data Mining Conference, Orlando, April 1999
3. Wols --- A Genetic Programming
System that finds Bulding Blocks
SRGP
Front End
Interface
Executable
SRGP
CODE
Building
Block
Detection
Tool
SPIE Data Mining Conference, Orlando, April 1999
Wols’ Major Inputs
Example:
Attributes
Cases
A1
A2
B
#1
0.8
0.1
0.2
#2
0.4
0.3
0.3
B = f(A1, A2)
#3
0.3
0.4
0.4
Function Set: *, +, SIN
#4
0.2
0.5
0.1
B = A1 * SIN(A2 + A1)
SPIE Data Mining Conference, Orlando, April 1999
Features of the Wols System
The WOLS system is basically a modified version of the standard genetic program model.
A standard genetic program model employs crossover and mutation operators that evolve
solutions (here approximation functions) relying on the principles of the survival of the
fittest: better solutions reproduce with a higher probability.
The WOLS system employs the traditional “sub-tree swapping” crossover operator, but
provides a quite sophisticated set of 7 mutation operators, that apply various random
changes to solutions.
The goal of the WOLS system is to find good approximations for an unknown function ƒ,
and to analyze the decomposition of good approximations with the goal to identify building
blocks.
The front end of the WOLS system takes a data collection DC=(A1,…,An, B) as its input
and tries to find good approximations h, where B=h(A1,…,An), of the unknown function f
with respect to a given function set and a given error function.
The symbolic regression genetic programming component (SRGP) of the WOLS system
includes an interactive front end interface which allows a user to easily generate a genetic
program that employs symbolic regression and is tailor made to the problem he wishes to
solve.
SPIE Data Mining Conference, Orlando, April 1999
Features of the Wols System (cont.)
The system run starts by asking the user interactively for parameters needed to run
the symbolic regression system. The information the user provides is then used to
generate the functions for the specific problem he wishes to solve. These problem
specific functions are then merged with a set of general genetic programming
functions to form a complete problem specific genetic program.
The SRGP component allows the user to select between three different error
functions:
• Manhattan error function (minimizes the absolute error in the output
variable)
• Least Square error function (minimizes the squared error in the output
variable)
• Classification error function (minimizes the number of misclassifications)
The WOLS system is capable of solving both regression analysis and classification
problems. In regression analysis problems the goal is to minimize the error, while
classification problems the goal is to minimize the number of misclassifications.
SPIE Data Mining Conference, Orlando, April 1999
Features of the Wols System (cont.)
The goal of the decision block analysis tool is to identify frequently occurring patterns
in sets of solutions. In our particular approach, building blocks are expressions that
involve functions symbols, variable symbols, as well as the ‘?’ (“don’t care”) symbol.
For example, if DC=(A1, A2, A3, B) is a data collection the variable A1, the expression
(A2 * A3), and expression ((sin A1) + (A2 * ?)) are all potential building blocks.
In general we are interested in finding patterns that occur in about 85% or more of the
set of analyzed approximations.
The identification of building blocks is useful for both improving the attribute space and
in determining which part of the search space has been explored.
SPIE Data Mining Conference, Orlando, April 1999
Decision Block Analysis Functions
Root Analysis: Finds common root patterns among a population of
solution trees.
General Analysis: Finds common patterns anywhere in a solution tree.
Leaf Analysis: Finds common patterns at the leaf nodes of solution trees.
Frequency Analysis: Finds the total number of times a specific operator,variable,
or constant is used in the population of solution trees.
SPIE Data Mining Conference, Orlando, April 1999
4. Experimental Results
Experiment 1- 5 variable problem with 30 points training data
Experiment 2 - 5 variable problem with 150 points training data
Experiment 3 - Simple Linear Regression
Experiment 4 - Time for a Hot Object To Cool
Experiment 5 - Time for a Hot Object To Cool Using Constructive Induction
Experiment 6 - Glass Classification
Experiment 7 - Learning An Approximation Function
SPIE Data Mining Conference, Orlando, April 1999
Experiment 7 - Learning An
Approximation Function
f(a, b, c, d, e) = |b-|d-0.44||*|log(a+1)| + a*b*(1-log(2))
Random values were plugged into function f in order to generate test
cases of the form (a, b, c, d, e, f(a, b, c, d, e))
training data set of 150 test cases, testing data set size 150 test cases
Goal was to find a function h(a, b, c, d, e) that best approximates
function f.
function set: +, *, loga(x):= |log(x)|, cosa(x):=|cos(x)|,
sqrta(x):=|sqrt(x)|, por(a,b):= a+b-a*b, diva(a/b):= |a/b|, aminus(a,b):=
|a-b|, expa(x):= emin(x 300), sin100a:= |sin(100*x)|
variable set: a, b, c, d, e
SPIE Data Mining Conference, Orlando, April 1999
Experiment 7 - continued
Fitness Function: Average Absolute Manhattan Distance
n
i
i
where:
i 1
n
n = total number of test cases
xi = output generated by the solution found through genetic
programming
yi = output generated by the known function (desired output)
x y
RUN 1 - Direct Approach
RUN 2 - Using Leaf Building Blocks
RUN 3 - Using Leaf Building Blocks and Reduced Operator and Variable Sets
For each run a population size of 200 was used.
Each run was terminated after 100 generations.
SPIE Data Mining Conference, Orlando, April 1999
RUN 1 - Direct Approach
Average Absolute Manhattan Distance Error
Training: 0.01102
Testing: 0.01332
Leaf Building Blocks In The Top 10% Of The Population
Pattern (* B A) occurred 25 times in 100.0 % of the solutions.
Pattern (AMINUS 0.41999999999999998 D) occurred 32 times in 95.0 % of the solutions.
Pattern (EXPA (AMINUS 0.41999999999999998 D)) occurred 28 times in 95.0 % of the solutions.
Frequency Analysis In The Top 10% Of The Population
+
was used 9 times in 25.0% of population
aminus was used 56 times in 100.0% of population
*
was used 176 times in 100.0% of population
/
was used 75 times in 100.0% of population
por
was used 111 times in 100.0% of population
sqrta was used 63 times in 100.0% of population
SPIE Data Mining Conference, Orlando, April 1999
Frequency Analysis (continued)
sin100a
was used 3 times in 10.0% of population
cosa was used 1 times in 5.0% of the population
loga
was used 2 times in 5.0% of population
expa
was used 33 times in 100.0% of population
A
was used 76 times in 100.0% of population
B
was used 132 times in 100.0% of population
C
was used 9 times in 35.0% of population
D
was used 57 times in 100.0% of population
E
was used 21 times in 40.0% of population
CONSTANT was used 152 times in 100.0% of population
Leaf Building Blocks: (* B A), (AMINUS 0.41999999999999998 D), and
(EXPA (AMINUS 0.41999999999999998 D))
Since variables C and E are the only two variables not present in 100% of the
solutions analyzed they will be eliminated in RUN 3 along with the least used
operator cosa. Also notice that variables C and E were not actually present in
function f and were successfully filtered out.
SPIE Data Mining Conference, Orlando, April 1999
RUN 2 - Using Leaf Building Blocks
Solution
(* ADF1
(SQRTA (+ (SQRTA (AMINUS (SQRTA (POR B
(SQRTA
(AMINUS
(*
(* B
(POR 0.93000000000000005
(AMINUS (+ ADF1 ADF1) ADF1)))
(POR
(AMINUS
(POR
(SQRTA
(AMINUS
(* 0.91000000000000003
(POR A
(EXPA
(AMINUS
(POR ADF2
(POR (SQRTA ADF2)
(+
(DIVA
0.80000000000000004
B)
(SQRTA
(AMINUS
(SQRTA
(AMINUS B ADF2))
ADF2)))))
ADF2))))
ADF2))
(LOGA
(DIVA 0.80000000000000004
B)))
ADF1)
ADF2))
ADF2))))
ADF2))
0.029999999999999999)))
Average Absolute Manhattan Distance Error
Training: 0.00886
Testing: 0.01056
In RUN 2 there is a 19.6% increase in accuracy in training
and a 20.7% improvement in testing over RUN 1.
Where:
ADF1 = (* B A)
ADF2 = (AMINUS 0.41999999999999998 D)
ADF3 = (EXPA (AMINUS 0.41999999999999998 D))
SPIE Data Mining Conference, Orlando, April 1999
Best Solution Run3
Solution
(DIVA ADF1
(EXPA (* (LOGA ADF3)
(AMINUS (+ (DIVA (POR 0.23999999999999999 A)
(POR B
(LOGA
(POR ADF1
(POR
(POR (EXPA A)
(EXPA
(LOGA
(POR (DIVA ADF3 B)
(AMINUS ADF3 B)))))
(+ (EXPA 0.95000000000000007)
A))))))
ADF2)
ADF1))))
Where:
ADF1 = (* B A)
ADF2 = (AMINUS 0.41999999999999998 D)
ADF3 = (EXPA (AMINUS 0.41999999999999998 D))
Variables C and E and operator cosa were removed from the variable and operator sets.
SPIE Data Mining Conference, Orlando, April 1999
RUN 3 - Using Leaf Building Blocks and
Reduced Operator and Variable Sets
Solution
(DIVA ADF1
(EXPA (* (LOGA ADF3)
(AMINUS (+ (DIVA (POR 0.23999999999999999 A)
(POR B
(LOGA
(POR ADF1
(POR
(POR (EXPA A)
(EXPA
(LOGA
(POR (DIVA ADF3 B)
(AMINUS ADF3 B)))))
(+ (EXPA 0.95000000000000007)
A))))))
ADF2)
ADF1))))
Where:
ADF1 = (* B A)
ADF2 = (AMINUS 0.41999999999999998 D)
ADF3 = (EXPA (AMINUS 0.41999999999999998 D))
Average Absolute Manhattan Distance Error
Training: 0.00825
Testing: 0.00946
In RUN 3 there is a 7.0% increase in accuracy in
Variables C and E and operator cosa were removed from the variable
and operator sets.
training and a 10.4% improvement in testing over
RUN 2 and a 25.1% increase in accuracy in training
and a 29.0% improvement in testing over RUN 1.
SPIE Data Mining Conference, Orlando, April 1999
5. Summary and Conclusion
The WOLS system is a tool whose objective is to find good approximations
for an unknown function , and to analyze the decomposition of good
approximations with the goal to identify building blocks.
Our experiments demonstrated that significant improvements in the predictive
accuracy of the function approximations obtained can be achieved by
employing building block analysis and constructive induction. However,
more experiments are needed to demonstrate the usefulness of the WOLSsystem.
The symbolic regression/genetic programming front-end (SRGP) did quite
well for a number of regression system benchmarks. This interesting
observation has to be investigated in more detail in the future.
The WOLS system was written in GCL Common LISP Version 2.2. All
Experiments were run on an Intel 266 MHZ Pentium II with 96 MB RAM.
SPIE Data Mining Conference, Orlando, April 1999
Future Work
Improving the WOLS system’s performance in the area of
classification problems.
Conduct more experiments that use the WOLS system as a secondary
learning system, whose goal is to search for good building blocks
useful for improving the attribute space for other inductive learning
methods such as C4.5.
Construct a better user interface, such as a graphical user interface.
Although the empirical results obtained from the large number of
experiments involving regression were encouraging more work still
needs to be done in this area as well.
SPIE Data Mining Conference, Orlando, April 1999
Experiment 3 - Simple Linear Regression
In this experiment the WOLS system, NLREG, and Matlab were all used to obtain an approximation
function for a simple linear regression problem.
NLREG, like most conventional regression analysis packages, is only capable of finding the numeric
coefficients for a function whose form (i.e. linear, quadratic, or polynomial) has been prespecified by
the user. A poor choice, made by the user, of the functions form will in most cases lead to a very poor
solution which would not be truly representative of the programs ability to solve the problem. In order
to avoid this problem, for this experiment both the data set and the actual functional form, from one of
the example problems provided with this software, were used.
In Matlab there are basically two different approaches available to perform regression analysis. The
quickest method, which was used in this experiment, is to use the command polyfit(x, y, n), which
instructs Matlab to fit an nth order polynomial to the data. The second method is similar to NLREG
and requires the functional form to be specified by the user.
training data set of 30 test cases; testing data set size 15 test cases
The following parameters were used for the WOLS system:
Population Size: 500
Generations: 200
function set: +, *, loga(x):= |log(x)|, cosa(x):=|cos(x)|, sqrta(x):=|sqrt(x)|, por(a,b):= a+b-a*b,
diva(a/b):= |a/b|, aminus(a,b):= |a-b|, expa(x):= emin(x 300), sin100a:= |sin(100*x)|
variable set: a
SPIE Data Mining Conference, Orlando, April 1999
NLREG Solution
1: Title "Linear equation: y = p0 + p1*x";
2: Variables x,y;
// Two variables: x and y
3: Parameters p0,p1;
// Two parameters to be estimated: p0 and p1
4: Function y = p0 + p1*x;
// Model of the function
Stopped due to: Both parameter and relative function convergence.
Proportion of variance explained (R^2) = 0.9708 (97.08%)
Adjusted coefficient of multiple determination (Ra^2) = 0.9698 (96.98%)
---- Descriptive Statistics for Variables ---Variable
---------x
y
Minimum value Maximum value
--------------------------1
8
6.3574
22.5456
Mean value
--------------4.538095
14.40188
Standard dev.
------------2.120468
4.390238
Standard error
----------------5.14417276
0.06681681
t
----------------0.3336883
30.53
Prob(t)
--------15.42
0.00001
Mean Square
--------932.15
0.5821475
F value
------0.00001
---- Calculated Parameter Values ---Parameter
---------p0
p1
Initial guess
1
Final estimate
--------------1
2.0399976
------0.00001
---- Analysis of Variance ----
Source
DF
Sum of Squares
--------------------------------------Regression
1
542.6514
542.6514
Error
28
16.30013
Total
29
558.9515
y = 5.14417276 + 2.0399976x
The average absolute Manhattan distance error
Training: 0.6361
Testing: 0.7438
Prob(F)
SPIE Data Mining Conference, Orlando, April 1999
Matlab Solution
1st order polynomial(n = 1)
y = 2.0400x + 5.1442
Training: 0.6361
Testing: 0.7438
5th order polynomial(n = 5)
y = 0.0070x5 - 0.1580x4 + 1.3365x3 - 5.2375x2 + 11.5018x - 1.0632
Training: 0.5838
Testing: 0.8034
10th order polynomial(n = 10)
y = 0.0004x10 - 0.0187x9 + 0.3439x8 - 3.5856x7 + 23.2763x6- 97.5702x5 + 265.0925x4 456.1580x3 + 469.7099x2 - 254.6558x + 59.9535
Training: 0.4700
Testing: 0.8669
In viewing the results above, one the first things that is noticed is that as the
order of the polynomial increases, the training performance also increases, but
the testing performance decreases. This is because the higher order
polynomials tend emulated the training set data too closely which in return
causes a poorer performance when the testing data set values are applied.
SPIE Data Mining Conference, Orlando, April 1999
WOLS Solution
(+ (POR (+ X X) 0.080000000000000002)
(+ (+ (DIVA 0.95000000000000007 0.38)
(+ (DIVA (LOGA (* (POR (EXPA (POR 0.94000000000000006
(LOGA
(COSA
(+
(DIVA
(+
(EXPA
(DIVA
(+
(EXPA
(+ 0.68000000000000005
0.13))
0.51000000000000001)
0.38))
0.51000000000000001)
0.38)
X)))))
0.42999999999999999)
(POR X
(LOGA (SIN100A
(LOGA
(*
(POR 0.98999999999999999
0.42999999999999999)
(POR X
0.040000000000000001))))))))
0.38)
(COSA (LOGA (* (POR (EXPA (POR 0.46000000000000002 1.0))
0.46000000000000002)
(POR (SIN100A (COSA X))
(+ (DIVA
(DIVA 0.83000000000000007
0.17999999999999999)
0.38)
X)))))))
(SIN100A (* X 0.26000000000000001))))
Average Absolute Manhattan Distance
Training: 0.3788
Testing: 0.6906
Runtime wise the WOLS system cannot compete with
NLREG or Matlab each of which found a solution in
around 1 second, while the WOLS system spent nearly 4
hours generating its solution.
The WOLS system produced a solution that had 40%
better training performance and a 7 % better testing
performance then both the NLREG and Matlab solutions.
This solution was obtained without using building block
analysis and constructive induction.
SPIE Data Mining Conference, Orlando, April 1999