Transcript Jim_Slides

A survey of Evolutionary Algorithms for
Data Mining and Knowledge Discovery
Alex Freitas (2001)
A look at Genetic Algorithms(GA) and Genetic Programming(GP)
also from the point-of-view of the user.
An overview of Data Mining and Knowledge Discovery
The three desirable properties of KD:
1)Accuracy. We would like a high predictive accuracy rate
for our data mining task of classification
2)Comprehensible.The user should not be presented with a
“black box” predictor that spouts out a number of IF_THEN rules
and say this is the output. Prediction rules should be
comprehensible.
3)Interesting. In many cases the user should be presented
with an ensemble of discovered rules not all of which will be
interesting in practice.
The overview:

Show the user some schematics:

Data Set#1

Data Set#2

Data Set#3---------->Data Integration----->Pre-Processing

Data Set#4
Data Mining

Data Set#5
Post-Processing

Data Set#6
Output Knowledge
Motivation



“We believe that ideally a combination of subjective and
objective approaches should be used to try to solve the
very hard problem of returning interesting knowledge to
the user”
Objective approaches: clearly the algorithms and
computer processing give us high-level knowledge
Subjective approaches: user helps to define and re-define
those formulas or IF_THEN rules that are of interest to the
knowledge discovery.
Data Mining Task: Classification




Many of the KD procedures are to predict the value (the class) of a
user-specified goal attribute based on the predicting attributes.
E.G. IF( Unpaid_Loan =“NO”) AND (Overdrafts=“Yes”)
THEN ( Credit = “Bad”)
For a comprehensive discussion about how to measure predictive
accuracy of classification rules we are referred to [34] Hand, DJ ,
Construction and Assessment of Classification Rules
Tom Mitchell’s book also has good information.
Data Mining : Other tasks



Dependence Modelling :an extension or
generalization of classification
Clustering: discovering groups: unsupervised
learning
Discovery of Association Rules: more than one
item in the consequent attribute is possible; the
classification task may be assymmetric with
respect to the predicting attributes and the
consequent attribute
The Knowledge Discovery Process





Data integration
Data cleansing
Discretization: transforming a “continuous” attribute into a
discrete one. (For example, “low”, “medium”, “high”)
Attribute Selection: selecting a set of attributes relevant for
classification.
The motivation for attribute selection may be obvious: it has
been found that irrelavant attributes can somehow “confuse”
the data mining algorithm, leading to the discovery of
inaccurate or useless knowledge. ( A wrapper method can
help select attributes.)
Discovered Knowledge Postprocessing

Two main motivations:
 First , when the discovered rule set is large, we want to
simplify it. (The user may or may not help at this stage.)
Some other techniques will be addressed.
 Second, we often want to extract a subset of interesting
rules, among all the discovered ones. We will look at
some objective methods later on(GA) but subjective
methods involving a user/data miner collaboration may
also be important.
Genetic Algorithms (GA) for Rule Discovery




Michigan approach: population consists of individuals(“chromosomes”)
where each individual encodes a single prediction rule.
Pittsburgh approach: each individual encodes a set of prediction rules
Pluses and minuses: the Pittsburgh approach directly takes into account
rule interaction when computing the fitness function of an individual. This
approach leads to syntactically-longer individuals. In the Michigan
approach the individuals are simpler and syntactically shorter. It
simplifies the design of genetic operators. (Interactions in Michigan
approach are not taken into account.)
Take the rule: IF cond#1 AND cond#2 AND …cnd#n….THEN class= c(i)


Representation of the rule antecedent
Representation of rule consequent ( the THEN part)
The Rule Antecedent (Using GA)





Often there is a conjunction of conditions.
Usually use binary encoding.
A given attribute can take on k discrete values. Encoding can
consist of k bits.( for “on” or “off”) Allows for internal disjunctions.
 0 0 1 1 1 0 1 1 0 0………0
All bits can be turned into “1” ’s in order to “turn off” this
condition.
Non-binary encoding is possible. Variable-length individuals will
arise. May have to modify crossover to be able to cope with
variable-length individuals.
Representing the Rule Consequent (Predicted Class)

Three ways of representing the predicted class. (the THEN part)
 First, encode it in the genome of an individual (possibly
making it subject to evolution.)
 Second, associate all individuals of the population with the
same predicted class, which is never modified during the
running of the algorithm.
 Third, choose the predicted class most suitable for a rule (a
deterministic way) as soon as the corresponding rule
antecedent is formed. ( E.G. Maximize fitness.)

Author believes the third possibility to be the most
sound overall.
Genetic Operators for Rule Discovery



Selection: Each individual represents a single rule.(Michigan
approach). An approach called REGAL can be used. Individuals
to be “mated” are “elected” by training examples. (Use a fitness
operator or a probabilistic model.)
Generalizing/specializing crossover: basic idea of this special
kind of crossover is to generalize or specialize a given rule,
depending on whether it is currently overfitting or underfitting
the data.
Generalizing/specializing-condition operator. The g/s of a rule
can be done in a way independent of crossover. (Tweaking the
antecedent conditions especially if contnuous conditions exist.)
Fitness Function for Rule Discovery










Remember:Accuracy, comprehensibility + interesting
How to get these 3 rule quality criteria incorporated
into a fitness function.
Let a rule be IF A THEN C , the calculate the
confidence factor CF = TP / ( TP + FP) using chart:
Given
Actual Class
.
C
not C
.
Predicted
C
TP
FP
.
Class
not C FN
TN
Comp=completeness measure = TP / (TP + FN)
Fitness = CF * COMP =
(TP)(TP) / (TP+FP)(TP+FN)
Fitness = w1 X (CF * COMP) + w2 X (Simp) where Simp is a
measure of rule simplicity 0< Simp<1 and w1 and w2 are weights
Genetic Algorithms (GAs) for Pre-processing




“The use of GAs for attribute selection seems natural.
The main reason is that the major source of difficulty in
attribute selection is attribute interaction, and one of
the strengths of Gas is that they usually cope well with
attribute interactions.”
We can use very simple genetic encoding where each
individual represents a candidate attribute subset. A
candidate attribute subset can be represented as a
string with m binary genes where m is the number of
attributes and each gene can take on a “0” or “1”.
Follow crossover and mutation procedures.
GA can be used with nearest neighbour algorithms
(NNA) to “tweak” for better results.
Genetic Algorithms (GAs) for Post-processing



GAs can be used in the post-processing step when there is an
ensemble of classifiers (e.g. rule sets) created.
Generating an ensemble of classifiers is useful since it has been
shown that in several cases an ensemble of classifiers has a better
predictive accuracy than a single classifier.
A fitness function may be created using weights for each classifier
in the ensemble. (A user may help.) There are also GA schemes to
optimize the weights of the classifiers.
Genetic Programming (GP) for Rule Discovery






Individual representation: attributes are often numeric
Functions such as +,-,*,/ , < , >,=, AND, OR,… are used
as well as input arguments.
An individual is often represented as a tree diagram.
Once we apply the functions in the internal nodes of a
GP individual to the values of the attributes in the leaf
nodes of the individual, the system computes a
numerical value that is output at the root of the tree.
Discovering comprehensible rules using GP: these
rules could be similar to GA rules but there are some
othersuch as:
Simplicity = (MaxNodes -0.5NumNodes -0.5) / (MaxNodes -1)

This could lead to discovery of short, simple rules that
may be required in the Medical field.
Genetic Programming (GP) for Data Pre-Processing




A Major problem in attribute construction is that the search space
tends to be huge. If the search can be accomplished especially
with the relational operator “>” then many good candidate
operations may evolve.
Sometimes there are GA/GP methods for pre-processing.
Conclusions:
In his Chapter on evolutionary algorithms Alex Freitas has shown
us where the emphasis should be laid in Knowledge learning. His
goal of “transparency” of pre-processing, rule learning and postprocessing methods should be taken into account. A user would
like to know where the classifier or the IF_THEN rule came from
and if he can help influence the process to get a more intelligent
result.
High Classification Accuracy does not imply
Effective Genetic Search (Tim Kovacs;Manfred Kerber)


The authors are publishing their experimental results which they
believe will help clear up some of the limitations of GA methods. In
particular they examine XCS, a popular classification system
which uses genetic algorithms.
The paper by K+K refers us to work by Stewart Wilson(1995)
entitled “ Classifier Fitness Based on Accuracy”. In XCS each
classifier maintains a prediction of expected payoff, but the
classifier’s fitness is given by a measure of the predictor’s
accuracy. Wilson’s example shows some individuals in the
population P:

.
Table of
Population P
.

Where p=prediction є = prediction error



#011:01
11##:00
#0##:11
p
43
32
14
є
.01
.13
.05
F
99
9
5
F= fitness parameter
XCS and XCS-NGA are compared






The authors XCS-NGA (XCS with no GA) uses XCS modified so that
genetic search does not operate on the initial rule population. In all
other respects XCS-NGA functions as XCS.
XCS classifies data points by a vote among the rules which match
it, with each vote weighted both by the rule’s fitness. In this way a
low-accuracy rule and a high-accuracy rule is given the
classification of the high-accuracy rule.
In XCS, the rules (region shapes and sizes) are adapted by the GA.
XCS-NGA lacks a GA and its region shapes and sizes do not
change.
XCS-NGA relies on there being enough rules to adequately solve
the rule improvement problem (rule discovery) by random chance.
Roughly speaking, XCS-NGA’s approach is to generate many
random rules and ignore those which happen to have low
accuracy.
High Accuracy implications of XCS-NGA




The authors have obtained very good results with XCS-NGA. They
agree that this alternative has its limitations but they want to
address the publication of classification accuracy as a goal onto
itself.
Many published papers use 6-bit multiplexer functions only.
(Strings are of length L= k + 2^k so that if k=2bits then L=6 . For a
70-bit multiplexer we have : k=6 bits, and L= 70.)
Because of its random nature the XCS-NGA thrives better with
more initial rules and then give excellent results.
The claim is that they argue that only those studies which claim
effective genetic research based on results with small functions
are demonstrated invalid by their results with XCS-NGA.
A More Powerful Metric for Evaluating Genetic Search








This metric is symbolized by %[O] where %[O] = the proportion of
the Optimal solution in the rule population on a given time step.
This metric has been shown to have greater discriminatory power
than the performance metric introduced by Wilson.
Wilson’s “performance” is defined as a moving average of the
proportion of the last n trials in which the system has responded
with the correct action. ( n is traditionally equal to 50.)
There are often 400 rules to start with. The XCS-NGA does better
with more rules.
%[O] is better able to discern the progress of genetic search than
the performance metric.
The new metric extends the utility of small tests.
%[O] has disadvantages including the need to compute the optimal
solution in advance as well as the computational expense of
evaluating it.
Finally, replacing GA with an interactive random rule generator
would provide a baseline against which to compare genetic search.