Transcript Document

CpSc 810: Machine Learning
Genetic Algorithm
Copy Right Notice
Most slides in this presentation are
adopted from slides of text book and
various sources. The Copyright belong to
the original authors. Thanks!
2
Overview of Genetic Algorithm (GA)
GA is a learning method motivated by
analogy to biological evolution.
GAs search the hypothesis space by
generating successor hypotheses which
repeatedly mutate and recombine parts of
the best currently known hypotheses.
In Genetic Programming (GP), entire
computer programs are evolved to certain
fitness criteria.
3
Biological Evolution
Lamarck and others:
Species “transmute” over time
Darwin and Wallace:
Consistent, heritable variation among individuals
in population
Natural selection of the fittest
Mendel and genetics:
A mechanism for inheriting traits
Genotype->phenotype mapping
4
Genetic Algorithms
The algorithm operates by iteratively
updating a pool of hypotheses, called the
population.
On each iteration, all members of the
population are evaluated according to the
fitness function.
A new population is then generated by
probabilistically selecting the most fit
individuals from current population
5
some of these selected individuals are carried
forward into the next generation population
intact.
Others are used as the basis for creating new
offspring individuals by applying genetic
operations.
GA
GA(Fitness, Threshold, p, r, m)
Initialize Population: generate p hypotheses at random->P.
Evaluate: for each p, compute fitness(h)
While Maxh Fitness(h) < Threshold do
Select: probabilistically select (1-r)p members of P. Call this new
generation Pnew
Pr( hi ) 
Fitness(hi )
|P|
 Fitness(hi )
j 1
Crossover: probabilistically select (r*p)/2 pairs of hypotheses
from P. For each pair, <h1, h2> produce two offspring by
applying the crossover operator. Add all offspring to Pnew.
Mutate: Choose m% of PNew with uniform probability. For each,
invert one randomly selected bit in its representation.
Update: P <- Pnew
Evaluate: for each p in P, compute fitness(p)
6
Return the hypothesis from P that has the highest fitness.
Representing Hypotheses
In GAs, hypotheses are often represented by bit
strings so that they can be easily manipulated
by genetic operators such as mutation and
crossover.
Examples:
Represent
7
(Outlook = Overcast v Rain) ^ (Wind = Strong)
By
Outlook
Wind
011
10
Represent
IF Wind = Strong THEN PlayTennis = yes
By
Outlook
Wind
PlayTennis
111
10
10
Genetic Operators
Crossover Techniques: produce two new
offspring from two parent strings by copying
selecting bits form each parents. The choice of
which parent contributes the bit for position i is
determined by an additional string called
crossover mask.
Single-point Crossover: Mask example: 11111000000
Two-point Crossover. Mask example: 00111110000
Uniform Crossover. Mask example: 10011010011
8
Genetic Operators
Mutation Techniques: produces small
random changes to the bit string
Point Mutation
9
Select Most Fit Hypothesis
A simple measure for modeling the probability that a
hypothesis will be selected is given by the fitness
proportionate selection (or roulette wheel selection):
Pr(hi)= Fitness(hi)/j=1p Fitness(hj)
This simple measure can lead to crowding
Tournament Selection
Pick h1, h2 at random with uniform probability
With probability p, select the more fit
Rank Selection
Sort all hypotheses by fitness
Probability of selection is proportional to rank
10
In classification tasks, the Fitness function typically
has a component that scores the classification
accuracy over a set of provided training examples.
Other criteria can be added (e.g., complexity or
generality of the rule)
GABIL (DeJong et al. 1993)
Lean disjunctive set of propositional rules,
competitive with C4.5
Fitness: Fitness(h)=(correct(h))2
Representation:
IF a1=T Λ a2=F THEN c=T; IF a2=T THEN c=F
Represented by
a1 a2 c a1 a2 c
10 01 1 11 10 0
Genetic operators:
Standard mutation operators
Extended two point crossover
11
Want variable length rule sets
Want only well-formed bitstring hypotheses
Crossover with Variable-Length
Bit-strings
Start with
h1
h2
a1 a2 c a 1 a2 c
10 01 1 11 10 0
01 11 0 10 01 0
1. choose crossover points for h1, e.g., after bit 1, 8
2. now restrict points in h2 to those that produce
bitstrings with well-defined semantics, e.g., <1,3>,
<1,8>, <6, 8>
Let d1 and d2 denote the distance from the leftmost and
rightmost of two crossover points in h1 to the rule
boundary immediately to it left.
Then, the crossover points in h2 must have the same d1
and d2 values.
If we choose <1,3> result is
h3
12
h2
a1 a2 c
11 10 0
a1 a2 c a1 a2 c a1 a2 c
00 01 1 11 11 0 10 01 0
GABIL Extentions
Add new genetic operators, also applied
probabilistically
1.AddAlternative: generalize constraint on ai by
changing a 0 to 1.
2. DropCondition: generalize constraint on by
change every 0 to 1.
And add new filed to bitstring to determine
whether to allow these
a1 a2 c a1 a2 c AA DC
01 11 0 10 01 0 1
0
So now the learning strategy also evolves
13
GABIL Results
Performance of GABIL comparable to
symbolic rule/tree learning methods C4.5,
ID5R, AQ14
Average performance on a set of 12
synthetic problems:
GABIL with out AA and DC operator: 92.1%
GABIL with AA and DC operators: 95.2%
Symbolic learning methods ranged from 91.2 to
96.6%.
14
Hypothesis Space Search
GA search can move very abruptly (as
compared to Backpropagation, for example),
replacing a parent hypothesis by an offspring
that may be radically different from the parent.
The problem of Crowding: when one individual
is more fit than others, this individual and
closely related ones will take up a large fraction
of the population.
Solutions:
15
Use tournament or rank selection instead of roulette
selection.
Fitness sharing: the measured fitness of an individual is
reduced by the presence of other similar individuals in the
population.
restrict ion on the kinds of individuals allowed to
recombine to form offspring.
The Schema Theorem [Holland, 75]
Definition: A schema is any string
composed of 0s, 1s and *s where *
means ‘don’t care’.
Example: schema 0*10 represents
strings 0010 and 0110.
Characterize population by number of
instance representing each possible
schema
16
Consider Just Selection
f(t)= average fitness of population at time t
m(s, t) = number of instances of schema s in
population at time t.
u^(s, t) = average fitness of instances of s at time t
Probability of selecting h in one selection step
Pr( h ) 
f (h)
n
 f ( hi )

f (h)
nf (t )
i 1
Probability of selecting an instance of s in one step
f (h) uˆ( s, t )
Pr( h  s)  

m( s, t )
nf (t )
hs pt nf (t )
Expected number of instances of s after n
selections
17
uˆ ( s, t )
E[m( s, t _  1)] 
m( s, t )
f (t )
Schema Theorem
The full schema theorem provides a lower
bound on the expected frequency of schema s,
as follow:
uˆ( s, t )
d ( s) 

o( s )
E[m( s, t _  1)] 
m( s, t )1  pc
(
1

p
)

m
f (t )
l

1


pc= probability of single point crossover operator
pm=probability of mutation operator
l = length of single bit strings
o(s) = number of defined (non “*”) bits in s
d(s) = distance between leftmost, rightmost defined
bits in s
18
The Schema Theorem: More fit schemas will tend to
grow in influence, especially schemas containing a
small number of defined bits (i.e., containing a large
number of *s), and especially when these defined
bits are near one another within the bit string.
Genetic Programming
Genetic programming is a form of
evolutionary computation in which the
individuals in the evolving population are
computer programs rather than bit strings
Population of programs represented by
trees
19
Crossover
20
Models of Evolution and Learning I:
Lamarckian Evolution [Late 19th C]
Proposition: Experiences of a single
organism directly affect the genetic
makeup of their offsprings.
Assessment: This proposition is wrong:
the genetic makeup of an individual is
unaffected by the lifetime experience of
one’s biological parents.
However: Lamarckian processes can
sometimes improve the effectiveness of
computerized genetic algorithms.
21
Models of Evolution and
Learning II: Baldwin Effect
Assume
Individual learning has no direct influence on
individual DNA
But ability to learn reduces need to “hard wire”
traits in DNA
Then
Ability of individuals to learn will support more
diverse gene pool
Because learning allows individuals with various “hard
wired” traits to be successful
More diverse gene pool will support faster
evolution of gene pool
22
Individual learning (indirectly) increases
rate of evolution
Models of Evolution and
Learning II: Baldwin Effect
Plausible example:
New predator appears in environment
Individuals who can learn (to avoid it) will be
selected
Increase in learning individuals will support more
diverse gene pool
Resulting in faster evolution
Possibly resulting in new non-learned traits such
as instinctive fear of predator
23
Computer Experiments on
Baldwin Effect
Evolve simple neural networks
Some network weights fixed during lifetime,
other trainable
Genetic makeup determines which are fixed and
their weight values
Results
With no individual learning, population failed to
improve over time
When individual learning allowed
Early generations: population contained many
individuals with many trainable weights
Later generations: higher fitness, while number of
trainable weights decreased
24
Summary: Evolutionary programming
Conduct randomized, parallel, hill-climbing
search through H
Approach learning as optimization problem
(optimize fitness)
Nice feature: evaluation of Fitness can be
very indirect
Consider learning rule set for multistep decision
making
No issue of assigning credit/blame to individual
steps
25
Parallelizing Genetic Algorithms
GAs are naturally suited to parallel
implementation. Different approaches
were tried:
Coarse Grain: subdivides the population into
distinct groups of individuals (demes) and
conducts a GA search in each deme.
Transfer between demes occurs (though
infrequently) by a migration process in which
individuals from one deme are copied or
transferred to other demes
Fine Grain: One processor is assigned per
individual in the population and
recombination takes place among
neighboring individuals.
26