CS 478 - Machine Learning
Download
Report
Transcript CS 478 - Machine Learning
CS 478 - Machine Learning
Genetic Algorithms (I)
Darwin’s Origin of Species: Basic
Principles (I)
Individuals survive based on their ability to adapt to the pressures
of their environment (i.e., their fitness)
Fitter individuals tend to have more offspring, thus driving the
population as a whole towards favorable traits
During reproduction, the traits found in parents are passed onto
their offspring
In sexual reproduction, the chromosomes of the offspring are a mix
of those of their parents
The traits of offspring are partially inherited from their parents and
partially the result of new genes/traits created during the process of
reproduction
Nature produces individuals with differing traits
Over long periods, variations can accumulate, producing entirely
new species whose traits make them especially suited to particular
ecological niches
Fall 2004
CS 478 - Machine Learning
2
Darwin’s Origin of Species: Basic
Principles (II)
Evolution is effected via two main genetic
mechanisms:
Crossover
Mutation
Fall 2004
Take 2 candidate chromosomes
“Randomly” choose 1 or 2 crossover points
Swap the respective components to create 2 new
chromosomes
Choose a single offspring
Randomly change some aspect of it
CS 478 - Machine Learning
3
Intuition
Essentially a pseudo-random walk through
the population with the aim of maximizing
some fitness function
From a starting population:
Crossover ensures exploitation
Mutation ensures exploration
GAs are based on these principles
Fall 2004
CS 478 - Machine Learning
4
Natural vs Artificial
Individual
Population
Fitness
Chromosome
Gene
Crossover and mutation
Natural selection
Fall 2004
Candidate solution
Set of candidate solutions
Measure of quality of
solutions
Encoding of candidate
solutions
Part of the encoding of a
solution
Search operators
Re-use of good (sub)solutions
CS 478 - Machine Learning
5
Phenotype vs. Genotype
In Genetic Algorithms (GAs), there is a clear distinction
between phenotype (i.e., the actual individual or
solution) and genotype (i.e., the individual's encoding or
chromosome). The GA, as in nature, acts on genotypes
only. Hence, the natural process of growth must be
implemented as a genotype-to-phenotype decoding.
The original formulation of genetic algorithms relied on a
binary-encoding of solutions, where chromosomes are
strings of 0s and 1s. Individuals can then be anything so
long as there is a way of encoding/decoding them using
binary strings.
Fall 2004
CS 478 - Machine Learning
6
Simple GA
One often distinguishes between two types of genetic algorithms,
based on whether there is a complete or partial replacement of the
population between generations (i.e., whether there is overlap or
not between generations).
When there is complete replacement, the GA is said to generational,
whilst when replacement is only partial, the GA is said to be steadystate. If you look carefully at the algorithms below, you will notice
that even the generational GA gives only partial replacement when
cloning takes place (i.e., cloning causes overlap between
generations). Moreover, if steady-state is performed on the whole
population (rather than on a proportion of fittest individuals), then
the GA is generational.
Hence, the distinction is more a matter of how reproduction takes
place than a matter of overlap.
Fall 2004
CS 478 - Machine Learning
7
Generational GA
Randomly generate a population of
chromosomes
While (termination condition not met)
Fall 2004
Decode chromosomes into individuals
Evaluate fitness of all individuals
Select fittest individuals
Generate new population by cloning,
crossover and mutation
CS 478 - Machine Learning
8
Steady-state GA
Randomly generate a population of
chromosomes
While (termination condition not met)
Fall 2004
Decode chromosomes into individuals
Evaluate fitness of all individuals
Select fittest individuals
Produce offspring by crossover and mutation
Replace weakest individuals with offspring
CS 478 - Machine Learning
9
Genetic Encoding / Decoding
We focus on binary encodings of solutions
We first look at single parameters (i.e.,
single gene chromosomes) and then
vectors of parameters (i.e., multi-gene
chromosomes)
Fall 2004
CS 478 - Machine Learning
10
Integer Parameters
Let p be the parameter to be encoded. There
are three distinct cases to consider:
p takes values from {0, 1, ..., 2N-1} for some N
p takes values from {M, M+1, ..., M+2N-1} for some
M and N
Then (p - M) can be encoded directly by its equivalent binary
representation.
p takes values from {0, 1, ..., L-1} for some L such
that there exists no N for which L=2N
Fall 2004
Then p can be encoded directly by its equivalent binary
representation
Then there are two possibilities: clipping or scaling
CS 478 - Machine Learning
11
Clipping
Clipping consists of taking N=log(L)+1 bits
and encoding all parameter values 0 p L-2
by their equivalent binary representation,
letting all other N-bit strings serve as
encodings of p=L-1.
For example, assume p takes values in {0, 1,
2, 3, 4, 5}, i.e., L=6. Then N=log(6)+1=3.
Here, not only is 101 an (expected) encoding of
p=L-1=5, but so are 110 and 111
Advantages: easy to implement.
Disadvantages: strong representational bias,
i.e., all parameter values between 0 and L-2
have a single encoding, whilst the single
parameter value L-1 has 2N-L+1 encodings.
Fall 2004
CS 478 - Machine Learning
0
1
2
3
4
5
5
5
000
001
010
011
100
101
110
111
12
Scaling
Scaling consists of taking N=log(L)+1 bits and
encoding p by the binary representation of the
integer value e such that p = e(L-1)/(2N-1)
For example, assume p takes values in {0, 1,
2, 3, 4, 5}, i.e., L=6. Then N=log(6)+1=3.
Here, the binary encodings are not generally
numerically equivalent to the integer values
they code
Advantages: easy to implement and smaller
representational bias than clipping (each value
of p has 1 or 2 encodings, with double
encodings evenly spread over the values of p)
Disadvantages: more computation needed and
still a small representational bias
Fall 2004
CS 478 - Machine Learning
0
0
1
2
2
3
4
5
000
001
010
011
100
101
110
111
13
Real-valued Parameters (I)
Real values may be encoded as fixed point
numbers or integers via scaling and
quantization
If p ranges over [min, max], then p is encoded by
the binary representation of the integer part of:
N
(2 1) p
max min
Fall 2004
CS 478 - Machine Learning
14
Real-valued Parameters (II)
Real values may also be encoded using
thermometer encoding
Fall 2004
Let T be an integer greater than 1
Thermometer encoding of real values on T bits
consists of normalizing all real values to the interval
[0,1] and converting each normalized value x to a
bit-string of xT (rounded down) 1s followed by
trailing 0s as needed.
CS 478 - Machine Learning
15
Vectors of Parameters
Vectors of parameters are encoded on multi-gene
chromosomes by combining the encodings of each individual
parameter
Let ei=[bi0, ..., biN] be the encoding of the ith of M parameters
There are two possibilities for combining the ei 's onto a
chromosome:
Concatenating: Here, individual encodings simply follow each
other in some pre-defined order, e.g., [b10, ..., b1N, ..., bM0, ...,
bMN]
Interleaving: Here, the bits of each individual encoding are
interleaved, e.g., [b10, ..., bM0}}, ..., b1N, ..., bMN]
The order of parameters in the vector (resp., genes on the
chromosome) is important, especially for concatenated
encodings
Fall 2004
CS 478 - Machine Learning
16
Gray Coding (I)
A Gray code represents each number in the
sequence of integers 0, 1, ..., 2N-1 as a binary
string of length N, such that adjacent integers
have representations that differ in only one bit
position
A number of different Gray codes exist. One
simple algorithm to produce gray codes starts
with all bits set to 0 and successively flips the
right-most bit that produces a new string
Fall 2004
CS 478 - Machine Learning
17
Gray Coding (II)
Integer Value
Binary coding
Gray coding
0
1
2
3
4
5
6
7
000
001
010
011
100
101
110
111
000
001
011
010
110
111
101
100
Fall 2004
CS 478 - Machine Learning
18
Gray Coding (III)
Advantages: random bit-flips (e.g., during mutation) are more likely
to produce small changes (i.e., there are no Hamming cliffs since
adjacent integers' representations differ by exactly one bit).
Disadvantages: big changes are rare but bigger than with binary
codes.
For example, consider the string 001. There are 3 possible bit flips
leading to the strings 000, 011 and 101
Fall 2004
With standard binary encoding, 2 of the 3 flips lead to relatively large
changes (from 001(=1) to 011(=3) and from 001(=1) to 101(=5),
respectively)
With Gray coding, 2 of the 3 flips produce small changes (from 001(=1)
to 000(=0) and from 001(=1) to 011(=2), respectively)
However, the less probable (1 out of 3) flip from 001 to 101 produces a
bigger change under Gray coding (to 6) than under standard binary
encoding (to 5)
CS 478 - Machine Learning
19
GA operators
We will restrict our discussion to binary
strings
The basic GA operators are:
Fall 2004
Selection
Crossover
Mutation
CS 478 - Machine Learning
20
Selection
Selection is the operation by which chromosomes are selected for
reproduction
Chromosomes corresponding to individuals with a higher fitness
have a higher probability of being selected
There are a number of possible selection schemes (we discuss some
here)
Fitness-based selection makes the following assumptions:
Fall 2004
There exists a known quality measure Q for the solutions of the
problem
Finding a solution can be achieved by maximizing Q
For all potential solutions (good or bad), Q is positive.
A chromosome's fitness is taken to be the quality measure of the
individual it encodes
CS 478 - Machine Learning
21
Fitness-proportionate Selection
This selection scheme is the most widely used in GAs
Let fi be the fitness value of individual i and let favg be
the average population fitness
Then, the probability of an individual i being selected is
given by:
pi
fi
N
fj
1 fi
N f avg
j 1
Fall 2004
CS 478 - Machine Learning
22
Roulette Wheel
Fitness-proportionate selection (FPS) can be
implemented with the roulette-wheel algorithm
A wheel is constructed with markers
corresponding to fitness values
For each fitness value fi, the size of the marker
(i.e., the proportion of the wheel's
circumference) associated to fi is given by pi as
defined above
Hence, when the wheel is spun, the probability
of the roulette landing on fi (and thus selecting
individual i) is given by pi, as expected
Fall 2004
CS 478 - Machine Learning
23
Vector Representation
A vector v of M elements from {1, ..., N} is
constructed so that each subsequent i in {1, ...,
N} has M.pi entries in v
A random index r from {1, ..., M} is selected and
individual v(r) is selected
Example:
Fall 2004
4 individuals such that f1=f2=10, f3=15 and f4=25
If M=12, then v=(1, 1, 2, 2, 3, 3, 3, 4, 4, 4, 4, 4)
Generate r=6, then individual v(6)=3 is selected
CS 478 - Machine Learning
24
Cumulative Distribution
A random real-valued number r in
N
0, f j
j 1
is chosen and individual i, such that
i 1
i
j 1
j 1
fj r fj
is selected (if i=1, the lower bound sum is 0).
Fall 2004
CS 478 - Machine Learning
25
Discussion (I)
The implementation based on cumulative distribution is effective but
relatively inefficient, whilst the implementation based on vector
representation is efficient but its effectiveness depends on M (i.e., the
value of M determines the level of quantization of the pi's and thus
accuracy depends on M).
Assume that N individuals have to be selected for reproduction. The
expected number of copies of each individual in the mating pool is:
N i Npi
fi
f avg
Hence, individuals with above-average fitness tend to have more than one
copy in the mating pool, whilst individuals with below-average fitness tend
not to be copied. This leads to problems with FPS.
Fall 2004
CS 478 - Machine Learning
26
Discussion (II)
Premature convergence
Stagnation
Assume an individual X with fi >> favg but fi << fmax is produced
in an early generation. As Ni >> 1, the genes of X quickly spread
all over the population. At that point, crossover cannot generate
any new solutions (only mutation can) and favg << max forever.
Assume that at the end of a run (i.e., in one of the consecutive
generations) all individuals have a relatively high and similar
fitness, i.e., fi is almost fmax for all i. Then, Ni is almost 1 for all i
and there is virtually no selective pressure.
Both of these problems can be solved with fitness
scaling techniques
Fall 2004
CS 478 - Machine Learning
27
Fitness Scaling
Essentially, fitness values are scaled down at the beginning and scaled up
towards the end
There are 3 general scaling methods:
Linear scaling
f is replaced by f’=a.f + b, where a and b are chosen such that:
f’avg=favg (i.e., the scaled average is the same as the raw average)
f’max=c. favg (c is the number of expected copies desired for the best individual; usually c=2)
The scaled fitness function may take on negative values if there are a few bad
individuals with fitness much lower than favg and favg is close to fmax . One solution is to
arbitrarily assign the value 0 to all negative fitness values.
Sigma truncation
f is replaced by f'=f - (favg - c.), where is the population standard deviation, c is a
reasonable multiple of (usually 1 c 3) and negative results are arbitrarily set to 0.
Truncation removes the problem of scaling to negative values. (Note that truncated
fitness values may also be scaled if desired)
Power law scaling
f is replaced by f'=fk for some suitable k. This method is not used very often. In
general, k is problem-dependent and may require dynamic change to stretch or shrink
the range as needed
Fall 2004
CS 478 - Machine Learning
28
Rank Selection
All individuals are sorted by increasing values of their fitness
Then, each individual is assigned a probability pi of being selected from some prior
probability distribution
Typical distributions include:
Rank selection (RS) has little biological plausibility. However, it has the following
desirable features:
Linear: Here, pi=a.i + b
Negative exponential: Here, pi=a.eb.i + c
No premature convergence. Because of the ranking and the probability distribution imposed
on it, even less fit individuals will be selected (e.g., let there be 3 individuals such that
f1=90, f2=7, f3=3, and pi=-0.4i + 1.3. With FPS, p1=0.9 >> p2=0.07 and p3=0.03, so that
individual 1 comes to saturate the population. With RS, p1=0.9, p2=0.5 and p3=0.1, so that
individual 2 is also selected).
No stagnation. Even at the end, N1 N2 ... (similar argument to above).
Explicit fitness values not needed. To order individuals, only the ability of comparing pairs of
solutions is necessary.
However: rank selection introduces a reordering overhead and makes a theoretical
analysis of convergence difficult.
Fall 2004
CS 478 - Machine Learning
29
Tournament Selection
Tournament selection can be viewed as a noisy version
of rank selection.
The selection process is two-stage:
Select a group of N ( 2) individuals
Select the individual with the highest fitness from the group and
discard all others
Tournament selection inherits the advantages of rank
selection. In addition, it does not require global
reordering and is more naturally-inspired.
Fall 2004
CS 478 - Machine Learning
30
Elitist Selection
The idea behind elitism is that at least one copy of the
best individual in the population is always passed onto
the next generation.
The main advantage is that convergence is guaranteed
(i.e., if the global maximum is discovered, the GA
converges to that maximum). By the same token,
however, there is a risk of being trapped in a local
maximum.
One alternative is to save the best individual so far in
some kind of register and, at the end of each run, to
designate it as the solution instead of using the best of
the last generation.
Fall 2004
CS 478 - Machine Learning
31
1-point Crossover
Here, the chromosomes of the parents are cut at
some randomly chosen common point and the
resulting sub-chromosomes are swapped
For example
P1=1010101010 and P2=1110001110
Crossover point between the 6th and 7th bits
Then the offspring are:
Fall 2004
O1=1010101110
O2=1110001010
CS 478 - Machine Learning
32
2-point Crossover
Here, the chromosomes are thought of as rings with the
first and last gene connected (i.e., wrap-around
structure)
The rings are cut in two sites and the resulting sub-rings
are swapped
For example
P1=1010101010 and P2=1110001110
Crossover points are between the 2nd and 3rd bits, and between
the 6th and 7th bits
Then the offspring are:
Fall 2004
O1=1110101110
O2=1010001010
CS 478 - Machine Learning
33
Uniform Crossover
Here, each gene of the offspring is
selected randomly from the corresponding
genes of the parents
For example
P1=1010101010 and P2=1110001110
Then the offspring could be:
O=1110101110
Note: produces a single offspring
Fall 2004
CS 478 - Machine Learning
34
Mutation
Mutation consists of making (usually
small) alterations to the values of one or
more genes in a chromosome
In binary chromosomes, it consists of
flipping random bits of the genotype. For
example, 1010101010 may become
1011101010 if the 4th bit is flipped.
Fall 2004
CS 478 - Machine Learning
35