Estimation Of Distribution Algorithm based on Markov

Download Report

Transcript Estimation Of Distribution Algorithm based on Markov

Estimation Of Distribution
Algorithm based on Markov
Random Fields
Siddhartha Shakya
School Of Computing
The Robert Gordon University
24-02-2006
Siddhartha Shakya
1
Outline
• From GAs to EDAs
• Probabilistic Graphical Models in EDAs
– Bayesian networks
– Markov Random Fields
• Fitness modelling approach to estimating and
sampling MRF in EDA
– Gibbs distribution, energy function and modelling the
fitness
– Estimating parameters (Fitness modelling approach)
– Sampling MRF (several different approaches)
• Conclusion
24-02-2006
Siddhartha Shakya
2
Genetic Algorithms (GAs)
• Population based optimisation technique
• Based on Darwin's theory of Evolution
• A solution is encoded as a set of symbols
known as chromosome
• A population of solution is generated
• Genetic operators are then applied to the
population to get next generation that
replaces the parent population
24-02-2006
Siddhartha Shakya
3
Simple GA simulation
24-02-2006
Siddhartha Shakya
4
GA to EDA
24-02-2006
Siddhartha Shakya
5
Simple EDA simulation
0
1
1
1
1
0
1
1
1
1
1
0
1
0
1
1
0
1
0
1
0.5
0
0
1
0
1
0
0
1
0
1
0
1
0
0
0
0
1
0
0
0
0.5
1.0
n
0.5
p ( x)   p ( xi )
0
0
1
1
1
1
1
1
0
1
1
0
1
0
1
0
1
1
1
1
1.0
i 1
p( x5  1)
24-02-2006
Siddhartha Shakya
6
Joint Probability Distribution (JPD)
• Solution as a set of
random variables
x  {x1 , x1 ,..., xn }
• Joint probability
Distribution (JPD)
p( x)  p( x1 , x1 ,..., xn )
• Exponential to the
number of variables,
therefore not feasible to
calculate in most cases
n
2 parameters (m arg inal
probabilit ies ) for xi  {0,1}
• Needs Simplification!!
24-02-2006
Siddhartha Shakya
7
Factorisation of JPD
• Univariate model: No
interaction: Simplest
model
n
p ( x)   p ( xi )
: xi  x
i 1
n
• Bivariate model: Pairwise interaction
p ( x)   p ( xi | x j ) : xi , x j  x
i 1
n
• Multivariate Model:
interaction of more
than two variables
24-02-2006
p ( x)   p ( xi | x A )
: xi  x, x A  x
i 1
Siddhartha Shakya
8
Typical estimation and sampling of JPD in EDAs
• Learn the interaction between variables in
the solution
• Learn the probabilities associated with
interacting variables
• This specifies the JPD: p(x)
• Sample the JPD (i.e. learned probabilities)
24-02-2006
Siddhartha Shakya
9
Probabilistic Graphical Models
• Efficient tool to represent the factorisation of
JPD
• Marriage between probability theory and Graph
theory
• Consist of Two components
– Structure
– Parameters
• Two types of PGM
– Directed PGM (Bayesian Networks)
– Undirected PGM (Markov Random Field)
24-02-2006
Siddhartha Shakya
10
Directed PGM (Bayesian networks)
• Structure:
X2
X1
Directed Acyclic Graph (DAG)
X3
• Independence relationship:
A variable is conditionally
independent of rest of the
variables given its parents
X4
X5
p ( x1  0),
p ( x1  1)
p ( x2  0),
p ( x2  1)
p ( x3  0 | x1  0, x2  0), p ( x3  0 | x1  1, x2  0),
p ( x3  0 | x1  0, x2  1), p ( x3  0 | x1  1, x2  1),
p ( x3  1 | x1  0, x2  0), p ( x3  1 | x1  1, x2  0),
p ( x3  1 | x1  0, x2  1), p ( x3  1 | x1  1, x2  1)
P( x4  0 | x3  0), p ( x4  0 | x3  1),
• Parameters:
P( x4  1 | x3  0), p ( x4  1 | x3  1)
Conditional probabilities
24-02-2006
Siddhartha Shakya
p ( x5  0 | x3  0), p ( x5  0 | x3  1),
p ( x5  1 | x3  0), p ( x5  1 | x3  1)
11
Bayesian networks
• The factorisation of JPD
encoded in terms of conditional
probabilities is
p( x)  p( x1 ) p( x2 ) p( x3 | x1 , x2 ) p( x4 | x3 ) p( x5 | x3 )
• JPD for BN
X2
X1
X3
X4
X5
p ( x1  0),
p ( x1  1)
p ( x2  0),
p ( x2  1)
p ( x3  0 | x1  0, x2  0), p ( x3  0 | x1  1, x2  0),
n
p ( x)   p( xi |  i )
i 1
p ( x3  0 | x1  0, x2  1), p ( x3  0 | x1  1, x2  1),
p ( x3  1 | x1  0, x2  0), p ( x3  1 | x1  1, x2  0),
p ( x3  1 | x1  0, x2  1), p ( x3  1 | x1  1, x2  1)
P( x4  0 | x3  0), p ( x4  0 | x3  1),
P( x4  1 | x3  0), p ( x4  1 | x3  1)
p ( x5  0 | x3  0), p ( x5  0 | x3  1),
p ( x5  1 | x3  0), p ( x5  1 | x3  1)
24-02-2006
Siddhartha Shakya
12
Estimating a Bayesian network
• Estimate structure
X2
X1
p ( x1  0),
p ( x1  1)
p ( x2  0),
p ( x2  1)
p ( x3  0 | x1  0, x2  0), p ( x3  0 | x1  1, x2  0),
p ( x3  0 | x1  0, x2  1), p ( x3  0 | x1  1, x2  1),
p ( x3  1 | x1  0, x2  0), p ( x3  1 | x1  1, x2  0),
p ( x3  1 | x1  0, x2  1), p ( x3  1 | x1  1, x2  1)
X3
• Estimate parameters
P ( x4  0 | x3  0), p ( x4  0 | x3  1),
P ( x4  1 | x3  0), p ( x4  1 | x3  1)
p ( x5  0 | x3  0), p ( x5  0 | x3  1),
p ( x5  1 | x3  0), p ( x5  1 | x3  1)
X4
• This completely
specifies the JPD
X5
p( x)  p( x1 ) p( x2 ) p( x3 | x1 , x2 ) p( x4 | x3 ) p( x5 | x3 )
• JPD can then be
Sampled
24-02-2006
Siddhartha Shakya
13
BN based EDAs
1. Initialise parent solutions
2. Select a set from parent solutions
3. Estimate a BN from selected set
a. Estimate structure
b. Estimate parameters
4. Sample BN to generate new population
5. Replace parents with new set and go to
2 until termination criteria satisfies
24-02-2006
Siddhartha Shakya
14
How to estimate and sample BN in EDAs
• Estimating structure
– Score + Search techniques
– Conditional independence test
X2
X1
• Estimating parameters
– Trivial in EDAs: Dataset is
complete
X3
– Estimate probabilities of
parents before child
• Sampling
– Probabilistic Logical Sampling
(Sample parents before child)
24-02-2006
X4
X5
p( x)  p( x1 ) p( x2 ) p( x3 | x1 , x2 ) p( x4 | x3 ) p( x5 | x3 )
Siddhartha Shakya
15
BN based EDAs
• Well established approach in EDAs
BOA, EBNA, LFDA, MIMIC, COMIT,
BMDA
References
– Larrañiaga and Lozano 2002
– Pelikan 2002
24-02-2006
Siddhartha Shakya
16
Markov Random Fields (MRF)
• Structure:
X1
Undirected Graph
X2
• Local independence:
A variable is conditionally independent of
rest of the variables given its neighbours
X5
X3
X4
X6
• Global Independence:
Two sets of variables are conditionally
independent to each other if there is a third
set that separates them.
• Parameters:
potential functions defined on the cliques
24-02-2006
Siddhartha Shakya
 1 ( x1 , x2 , x3 )
 2 ( x2 , x3 , x4 )
 3 ( x2 , x5 )
 4 ( x3 , x6 )
17
Markov Random Field
• The factorisation of JPD
encoded in terms of potential
function over maximal cliques
is
p ( x) 
1
Z
X1
X2
X3
 ( x1 , x2 , x3 ) ( x2 , x3 , x4 ) ( x2 , x5 ) ( x3 , x6 )
where, Z 
 ( x1 , x2 , x3 ) ( x2 , x3 , x4 ) ( x2 , x5 ) ( x3 , x6 )
X5
X4
X6
x
is a partition function
• JPD for MRF
1 m
p ( x)   (ci )
Z i 1
24-02-2006
Siddhartha Shakya
 1 ( x1 , x2 , x3 )
 2 ( x2 , x3 , x4 )
 3 ( x2 , x5 )
 4 ( x3 , x6 )
18
Estimating a Markov Random field
• Estimate structure from
data
 1 ( x1 , x2 , x3 )
 2 ( x2 , x3 , x4 )
 3 ( x2 , x5 )
 4 ( x3 , x6 )
X1
• Estimate parameters
X2
– Requires potential
functions to be numerically
defined
X3
X5
X4
X6
• This completely specifies
the JPD
1
 ( x1 , x2 , x3 ) ( x2 , x3 , x4 ) ( x2 , x5 ) ( x3 , x6 )
Z
where,
Z   ( x1 , x2 , x3 ) ( x2 , x3 , x4 ) ( x2 , x5 ) ( x3 , x6 )
p( x) 
• JPD can then be Sampled
– No specific order (not a
DAG) so a bit problematic
24-02-2006
Siddhartha Shakya
x
19
MRF in EDA
• Has recently been proposed as a
estimation of distribution technique in EDA
• Shakya et al 2004, 2005
• Santana et el 2003, 2005
24-02-2006
Siddhartha Shakya
20
MRF based EDA
1. Initialise parent solutions
2. Select a set from parent solutions
3. Estimate a MRF from selected set
a. Estimate structure
b. Estimate parameters
4. Sample MRF to generate new
population
5. Replace parent with new solutions and
go to 2 until termination criteria satisfies
24-02-2006
Siddhartha Shakya
21
How to estimate and sample MRF in EDA
• Learning Structure
– Conditional Independence test (MN-EDA, MN-FDA)
– Linkage detection algorithm (LDFA)
• Learning Parameter
–
–
–
–
Junction tree approach (FDA)
Junction graph approach (MN-FDA)
Kikuchi approximation approach (MN-EDA)
Fitness modelling approach (DEUM)
• Sampling
–
–
–
–
–
Probabilistic Logic Sampling (FDA, MN-FDA)
Probability vector approach (DEUMpv)
Direct sampling of Gibbs distribution (DEUMd)
Metropolis sampler (Is-DEUMm)
Gibbs Sampler (Is-DEUMg, MN-EDA)
24-02-2006
Siddhartha Shakya
22
Fitness modelling approach
• Hamersley Clifford theorem: JPD for
any MRF follows Gibbs distribution
e U ( x ) / T
p ( x) 
Z
(a)
m
• Energy of Gibbs distribution in terms
of potential functions over the cliques
U ( x)   u (ci )
i 1
p( x) 
• Assuming probability of solution is
proportional to its fitness:
• From (a) and (b) a Model of fitness
function - MRF fitness model (MFM)
– is derived
24-02-2006
Siddhartha Shakya
f ( x)
Z
(b)
 ln( f ( x))  U ( x)
or ,
m
 ln( f ( x))   u (ci )
i 1
23
MRF fitness Model (MFM)
m
 ln( f ( x))  U ( x)   u (ci )
i 1
• Properties:
– Completely specifies the JPD for MRF
– Negative relationship between fitness and Energy i.e.
Minimising energy = maximise fitness
• Task:
– Need to find the structure for MRF
– Need to numerically define clique potential function
24-02-2006
Siddhartha Shakya
24
MRF Fitness Model (MFM)
• Let us start with simplest model:
univariate model – this
eliminates structure learning :)
u1 ( x1 )
X1
X2
u 2 ( x2 )
u3 ( x3 )
X3
u 4 ( x4 )
• For univariate model there will
be n singleton clique
• For each singleton clique assign
a potential function
X5
X4
X6
u5 ( x5 )
u 6 ( x6 )
ui ( xi )   i xi
 ln( f ( x))  U ( x)  1 x1   2 x2  ..   n xn
• Corresponding MFM
m
• In terms of Gibbs distribution
24-02-2006
Siddhartha Shakya
p( x) 
e
   i xi T
i 1
Z
25
Estimating MRF parameters using MFM
 ln( f ( x))  1 x1   2 x2  ..   n xn
• Each chromosome gives us a linear equation
• Applying it to a set of selected solution gives us a
system of linear equations
• Solving it will give us the approximation to the MRF
parameters
• Knowing MRF parameters completely specifies JPD
• Next step is to sample the JPD
24-02-2006
Siddhartha Shakya
26
General DEUM framework
Distribution Estimation Using MRF algorithm
(DEUM)
1. Initialise parent population P
2. Select set D from P (can use D=P !!)
3. Build a MFM and fit to D to estimate MRF
parameters
4. Sample MRF to generate new population
5. Replace P with new population and go to 2 until
termination criterion satisfies
24-02-2006
Siddhartha Shakya
27
How to sample MRF
• Probability vector approach
• Direct Sampling of Gibbs Distribution
• Metropolis sampling
• Gibbs sampling
24-02-2006
Siddhartha Shakya
28
Probability vector approach to sample MRF
 ln( f ( x))  U ( x)  1 x1   2 x2  ..   n xn
• Minimise U(x) to maximise f(x)
• To minimise U(x) Each αixi should be minimum
• This suggests: if αi is negative then corresponding xi
should be positive
• We could get an optimum chromosome for the current
population just by looking on α
• However not always the current population contains
enough information to generate optimum
• We look on sign of each αi to update a vector of
probability
24-02-2006
Siddhartha Shakya
29
DEUM with probability vector (DEUMpv)
24-02-2006
Siddhartha Shakya
30
Updating Rule
For i  1 to n do
if  i  0 then pi  pi (1   )   ;
if  i  0 then pi  pi (1   );
• Uses sign of a MRF parameter to direct search
towards favouring value of respective variable that
minimises energy U(x)
• Learning rate controls convergence
24-02-2006
Siddhartha Shakya
31
Simulation of DEUMpv
0.5
0.5
0.5
0.5
0.5
0
1
1
1
1
4
0
1
1
1
1
4
1
0
1
0
1
3
1
0
1
0
1
3
0
0
1
0
1
2
0
0
1
0
1
2
0
1
0
0
0
1
0
1
0
0
0
1
 1   2   3   4   5  1.4
0.4
0.6
0.6
0.6
0.6
1   2   3   4   5  1.1
0.05
24-02-2006
Siddhartha Shakya
-0.05
-0.625
-0.05
-0.625
32
Results
• OneMax Problem
24-02-2006
Siddhartha Shakya
33
Results
• F6 function optimisation
24-02-2006
Siddhartha Shakya
34
Results
• Trap 5 function
• Deceptive problem
• No solution found
24-02-2006
Siddhartha Shakya
35
Sampling MRF
• Probability vector approach
• Direct sampling of Gibbs distribution
• Metropolis sampling
• Gibbs sampling
24-02-2006
Siddhartha Shakya
36
Direct Sampling of Gibbs distribution
• In the probability vector approach, only the
sign of MRF parameters has been used
• However, one could directly sample from
the Gibbs distribution and make use of the
values of MRF parameters
• Also could use the temperature coefficient
to manipulate the probabilities
24-02-2006
Siddhartha Shakya
37
Direct Sampling of Gibbs distribution
U ( x )
p ( x) 
T
e
Z
 U ( x)  1 x1   2 x2  ..   n xn
p ( x)  xi  1
e  (U ( x )xi 1) / T
p ( xi  1) 
 (U ( x )xi  1) / T
 (U ( x )xi 1) / T
p
(
x
)
e

e

xi {1,1}
1
p ( xi  1) 
1  e 2 i / T
24-02-2006
1
p( xi  1) 
1  e  2 i / T
Siddhartha Shakya
38
Direct Sampling of Gibbs distribution
1
p ( xi  1) 
1  e 2 i / T
1
p( xi  1) 
1  e  2 i / T
• The temperature coefficient has an important
role
• Decreasing T will cool probability to either 1 or 0
depending upon sign and value of alpha
• This forms the basis for the DEUM based on
direct sampling of Gibbs distribution (DEUMd)
24-02-2006
Siddhartha Shakya
39
DEUM with direct sampling (DEUMd)
1. Generate initial population, P, of size M
2. Select the N fittest solutions, N ≤ M
3. Calculate MRF parameters
4. Generate M new solutions by sampling univariate
distribution
m
p ( x) 
e
  i xi T
i 1
Z
1
2
 p( xi  1) 
,    g
 i
T
1 e
5. Replace P by new population and go to 2 until
complete
24-02-2006
Siddhartha Shakya
40
DEUMd simulation
0
1
1
1
1
4
0
1
1
1
1
4
 1   2   3   4   5  1.4
1
0
1
0
1
3
1
0
1
0
1
3
1   2   3   4   5  1.1
0
0
1
0
1
2
0
0
1
0
1
2
0
1
0
0
0
1
0
1
0
0
0
1
0.05
-0.05
-0.625
p ( xi  1) 
24-02-2006
0
1
1
1
1
4
1
0
1
1
1
4
0
1
1
0
1
3
0
1
0
1
0
2
Siddhartha Shakya
0.4
0.6
0.6
-0.05
-0.625
1
1  e  i
0.6
0.6
41
Experimental results
• OneMax Problem
24-02-2006
Siddhartha Shakya
42
F6 function optimization
24-02-2006
Siddhartha Shakya
43
Plateau Problem (n=180)
24-02-2006
Siddhartha Shakya
44
Checker Board Problem (n=100)
24-02-2006
Siddhartha Shakya
45
Trap function of order 5 (n=60)
24-02-2006
Siddhartha Shakya
46
Experimental results
Checker
Board
Fitness
Evaluation
EqualProducts
Fitness
Evaluation
Colville
Fitness
Evaluation
Six
Peaks
Fitness
Evaluation
24-02-2006
GA
UMDA
PBIL
DEUMd
254.68 ±
233.79 ±
243.5 ±
254.1 ±
(4.39)
(9.2)
(8.7)
(5.17)
427702.2 ±
50228.2 ±
191476.8 ±
33994 ±
(1098959.3)
(9127)
(37866.65)
(13966.75)
211.59 ±
5.03 ±
9.35 ±
2.14 ±
(1058.47)
(18.29)
(43.36)
(6.56)
1000000 ±
1000000 ±
1000000 ±
1000000 ±
(0)
(0)
(0)
(0)
0.61 ±
40.62 ±
2.69 ±
0.61 ±
(1.02)
(102.26)
(2.54)
(0.77)
1000000 ±
62914.56 ±
1000000 ±
1000000 ±
(0)
(6394.58)
(0)
(0)
99.1 ±
98.58 ±
99.81 ±
100 ±
(9)
(3.37)
(1.06)
(0)
49506 ±
121333.76 ±
58210 ±
26539 ±
(4940)
(14313.44)
(3659.15)
(1096.45)
Siddhartha Shakya
47
Analysis of Results
• For Univariate problems (OneMax), given population size of 1.5n,
P=D and T->0, solution was found in single generation
• For problems with low order dependency between variables
(Plateau and CheckerBoard), performance was significantly better
than that of other Univariate EDAs.
• For the deceptive problems with higher order dependency (Trap
function and Six peaks) DEUMd was deceived but by slowing the
cooling rate, it was able to find solution for Trap of order 5.
• For the problems where optimum was not known the performance
was comparable to that of GA and other EDAs and was better in
some cases.
24-02-2006
Siddhartha Shakya
48
Cost- Benefit Analysis (the cost)
• Polynomial cost of estimating the distribution
compared to linear cost of other univariate
EDAs
• Cost to compute univariate marginal frequency:
O (nN )
• Cost to compute SVD

N n
O(n 2 N ) 
N n
O ( nN 2 ) 
N n
3
O(n )
24-02-2006
Siddhartha Shakya
49
Cost- Benefit Analysis (the benefit)
• DEUMd can significantly reduce the
number of fitness evaluations
• Quality of solution was better for DEUMd
than other compared EDAs
• DEUMd should be tried on problems
where the increased solution quality
outweigh computational cost.
24-02-2006
Siddhartha Shakya
50
Sampling MRF
• Probability vector approach
• Direct Sampling of Gibbs Distribution
• Metropolis sampling
• Gibbs sampling
24-02-2006
Siddhartha Shakya
51
Example problem: 2D Ising Spin Glass
Given coupling constant J find the value of each spins that minimises H
MRF fitness model
24-02-2006
Siddhartha Shakya
52
Metropolis Sampler
24-02-2006
Siddhartha Shakya
53
Difference in Energy
24-02-2006
Siddhartha Shakya
54
DEUM with Metropolis sampler
24-02-2006
Siddhartha Shakya
55
Results
24-02-2006
Siddhartha Shakya
56
Sampling MRF
• Probability vector approach
• Direct Sampling of Gibbs Distribution
• Metropolis sampling
• Gibbs sampling
24-02-2006
Siddhartha Shakya
57
Conditionals from Gibbs distribution
For 2D Ising spin glass problem:
24-02-2006
Siddhartha Shakya
58
Gibbs Sampler
24-02-2006
Siddhartha Shakya
59
DEUM with Gibbs sampler
24-02-2006
Siddhartha Shakya
60
Results
24-02-2006
Siddhartha Shakya
61
Summary
• From GA to EDA
• PGM approach to modelling and sampling distribution in
EDA
• DEUM: MRF approach to modelling and sampling
• Learn Structure: No structure learning so far (Fixed
models are used)
• Learn Parameter: Fitness modelling approach
• Sample MRF:
–
–
–
–
Probability vector approach to sample
Direct sampling of Gibbs distribution
Metropolis sampler
Gibbs Sampler
• Results are encouraging and lot more to explore
24-02-2006
Siddhartha Shakya
62