Bayesian Optimization Algorithm, Decision Graphs, and Occam`s

Download Report

Transcript Bayesian Optimization Algorithm, Decision Graphs, and Occam`s

Bayesian Optimization
Algorithm, Decision Graphs,
and Occam’s Razor
Martin Pelikan, David E. Goldberg,
and Kumara Sastry
IlliGAL Report No. 2000020
May 2000.
Abstract
• The use of various scoring metrics for
Bayesian networks.
• The use of decision graphs in Bayesian
networks to improve the performance of
the BOA.
• BDe metric for Bayesian networks with
decision graphs.
Bayesian Networks
• Two basics components in Bayesian Networks
– A scoring metric for discriminates the networks
– A search algorithm for finding the best scoring metric
value
• BOA (in previous works)
– The complexity of the considered models was
bounded by the maximum number of incoming edges
into any node.
– To search the space of networks, a simple greedy
algorithm was used due to its efficiency.
Bayesian-Dirichlet Metric
• BDe metric combines the prior knowledge about
the problem and the statistical data from a given
data set.
p( B) p( D | B)
• Bayes theorem p( B | D) 
p ( D)
• The higher the p(B|D), the more likely the
network B is a correct model of the data.
 Bayesian scoring metric, or the posterior
probability
– Even more, we use a fixed data set D.
Bayesian-Dirichlet Metric
• p(B) : prior probability of the network B
p( B)  c 
c : normalizat ion constant
 : constant factor penalizing the network for each unmatched edge with the prior network
 : symmetric difference between B and the prior network
• BDe metric gives preference to simpler networks
– But, it’s not enough!
Bayesian-Dirichlet Metric
• p(B|D)
– Data is a multinomial sample
– Parameters are independent
1. The parameters associated with each variable are
independent (global parameter independence)
2. The parameters associated with each instance of
the parents of a variable are independent (local
parameter independence)
– Dirichlet distribution
– No missing data (complete data)
Bayesian-Dirichlet Metric
n 1
(m( i ))
(m( i ))
p( D | B)  



(
m
(

)

m
(

))
i 0  i
i  0  i ( m( i )  m( i ))
i
i
n 1
• Often referred to K2 metric
Minimum Description Length
Metric
length( B, D)  length( B)  length( X ,  )  length( D | B)

 n 
 
length( B)    log 2 n  log 2 
i 0 
 |  i | 
n 1
1
length( X ,  )  log 2 N  2| i |
2
i 0
n 1
n 1
length( D | B)   N  p( xi ,  i ) log 2 p( xi |  i )
i  0 xi
i
• Not good for using prior information
Constructing a Network
• Constructing a best network is NPcomplete.
• Most of the commonly used metrics can be
decomposed into independent terms each
of which corresponds to one variable.
• Empirical results show that more
sophisticated search algorithms do not
improve the obtained result significantly.
Decision Graphs in Bayesian
Networks
• The use of local structures as decision
trees, decision graphs, and default tables
to represent equalities among parameters
was proposed
• The network construction algorithm takes
an advantage of using decision graphs by
directly manipulating the network
structure through the graphs.
Decision Graphs
• A decision graph is an extension of a
decision tree in which each non-root node
can have multiple parents.
Advantages of Decision Graph
• Much less parents can be used to
represent a model
• Learning more complex class of models,
called Bayesian multinets
• Performs smaller and more specific steps
what results in better models with respect
to their likelihood.
• Network complexity measure can be
incorporated into the scoring metir
Bayesian Score for Networks
with Decision Graphs
(m( xi , i, l )  m( xi , i, l ))
(m(i, l ))
p ( D | B)  



(
m
(
i
,
l
)

m
(
i
,
l
))
(m( xi , i, l ))
i  0 lLi
xi
n 1
Li : the set of leaves in the dicision graph
m( i ) : the number of instances in D which end up
the traversal through t he graph
m : the prior knowledge
Operators on Decision Graphs
split
merge
Constructing BN with DG
1. Initialize a decision graph Gi for each
node xi to a graph containing only a
single leaf.
2. Initialize the network B into an empty
network.
3. Choose the best split or merge that does
not result in a cycle in B.
4. If the best operator does not improve the
score, finish.
Constructing BN with DG
5. Execute the chosen operator
6. If the operator was a split, update the
network B by adding a new edge.
7. Go to (3)
Experiments
• One-max
• 3-deceptive
• Spin-glass
• Graph bisection