Learning: Structure

Download Report

Transcript Learning: Structure

Structure Learning
in Bayesian Networks
Eran Segal
Weizmann Institute
Structure Learning Motivation

Network structure is often unknown

Purposes of structure learning

Discover the dependency structure of the domain



Goes beyond statistical correlations between individual variables and
detects direct vs. indirect correlations
Set expectations: at best, we can recover the structure up to the Iequivalence class
Density estimation

Estimate a statistical model of the underlying distribution and use it
to reason with and predict new instances
Advantages of Accurate Structure
X1
Spurious edge
X1
X2
Y
X2
Y
Missing edge
X1
X2
Y
 Increases number of
fitted parameters
 Cannot be compensated
by parameter estimation
 Wrong causality and
domain structure
assumptions
 Wrong causality and
domain structure
assumptions
Structure Learning Approaches

Constraint based methods




View the Bayesian network as representing dependencies
Find a network that best explains dependencies
Limitation: sensitive to errors in single dependencies
Score based approaches

View learning as a model selection problem




Define a scoring function specifying how well the model fits the data
Search for a high-scoring network structure
Limitation: super-exponential search space
Bayesian model averaging methods


Average predictions across all possible structures
Can be done exactly (some cases) or approximately
Constraint Based Approaches

Goal: Find the best minimal I-Map for the domain

G is an I-Map for P if I(G)I(P)



Minimal I-Map if deleting an edge from G renders it not an I-Map
G is a P-Map for P if I(G)=I(P)
Strategy


Query the distribution for independence relationships that
hold between sets of variables
Construct a network which is the best minimal I-Map for P
Constructing Minimal I-Maps

Reverse factorization
theorem
n

P( X 1 ,..., X n )   P( X i | Pa( X i ))  G is an I-Map of P
i 1

Algorithm for constructing a minimal I-Map


Fix an ordering of nodes X1,…,Xn
Select parents of Xi as minimal subset of X1,…,Xi-1,
such that Ind(Xi ; X1,…Xi-1 – Pa(Xi) | Pa(Xi))
Limitations
(Outline of) Proof of minimal I-map

 Independence
involve a above
large number
variables
 I-map since queries
the factorization
holds byofconstruction
 Minimal since
by construction,
removing
one edge
destroys
 Construction
involves
a large number
of queries
(2i-1 subsets)
the factorization
 We do not know the ordering and network is sensitive to it
Constructing P-Maps

Simplifying assumptions




Network has bound in-degree d per node
Oracle can answer Ind. queries of up to 2d+2 variables
Distribution P has a P-Map
Algorithm



Step I: Find skeleton
Step II: Find immoral set of v-structures
Step III: Direct constrained edges
Step I: Identifying the Skeleton

For each pair X,Y query all Z for Ind(X;Y | Z)


X–Y is in skeleton if no Z is found
If graph in-degree bounded by d  running time O(n2d+2)


Since if no direct edge exists, Ind(X;Y | Pa(X), Pa(Y))
Reminder

If there is no Z for which Ind(X;Y | Z) holds,
then XY or YX in G*




Proof: Assume no Z exists, and G* does not have XY or YX
Then, can find a set Z such that the path from X to Y is blocked
Then, G* implies Ind(X;Y | Z) and since G* is a P-Map
Contradiction
Step II: Identifying Immoralities

For each pair X,Y query candidate triplets X,Y,Z


XZY if no W is found that contains Z and Ind(X;Y | W)
If graph in-degree bounded by d  running time O(n2d+3)


If W exists, Ind(X;Y|W), and XZY not immoral, then ZW
Reminder

If there is no W such that Z is in W and Ind(X;Y | W),
then XZY is an immorality





Proof: Assume no W exists but X–Z–Y is not an immorality
Then, either XZY or XZY or XZY exists
But then, we can block X–Z–Y by Z
Then, since X and Y are not connected, can find W that includes Z
such that Ind(X;Y | W)
Contradiction
Answering Independence Queries

Basic query



Determine whether two variables are independent
Well studied question in statistics
Common to frame query as hypothesis testing



Null hypothesis is H0
H0: Data was sampled from P*(X,Y)=P*(X)P*(Y)
Need a procedure that will Accept or Reject the hypothesis

2 test to assess deviance of data from hypothesis
^
d  2 ( D)  
x, y

^
M [ x, y ]  M  P ( x )  P ( y )
^
^
M  P( x)  P( y )
Alternatively, use mutual information between X and Y
Structure Based Approaches

Strategy



Define a scoring function for each candidate structure
Search for a high scoring structure
Key: choice of scoring function


Likelihood based scores
Bayesian based scores
Likelihood Scores

Goal: find (G,) that maximize the likelihood


ScoreL(G:D)=log P(D | G, ’G) where ’G is MLE for G
Find G that maximizes ScoreL(G:D)
Example
X
X
Y
Y
ScoreL (G0 : D)   log ˆx[ m ]  log ˆy[ m ]
m
ScoreL (G1 : D)   log ˆx[ m ]  log ˆy[ m ]| x[ m ]
m
ScoreL (G1 : D)  ScoreL (G0 : D)   log ˆy[ m ]| x[ m ]  log ˆy[ m ]
m
  M [ x, y ] log ˆy| x   M [ y ] log ˆy
x, y
y
 M  Pˆ ( x, y ) log Pˆ ( y | x)  M  Pˆ ( y ) log Pˆ ( y )
x, y
y
x, y
x, y
 M  Pˆ ( x, y ) log Pˆ ( y | x)  M  Pˆ ( x, y ) log Pˆ ( y )
 M  I Pˆ ( X , Y )
General Decomposition

The Likelihood score decomposes
as:
n
n
ScoreL (G : D)  M  I Pˆ ( X i , Pa GXi )  M  H Pˆ ( X i )

i 1
Proof:
i 1


ˆ
ScoreL (G : D)      M [ xi , ui ] log  xi |ui 
i 1 ui Val ( PaG

X i ) xi

n
1
M
 M [ x , u ] log ˆ
i
ui
xi
i
xi |ui
  Pˆ ( xi , ui ) log Pˆ ( xi | ui )
ui
xi
 Pˆ ( xi , ui ) Pˆ ( xi ) 
ˆ

  P( xi , ui ) log 

ˆ
ˆ
ui xi
 P(ui ) P( xi ) 


 Pˆ ( xi , ui ) 
ˆ
ˆ

 log Pˆ ( xi )


  P( xi , ui ) log 

P
(
x
,
u
)


i
i



ˆ
ˆ
ui xi
 P(ui ) P( xi )  xi  ui

 I Pˆ ( X i , U i )   Pˆ ( xi ) log Pˆ ( xi )
xi
 I Pˆ ( X i , U i )  H Pˆ ( X i )
General Decomposition

The Likelihood score decomposes as:
n
n
ScoreL (G : D)  M  I Pˆ ( X i , Pa )  M  H Pˆ ( X i )
G
Xi
i 1



i 1
Second term does not depend on network structure and
thus is irrelevant for selecting between two structures
Score increases as mutual information, or strength of
dependence between connected variable increases
After some manipulation can show:
n
ScoreL (G : D)  H Pˆ ( X 1 ,..., X n )   I Pˆ ( X i ,{ X 1 ,... X i 1}  Pa GXi | Pa GXi )
i 1

To what extent are the implied Markov assumptions valid
Limitations of Likelihood Score
X
X
Y
Y
G0
G1
ScoreL (G1 : D)  ScoreL (G0 : D)  M  I Pˆ ( X , Y )
 Since IP(X,Y)0  ScoreL(G1:D)ScoreL(G0:D)
 Adding arcs always helps
 Maximal scores attained for fully connected network
 Such networks overfit the data (i.e., fit the noise in the data)
Avoiding Overfitting

Classical problem in machine learning

Solutions

Restricting the hypotheses space



Minimum description length



Limits the overfitting capability of the learner
Example: restrict # of parents or # of parameters
Description length measures complexity
Prefer models that compactly describes the training data
Bayesian methods


Average over all possible parameter values
Use prior knowledge
Bayesian Score
Marginal likelihood
Prior over structures
P( D | G ) P(G )
P(G | D) 
P( D)
Marginal probability of Data
P(D) does not depend on the network
Bayesian Score: ScoreB (G : D)  log P( D | G)  log P(G)
Marginal Likelihood of Data Given G
Bayesian Score: ScoreB (G : D)  log P( D | G)  log P(G)
Likelihood
Prior over parameters
P( D | G )   P( D | G,  G ) P( G | G )d G
G
Note similarity to maximum likelihood score, but with the key
difference that ML finds maximum of likelihood and here we
compute average of the terms over parameter space
Marginal Likelihood: Binomial Case


Assume a sequence of m coin tosses
By the chain rule for probabilities
P( x[1], , x[m])  P( x[1])  ...  P( x[m] | x[1], , x[m  1])

Recall that for Dirichlet priors

M Hm   H
P( x[m  1]  H | x[1], , x[m]) 
m   H  T
Where MmH is number of heads in first m examples
P( x[1],..., x[m]) 
[ H  ...  ( H  M H  1)][T  ...  (T  M T  1)]
  ...  (  M  1)
Marginal Likelihood: Binomial Case
P( x[1],..., x[m]) 
[ H  ...  ( H  M H  1)][T  ...  (T  M T  1)]
  ...  (  M  1)
Simplify using(x+1)=x(x)
( )(  1) (  M  1) 
(  M )
( )
( ) ( H  M H ) ( T  M T )
P( x[1],..., x[m]) 


(  M )
( H )
( T )
For multinomials with Dirichlet prior
k
( i  M [ x i ])
( )
P( x[1],..., x[m]) 

(  M ) i 1
( i )
Marginal Likelihood Example
Actual experiment with P(H) = 0.25
-0.6
-0.7
(log P(D)) / M

-0.8
-0.9
-1
Dirichlet(.5,.5)
Dirichlet(1,1)
Dirichlet(5,5)
-1.1
-1.2
-1.3
0
5
10
15
20
25
M
30
35
40
45
50
Marginal Likelihood: BayesNets

Network structure
determines form of
marginal likelihood
1
2
3
4
5
6
7
X
H
T
T
H
T
H
H
Y
H
T
H
H
T
T
H
Network 1: Two Dirichlet marginal likelihoods
P(X[1],…,X[7])
P(Y[1],…,Y[7])
Network
X
Y
Marginal Likelihood: BayesNets

Network structure
determines form of
marginal likelihood
1
2
3
4
5
6
7
X
H
T
T
H
T
H
H
Y
H
T
H
H
T
T
H
Network 2: Three Dirichlet marginal likelihoods
Network
P(X[1],…,X[7])
P(Y[1]Y[4]Y[6]Y[7])
X
P(Y[2]Y[3]Y[5])
Y
Idealized Experiment


P(X = H) = 0.5
P(Y = H|X = H) = 0.5 + p
P(Y = H|X = T) = 0.5 - p
-1.3
(logP(D))/M
-1.35
-1.4
-1.45
-1.5
-1.55
-1.6
Independent
-1.65
P = 0.05
P = 0.10
P = 0.15
P = 0.20
-1.7
-1.75
-1.8
1
10
M
100
1000
Marginal Likelihood: BayesNets
The marginal likelihood has the form:
P( D | G )  
i
where


paiG

 
  paG
i

  paG  M [ pa Gi ]
i
xi
( x , paG  M [ xi , pa Gi ])
i
i
( x , paG )
i
i
Dirichlet Marginal Likelihood
For the sequence of values of Xi when
Xi’s parents have a particular value
M(..) are the counts from the data
(..) are hyperparameters for each family
Priors
Bayesian Score: ScoreB (G : D)  log P( D | G)  log P(G)

Structure prior P(G)



Uniform prior: P(G)  constant
Prior penalizing number of edges: P(G)  c|G| (0<c<1)
Normalizing constant across networks is similar and can
thus be ignored
Priors
Bayesian Score: ScoreB (G : D)  log P( D | G)  log P(G)

Parameter prior P(|G)

BDe prior



M0: equivalent sample size
B0: network representing the prior probability of events
Set (xi,paiG) = M0 P(xi,paiG| B0)



Note: paiG are not the same as parents of Xi in B0
Compute P(xi,paiG| B0) using standard inference in B0
BDe has the desirable property that I-equivalent networks have the
same Bayesian score when using the BDe prior for some M’ and P’
Bayesian Score: Asymptotic Behavior

For M, a network G with Dirichlet priors satisfies
log M
log P( D | G )  l (ˆG : D) 
Dim (G )  O(1)
2
Dim(G): number of independent parameters in G

Approximation is called BIC score
log M
ScoreBIC (G : D)  l (ˆG : D) 
Dim (G)
2
n
n
log M
 M  I Pˆ ( X i , Pa X i )  M  H Pˆ ( X i ) 
Dim (G)
2
i 1
i 1


Score exhibits tradeoff between fit to data and complexity
Mutual information grows linearly with M while complexity
grows logarithmically with M

As M grows, more emphasis is given to the fit to the data
Bayesian Score: Asymptotic Behavior

For M, a network G with Dirichlet priors satisfies
log M
ˆ
log P( D | G )  l ( G : D) 
Dim (G )  O(1)
2

Bayesian score is consistent

As M, the true structure G* maximizes the score


Spurious edges will not contribute to likelihood and will be penalized
Required edges will be added due to linear growth of likelihood term
relative to M compared to logarithmic growth of model complexity
Summary: Network Scores

Likelihood, MDL, (log) BDe have the form
Score(G : D)   Score( X i | PaGi : D)
i

BDe requires assessing prior network



Can naturally incorporate prior knowledge
BDe is consistent and asymptotically equivalent (up to
a constant) to BIC/MDL
All are score-equivalent

G I-equivalent to G’

Score(G) = Score(G’)
Optimization Problem
Input:



Training data
Scoring function (including priors, if needed)
Set of possible structures

Including prior knowledge about structure
Output:

A network (or networks) that maximize the score
Key Property:

Decomposability: the score of a network is a sum of terms.
Score(G : D)   Score( X i | PaGi : D)
i
Learning Trees

Trees


At most one parent per variable
Why trees?

Elegant math
we can solve the optimization problem efficiently
(with a greedy algorithm)

Sparse parameterization
avoid overfitting while adapting to the data
Learning Trees

Let p(i) denote parent of Xi, or 0 if Xi has no parent

We can write the score as
Score(G : D)   Score( X i : Pai )
i

 Score( X : X
 Score( X : X
i: p ( i )  0

i: p ( i )  0
Improvement over
“empty” network

i
p (i )
i
 Score( X )
)  Score( X )    Score( X )
)
p (i )
i
i: p ( i )  0
i
i
i
Score of “empty”
network
Score = sum of edge scores + constant
Learning Trees

Algorithm



Construct graph with vertices: 1,...n
Set w(ij) = Score(Xj | Xi ) - Score(Xj)
Find tree (or forest) with maximal weight



This can be done using standard algorithms in low-order polynomial
time by building a tree in a greedy fashion
(Kruskal’s maximum spanning tree algorithm)
Theorem: Procedure finds the tree with maximal score
When score is likelihood, then w(ij) is proportional
to I(Xi; Xj). This is known as the Chow & Liu method
Learning Trees: Example
MINVOLSET
PULMEMBOLUS
Tree learned from data
of Alarm network
PAP
KINKEDTUBE
INTUBATION
SHUNT
VENTMACH
VENTLUNG
DISCONNECT
VENITUBE
PRESS
MINOVL
VENTALV
FIO2
PVSAT
ANAPHYLAXIS
SAO2
TPR
HYPOVOLEMIA
LVEDVOLUME
Correct edges
CVP
PCWP
LVFAILURE
STROEVOLUME
ARTCO2
EXPCO2
INSUFFANESTH
CATECHOL
HISTORY
ERRBLOWOUTPUT
CO
HR
HREKG
ERRCAUTER
HRSAT
HRBP
Spurious edges
BP
Not every edge in tree is in the the original network
Tree direction is arbitrary --- we can’t learn about arc direction
Beyond Trees

Problem is not easy for more complex networks



Example: Allowing two parents, greedy algorithm is no
longer guaranteed to find the optimal network
In fact, no efficient algorithm exists
Theorem:

Finding maximal scoring network structure with at most k
parents for each variable is NP-hard for k>1
Fixed Ordering

For any decomposable scoring function Score(G:D)
and ordering  the maximal scoring network has:
PaiG  arg max U{ X j : X j  X i } Score( X i | U i : D)
(since choice at Xi does not constrain other choices)


 For fixed ordering we have independent problems
If we bound the in-degree per variable by d, then
complexity is exponential in d
Heuristic Search
We address the problem by using heuristic search
 Define a search space:


nodes are possible structures
edges denote adjacency of structures

Traverse this space looking for high-scoring structures

Search techniques:




Greedy hill-climbing
Best first search
Simulated Annealing
...
Heuristic Search

Typical operations:
A
B
C
A
B
D
C
A
B
D
A
B
C
C
D
D
Exploiting Decomposability
A
A
B
C
C
D

B
A
B
D
A
B
C
C
D
D
Caching: To update the score after a local
change, we only need to rescore the families
that were changed in the last move
Greedy Hill Climbing

Simplest heuristic local search

Start with a given network




At each iteration





empty network
best tree
a random network
Evaluate all possible changes
Apply change that leads to best improvement in score
Reiterate
Stop when no modification improves score
Each step requires evaluating O(n) new changes
Greedy Hill Climbing Pitfalls

Greedy Hill-Climbing can get stuck in:

Local Maxima


Plateaus




All one-edge changes reduce the score
Some one-edge changes leave the score unchanged
Happens because equivalent networks received the same score and
are neighbors in the search space
Both occur during structure search
Standard heuristics can escape both


Random restarts
TABU search
Equivalence Class Search

Idea



Search the space of equivalence classes
Equivalence classes can be represented by PDAGs (partially
ordered graph)
Advantages


PDAGs space has fewer local maxima and plateaus
There are fewer PDAGs than DAGs
Equivalence Class Search

Evaluating changes is more expensive

In addition to search, need to score a consistent network
X
Y
Original PDAG
Z
New PDAG
Add Y—Z
Consistent DAG
X
Y
Z
X
Y
Z
Score

These algorithms are more complex to implement
Learning Example: Alarm Network
2
True Structure/BDe M' = 10
Unknown Structure/BDe M' = 10
KL Divergence
1.5
1
0.5
0
0
500
1000
1500
2000
2500
M
3000
3500
4000
4500
5000
Model Selection

So far, we focused on single model



Find best scoring model
Use it to predict next example
Implicit assumption:

Best scoring model dominates the weighted sum


Pros:



Valid with many data instances
We get a single structure
Allows for efficient use in our tasks
Cons:


We are committing to the independencies of a particular structure
Other structures might be as probable given the data
Model Selection

Density estimation


One structure may suffice, if its joint distribution is similar to
other high scoring structures
Structure discovery


Define features f(G) (e.g., edge, sub-structure, d-sep query)
Compute P( f | D)   f (G) P(G | D)
G

Still requires summing over exponentially many structures
Model Averaging Given an Order

Assumptions



Known total order of variables 
Maximum in-degree for variables d
Marginal likelihood
Using decomposability
assumption on prior P(G|)
 P( D | G) P(G |  )
   exp FamScore ( X | Pa : D


expFamScore ( X | U : D
P( D |  ) 
GGd ,
GGd ,
i
B
G
Xi
i
i
U{U:U  X i  ,|U| d }
Cost per family: O(nd)
Total cost: O(nd+1)
B
i
i
Since given ordering , parent
choices are independent
Model Averaging Given an Order

Posterior probability of a general feature
P( f , D |  )
P ( f |  , D) 

P( D |  )

i

 f (G) P( D | G) P(G |  )
GGd ,
expFamScore
B
U{U:U  X i  ,|U| d }
( X i | U i : D
Posterior probability of particular choice of parents
P( Pa  U | D,  ) 
G
Xi
expFamScoreB ( X i | U : D
 expFamScoreB ( X i | U i : D
U{U:U  X i  ,|U| d }

Posterior probability of particular edge choice
 expFamScoreB ( X i | U i : D
P( X j  Pa GXi | D,  ) 
U{U: X j U and U  X i  ,|U| d }
expFamScore
U{U:U  X i  ,|U| d }
B
( X i | U i : D
All terms
cancel out
Model Averaging

We cannot assume that order is known

Solution: Sample from posterior distribution of P(G|D)



Estimate feature probability by P( f | D)  1
K
Sampling can be done by MCMC
Sampling can be done over orderings
K
 f (G
i 1
d
)
Notes on Learning Local Structures

Define score with local structures


Prior may need to be extended


Example: in tree CPDs, score decomposes by leaves
Example: in tree CPDs, penalty for tree structure per CPD
Extend search operators to local structure


Example: in tree CPDs, we need to search for tree structure
Can be done by local encapsulated search or by defining
new global operations
Search: Summary


Discrete optimization problem
In general, NP-Hard


Need to resort to heuristic search
In practice, search is relatively fast (~100 vars in ~10 min):



Decomposability
Sufficient statistics
In some cases, we can reduce the search problem to
an easy optimization problem

Example: learning trees