Learning Factor Graphs in Polynomial Time & Sample Complexity
Download
Report
Transcript Learning Factor Graphs in Polynomial Time & Sample Complexity
Learning factor graphs in
polynomial time & sample
complexity
Pieter Abbeel
Daphne Koller
Andrew Y. Ng
Stanford University
Overview
First polynomial time & sample complexity
learning algorithm for factor graphs,
a superset of Bayesian nets, Markov nets.
Applicable to any factor graph of bounded
factor size and connectivity,
Introduction
including intractable networks (e.g., grids).
New technical ideas:
Parameter learning: closed-form; parameterization
with low-dimensional frequencies only.
Structure learning: results about guaranteedapproximate Markov blankets from sample data.1
Factor graph distributions
Bayesian network
Introduction
Factor graph
1 factor per conditional
probability table
Markov random field
Factor graph
1 factor per clique
2
Factor graph distributions
…
factor over variables
Cj µ X1:n
partition
function
Introduction
Example:
{1}
{2}
{3}
1
2
3
{1,2}
instantiation x1:n
restricted to Cj
Factor node
Variable node
{2,3}
3
Introduction
Related work
Target distr.
(structure/parameter learning)
True distr.
Samples
Time
Graceful
degradation
Ref.
ML Tree (structure)
any
poly
poly
yes
[1]
ML Bounded tree width (structure)
any
poly
NP-hard
yes
[2]
Bounded tree width (structure)
same
poly
poly
no
[3]
Factor graph (parameter)
same
poly
poly
yes
[x]
Factor graph (structure)
same
poly
poly
yes
[x]
Our work: first poly time & sample complexity solutions for
parameter estimation & structure learning of factor graphs.
Current practice for parameter learning: max likelihood.
Expensive, and applies only to tractable networks.
Current practice for structure learning: local search heuristics
or heuristic learning of bounded tree-width model.
Slow to evaluate, and no performance guarantees. [4],[5],[6],[7],[8]
[1] Chow&Liu, 1968; [2] Srebro, 2001; [3] Narasimhan&Bilmes, 2004; [4] Della Pietra et al., 1997;
[5] McCallum, 2003; [6] Malvestuto, 1991; [7] Bach&Jordan, 2002; [8] Deshpande et al., 2001
4
Parameter learning
Canonical parameterization
Consider the
factor graph:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Hammersley-Clifford theorem gives:
canonical factors
5
Canonical factors
Parameter learning
No lower-order interactions by inclusion-exclusion:
Complete interaction.
Subtract lower order
interactions.
Compensate for
double counting.
Frequencies only.
Equal # of +,- terms.
Closed-form parameter learning ? NO. (Not yet.)
The frequencies P(X1:16=(x1,x2,0,…,0))
involve full instantiations and are thus
expensive to estimate from samples.
6
Parameter learning
Markov blanket canonical factors
Positive and negative term of canonical factor:
Transform to conditional
probability.
Terms cancel.
Conditional
independence.
Low dimensional
distributions.
(MB: Markov blanket.)
7
Parameter learning
Markov blanket canonical factors
{Cj*}: all subfactors of the given structure.
: from distribution over Cj*, MB(Cj*).
• Low
dimensional distributions.
• Efficient estimation from samples.
Example:
8
Parameter learning
Algorithm:
Parameter learning
Estimate the Markov blanket canonical
factors
from data.
Return
Theorem. The parameter learning algorithm
runs in polynomial time,
uses polynomial # of samples,
guarantees: D(……) is small with high probability.
No dependence on tree-width of the network!
9
Graceful degradation
Parameter learning
Theorem. When
true distribution: factor graph G,
structure for parameter learning: factor graph G ( G),
then the additional error consists of two terms:
Canonical factors capture
residual highest-order
interactions only. Small error
when subfactors are in G.
MB in given
factor graph G
MB in given
factor graph G
If MB is a good
approximation of MB,
error will be small. (See
structure learning.)
10
Structure learning
Structure learning
Assume factor size k.
Structure: all
factors of size k
NO:
Parameter
+ learning
=
?
Structure
learning
Estimating Markov blanket canonical factors
requires knowledge of the Markov blankets.
But if we knew the Markov blankets,
structure learning problem would be solved.
11
Structure learning
Recovering the Markov blankets
True distribution Markov blanket criterion
Sample data
Markov blanket criterion
True Markov blankets
???
At best approximate Markov blanket from sample data.
Key for parameter learning:
Desired property for approximate Markov blanket:
12
Structure learning
Conditional entropy
Conditional entropy:
For any candidate Markov blanket Y :
Conditional
independence
Thus
What
about
Conditioning reduces entropy:
For any X,Y,Z H (X |Y,Z ) H (X |Y ).
True distribution
Sample data
Conditional entropy
Conditional entropy
True Markov blankets
???
13
Structure learning
Conditional entropy
Theorem. Conditional entropy satisfies the
desired approximate Markov blanket property:
MB(C) looks like
For any > 0,
Markov blanket
i
f
MB(C) can be
then
Theorem. Empirical conditional entropy estimates
are a good approximation for the true conditional
entropy, even with poly number of samples.
where
used as
Markov blanket
for learning
14
Structure learning
Structure learning algorithm
Assume factor size k, Markov blanket size b.
For all subsets of variables Cj* of size k
Estimate Markov blanket
canonical factors
from data.
Discard factors that are close
to the trivial “all ones” factor.
Find Markov
blankets from
empirical entropy.
Parameter
learning.
Simplify structure.
Return
15
Structure learning
Structure learning theorem
Assume fixed: factor size k, MB size b.
Theorem. The structure learning algorithm
runs in polynomial time,
uses polynomial # of samples,
guarantees: D(……) is small with high probability.
Note:
Exponential dependence on factor size, MB size for
computational and sample complexity.
Bounded connectivity implies bounded factor and MB size.
No dependence on tree-width of the network!
16
Graceful degradation
Theorem. Let G be the factor graph of true
distribution. When in the true distribution the
max factor size > k or max MB size > b, the
additional error consists of three terms:
Canonical factors capture residual
highest-order interactions only.
Small error when small true
interactions of order > k.
Structure learning
If MB is a good approximation of
MB, error will be small.
Factors that are trivial in the true
distribution; but estimated as nontrivial since their MB size is larger
than b.
17
Structure learning
Consequences for Bayesian networks
Factor graph
Bayesian network
Factor node
1 factor per conditional
probability table
Variable node
bounded fan-in, fan-out
bounded factor size,
bounded Markov blanket size
Factor graph
structure learning
Factor graph distribution
P with D(PBN||P) .
Samples from PBN with
unknown structure.
Learning a factor graph (not a Bayesian network)
gives efficient learning of the distribution from finite data.
18
Related work
Finding highest scoring, bounded in-degree Bayesian network
is NP-hard (Chickering, Meek & Heckerman, 2003).
?
Structure learning
Our algorithm recovers a factor graph
representation only.
The (difficult) acyclicity constraint is avoided.
Learning a factor graph (not a Bayesian network)
gives efficient learning of the distribution from finite data.
Note: Spirtes, Glymour & Scheines (2000) and Chickering & Meek
(2002) do recover Bayesian network structure, but only with access to
true distribution (infinite sample size).
19
Discussion and conclusion
First polynomial time & polynomial sample
complexity learning algorithm for factor graphs.
Applicable to any factor graph of bounded factor
size and connectivity,
Conclusion
including intractable networks (e.g., grids).
Practical drawbacks of the proposed algorithm:
Estimates parameters from only small fraction of data.
Structure learning: algorithm enumerates all possible
Markov blankets.
Complexity exponential in Markov blanket size.
20
Done ...
Additional and outdated slides follow.
21
Detailed theorem statements
Parameter learning theorem
22
Detailed theorem statements
Structure learning theorem
23
Learning factor graphs in polynomial
time & sample complexity
Factor graphs: superset of Markov, Bayesian networks.
Factor graph
Markov network (MN)
1 factor per clique
Bayesian network (BN)
Factor graph
1 factor per conditional
probability table
Current practice in Markov network learning:
parameter learning: max likelihood, only applicable in tractable MN’s.
structure learning: local-search heuristics or heuristic learning of
bounded tree-width model. No performance guarantees.
Finding highest scoring BN is NP-hard (Chickering et al. 2003).
Pieter Abbeel, Daphne Koller and Andrew Y. Ng
Learning factor graphs in polynomial
time & sample complexity
First polynomial time & sample complexity
learning algorithm for factor graphs.
Applicable to any factor graph of bounded
factor size and connectivity,
including intractable networks (e.g., grids).
New technical ideas:
Parameter learning: in closed-form, using
parameterization with low-dimensional
frequencies only.
Structure learning: results about guaranteedapproximate Markov blankets from sample data.
Pieter Abbeel, Daphne Koller and Andrew Y. Ng
Structure learning
Relation to Narasimhan & Bilmes (2004)
Narasimhan & This paper
Bilmes (2004)
Independent of treewidth.
NO.
YES.
Independent of Markov
blanket size.
YES.
NO.
Graceful degradation result.
NO.
YES.
n x n grid: treewidth=n+1,
Markov blanket size=6.
n-star graph: treewidth=2,
Markov blanket size=n.
Factor node
Variable node
26
Canonical parameterization
27
Canonical parameterization (2)
28
Canonical parameterization (3)
29
Markov blanket canonical factors
30
Markov blanket canonical parameterization
31
Approximate Markov blankets
32
Structure learning algorithm
33
Structure learning algorithm
34