Transcript Mossel

Some Recent Progress in Combinatorial
Statistics
Elchanan Mossel,
UC Berkeley + Weizmann Institute
http://www.stat.berkeley.edu/~mossel
What is Combinatorial Statistics
•
“Combinatorial Statistics” deals with Inference of
•
Discrete parameters such as Graph, Trees or Partitions.
•
With the goal of providing algorithms with guarantees on:
•
Sampling Complexity, Computational Complexity and Success Probability.
•
Alternatively: Sampling or Complexity Lower bounds.
Example 1: Graphical Model reconstruction
Markov random fields / Graphical Models
• A common model for stochastic networks
bounded degree graph G = (V, E)
weight functions C: | C |R≥0
for every clique C
1
2
3
4
Markov Random Fields / Graphical Models
• A common model for stochastic networks
bounded degree graph G = (V, E)
weight functions C: | C |R≥0
for every edge clique C
1
nodes v are assigned
values av in alphabet 
2
distribution over states
  V given by
3
Pr[] ~ C C(au, u 2 C)
4
where the product goes over all
Cliques C
Reconstruction task for Markov random fields
• Suppose we can obtain independent samples
from the Markov random field
• Given observed data at the nodes, is it possible
to reconstruct the model (network)?
• Important: Do we see data at all of the nodes or
only a subset?
Reconstruction problem
Problem: Given k independent samples of  = (1,…n) at all the nodes,
find the graph G
(Given activity at all the nodes, can network be reconstructed?)
• Restrict attention to graphs with max degree d:
• A structure estimator is a map
Questions:
1. How many samples k are required (asymptotically) in order to
reconstruct MRFs with number of nodes n, max degree d with
probability approaching 1, I.e.
2. Want an efficient algorithm for reconstruction.
Related work
•
•
•
•
•
Tree Markov Fields can be reconstructed efficiently (even with hidden
nodes).
[Erdös,Steel,Szekely,Warnow,99], [Daskalakis,Mossel,Roch,06].
PAC Setup: [Abbeel,Koller,Ng, ‘06] produce a factorized distribution that is 
n close in Kullback-Leibler divergence to the true distribution.
No guarantee to reconstruct the correct graph
Running time and sampling complexity is nO(d)
•
•
More restricted problem studied by [Wainwright,Ravikumar,Lafferty, ‘06]
Restricted to Ising model, sample complexity (d5 log n), difficult to verify
convergence condition – technique based on L1 regularization. Moreover
works for graphs not for graphical models! (clique potentials not allowed).
•
Subsequent to our results, [Santhanam,Wainwright, ‘08] determine
information theoretic sampling complexity and
[Wainwright,Ravikumar,Lafferty, ‘08] get (d log n) sampling (restricted to
Ising models; still no checkable guarantee for convergence).
Related work
Method
Abeel et
al
Wainwright et al
Bresler et al.
Generative
model
MRF
General
Collection of Edges MRF
Ising
General
Reconstruct
Dist of
small KL
Distance
Graph
Graph
Additional
conditions
No
Yes (very hard to
check)
No
Running time
nd
n5
nd
Sampling
Complexity
poly(n)
d5 log n
Later:
d log n
d log n
Reconstructing General Networks - New Results
Theorem (Bresler-M-Sly-08; Lower bound on sample
complexity):
•
•
•
In order to recover G of max-deg d need at least c d log n samples, for
some constant c.
Pf follows by “counting # of networks”; information theory lower bounds.
More formally: Given any prior distribution which is uniform over degree
d graphs (no restrictions on the potentials), in order to recover correct
graph with probability ¸ 2-n need at least c d log n samples.
Theorem (Bresler-M-Sly-08; Asymptotically optimal algorithm):
•
•
•
If distribution is “non-degenerate” c d log n samples suffice to reconstruct
the model with probability ¸ 1 – 1/n100, for some (other) constant c.
Running time is nO(d)
(sampling complexity tight up to a constant factor; running time – unknown)
Intuition Behind Algorithms
•
•
•
Observation: Knowing graph is same as knowing neighborhoods
But neighborhood is determined by Markov property
Same intuition behind work of Abeel et. al.
“Algorithm”:
Step 1. Compute empirical probabilities,
which are concentrated around
true probabilities
Step 2. For each node, simply test
Markov property of each
candidate neighborhood
Main Challenge: Show nondegeneracy ) algorithm works
Example 2: Reconstruction of MRF with Hidden nodes
• In many applications only some of the nodes can
be observed
visible nodes W  V
1
Markov random field over
visible nodes is
2
W = (w : w  W)
3
• Is reconstruction still possible?
4
• What does “reconstruction” even mean?
Reconstruction versus distinguishing
• Easy to construct many models that lead to the
same distribution (statistical unidentifiable)
• Assuming “this is not a problem” are there
computational obstacles for reconstruction?
• In particular: how hard is it to distinguish
statistically different models?
Distinguishing problems
• Let M1, M2 be two models with hidden nodes
PROBLEM 1
• Can you tell if M1 and M2 are statistically close or
far apart (on the visible nodes)?
PROBLEM 2
• Assuming M1 and M2 are statistically far apart and
given access to samples from one of them, can
you tell where the samples came from?
Hardness result with hidden nodes
• In Bogdanov-M-Vadhan-08:
Problems 1 and 2 are intractable (in the
worst case) unless NP = RP
• Conversely, if NP = RP then distinguishing (and
other forms of reconstruction) are achievable
A possible objection
• The “hard” models M1, M2 describe distributions
that are not efficiently samplable
• But if nature is efficient, we never need to worry
about such distributions!
Distinguishing problem for samplable distributions
PROBLEM 3
• If M1 and M2 are statistically far apart and given
access to samples from one of them, can you tell
where the samples came from, assuming M1 and
M2 are efficiently samplable?
• Theorem
Problem 3 is intractable unless computational
zero knowledge is trivial
– We don’t know if this is tight
Example 3: Consensus Ranking and the Mallows Model

Problem Given a set of rankings {1, 2, ... N} find the consensus ranking (or central
ranking)
for d = distance on the set of permutations of n objects
Most natural is dK which is the Kendall distance.
18
The Mallows Model

Exponential family model in :
 P(  | 0) = Z()-1 exp(- dK(,0))




´0
: uniform distribution over permutations
>0:
0 is the unique mode of P,0
Theorem [Meila,Phadnis,Patterson,Bilmes 07]
Consensus ranking (i.e ML estimation of 0 for constant ) can be solved
exactly by a branch and bound (B&B) algorithm.

The B&B algorithm can take (super-) exponential time in some cases

Seem to perform well on simulated data.
19
Related work
ML estimation
[Fligner&Verducci 86] introduce generalized
Mallows model, ML estimation
[Fligner&Verducci 88] (FV) heuristic
for 0 estimation


Consensus ranking
[Cohen,Schapire,Singer 99] Greedy algorithm (CSS)
 + improvement by finding strongly connected components
 + missing data (not all i rank all n items)
[Ailon,Newman,Charikar 05] Randomized algorithm
 guaranteed 11/7 factor approximation (ANC)





[Mathieu, 07] (1+) approximation, time O(n6/+2^2O(1/))
[Davenport,Kalagnanan 03] Heuristics based on edge-disjoint cycles
[Conitzer,D,K 05] Exact algorithm based on integer programming, better
bounds
20
Efficient Sorting of the Mallow’s model
•
[Braverman-Mossel-09]:
• Given r independent samples from the Mallows
Model, find ML solution exactly! in time nb,where
• b = 1 + O(( r)-1),
• where r is the number of samples
• with high probability (say ¸ 1-n-100)
Sorting the Mallow’s Model
• [Braverman-M-09]: Optimal order can be found in polynomial
time and O(n log n) queries.
• Proof Ingredient 1: “statistical properties” of generated
permutations i in terms of the original order 0 :
• With high probability: x |i(x)-(x)| = O(n),
max |i(x)-(x)| = O(log n)
•Additional ingredient: A dynamic programming
algorithm to find  given a starting point where each
elements is at most k away with running time O(n 26k)
Example 4: MCMC Diagnostics
• Sampling from a joint probability distribution π on Ω.
• Construct Markov chain X0, X1, …which converges to π.
• Multidimensional Integration
• Bayesian inference (sampling from “posterior distributions” of the model)
• Computational physics,
• Computational biology,
• Vision, Image segmentation.
Convergence Diagnostics
• A method to determine t such that Pt(X0, ) is “sufficiently close” to π.
- MCMC in Practice, Gilks, Richardson & Spiegelhalter
Are we there yet?
•
Convergence Diagnostics in the Literature
Visual inspection of functions of samples.
- MCMC in Practice, Gilks, Richardson & Spiegelhalter
[Raftery-Lewis]
Bounding the variance of estimates of quantiles of functions of parameters.
[Gelman-Rubin]
Convergence achieved if the separate empirical distributions are
approximately the same as the combined distribution.
[Cowles-Carlin]
• Survey of 13 diagnostics designed to identify specific types of
convergence failure.
• Recommend combinations of strategies – running parallel chains,
computing autocorrelations.
Diagnostic Algorithm
X
Diagnostic algorithm A:
Input:
n
An MC P on Ω  {0,1} and a time t such that either
• T(1/4) < t
• T(1/4) > 100t
Output:
• “Yes” if T(1/4) < t
• “No” T(1/4) > 100t
Y
Results - The General Case
Theorem [Bhatnagar-Bogdanov-M-09]: Given
n
1. A MC P on Ω  {0,1}
2. A time t such that either
•
T(1/4) < t
•
T(1/4) > 100t
It is a PSPACE-complete problem to distinguish the two.
PSPACE complete problems
[Crasmaru-Tromp] Ladder problems in GO – does black have a winning
strategy?
PSPACE hard problems are “much harder” the NP-hard problems.
Results - Polynomial Mixing
Theorem 2 [Bhatnagar-Bogdanov-M-09]: Given
n
1. A MC P on Ω  {0,1} with a promise that T(1/4) < poly(n)
2. A time t such that either
•
T(1/4) < t
•
T(1/4) > 100t
• It is as hard as any coNP problem to distinguish the two.
(x0+x3+x7)^(x2+x7+xn) …^(…)
T(1/4) < t
(x6+x3+x2)^(x0+x7+x1) …^(…)
T(1/4) > 100t
Results - Mixing from a Given State
Theorem 3 [Bhatnagar-Bogdanov-M-09]: : Given
n
1. A MC P on Ω  {0,1} such that T(1/4) < poly(n)
2. A starting state X0,
3. A time t such that either
•
TX0(1/4) < t
•
TX0(1/4) > 100t
A diagnostic algorithm can be used to distinguish whether two efficiently
sampleable distributions are close or far in statistical distance.
[Sahai-Vadhan] “Statistical Zero Knowledge” - complete
…
…
r1
rn
r’1
r’n
C1
a1
…
C2
am
a1
…
am
Comb. Stat. vs. Common ML Approaches
• Some features of Combinatorial Statistics in comparison to ML:
• +: Using explicit well defined models.
• +: Explicit non-asymptotic running time / sampling bounds.
• - : Often not “statistically efficient”.
• - : Inputs must be generated from correct dist. or close to it
(but: often: NP hard otherwise; so true for of all methods)
• +/- : Need to understand combinatorial structure for each new problem vs:
• Applying general statistical techniques and structure is hidden in
programming/design etc.
Conclusions and Questions
• [Cowles-Carlin] Survey of 13 diagnostics designed to identify specific types of
convergence failure. Each can fail to detect the type of convergence it was
designed for.
• [Arora-Barak] Efficient algorithms are not believed to exist for PSPACEcomplete, coNP-complete or SZK-complete problems.
• Diagnostic algorithms do not exist for large classes of MCMC algorithms, like
Gibbs samplers unless there are efficient algorithms for PSPACE or coNP or SZK.
• For what classes of samplers are convergence diagnostics possible?
• Hardness of testing convergence for more restricted classes of samplers.
Collaborators
Nayantara
Bhatnagar,
Berkeley
Allan Sly
Berkeley
Guy Bresler
Berkeley
Andrej Bogdanov:
Hong-Kong
Mark Braverman
Microsoft
Salil Vadhan:
Harvard
Thanks !!