Slides - Stanford Computer Science

Download Report

Transcript Slides - Stanford Computer Science

1
OPTIMIZATION WITH PARITY
CONSTRAINTS: FROM
BINARY CODES TO
DISCRETE INTEGRATION
Stefano Ermon*, Carla P. Gomes*,
Ashish Sabharwal+, and Bart Selman*
*Cornell University
+IBM Watson Research Center
UAI - 2013
2
High-dimensional integration
• High-dimensional integrals in statistics, ML, physics
• Expectations / model averaging
• Marginalization
• Partition function / rank models / parameter learning
• Curse of dimensionality:
n dimensional
hypercube
L
L2
L3
L4
• Quadrature involves weighted sum over exponential
number of items (e.g., units of volume)
Ln
3
Discrete Integration
Size visually represents weight
2n Items
• We are given
• A set of 2n items
• Non-negative weights w
1
4
…
0
5
• Goal: compute total weight
• Compactly specified weight function:
• factored form (Bayes net, factor graph, CNF, …)
• Example 1: n=2 variables, sum over 4 items
5
0
2
1
factor
Goal: compute 5 + 0 + 2 + 1 = 8
• Example 2: n= 100 variables, sum over 2100 ≈1030 items
(intractable)
5
1
0
2
4
EXP
PSPACE
Hardness
0
1
• 0/1 weights case:
• Is there at least a “1”?  SAT
0
1
• How many “1” ?  #SAT
• NP-complete vs. #P-complete. Much harder
• General weights:
• Find heaviest item (combinatorial optimization, MAP)
• Sum weights (discrete integration)
P^#P
PH
NP
P
Easy
0
3
4
7
• [ICML-13] WISH: Approximate Discrete Integration
via Optimization. E.g., partition function via MAP
inference
• MAP inference often fast in practice:
• Relaxations / bounds
• Pruning
Hard
5
WISH : Integration by Hashing and
Optimization
Outer loop over n variables
MAP inference on
model augmented
with random parity
constraints
Repeat log(n) times
Aggregate MAP inference solutions
Original graphical model
σ
n binary variables
AUGMENTED MODEL
σ {0,1}n
Parity check nodes
enforcing A σ= b (mod 2)
The algorithm requires only O(n log n) MAP queries to
approximate the partition function within a constant factor
6
Visual working of the algorithm
• How it works 1 random parity
Function to be integrated
2 random parity
constraints
constraint
….
n times
3 random parity
constraints
….
….
….
Log(n) times
Mode M0
+
median M1
×1
+
median M2 ×2
+
median M3 ×4 + …
7
Accuracy Guarantees
• Theorem [ICML-13]: With probability at least 1- δ (e.g.,
99.9%) WISH computes a 16-approximation of the
partition function (discrete integral) by solving θ(n log n)
MAP inference queries (optimization).
• Theorem [ICML-13]: Can improve the approximation
factor to (1+ε) by adding extra variables and factors.
• Example: factor 2 approximation with 4n variables
• Remark: faster than enumeration only when combinatorial
optimization is efficient
8
Summary of contributions
• Introduction and previous work:
• WISH: Approximate Discrete Integration via Optimization.
• Partition function / marginalization via MAP inference
• Accuracy guarantees
• MAP Inference subject to parity constraints:
• Tractable cases and approximations
• Integer Linear Programming formulation
• New family of polynomial time (probabilistic) upper and lower
bounds on partition function that can be iteratively tightened (will
reach within constant factor)
• Sparsity of the parity constraints:
• Techniques to improve solution time and bounds quality
• Experimental improvements over variational techniques
9
MAP INFERENCE WITH
PARITY CONSTRAINTS
Hardness, approximations, and bounds
10
Making WISH more scalable
• Would approximations to the optimization (MAP inference
with parity constraints) be useful? YES
• Bounds on MAP (optimization) translate to bounds on the
partition function Z (discrete integral)
• Lower bounds (local search) on MAP  lower bounds on Z
• Upper bounds (LP,SDP relaxation) on MAP  upper bounds on Z
• Constant-factor approximations on MAP  constant factor on Z
• Question: Are there classes of problems where we can
efficiently approximate the optimization (MAP inference) in
the inner loop of WISH?
11
Error correcting codes
Communication over a noisy channel
Bob
Alice
x
0100|1
Noisy
channel
y
0110|1
Redundant parity check bit=
0 XOR 1 XOR 0 XOR 0
Parity check bit = 1
≠
0 XOR 1 XOR 1 XOR 0 = 0
•Bob: There has been a transmission error! What was the
message actually sent by Alice?
• Must be a valid codeword
• As close as possible to received message y
12
Decoding a binary code
• Max-likelihood decoding
x
0100|1
Noisy
channel
y
0110|1
ML-decoding graphical model
Our more general case
Noisy
channel
model
x
Transmitted
string must
be a
codeword
More complex
probabilistic
model
Parity check nodes
MAP inference is NP hard to
approximate within any constant factor
[Stern, Arora,..]
LDPC Routinely solved: 10GBase-T
Ethernet, Wi-Fi 802.11n, digital TV,..
Parity check nodes
Max w(x) subject to A x = b (mod 2)
Equivalent to MAP inference on
augmented model
13
Decoding via Integer Programming
• MAP inference subject to parity constraints encoded as
an Integer Linear Program (ILP):
• Standard MAP encoding
• Compact (polynomial) encoding by Yannakakis for parity
constraints
Parity polytope
• LP relaxation: relax integrality constraint
• Polynomial time upper bounds
• ILP solving strategy: cuts + branching + LP relaxations
• Solve a sequence of LP relaxations
• Upper and lower bounds that improve over time
14
Iterative bound tightening
Polynomial time upper ad lower bounds on MAP that are iteratively
tightened over time
•Recall: bounds on optimization (MAP)  (probabilistic) bounds on
the partition function Z. New family of bounds.
•WISH: When MAP is solved to optimality (LowerBound =
UpperBound), guaranteed constant factor approximation on Z
15
SPARSITY OF THE PARITY
CONSTRAINTS
Improving solution time and bounds quality
16
Inducing sparsity
• Observations:
• Problems with sparse A x = b (mod 2) are empirically easier to
solve (similar to Low-Density Parity Check codes)
• Quality of LP relaxation depends on A and b , not just on the
solution space. Elementary row operations (e.g., sum 2 equations)
do not change solution space but affect the LP relaxation.
1)Reduce A x = b (mod 2) to row-echelon form with
Gaussian elimination (linear equations over finite field)
2)Greedy application of elementary row operations
Matrix A in row-echelon form
Parity check nodes
Equivalent but sparser
Parity check nodes
17
Improvements from sparsity
• Quality of LP relaxations significantly improves
• Finds integer solutions faster (better lower bounds)
Without
sparsification,
fails at finding
integer
solutions (LB)
Upper bound improvement
Improvements from sparsification using IBM CPLEX ILP solver
for a 10x10 Ising Grid
18
Generating sparse constraints
• WISH based on Universal Hashing:
• Randomly generate A in {0,1}i×n, b in {0,1}i
• Then A x + b (mod 2) is:
n
We optimize over
solutions of A x = b
mod 2
(parity constraints)
• Uniform over {0,1}i
• Pairwise independent
i
A
Given variable assignments x and y , the events
A x = b (mod 2)
and
A y =b (mod 2)
are independent.
= b
(mod 2)
x
• Suppose we generate a sparse matrix A
• At most k variables per parity constraint (up to k ones per row of A)
• A x+b (mod 2) is still uniform, not pairwise independent anymore
• E.g. for k=1, A x = b mod 2 is equivalent to fixing i variables. Lots of
correlation. (Knowing A x = b tells me a lot about A y = b)
19
Using sparse parity constraints
• Theorem: With probability at least 1- δ (e.g., 99.9%)
WISH with sparse parity constraints computes an
approximate lower bound of the partition function.
• PRO: “Easier” MAP inference queries
• For example, random parity constraints of length 1 (= on a single
variable). Equivalent to MAP with some variables fixed.
• CON: We lose the upper bound part. Output can
underestimate the partition function.
• CON: No constant factor approximation anymore
20
MAP with sparse parity constraints
• MAP inference with sparse constraints evaluation
10x10 attractive Ising Grid
10x10 mixed Ising Grid
• ILP and Branch&Bound outperform message-passing (BP,
MP and MPLP)
21
Experimental results
• ILP provides probabilistic upper and lower bounds
that improve over time and are often tighter than
variational methods (BP, MF, TRW)
22
Experimental results (2)
• ILP provides probabilistic upper and lower bounds
that improve over time and are often tighter than
variational methods (BP, MF, TRW)
23
Conclusions
• [ICML-13] WISH: Discrete integration reduced to small
number of optimization instances (MAP)
• Strong (probabilistic) accuracy guarantees
• MAP inference is still NP-hard
• Scalability: Approximations and Bounds
• Connection with max-likelihood decoding
• ILP formulation + sparsity (Gauss sparsification & uniform hashing)
• New family of probabilistic polynomial time computable
upper and lower bounds on partition function. Can be
iteratively tightened (will reach within a constant factor)
• Future work:
• Extension to continuous integrals and variables
• Sampling from high-dimensional probability distributions
24
Extra slides