NIPS - People

Download Report

Transcript NIPS - People

Algorithmic Tools Applied
to Some Machine Learning
and Inference Problems
David Karger
Conclusion



Algorithms research has produced a huge set
of tools that can be applied to efficiently
solving computational problems
Any time one can precisely define a
computational problem, it’s worth looking in
the algorithms toolbox to see if it can be
solved efficiently
So, define some problems and talk to your
local algorithmicist!
Outline

Demonstrate by case studies: applying
algorithmic tools to four distinct
learning/inference problems





Near neighbor searching
MDPs with nonrecurring rewards
Decoding turbo codes
Learning low-treewidth graphical models
Each is a well defined computational problem
Approach

4 separate talks?



Commonality not in tools (too many to enumerate)
Rather, in reductionist / analogical approach
Common heuristics in search for right tools


By analogy to similar problems
By properties of problem, e.g.




Self-reducibility ---recursion, sampling
Use of paths, trees, cliques, etc.
Natural integer programming formulations
By judicious simplification
Tools we’ll use today











Random sampling
Self-reducibility
Amortization
Dynamic programming
Combinatorial graph theory
Constant-factor approximation algorithms for NPhard problems
Max flow
Approximation-preserving reductions
Locally checkable structures
Linear programming relaxations
Randomized rounding of relaxed solutions
Lost in Translation

Algorithms designed for “clean” problems


Measures of performance sometimes artificial



Often require simplifying assumptions from
real-world problems
“Constant approximation factor” nice in theory, but
which constant matters in practice
So, algorithm may or may not be useful
But even if not, can provide insight into
practical ways of attacking the problem
A Data Structure for
Nearest Neighbors on
Manifolds
Random Sampling
Greedy algorithms
Local Search
Memoization
Motivation



Common theme in machine learning: find
high-fidelity embedding of high-dimensional
data into a lower-dimensional representation
Isomap [TdSL] and locally linear embedding
[RS] postulate data on low-dimensional
manifold in high dimensional space
To reconstruct manifold, both find each
node’s “near neighbors” by brute force


Quadratic time in number of points
Improve with fast near-neighbor queries?
The Original Motivation




Peer to peer systems
Nodes need to find “nearby” nodes for fast
communication/data accesss
Internet is (kind-of) low dimensional
This near neighbor data structure works well
Near Neighbor Search



Many optimal data structures in low
dimensions (computational geometry)
But high dimensional problem hard
Subject of much recent attention in
algorithms community



[Kleinberg]---Two algorithms for NNS in high
dimension: STOC 1997
[DISV] et al workshop on near neighbor search:
NIPS 2003
But focus on approximate near neighbors
First Try


Manifold is low dimensional
Use low dimensional NN
solutions?


E.g. Voronoi diagram
Problem:



Voronoi diagram relies on geometry of space
But we don’t know geometry (coordinate system)
until after we construct manifold
So to reconstruct, are limited to using distance,
not geometry
Some Generic Ideas

Greedy approach




Take query q and closest point p found so far
Arrange to find something closer to q than p
Repeat till done
Local Search




Want point closer to q
Must be pretty close to p
So examine points near p till
find one close to q
Repeat with new point
p’ p
d
d/2
q
Some More Ideas

Random Sampling





For fast search, look only at a few candidates
Good candidates depend on query point
For any choice, “adversary” can pick bad q
So choose random set---probably good for all
Memoization



Picking random points near p is a NN search!
Pick in advance, as part of building data structure
E.g., for every possible radius (!), store (a few)
random candidates in that ball around p
Random Sampling

When will it work?




Evenly distributed low-dimensional point sets
E.g. a grid of points
E.g., random sample from low-dim. manifold
On grid, if p and q at distance d, then
Ball(q,d/2) is 1/16 of points in Ball(p,2d)


So, random point has 1/16 chance of improving
And trying 16 random points raises odds to 60-40
A Big Fast Data Structure?


At each point p, store 16 random points from
balls of every possible radius around p
Resolve query by iterating improvements





Look in right size ball around current point
If lucky, find point closer to q, can look in smaller
ball next time
If unlucky, end up with bigger ball
But luck is more likely---drift in right direction
Expected ball size shrinks by const factor

So, O(log n) steps get to “tiny” ball---at answer
Time to Get Real

Storing samples for all radii expensive





Worry about dependence



OK to be sloppy, and store only for powers of 2
Means always have ball of roughly correct size
But only log n balls needed per point
So O(n log n) space overall
What if search cycles back to point already seen?
“random samples” are no longer random
And how find random samples in first place?
Metric Skip Lists

Consider randomly ordering list of points



For each power-of-two d, point p records next 16
points in list within distance d
These points are a random sample from ball
Search inspects points forward in list




From current p, check 16 following in current d
Search always moves forward
So always considering “new” points
So no cycles, and samples independent in every
step (mostly)
Construction

Randomized Incremental Construction



Common idea from computational geometry
Adding points to data structure in random order is
often easy, yields good data structure
Build list by prepending points in random order




To set new pointers, must find next 16 points within
(each) distance d of new point
But this is like NN query (given the way we search)
Find answer by slight variation on NN search
List so far constructed makes this easy
Summary

In any low-dimensional space with evenly
distributed points, metric skip lists solve
nearest neighbor queries





O(n log n) space
O(n log n) time to build data structure
O(log n) time to query
Constant depends on dimensionality of space
Application: by building this structure, can
find NNs for Isomap/LLE in O(n log n) time
Experiments



Data structure is simple---no hidden
“theoretical” aspects
Implementation for P2P search works well
Experiments on some simple manifolds [JT]
work well too
Improvements [KL04]




There are better ways to pick “improvement
candidates” for local search
Assouad’s doubling dimension is the
maximum number of diameter-d balls needed
to cover any diameter-d/2 ball
Pick one candidate from each ball in cover
Get same bounds as above, but


Applies to unevenly distributed points
Deterministic
Markov Decision
Processes with
Nonrecurring Rewards
Approximation algorithms
Reductions to related problems
Dynamic programming
A Robot Navigation Problem




Robot to deliver packages
Goal to deliver as quickly as possible
Sounds like traveling salesman problem?
Mismatches


Robot may not go where it plans to (sensor error,
motor control error, battery failure….)
Some packages matter more
Formulate as
Markov Decision Process




Graph with rewards rv on states (vertices) v,
travel times (lengths) on edges
From each node, choice of actions (each a
probability distribution on next vertex)
Choosing sequence of actions produces a
random path through graph
If arrive at vertex v at time t, receive
discounted reward gt rv where g<1


Motivates getting there quickly
Goal: maximize total discounted reward
MDP Discounting

Reward received each time vertex is visited




So plain value of infinite path can be infinite
Discounting means total reward is bounded by a
geometric series, so bounded
Alternative: consider average reward per unit time
Other reasons for discounting



Inflation (money in future less value than now)
Uncertainty (what if something happens before I
collect future prize?)
Mathematical elegance
Solution


Fixing action at each state produces a
Markov Chain. Transition probabilities pvw
Can compute expected discounted reward rv
if start at state v:
rv = rv + Sw pvw gt(v,w) rw

Choosing actions to optimize this recurrence
is polynomial time solvable


Linear programming
Dynamic programming (like shortest paths)
Solving the wrong problem

Package can only be delivered once


One solution: expand state space




So, incorrect to get reward each time reach target
Vertex represents where I am and where I have
been before (what packages already delivered)
Reward nonzero only on states where current
location not included in list of previously visited
Now apply MDP algorithm
Problem: expanded problem size exponential
in original input
This is one Instance of a
General Problem [KL]


Often, MDP has state space with nice small
implicit description but huge explicit
description
How do we accomplish MDP optimization on
such instances?
Tackle an easier problem

Problem has two novel elements for TOC



We will ignore second issue




Discounting of reward based on arrival time
Probability distribution on outcome of actions
In practice, robot can control errors
Even first issue by itself is hard and interesting
First step towards solving whole problem
Frantic Salesman Problem

Given rewards, travel times, and discount factor,
find a path maximizing total discounted reward
Approximation Algorithms

FSP is NP-complete (thus, so is more
general MDP-type problem)



Reduction from minimum latency TSP
So intractable to solve exactly
Goal: approximation algorithm that is
guaranteed to achieve at least some constant
(<1) multiple of the best possible discounted
reward
TOC Toolbox


Goal seems to be to find a “short” path that
visits “lots” of reward
Relates to previously studied k-TSP problem



Given a root vertex v, find a path of minimum total
length that starts at v and visits vertices with
(undiscounted) prize at least k
Constant factor approximation algorithm known
for undirected graphs (so we assume this too)
i.e., can find path of at most a constant (>1)
multiple of minimum possible total edge length
Mismatch




Constant factor approximation doesn’t
exponentiate well
Suppose optimum solution reaches some
reward r at time t for reward gtr
Constant factor approximation would reach
within time 2t for reward g2tr
Result: get only gt fraction of optimum
discounted reward, not a constant fraction.
Idea:
Change Objective Function

Modify k-TSP to approximate prize collected
instead of length: orienteering problem




Assume tour of length l collecting prize p
Find tour of length l collecting prize p/2
Avoids changing length, so exponentiation doesn’t
hurt
Drawback: no constant factor approximation
previously known


Flipping objective/feasibility transforms problem
(Our techniques end up resolving this too)
Idea: Upper Bounds

General tool for approximation algorithms




Show close to something no solution can beat
Let dv denote shortest path distance to v
Define the prize at v as pv=gdv rv

Max discounted reward possibly collectable at v

So max conceivable reward S pv
Potential greedy algorithm: take shortest path
to one max-prize vertex
 Gets at least 1/n of optimum
Compare to Upper Bound

If given path reaches v at time tv, define
excess at v as ev = tv – dv



Difference between shortest path and chosen one
Then discounted reward at v is gev pv
Idea: good solution need not bother visiting
nodes at large excess


If excess large and node still worth visiting, prize
must be huge
So forget current path, just go straight to huge
prize without discounting
Formalize

Fact: excess only increases as traverse path


Without loss of generality, assume g = ¼


Excess reflects lost time; can’t make it up
Just scale edge lengths
Claim: at least ½ of optimum path’s
discounted reward R is collected before
path’s excess rises to ½



Let w be first vertex with ew > ½
Suppose more than R/2 reward follows w
Show contradiction
Path Improvement


ew > ½ but more than
R/2 reward follows w
Shortcut directly to w
then traverse optimum




reduces all excesses after
w by ½
so improves discounts by
(1/g) ½ = 2
so doubles discounted
reward collected
but this was more than
R/2: contradiction
w excess 1/2
0
excess 11/2
Discount Discounting


We showed large excess can be ignored
But if excess is small, discounting by excess
can be ignored!




(discounted) reward ~ (undiscounted) prize
So, just find path with small excess
maximizing amount of (undiscounted) prize
Gives path with (discounted) reward ~ prize
Of course, min-excess gives min-distance

But, may be better off approximating excess
Improvement on k-TSP:
Approximate Excess


Recall discounted reward at v is gev pv
Prefix of optimum discounted reward path:




Find a path of approximately (3 times)
minimum excess and prize R/2


Has discounted reward S gev pv > R/2
So has prize S pv > R/2
And has no vertex with large excess
(we can guess R/2)
Excesses at most 3/2, so gev pv > pv/8

So discounted reward on found path > R/8
The Downside



Min-excess problem more useful
But harder to solve
Approximating min-distance does not
approximate min-excess
Opt length 1+e
length 2
e excess
1 excess
Exactly Solvable Case:
monotonic paths


Suppose optimum goes through vertices in
strictly increasing distance order from root
Then can find optimum by dynamic program


Just as can solve longest path in an acyclic graph
Build table: is there a monotonic path from v
with length l and prize p?


To answer, look for a u after v with a path of length
l – dvu and prize p - pv
Works because monotonic path won’t go back
through v
Dynamic Program
(3,1)
(4,2)
1
4
(8,2)
(7,2)
(8,3)
(13,3)
(15,4)
2
(9,2)
(6,2)
(11,3)
1
2
2
(4,1)
(5,2)
(9,2)
(2,0)
5
(7,1)
Approximable case:
wiggly paths


Length of path to v is tv = dv + ev
If ev > dv then tv > ev > tv / 2



i.e., take twice as long as necessary to reach end
So if approximate tv to constant factor, also
approximate ev to twice that constant factor
But finding approximately optimum tv is k-TSP
problem

Constant factor approximation known
Decompose into easy cases
monotone
monotone
wiggly
monotone
wiggly
Divides into independent problems
> 2/3 of each wiggly path is excess
Decomposition Analysis

2/3 of each wiggly path is excess




That excess accumulates into whole path
So, total excess of wiggly paths upper bounded
by excess of whole path
Conclude total length of wiggly paths upper
bounded by 3/2 of path excess
Use k-TSP to find approximately shortest
wiggles collecting right amount of prize


Approximates length, so approximates excess
Over all wiggly parts, approximates total excess
Dynamic program

For each pair of vertices and each
(discretized) prize value, find




Shortest monotonic path collecting desired prize
Approximately shortest wiggly path collecting
desired prize
Note: polynomially many subproblems
Use dynamic programming to find optimum
pasting together of subproblems
Summary


Showed maximum discount prize can be
approximated by minimum excess path
Showed how to approximate min-excess path



Also solves “orienteering problem”
Also solves “dual” of k-TSP where length is fixed
Also solves “tree versions” of all these problems,
e.g. prize-approximate k-MST
Open Questions

Directed graphs?




We used k-TSP, only solved for undirected
For directed, even standard TSP has no known
constant factor approximation
We only use k-TSP/undirectedness in wiggly parts
Stochastic actions?


Stochastic seems to imply directed
Special case: forget rewards.

Given choice of actions, choose actions to minimize
cover time of graph
Decoding Turbo Codes
via Linear Programming
Linear Programming Relaxations
Basic Coding Problem

Goal: transmit message across noisy channel
that randomly perturbs parts of message



Binary symmetric channel: each transmitted bit
flipped independently with probability p < ½
Approach also works for AWGN channel
Method: introduce redundancy in messages
to cope with perturbations


Compute encoding function mapping each
information word ũ  {0,1}k to codeword ỹ  {0,1}n
Receive and decode randomly perturbed ŷ
Definitions
Encoder
Information word ũ
Length k
Codeword ỹ
Length n
Word error rate: is u = ũ ?
Noisy Channel
Bit error rate p
Decoded info word u
Decoder
Decoded codeword y
Corrupt codeword ŷ
Decoding

Received perturbed codeword usually not a
codeword.


Need rule for picking “best” decoding possibility
Performance measure: probability of giving
wrong answer


Known as word error rate (WER)
Won’t be zero, since with nonzero probability
channel replaces codeword with a different, valid
codeword (at that point, only natural to answer
with wrong information word)
Maximum Likelihood Decoding

Specific decoding rule





Choose info word whose codeword maximizes
probability of received word Pr(ŷ | y)
Assuming binary symmetric channel errors, this is
just codeword at minimum hamming distance
from received word
More generally, linear cost function on codewords
Under general assumptions, this is the “best”
way to decode
Drawback: generally NP-complete
Turbo Codes

A particular encoding approach



Numerous fast heuristics for decoding



Introduced in 1993 [Berrou, Glavieux, Thitimajshima]
Simple linear time encoding (state machine)
E.g. belief propagation
But may not converge
Fantastic in practice, but unclear why


“distance” of code is bad (multiple codewords with
few different bits: easily confused with each other)
Some asymptotic analysis of “random turbo codes”
Decoding by
Linear Programming

Describe polynomial time linear-programming
approach to decoding (arbitrary) turbo codes



Precise analysis for RA(2) codes



Generalizes to LDPC codes
Certifies when finds ML codeword
Repeat accumulate are special type of turbo
2 means rate ½ : two code bits per info bit
Based on analysis, explanation of how to
build good RA(2) code

Gives code with inverse-polynomial error bound
ML Decoding as
Integer Linear Programming






Given corrupted codeword ŷ, want find
codeword y of minimum hamming distance
 Note |y - ŷ| = Σ(1- 2ŷi)y + constant (linear function)
Code C  {0,1}n
Polytope P as convex hull of C
ML decoding wants minimum-cost vertex of P
Errors in channel change ŷ, perturb objective
Good code: small objective perturbations
leave optimum vertex unchanged
Linear Programming
Relaxation

Optimizing over P intractable



Generally no tractable way to work with
constraints (facets) defining P
Find a tractable polytope “similar” to P,
optimize over that instead
New polytope will generally have additional,
non-integral vertices
P
Relax
P
Q
Properties of a Good
Relaxation

Relaxed LP should be tractable



Relaxed polytope Q should contain original P



Preferably few constraints
Combinatorial structure to aid solution
Ensures that true optimum is feasible in relaxation
So Q optimum is lower bound on P optimum
Relaxation should be “tight”



Vertices of Q should be “close” to P
Increases hope that optimum in Q will be valid in P
Makes it easier to “round” Q solution into P
Repeated Accumulate Codes



Particular type of turbo code
Accumulator: accumulates sum of bits mod 2
RA(m) code





Make m copies of info word (total km bits)
Apply given fixed permutation to bits
Feed resulting sequence through accumulator
Output sequence of accumulator values (km outputs)
We focus on RA(2) code
Trellis
0
0
1
ũ1



0
0
0
0
0
0
0
ỹ1 1
ỹ2 1
ỹ3 1
ỹ41
1
1
1
1
0
ũ3
0
ũ2
0
ũ3
Circles represent state of accumulator
Each info bit appears at two transitions
Sequence of states is codeword
0
0
0
ỹ5 1
1
0
0
ũ1
ũ2
0
ỹ6
1
Encoding with Trellis
0
0
0
1

0
0
1
1

0
0
1
0
1
0
0
1
0
1
0
1
0
1
0
0
1
1
0
0
ũ1
ũ3
ũ2
ũ3
ũ1
ũ2
0
1
1
1
0
1
Encode 011
Output 010110
0
1
Decoding with Trellis
0
0
0
0
1


0
1
1
ũ1
ỹ2=0
ỹ1=1
0
0
1
0
ũ3 =1
1
0
0
1
0
ũ2
1
0
1
0
ũ3
1
0
0
0
1
1
0
0
ũ1
ũ2
At each step, codeword says which state should
transition to
Read off ũi label on transition arc
1
Handling Errors

Not every path through trellis is a codeword



Need both occurrences of each info bit to agree
(cause same transition)
Call such a path agreeable
Want path with fewest transitions not
matching received codeword


Give cost 0 to transitions matching received word,
cost 1 to transitions not matching
Shortest agreeable path is ML codeword
Relaxation


Shortest agreeable path is NP-complete
Relax to min-cost agreeable flow



Flow is convex combination of paths (exponential
number of variables, but tractable)
Alternatively, poly-size LP based on balance of
incoming and outgoing flow
Agreeability is a constraint that certain groups of
transition edges carry same amount of flow
Properties of Relaxation

Any agreeable path is an agreeable flow




Conclude: correct answer is feasible in relaxation
Conclude: if min-cost agreeable flow is a path,
then it is shortest agreeable path
Certificate property: min-cost agreeable flow
decoder will know it has found ML codeword
Relaxation is tractable to solve


Can directly solve via LP
But actually exists reduction to standard min-cost
flow, so can use specialized fast algorithms
Performance


When is correct codeword (path) the optimum
for MCAF?
Intuition from residual graphs in flow


Build new graph by subtracting from each edge’s
capacity amount of flow currently on edge
A minimum cost flow is optimal if and only if there
is no negative cost cycle in the residual graph
(i.e., of positive residual capacity)


Can push flow around cycle, reduce cost of flow
Says local optimum is global optimum
Generalize to MCAF


True codeword is optimal if no negative cost
agreeable cycle in residual graph
What does agreeable cycle look like?




Must diverge from correct codeword path at some
point (traverses other label)
Then traverses same labels as correct for a while
Then may return to codeword path (again by
traversing “opposite” label
Each time traverses opposite label at some bit,
agreeability requires it also traverse opposite label
at “other copy” of that bit
Trellis
0
0
1
ũ1


0
0
0
0
0
0
0
ỹ1 1
ỹ2 1
ỹ3 1
ỹ41
1
1
1
1
0
ũ3
0
ũ2
0
ũ3
Suppose all 0s sent
If cycle uses 1-edge at second layer
must also use 1-edge at 4th layer
0
0
0
ỹ5 1
1
0
0
ũ1
ũ2
0
ỹ6
1
An Easier Representation:
the Tanner Graph

Draw path along codeword



Add “matching” edge between vertices
corresponding to copies of same info bit


Edge cost -1 if received bit flipped
1 otherwise
All edge costs 0
Cycles in residual graph correspond to cycles
in this graph G, and have same cost


Hamiltonian edge = use same transition as
codeword
Matching edge = use opposite transition
Picture
ỹ6
ũ2
ũ1
1
-1
0
ỹ5 1
ỹ1
ũ3
0
-1 ỹ2
0
ũ1
ũ2
1
ỹ4
1
ũ3
ỹ3
Connection


0
ũ2
Circle paths ==
simple cycles in ỹ5 1
trellis
Matching edges ũ1
for agreement
0
1
ũ1
0
0
0
0
ỹ1
ũ1
ỹ6
1
-1
0
0
ỹ4
1
1
0
ũ3
0
0
ỹ2 1
ỹ3 1
ỹ41
1
1
1
1
0
ũ2
-1 ỹ2
0
ỹ1 1
0
ũ3
ũ3
0
ũ3
ũ2
ỹ3
0
0
0
ỹ5 1
1
0
0
ũ1
ũ2
0
ỹ6
1
Main Theorem
The true codeword is the min-cost
agreeable flow solution iff there is no
negative cost cycle in the Tanner graph G
Suggests Good Codes





Recall edge cost is -1 if bit flipped, which
occurs with probability < ½
Intuition: if cycle is large, unlikely that more
than half its edges flip
So, good idea to build graph in which all cycle
are large
Achieve by building graph with larger
girth=length of shortest cycle
Erdos gives (Path+Matching) graph with girth
log n (and this is best possible)
Analysis


Theorem: Using RA(2) code from Erdos
Graph, if p < 2-4(ε+log 24)/2) then
WER=Pr[negative cycle] < n-ε
Proof:




Negative cycle has length at least log n
Break into subpaths of length log n
One must be negative
What is Pr[negative length log n path]?



probability at most n-2-ε
Degree 3 graph, so only n2 paths of length log n
Add up
Experiments
Summary





Combinatorial algorithm for decoding RA(2)
codes
Analysis of its error probability
“Recommended code” based on analysis
That code has polynomially small WER
Experiments show good performance in
practice


But slow, since solving LP
This approach gives insight, not algorithms
Extensions


Discovered that Wainright’s “Tree
Reweighted Max Product” is solving dual of
our LP
Thus, TRMP has same performance


Gives a belief-propogation flavored decoder with
same performance as LP decoder
Techniques extend to arbitrary LDPC codes


Relaxation by intersecting “parity polytopes” for
each parity check
LP decoder (but flow decoder doesn’t extend)
Breaking Result [FMSSW]



Analysis of LP decoder for LDPC codes
Proof that when code based on expander
graph, LP decoder handles constant fraction
of bits corrupted
Gives proof of exponentially small error rate
for polynomial-time decoder of LDPC codes.
Learning Markov
Random Fields
Randomized Rounding
Overview





Fundamentals of Markov networks
Maximum likelihood markov network
structure as a maximum hypertree problem
Tree-width and hypertrees
An approximation algorithm for maximum
hypertrees
Reducing maximum hypertrees to/from
maximum likelihood markov networks
Density Estimation




T observations x1,…,xT
Each x1= x11,…, x1n a vector of n values
Estimate joint probability distribution
P(X1,…,Xn) from which samples were taken.
(Assume observations i.i.d)
Maximum likelihood approach

Postulate “best fit” is maximum likelihood
distribution


Distribution that maximizes likelihood of data
To avoid over-fitting, limit choice to within
some (parametric) class.


Equivalent to projecting onto our class to find
“nearest neighbor” of empirical distribution
Our approach applies to this general distribution
projection problem
Markov Random Networks




Representations of joint distributions with
limited dependence.
Variables x1…xn
Graph with variables as vertices
Edges represent dependencies


Conditioned on (any specific values of) its
neighbors, xi is independent of all other variables.
More generally, two sides of any separator are
independent conditioned on separator
Problems

Markov net inference


Markov net learning


Given data and specific Markov net, find
parameter settings that best fit data
Given data, find Markov net structure that best fits
data (under best parameter settings)
For us, “best fits”=“maximum likelihood”
Hammersley Clifford Theorem

Cliques in Markov network are “special”: no
restriction on their marginal distribution

A distribution P is a Markov network over
graph G iff P factorizes over the cliques of G:
P( x) 

j ( x )
h
hClique(G )
h
Note jh assigns a value to each possible
setting xh of values of variables in clique h
Value of HC




HC gives a “concise” representation of any
MRF probability distribution function
Only need to specify clique potentials
If variables take s values, then potential on
size-k clique represented by sk values
If cliques are small (constant size) then
representation is small (linear size)
Limits of HC

Cannot use HC factorization to compute
important quantities




Normalization of j (can’t even tell if needed)
Marginal probability distributions (even 1 variable)
Conditional probability distributions (ditto)
Maximum likelihood parameter settings (finding j
to fit data)
Triangulated Markov Networks

No minimal cycles of more then three
vertices.
X1
X2
X5
X3
X4
X6
Benefits of Triangulation

Efficient (linear time) exact calculations





Marginals
Conditional distributions
Just about anything else (via canonical dynamic
program)
Explicit Hammersley-Clifford factorization
Efficient inference: calculation of maximum
likelihood j to fit observed data
Efficient Calculation

Triangulated graph has elimination ordering



Canonical dynamic program



Order of deleting vertices such that when vertex is
deleted, its surviving neighbors form a clique
Run backwards, adds each vertex to a clique
Memoize values on each clique (eg, distribution)
Gives necessary info to add new vertex
Memo-table size exponential in clique sizes

Happy if max clique size small
HC Factorization of
Triangulated Graphs
PX ( x) 
j ( x )
h
h
(as before)
Cliques h
P( xh )
j h ( xh ) 
j h' ( xh' )
h ' h
(new:
explicit j)
ML Inference on Triangulated
Graphs



Fixed network given
Choose parameters (potentials j) to
maximize likelihood of observations
In triangulated network, do so by making
marginals correct on cliques


i.e., want derived P(xh) equal to empirical
distribution P*(xh) for each clique h
Achieve by plugging P* into explicit formula for j
Inference on Triangulated
Graphs
PX ( x) 
* (x )
j
 h h
(as before)
Cliques h
P*( xh )
j h* ( xh ) 
j h*' ( xh' )
h ' h
(new:
explicit j)
Triangulation

Non-triangulated Markov network can be
triangulated by adding edges


Adding edges removes independence
constraints, so broadens class of models


Find large minimal cycles, add a chords
So, only increases “fit” to data (maximum
likelihood)
And, makes computations tractable
Treewidth


Could just add all edges (complete graph is
triangulated)
Drawbacks:



Dynamic programs exponential in clique size
Number of model parameters exponential in
clique size: leads to over-fitting
Treewidth of a graph: minimum over all
triangulations, of the maximum clique size of
the triangulation, minus one
Markov Net Learning



Given data, wish to find MRN of tree-width at
most k that maximizes likelihood of observed
data
Equivalently, since triangulation can only
increase likelihood, wish to find maximum
likelihood triangulated graph with clique size
at most k+1
We call such a graph a k-hyperforest

If maximal, k-hypertree
Computing (Log-)Likelihood
log P r[x1 ,  , xT ]  log  P r[x t ]
t
 log 
t

j
h
hClique( G )
t
h
(x )

t
x
(
j
log
 h h)

*
P

T
 h ( xh ) logjh ( xh )
hClique( G ) t

hClique( G ) xh
Additive Weights

For triangulated G, HC gives explicit
formulation for maximum likelihood j :
*
P ( xh )
j ( xh ) 
*
jh' ( xh' )
*
h
h ' h



Key: each j*h is independent of graph choice!
*
*
Set wh  Ph ( xh ) logjh ( xh )
Then log Pr[x1 ,, xT ]  T 
wh

hClique( G )
New formulation



Max-likelihood value of G is just sum of
weights of cliques it contains
So, given weights on cliques of size up to
k+1, want to find a hypertree (triangulated
graph of treewidth at most k) containing
maximum weight of cliques
This is the maximum hypertree problem
Chow Liu (1968): k=1

Treewidth 1 is just a tree




Edges are cliques of size 2
Weight wh on edge turns out to be mutual
information between the variables on its
endpoints.
Maximum likelihood tree is the maximum
spanning tree with: w(v, u)  I ( X v ; X u )
Polynomial time solvable
Larger Treewidths

Theorem: for k >1, maximum hypertree
problem is NP-complete



Reduction from SAT
[S]: conclude ML MRN NP-complete
So, seek approximation algorithm:
 Given optimum has value w, find (in polynomial
time) some solution with value at least w/a
 We give an algorithm with a=8kk!(k+1)!
 “Constant” for any fixed k
Idea:
Locally Testable Structure




“Treewidth k” is a global constraint, hard to
aim for
Define windmill, an object with local
characterization
Every hypertree contains a windmill with at
least 1/(k+1)! of its weight
Algorithm to find windmill of approximately
(factor 1/8kk!) maximum weight
Star graphs
Covering 11 of the 15 edges
of a tree with disjoint stars:
Partitioning a tree into two
sets of disjoint stars
Conclusion: some set of disjoint stars contains
at least 1/2 the edge weight of any given tree.
Windmills




A k-windmill is
defined by a
depth-k rooted tree
Its hyperedges are
all the paths from
the root
It has treewidth k
1-windmill is star
Windmill Theorem


Windmill farm: collection of disjoint windmills
Theorem: any weighted k-hypertree contains
a k-windmill-farm with at least a 1/(k+1)!
fraction of the weight



k-color the hypertree so no edge has repeated
colors
All edges that get colors “in same order” form a
windmill farm
Only k! orders
Idea: Randomized Rounding

Already saw idea of ILP relaxations



Define integer linear program solving problem (not
convex)
Ignore integrality constraints; solve LP (convex)
Now, want integral solution


Round fractional solution to integral by setting
variable of value 0 < x < 1 equal to 1 with
probability x and 0 otherwise
Gives integer solution where all constraints
satisfied in expectation
Special Case:
Layered 2-windmills


Variable yuv set to 1 if u is root and v is child of
u
Variable zuvx set to 1 if x is a child of v which is
a child of u

Objective function S wuvx zuvx
Consistency constraint zuvx  yuv

Single-parent constraint Su yuv  1

Rounding





Consider constraint Su yuv  1
Means yuv form a probability distribution on
choices of parent for v
Choose parents according to this distribution
Objective function was S wuvx zuvx
We can “keep” wuvx if yuv was set to 1
 So expected kept value is S wuvx yuv > S wuvx zuvx


i.e., rounded solution matches LP objective value!
(white lie)
Open problems

We have a constant factor approximation for
constant k, but it’s a pretty bad constant!





Two separate reasons for badness
Windmill theorem may be very loose
 (but we have examples of gap exceeding k)
Better windmill approximations?
 Multilevel facility location?
Use idea of “restricted, locally testable class of
MRNs” to find other, practical and tractable subsets
Hardness of approximation?
Conclusion
Near Neighbor Structure

Concepts





Random sampling
Memoization
Greedy improvement
Local search
Impact




Simple
Good constants
Good experimental performance
So likely useful in practice
One-shot MDPs

Concepts




Approximation Algorithms
Reductions to related problems
Dynamic Programming
Impact




Solved an open theory problem
Unlikely to be a good practical performer
But suggests heuristics
And poses good next questions
Turbo Decoding

Concepts




LP relaxations
Flows
Graph theory
Impact



Probably don’t want to use LP in DSP chips
Strengthens theoretical grounds/explanations for
good performance of certain codes
Provides insight for improving codes and
decoding algorithms
Hypertrees

Concepts


Randomized rounding of LP relaxations
Impact



Unlikely to be useful in practice
Possible application to branch and bound
optimization methods
Suggests some useful “special” classes of lowtreewidth graphs (windmills) that may be more
tractable to construct
Working with Algorithms


Many NIPS problems would be trivial given
infinite computational resources
Good algorithms can simulate such resources



To apply algorithmic methods, want welldefined algorithmic problem



Large collection of available tools
Large collection of people who know them
Avoid defining problem via algorithmic solution
Dialogue can help define true problem
Even oversimplified solution may give insight
Acknowledgements

Near Neighbor Searching


MDPs


With Avrim Blum, Shuchi Chawla, Terran Lane,
Adam Meyerson, Maria Minkoff
Turbo Decoding


With Matthias Ruhl
With Jon Feldman, Martin Wainright
Learning Markov Models

With Nati Srebro
Conculsion


Lots of interesting NIPS problems!
Techniques from theoretical computer
science can be applied



Toolbox of prior approximation algorithms
Combinatorial structure of problems
Wanted: more problems

Value in both definition and solution
http://theory.lcs.mit.edu/~karger/Talks/NIPS.ppt