Transcript Slide
Preconditioning in Expectation
Richard Peng
MIT
Joint with Michael Cohen (MIT), Rasmus Kyng (Yale),
Jakub Pachocki (CMU), and Anup Rao (Yale)
CMU theory seminar, April 5, 2014
RANDOM SAMPLING
• Collection of many objects
• Pick a small subset of them
GOALS OF SAMPLING
• Estimate quantities
• Approximate higher
dimensional objects
• Use in algorithms
SAMPLE TO APPROXIMATE
• ε- nets / cuttings
• Sketches
• Graphs
• Gradients
This talk: matrices
NUMERICAL LINEAR ALGEBRA
• Linear system in n x n matrix
• Inverse is dense
• [Concus-Golub-O'Leary `76]:
incomplete Cholesky, drop entries
HOW TO ANALYZE?
• Show sample is good
• Concentration bounds
• Scalar: [Bernstein `24][Chernoff`52]
• Matrices: [AW`02][RV`07][Tropp `12]
THIS TALK
• Directly show algorithm
using samples runs well
• Better bounds
• Simpler analysis
OUTLINE
• Random matrices
• Iterative methods
• Randomized preconditioning
• Expected inverse moments
HOW TO DROP ENTRIES?
• Entry based representation hard
• Group entries together
• Symmetric with positive entries
adjacency matrix of a graph
SAMPLE WITH GUARANTEES
• Sample edges in graphs
• Goal: preserve size of all cuts
• [BK`96] graph sparsification
• Generalization of expanders
DROPPING ENTRIES/EDGES
• L: graph Laplacian
• 0-1 x : |x|L2 = size
of cut between 0sand-1s
Unit weight case:
|x|L2 = Σuv (xu – xv)2
Matrix norm:
|x|P2 = xTPx
DECOMPOSING A MATRIX
Σuv (xu – xv)2
L = Σuv
• Sample based on positive
representations
• P = Σi Pi, with each Pi P.S.D
• Graphs: one Pi per edge
1 -1
u
-1 1
v
u
v
P.S.D. multi-variate version of positive
MATRIX CHERNOFF BOUNDS
P = Σi Pi, with each Pi P.S.D
Can sample Q with
O(nlognε-2) rescaled Pis
s.t. P ≼ Q ≼ (1 +ε) P
≼ : Loewner’s partial ordering,
A ≼ B B – A positive semi definite
CAN WE DO BETTER?
• Yes, [BSS `12]: O(nε-2) is possible
• Iterative, cubic time construction
• [BDM `11]: extends to general matrices
DIRECT APPLICATION
Find Q very close to P
Solve problem on Q
Return answer
For ε accuracy, need P ≼ Q ≼(1 +ε) P
Size of Q depends inversely on ε
ε-1 is best that we can hope for
USE INSIDE ITERATIVE METHODS
Find Q somewhat similar to P
Solve problem on P
using Q as a guide
• [AB `11]: crude samples
give good answers
• [LMP `12]: extensions to
row sampling
ALGORITHMIC VIEW
• Crude approximations are ok
• But need to be efficient
• Can we use [BSS `12]?
SPEED UP [BSS `12]
• Expander graphs, and more
• ‘i.i.d. sampling’ variant related
to the Kadison-Singer problem
MOTIVATION
• One dimensional sampling:
• moment estimation,
• pseudorandom generators
• Rarely need w.h.p.
• Dimensions should be disjoint
MOTIVATION
• Randomized coordinate descent
for electrical flows [KOSZ`13,LS`13]
• ACDM from [LS `13] improves
various numerical routines
RANDOMIZED COORDINATE DESCENT
• Related to stochastic optimization
• Known analyses when Q = Pj
• [KOSZ`13][LS`13] can be viewed
as ways of changing bases
OUR RESULT
For numerical routines, random
Q gives same performances as
[BSS`12], in expectation
IMPLICATIONS
• Similar bounds to ACDM from [LS `13]
• Recursive Chebyshev iteration
([KMP`11]) runs faster
• Laplacian solvers in ~ mlog1/2n time
OUTLINE
• Random matrices
• Iterative methods
• Randomized preconditioning
• Expected inverse moments
ITERATIVE METHODS
Find Q s.t. P ≼ Q ≼10 P
Use Q as guide to solve
problem on P
• [Gauss, 1823] Gauss-Siedel iteration
• [Jacobi, 1845] Jacobi Iteration
• [Hestnes-Stiefel `52] conjugate gradient
[RICHARDSON `1910]
x(t + 1) = x(t) + (b – Px(t))
• Fixed point: b – Px(t) = 0
• Each step: one matrixvector multiplication
ITERATIVE METHODS
• Multiplication is easier than division,
especially for matrices
• Use verifier to solve problem
1D CASE
Know: 1/2 ≤ p ≤ 1
1 ≤ 1/p ≤ 2
• 1 is a ‘good’ estimate
• Bad when p is far from 1
• Estimate of error: 1 - p
ITERATIVE METHODS
• 1 + (1 – p) = 2 – p is more accurate
• Two terms of Taylor expansion
• Can take more terms
ITERATIVE METHODS
1/p = 1 + (1 – p) + (1 – p)2 + (1 – p)3…
Generalizes to matrix settings:
P-1 = I + (I – P) + (I – P)2 + …
[RICHARDSON `1910]
x(0) = Ib
X(1) = (I + (I – P))b
x(2) = (I + (I – P) (I + (I – P)))b
…
x(t + 1) = b + (I – P) x(t)
• Error of x(t) = (I – P)t b
• Geometric decrease if P is close to I
OPTIMIZATION VIEW
Residue: r(t) = x(t ) – P-1b
Error: |r(t)|22
• Quadratic potential function
• Goal: walk down to the bottom
• Direction given by gradient
DESCENT STEPS
x(t)
x(t+1)
x(t)
• Step may overshoot
• Need smooth function
x(t+1)
MEASURE OF SMOOTHNESS
x(t + 1) = b + (I – P) x(t)
Note: b = PP-1b
r(t + 1) = (I – P) r(t)
|r(t + 1)|2 ≤|I – P|2 |x(t)|2
MEASURE OF SMOOTHNESS
• |I – P|2 : smoothness of |r(t)|22
• Distance between P and I
• Related to eigenvalues of P
1 / 2 I ≼ P ≼ I |I – P|2 ≤
1/2
MORE GENERAL
• Convex functions
• Smoothness / strong convexity
This talk: only quadratics
OUTLINE
• Random matrices
• Iterative methods
• Randomized preconditioning
• Expected inverse moments
ILL POSED PROBLEMS
.8 0
0 .1
• Smoothness of directions differ
• Progress limited by steeper parts
PRECONDITIONING
P
Q
P
• Solve similar problem Q
• Transfer steps across
PRECONDITIONED RICHARDSON
P
Q
• Optimal step down energy
function of Q given by Q-1
• Equivalent to solving
Q-1Px = Q-1b
PRECONDITIONED RICHARDSON
x(t + 1) = b + (I – Q-1P) x(t)
Residue:
r(t + 1) = (I – Q-1P) r(t)
|r(t + 1)|P = |(I – Q-1P )r(t)|P
CONVERGENCE
P
Q
Improvements depend
on |I – P1/2Q-1P1/2|2
• If P ≼ Q ≼10 P, error
halves in O(1) iterations
• How to find a good Q?
MATRIX CHERNOFF
P = Σi P i
Q = Σi s i P i
s has small support
• Take O(nlogn) (rescaled) Pis with
probability ~ trace(PiP-1)
• Matrix Chernoff
([AW`02],[RV`07]):
w.h.p.
P ≼ Q ≼ 2P
Note: Σitrace(PiP-1) = n
WHY THESE PROBABILITIES?
.8 0
0 .1
• trace(PiP-1):
• Matrix ‘dot product’
• If P is diagonal
• 1 for all i
• Need all entries
Overhead of concentration:
union bound on dimensions
IS CHERNOFF NECESSARY?
1 0
1 0
0 1
0 0
• P: diagonal matrix
• Missing one entry: unbounded
approximation factor
BETTER CONVERGENCE?
• [Kaczmarz `37]: random projections
onto small subspaces can work
• Better (expected) behavior than
what matrix concentration gives!
HOW?
P
≠
• Will still progress in good directions
• Can have (finite) badness if they
are orthogonal to goal
Q1
QUANTIFY DEGENERACIES
P
.8 0
.2 0
0 .2
0 .1
• Have some D ≼ P ‘for free’
• D = λmin (P)I (min eigenvalue)
• D = tree when P is a graph
• D = crude approximation /
rank certificate
D
REMOVING DEGENERACIES
P
D
• ‘Padding’ to remove degeneracy
• If D ≼ P and 0.5 P ≼ Q ≼ P,
0.5P ≼ D + Q ≼ 2P
ROLE OF D
P
D
• Implicit in proofs of matrix
Chernoff, as well as [BSS`12]
• Splitting of P in numerical analysis
• D and P can be very different
MATRIX CHERNOFF
P
Q
• Let D ≤ 0.1P, t = trace(PD-1)
• Take O(tlogn) samples with
probability ~ trace(PiD-1)
• Q D + (rescaled) samples
• W.h.p. P ≼ Q ≼ 2 P
WEAKER REQUIREMENT
Q only needs to do well in
some directions, on average
Q1
P
Q2
EXPECTED CONVERGENCE
• Let t = trace(PD-1)
• Take rand[t, 2t] samples,
w.p. trace(PiD-1)
• Add (rescaled) results to D
to form Q
Exist constant c s.t. for any r,
E[|(I – c Q-1P )r|P ≤ 0.99|r|P
OUTLINE
• Random matrices
• Iterative methods
• Randomized preconditioning
• Expected inverse moments
ASIDE
Matrix Chernoff
• f(Q)=exp(P-1/2(P-Q)P-1/2)
• Show decrease in
relative eigenvalues
Iterative methods:
• f(x) = |x – P-1b|P
• Show decrease in
distance to solution
Goal: combine these analyses
SIMPLIFYING ASSUMPTIONS
P0
P
D0
D
• P = I (by normalization)
• tr(Pi D-1) = 0.1, ‘unit weight’
• Expected value of picking
a Pi at random: 1/t I
DECREASE
Step: r’ = (I – Q-1P)r
= (I – Q-1)r
New error: |r’|P = |(I – Q-1 )r|2
Expand:
| r' | - | r | = -2 | r |
2
2
2
2
2
Q-1
+|r |
2
Q-2
DECREASE:
-2 | r |
2
Q-1
+ | r | £ -0.1 r 2
2
Q-2
• I ≼ Q ≼ 1.1 I would imply:
• 0.9 I ≼ Q-1
• Q-2 ≼ I
• But also Q-3 ≼ I and etc.
• Don’t need 3rd moment
2
RELAXATIONS
2
2 ù
é
EQ ë-2 | r |Q-1 + | r |Q-2 û
• Only need Q-1 and Q-2
• By linearity, suffices to:
• Lower bound EQ[Q-1]
• Upper bound EQ[Q-2]
TECHNICAL RESULT
Assumption: Σi Pi = I
trace(PiD-1) = 0.1
• Let t = trace(D-1)
• Take rand[t, 2t] uniform samples
• Add (rescaled) results to D to form Q
• 0.9I ≼ E[Q-1]
• E[Q-2] ≼ O(1) I
Q-1
• 0.5I ≼ E[Q-1] follows from
matrix arithmetic-harmonic
mean inequality ([ST`94])
• Need: upper bound on E[Q-2]
E[Q-2] ≼ O(1) ?
Q-1
j=0
j=t
Q-2
j=2t
• Q-2 is gradient of Q-1
• More careful tracking of Q-1
gives info on Q-2 as well!
TRACKING Q-1
• Q: start from D, add [t,2t]
random (rescaled) Pis.
• Track inverse of Q under
rank-1 perturbations
Sherman Morrison formula:
( A + M)
-1
-1
-1
A MA
=A +
-1
1+tr(A M)
-1
BOUNDING Q-1: DENOMINATOR
Current matrix: Qj, sample: R
-1
-1
Q
RQ
-1
j
j
Q-1
=
Q
+
j+1
j
1+tr(Q-1
j R)
• D ≼ Qj Qj-1 ≼ D-1
• tr(Qj-1R) ≤ tr(D-1R) ≤ 0.1 for any R,
ER[Qj+1-1] ≼ Qj-1 – 0.9 Qj-1E[R]Qj-1E
BOUNDING Q-1: NUMERATOR
ER[Qj+1-1] ≼ Qj-1 – 0.9 Qj-1E[R]Qj-1
• R: random rescaled Pi sampled
• Assumption: E[R] = P = I
ER[Qj+1-1] ≼ Qj-1 – 0.9/t Qj-2
AGGREGATION
ER[Qj+1-1] ≼ Qj-1 – 0.9/t Qj-2
D = Q0
Q1
Q2
• Qj is also random
• Need to aggregate choices
of R into bound on E[Qj-1]
HARMONIC SUMS
HrmSum(X, a) = 1/(1/x + 1/a)
• Use harmonic sum of matrices
• Matrix functionals
• Similar to Steljes transform in [BSS`12]
• Proxy for -2th power
• Well behaved under expectation:
EX[HrmSum (X,a)] ≤ HrmSum(E[X],a)
HARMONIC SUM
ER[Qj+1-1] ≼ Qj-1 – 0.9/t Qj-2
1 2
1
2
- u M-2 £ 2
- u M-1
t
u M-1 + t
æ 2 1ö
2 ù
2
é
E ë u Q-1 û £ 0.1 u Q-1 + 0.9HrmSum ç u Q-1 , ÷
j+1
j
è j tø
Initial condition + telescoping sum
gives E[Qt-1] ≼ O(1)I
E[Q-2] ≼ O(1)I
Q-1
j=0
j=t
Q-2
j=2t
• Q-2 is gradient of Q-1:
0.9/t Qj-2 ≼ Qj-1 - ER[Qj+1-1]
• 0.9/tΣj=t2t-1 Qj-2 ≼ E[Q2t-1] - E[Qt-1]
• Random j from [t,2t] is good!
SUMMARY
Un-normalize:
• 0.5 P ≼ E[PQ-1P]
• E[PQ-1PQ-1P] ≼ 5P
One step of preconditioned Richardson:
2ù
é
-1
E ê ( I - 0.1Q P ) r ú
Pû
ë
= r P - 0.2 r
2
2
PQ-1P
+ 0.01 r
2
PQ-1PQ-1P
£ r P - 0.1 r P + 0.05 r P = 0.95 r
2
2
2
2
P
MORE GENERAL
• Works for some convex functions
• Sherman-Morrison replaced by
inequality, primal/dual
FUTURE WORK
• Expected convergence of
• Chebyshev iteration?
• Conjugate gradient?
• Same bound without D
(using pseudo-inverse)?
• Small error settings
• Stochastic optimization?
• More moments?
THANK YOU!
Questions?