Numerical Linear Algebra in the Streaming Model

Download Report

Transcript Numerical Linear Algebra in the Streaming Model

Numerical Linear Algebra in
the Streaming Model
Ken Clarkson - IBM
David Woodruff - IBM
The Problems
Given n x d matrix A and n x d’ matrix B, we want
estimators for
• The matrix product AT B
• The matrix X* minimizing ||AX-B||
– A slightly generalized version of linear regression
• Given integer k, the matrix Ak of rank k minimizing
||A-Ak||
• We consider the Frobenius matrix norm: square
root of sum of squares
General Properties of Our Algorithms
• 1 pass over matrix entries, given in any order (allow
multiple updates)
• Maintain compressed versions or “sketches” of matrices
• Do small work per entry to maintain the sketches
• Output result using the sketches
• Randomized approximation algorithms
• Since we minimize space complexity, we restrict matrix
entries to be O(log nc) bits, or O(log nd) bits
Matrix Compression Methods
• In a line of similar efforts…
– Element-wise sampling [AM01], [AHK06]
– Row / column sampling: pick small random
subset of the rows, columns, or both [DK01],
[DKM04], [DMM08]
– Sketching / Random Projection: maintain a
small number of random linear combinations
of rows or columns [S06]
– Usually more than 1 pass
• Here: sketching
Outline
• Matrix Product
• Linear Regression
• Low-rank approximation
An Optimal Matrix Product Algorithm
• A and B have n rows, and a total of c columns, and we
want to estimate ATB, so that ||Est-ATB|| · ε||A||¢||B||
• Let S be an n x m sign (Rademacher) matrix
– Each entry is +1 or -1 with probability ½
– m small, set to O(log 1/ δ) ε-2
– Entries are O(log 1/δ)-wise independent
• Observation:
– E[ATSSTB/m] = ATE[SST]B/m = ATB
• Wouldn’t it be nice if all the algorithm did was maintain
STA and STB, and output ATSSTB/m?
An Optimal Matrix Product Algorithm
• This does work, and we are able to improve the
previous dependence on m:
• New Tail Estimate: for δ, ε > 0, there is m = O(log
1/ δ) ε-2 so that
Pr[||ATSSTB/m-ATB|| > ε ||A|| ||B||] · δ
(again ||C|| = [Σi, j Ci, j2]1/2)
• Follows from bounding O(log 1/δ)-th moment of
||ATSSTB/m-ATB||
Efficiency
• Easy to maintain sketches given updates
– O(m) time/update, O(mc log(nc)) bits of space
for STA and STB
– Improves Sarlos’ algorithm by a log c factor.
– Sarlos’ algorithm based on JL Lemma
• JL preserves all entries of ATB up to an additive
error, whereas we only preserve overall error
– Can compute [ATS][STB]/m via fast rectangular
matrix multiplication
Matrix Product Lower Bound
• Our algorithm is space-optimal for constant δ
– a new lower bound
• Reduction from a communication game
– Augmented Indexing, players Alice and Bob
– Alice has random x 2 {0,1}s
– Bob has random i 2 {1, 2, …, s}
• also xi+1, …, xs
• Alice sends Bob one message
• Bob should output xi with probability at least 2/3
• Theorem [MNSW]: Message must be (s) bits
on average
Lower Bound Proof
• Set s := £(cε-2log cn)
• Alice makes matrix U
– Uses x1…xs
• Bob makes matrix U’ and B
– Uses i and xi+1, …, xs
• Alg input will be A:=U+U’ and B
– A and B are n x c/2
• Alice:
– Runs streaming matrix product Alg on U
– Sends Alg state to Bob
– Bob continues Alg with A := U + U’ and B
• ATB determines xi with probability at least 2/3
– By choice of U, U’, B
– Solving Augmented Indexing
• So space of Alg must be (s) = (cε-2log cn) bits
Lower Bound Details
• U = U(1); U(2); …, U(log (cn)); 0s
– Each U(k) is an £(ε-2) x c/2 submatrix with entries in
•
•
•
•
•
{-10k, 10k}
– U(k)i, j = 10k if matched entry of x is 0, else U(k)i, j = -10k
Bob’s index i corresponds to U(k*)i*, j*
U’ is such that A = U+U’ = U(1); U(2); …, U(k*); 0s
– U’ is determined from xi+1, …, xs
ATB is i*-th row of U(k*)
||A|| ¼ ||U(k*)|| since the entries of A are geometrically
increasing
ε2||A||2¢ ||B||2, the squared error, is small, so most entries of
the approximation to ATB have the correct sign
Outline
• Matrix Product
• Linear Regression
• Low-rank approximation
Linear Regression
• The problem: minX ||AX-B||
• X* minimizing this has X* = A-B, where A- is the pseudoinverse of A
• Every matrix A = UΣVT using singular value
decomposition (SVD)
– If A is n x d of rank k, then
• U is n x k with orthonormal columns
• Σ is k x k diagonal matrix, diagonal is positive
• V is d x k with orthonormal columns
• A- = VΣ-1UT
• Normal Equations: ATA X = ATB for optimal X
Linear Regression
• Let S be an n x m sign matrix, m = O(dε-1log(1/δ))
• The algorithm is
– Maintain STA and STB
– Return X’ solving minX ||ST(AX-B)||
– Space is O(d2 ε-1log(1/δ)) words
– Improves Sarlos’ space by log c factor
– Space is optimal via new lower bound
• Main claim: With probability at least 1- δ,
||AX’-B|| · (1+ε)||AX*-B||
– That is, relative error for X’ is small
Regression Analysis
•
•
•
•
Why should X’ solving minX ||ST(AX-B)|| be good?
ST approximately preserves AX-B for fixed X
If this worked for all X, we’re done
ST must preserve norms even for X’, chosen using S
• First reduce to showing that ||A(X*-X’)|| is small
• Use normal equation ATAX* = ATB
– Implies ||AX’-B||2 = ||AX*-B||2 + ||A(X’-X*)||2
• Bounding ||A(X’-X*)||2 equivalent to bounding ||UTA(X’X*)||2, where A = UΣVT, from SVD, and U is an
orthonormal basis of the columnspace of A
Regression Analysis Continued
• Bounding ||¯||2 := ||UTA(X’-X*)||2
– ||¯|| · ||UTSSTU¯/m|| + ||UTSSTU¯/m-¯||
• Normal equations in sketch space imply
(STA)T(STA)X’ = (STA)T(STB)
• UTSSTU¯ = UTSSTA(X’-X*)
= UTSSTA(X’-X*) + UTSST(B-AX’)
= UTSST(B-AX*)
• || UTSSTU¯/m|| = ||UTSST(B-AX*)/m||
· (ε/k)1/2||U||¢||B-AX*|| (new tail estimate)
= ε1/2 ||B-AX*||
Regression Analysis Continued
• Hence, ||¯||2 := ||UTA(X’-X*)||2
– ||¯|| · ||UTSST¯/m|| + ||UTSSTU¯/m-¯||
· ε1/2 ||B-AX*|| + ||UTSSTU¯/m-¯||
• Recall the spectral norm: ||A||2 = supx ||Ax||/||x||
• Implies ||CD|| · ||C||2 ||D||
• ||UTSSTU¯/m-¯|| · ||UTSSTU/m-I ||2 ||¯||
• Subspace JL: for m = (k log(1/δ)), ST approximately
preserves lengths of all vectors in a k-space
• ||UTSSTU¯/m-¯|| · ||¯||/2
• ||¯|| · 2ε1/2||AX*-B||
• ||AX’-B||2 = ||AX*-B||2 + ||¯||2 = ||AX*-B||2 + 4ε||AX*-B||2
Regression Lower Bound
•
•
•
•
Tight (d2 log (nd) ε-1) space lower bound
Again a reduction from augmented indexing
This time more complicated
Embed log (nd) ε-1 independent regression subproblems into hard instance
– Uses deletions and geometrically growing property, as
in matrix product lower bound
• Choose the entries of A and b so that the
algorithm’s output x encodes some entries of A
Regression Lower Bound
• Lower bound of (d2) already tricky because of
bit complexity
• Natural approach:
– Alice has random d x d sign matrix A-1
– b is a standard basis vector ei
– Alice computes A = (A-1)-1 and puts it into the stream.
Solution x to minx ||Ax=b|| is i-th column of A-1
– Bob can isolate entries of A-1, solving indexing
• Wrong: A has entries that can be exponentially
small!
Regression Lower Bound
• We design A and b together (Aug. Index)
1
0
0
0
0
A1, 2
1
0
0
0
A1, 3 A1, 4
A2, 3 A2, 4
1
A3, 4
0
1
0
0
A1,5
A2,5
A3,5
A4,5
1
x1
x2
=
x3
x4
x5
0
A2,4
A3,4
1
0
x5 = 0, x4 = 1, x3 = 0, x2 = 0, x1 = -A1,4
Outline
• Matrix Product
• Regression
• Low-rank approximation
Best Low-Rank Approximation
• For any matrix A and integer k, there is a matrix
Ak of rank k that is closest to A among all matrices
of rank k.
• Since rank of Ak is k, it is the product CDT of two
k-column matrices C and D
– Ak can be found from the SVD (singular value
decomposition), where C and D are orthogonal
matrices U and VΣk
– This is a good compression of A
– LSI, PCA, recommendation systems, clustering
Best Low-Rank Approximation
• Previously, nothing was known for 1-pass lowrank approximation and relative error
– Even for k = 1, best upper bound O(nd log (nd)) bits
– Problem 28 of [Mut]: can one get sublinear space?
• We get 1-pass and O(kε-2(n+dε-2)log(nd)) space
• Update time is O(kε-4), so total work is O(Nkε-4),
where N is the number of non-zero entries of A
• New space lower bound shows optimal up to 1/ε
Best Low-Rank Approximation and STA
• The sketch STA holds information about A
• In particular, there is a rank k matrix Ak’ in the
rowspace of STA nearly as close to A as the
closest rank k matrix Ak
– The rowspace of STA is the set of linear
combinations of its rows
• That is, ||A-Ak’|| · (1+ε)||A-Ak||
• Why is there such an Ak’?
Low-Rank Approximation via Regression
• Apply the regression results with A ! Ak, B ! A
• The X’ minimizing ||ST(AkX-A)|| has
||AkX’-A|| · (1+ ε)||Ak X*-A||
• But here X* = I, and X’ = (ST Ak)- STA
• So the matrix AkX’ = Ak(STAk)-STA:
– Has rank k
– In the rowspace of STA
– Within 1+ε of smallest distance of any rank-k matrix
Low-Rank Approximation in 2 Passes
• Can’t use Ak(STAk)- STA without finding Ak
• Instead: maintain STA
– Can show that if GT has orthonormal rows, then
the best rank-k approximation to A in the
rowspace of GT is A’kGT, where A’k is the best
rank-k approximation to AG
– After 1st pass, compute orthonormal basis GT
for rowspace of STA
– In 2nd pass, maintain AG
– Afterwards, compute A’k and A’kGT
Low-Rank Approximation in 2 Passes
• A’k is best rank-k approximation to AG
• For any rank-k matrix Z,
• ||AGGT – A’kGT|| = ||AG-A’k||
· ||AG-Z||
· ||AGGT – ZGT||
• For all Y: (AGGT-YGT) ¢ (A-AGGT)T = 0, so we
can apply Pythagorean Theorem twice:
• ||A – A’kG|| = ||A-AGGT|| + ||AGGT – A’kGT||
· ||A-AGGT|| + ||AGGT – ZGT||
= ||A – ZGT||
1 Pass Algorithm
• With high probability,(*) ||AX’-B|| · (1+ε)||AX*-B||, where
– X* minimizes ||AX-B||
– X’ minimizes ||STAX-STB||
• Apply (*) with A ! AR and B ! A and X’ minimizing
||ST(ARX-A)||
• So X’= (STAR)-STA has ||ARX’-A|| · (1+ε)minX ||ARX-A||
• Columnspace of AR contains a (1+ε)-approximation to Ak
• So, ||ARX’-A|| · (1+ε)minX ||ARX-A|| · (1+ε)2 minX ||A-Ak||
• Key idea: ARX’ = AR(STAR)-STA is
– easy to compute in 1-pass with small space and fast update time
– behaves like A (similar to SVD)
– use it instead of A in our 2-pass algorithm!
1 Pass Algorithm
• Algorithm:
– Maintain AR and STA
– Compute AR(STAR)-STA
– Let GT be an orthonormal basis for the rowspace of
STA, as before
– Output the best rank-k approximation to
AR(STAR)-STA in the rowspace of STA
• Same as 2-pass algorithm except we don’t need a
second pass to project A onto the rowspace of STA
• Analysis is similar to that for regression
A Lower Bound
binary string x
matrix A
index i and xi+1, xi+2, …
k ε-1 columns per block
0s
k rows
n-k rows
10, -10
-10, 10
-10,-10
…
…
-100, -100
100, 100
100, 100
…
…
0, 0 -1000
-1000,
-1000,
0,
0 1000
1000,
0,
0 1000
…
…
0, 0
10000,
-10000
-10000,
0,
0
-10000
-10000,
0,
0
-10000
…
…
Error now dominated by block of interest
Bob also inserts a k x k identity submatrix into block of interest
…
0
…
0
…
…
…
…
…
…
Lower Bound Details
Block of interest:
k rows
0s
k ε-1 columns
P*Ik
0s
n-k rows
*
Bob inserts k x k identity submatrix, scaled by large value P
Show any rank-k approximation must err on all of shaded region
So good rank-k approximation likely has correct sign on Bob’s entry
Concluding Remarks
• Space bounds are tight for product, regression
– Sharpen prior upper bounds
– Prove optimal lower bounds
• Space bounds off by a factor of ε-1 for low-rank
approximation
– First sub-linear (and near-optimal) 1-pass algorithm
– We have better upper bounds for restricted cases
• Improve the dependence on ε in the update time
• Lower bounds for multi-pass algorithms?