Transcript wpe2

Randomized Algorithms for
Low-Rank Matrix Decomposition
Ben Sapp, WPE II
May 6, 2011
1
Low-Rank Decomposition:
The Goal
where
2
Advantages
• Requires mk+nk numbers to represent the matrix, instead
of mn. I.e., compression.
• Less numbers = less storage space and faster matrix
multiplication.
• In many applications, exposes the structure of the data.
3
Exposing structure: an example
Let’s say your matrix is full of points in n-dimensional space which have
some underlying linear structure:
This data can be almost exactly described by 2-dimensions, i.e., a rank-2 matrix.
description of the 2 axes in
Coefficients (embedding) of the points in
4
Formulation:
The Fixed-Rank Problem
•
•
•
Non-convex constraint set
But, global optimum exists
And there is a known solution…
5
Classical Solution:
The Fixed-Rank Problem
Singular Value Decomposition of A:
Optimal Solution: truncated SVD
(proof to come later)
6
Truncated SVD Properties
• Properties (via power method):
– O(mnk), ok, but…
– O(k) passes through the data
– Expects cheap random access
Netflix dataset FERET Face DB
– Iterative
2GB
5GB
Wiki English
14GB
Watson KB
1TB
• Issues:
Datasets are huge!
Data access is expensive
7
Architecture is parallel and decentralized
Low-Rank Algorithm Desiderata
• 1-2 passes over the matrix as pre-processing /
examination step.
• Remainder of the work sub-O(mn), and should
depend on desired rank k, rather than ambient
dimensions mxn.
• Trade off accuracy with computation time
• Decentralized, simple and numerically stable
8
Randomization to the Rescue!
Randomized meta-algorithm:
1. Given A (mxn), randomly obtain Y (mxs or sxs) in 1 pass.
2. Compute exact SVD on Y in O(ms2) or O(s3).
3. Use Y’s decomposition to approximate A’s decomposition.
 1-2 passes over the matrix in pre-processing/examination step.
 Remainder of the work sub-O(mn), depending on underlying rank,
rather than ambient dimensions
 Tradeoff accuracy with computation time
 Decentralized, simple and numerically stable
9
Outline
• Introduction
• Linear Algebra preliminaries and intuition
• The algorithms
– SampleCols
– AdaSample
– RandProj
• A comparison
10
Singular Value Decomposition
Any real rectangular matrix A can be factored into the form
Tall & skinny:
(m > n)
Short & fat:
(m < n)
11
SVD Properties
U and V are unitary matrices with mutually orthonormal
columns
These columns are called the left and right singular vectors
left singular vectors
right singular vectors
12
SVD Properties
contains the singular values on the diagonal, in order:
which correspond to the left and right singular vectors:
,
,
13
SVD Properties
For a vector
, the geometric interpretation of
:
1. Rotate x by rotation matrix VT.
2. Scale the result along the coordinate axes.
3. Either discard n-m dimensions (m < n), or add zero-pad m-n (m > n) to map to
.
4. Rotate the result by U.
14
Fundamental Subspaces
Define range(A) as the set of all vectors which A maps to:
synonymous: range(A) is the linear subspace spanned by the columns of A
If A has rank k
{u1,…,uk} are an orthonormal basis for range(A)
u3
u1
u2
15
Frobenius norm
• Equivalent definitions
in Matlab: sum(X(:).^2)
• Is unitarily invariant:
since
tr(XY) = tr(YX)
16
The Optimal Low-Rank Solution
Theorem (Eckart-Young, 1936).
Let
. Then
is minimized by
with optimal value
17
The Optimal Low-Rank Solution
Proof.
First, the Frobenius norm is unitarily invariant:
Thus,
diagonal
should be diagonal too
To make the right term diagonal, can construct A’ of the form
18
The Optimal Low-Rank Solution
Proof (continued).
At this point, we have
Since rank(A’) is at most k, at most k singular values can be non-zero.
Conclude:
In summary:
19
Orthogonal Projections
• Vector x projection onto unit vector:
• Project matrix A onto orthonormal basis
Q:
20
Randomized Meta-algorithm
Input: Matrix A, mxn, target rank k, number of samples s
Output: which approximately solves
1.
Form lower dimensional Y from sampling s rows and/or columns
from A, or by applying s random projections. Y is mxs or sxs.
2.
Compute the top k left singular vectors of Y to form the best rank-k
basis Q for range(Y).
3.
Project A onto the subspace spanned by Q:
21
The Main Idea
Meta-Algorithm
1. Form Y from A randomly
2. Get optimal rank-k basis Q for span of Y via SVD
3. Project A onto Q
Bounds: How far is
from
?
22
Outline
• Introduction
• Linear Algebra preliminaries and intuition
• The algorithms
– SampleCols
– AdaSample
– RandProj
• A comparison
23
Comparing the algorithms
Method
Running
Time
# Passes
Error w.h.p.
SampleCols
?
?
?
SampleRowsCols
?
?
?
AdaSample
?
?
?
RandProj
?
?
?
O(mnk)
O(k)
Exact partial SVD
24
Sampling Rows & Columns
• Simple idea: too much data to deal with?
Subsample!
• But, sample proportional to squared magnitude
more informative
not so useful
25
First pass: SampleCols
[Frieze, Kannan & Vempala, 1998]
Input: Matrix A, mxn, target rank k, number of samples s
Output: which approximately solves
1.
2.
Sample s columns from A: A(:,i1),…,A(:,is), proportional to their squared
magnitude:
Form
3.
4.
Compute Q = [q1 … qk], the top k left singular values of Y.
Project A onto the subspace spanned by Q:
26
Running time: SampleCols
Input: Matrix A, mxn, target rank k, number of samples s
Output: which approximately solves
1.
Sample s columns from A: A(:,i1),…,A(:,is), proportional to their squared
magnitude:
O(mn)
2.
Form
3.
Compute Q = [q1 … qk], the top k left singular values of Y.
4.
Project A onto the subspace spanned by Q:
O(ms)
O(ms2)
27
Analysis: SampleCols
• How different is A from Y on average?
• Let’s start by analyzing a randomized
matrix-vector multiplication algorithm:
Exact:
Randomly use s cols:
28
Analysis: SampleCols
Random matrix-vector multiplication
Exact:
Randomly use s rows:
Let random variable X take on value
with probability pi.
Then
How do we easily bound this variance?
Then
and
set
29
Analysis: SampleCols
Handle on the variance of matrix-vector multiplication:
With s samples, variance gets s times better:
Let’s extend the idea to matrix-matrix multiplication, randomly choosing col i
from A and corresponding row i from matrix B: Define random variable Z
col ij
A
B
row ij
Then
30
Analysis: SampleCols
Handle on the variance of matrix-matrix multiplication:
Now, let’s look at variance when B = AT. Let Z = YYT. Then plugging into the
above we have
and we have the following lemma:
Lemma (Distortion from sampling).
When Y is a sampled version of A as in SampleCols, then
31
Analysis: SampleCols
Lemma (Distortion from sampling).
When Y is a sampled version of A as in SampleCols, then
Need one more lemma, which quantifies distortion
when projecting on matrix onto another’s range….
32
Analysis: SampleCols
Lemma (Distortion from projection).
Given any two matrices A and B, let Q be a top-k basis of range(B). Then
projection of A onto
best k-basis of B
projection of A onto
best k-basis of A
differs by at most this
much
range(B)
range(A)
33
Analysis: SampleCols
Lemma (Distortion from sampling).
When Y is a sampled version of A as in SampleCols, then
Lemma (Distortion from projection).
Given any two matrices A and B, let Q be a basis of the top k left singular
values of B. Then
Taking expectation w.r.t. sampled columns of the second lemma, we obtain a
bound for SampleCols:
Theorem (SampleCols average Frobenius error).
SampleCols finds a rank k matrix
such that
34
Comparing the algorithms
Running
Time
# Passes
O(mn+ms2)
2
SampleRowsCols
?
?
?
AdaSample
?
?
?
RandProj
?
?
?
O(mnk)
O(k)
Method
SampleCols
Exact partial SVD
Error w.h.p.
35
One step further: SampleRowsCols
[Frieze, Kannan & Vempala, 2004]
Input: Matrix A, mxn, target rank k, number of samples s
Output: which approximately solves
1.
2.
3.
4.
5.
Sample s columns from A: A(:,i1),…,A(:,is), proportional to their squared
magnitude. Scale and form Y, mxs.
Sample s rows from Y: Y(i1,:),…,Y(is,:), proportional to their squared
magnitude. Scale and form W, sxs.
Compute V = [v1 … vs], the top s right singular values of W.
Compute Q = [q1 … qk], where qi =
Project A onto the subspace spanned by Q:
36
Running time: SampleRowsCols
Input: Matrix A, mxn, target rank k, number of samples s
Output: which approximately solves
1.
2.
3.
4.
5.
Sample s columns from A: A(:,i1),…,A(:,is), proportional to their squared
magnitude. Scale and form Y, mxs. O(mn)
Sample s rows from Y: Y(i1,:),…,Y(is,:), proportional to their squared
magnitude. Scale and form W, sxs. O(ms)
O(s3)
Compute V = [v1 … vs], the top s right singular values of W.
O(ms2)
Compute Q = [q1 … qk], where qi =
Project A onto the subspace spanned by Q:
Total running time:
O(mn + s3)
37
Analysis: SampleRowsCols
sample cols
project A
onto Q
SampleCols
sample rows
convert from
row basis of W
to col basis of Y
compute
basis for
rows of W
exact
38
Analysis: SampleRowCols
Lemma (Good left projections from good right projections).
Let
as in SampleRowsCols. Then
It turns out this additive error dominates the errors incurred from other steps of
the algorithm, i.e.,
thus, we can bound the algorithm as follows
Theorem (SampleRowsCols average Frobenius error).
SampleRowsCols finds a rank k matrix
such that
39
Comparing the algorithms
Running
Time
# Passes
O(mn+ms2)
2
O(mn+s3)
2
AdaSample
?
?
?
RandProj
?
?
?
O(mnk)
O(k)
Method
SampleCols
SampleRowsCols
Exact partial SVD
Error w.h.p.
40
SampleCols is easy to break
• Error is additive, and we have no control
over
• Consider:
a few
important
points
This data has a near-perfect rank-2 decomposition.
But, SampleCols will almost surely miss the outliers!
41
Improvement: AdaSample
• Sample some and form a basis.
• Next round of sampling should be
proportional to the residual part of A not
captured by the current sample.
42
AdaSample
[Deshpande, Rademacher, Vempala & Wang, 2006]
Input: Matrix A, mxn, target rank k, number of samples s
Output: which approximately solves
1.
2.
Start with empty sample set S = { }, E := A;
For t = 1 to T
a.
b.
c.
3.
Pick a subset St of s rows of A, with row i chosen according to
Update
Update
Return
43
Analysis: AdaSample
Let’s look at one iteration of AdaSample:
Lemma (AdaSample, one round).
Let A be mxn and
be a linear subspace. Let
collection of s rows of A sampled proportional to
and S a
. Then
where
is the optimal projection of A onto subspace L with rank k.
Proof is similar to derivation of bound on SampleCols.
44
Analysis: AdaSample
Lemma (AdaSample, one round).
Let A be mxn and
be a linear subspace. Let
collection of s rows of A sampled proportional to
and S a
. Then
Applying this lemma T times:
Theorem (AdaSample average Frobenius error).
After T > 1 iterations, AdaSample obtains a sample S such that
45
Running Time: AdaSample
At iteration t, we need to
basis
for S
A
1. Extend the orthonormal basis for S to an orthonormal basis for
Orthogonalizing s new vectors against st orthogonal, nx1 vectors: O(nts2).
2. Project A onto new portion of the basis spanned by St, takes O(mns)
So the iterative process takes
Finally, we need to compute singular vectors to obtain our final rank-k basis,
taking O(ms2T2). Total running time:
46
Comparing the algorithms
Method
SampleCols
SampleRowsCols
AdaSample
RandProj
Exact partial SVD
Running Time
#
Passes
O(mn+ms2)
2
O(mn+s3)
2
O(mnsT + s2T2(m+n))
2T
?
?
O(mnk)
O(k)
Error w.h.p.
?
47
Recap so far
• Previous algorithms attempted to capture
range(A) by (intelligently) sampling rows
and/or columns.
• What if instead, we probed A with a variety
of vectors to get a feel for range(A)?
48
RandProj: Geometric intuition
49
RandProj
[Halko, Martinsson & Tropp, 2010]
Input: Matrix A, mxn, target rank k, number of samples s
Output: which approximately solves
1.
Draw a random test matrix
2.
Form
3.
Compute an orthonormal basis
4.
Return
for the range of Y via SVD.
50
Running time: RandProj
Input: Matrix A, mxn, target rank k, number of samples s
Output: which approximately solves
1.
Draw a random test matrix
2.
Form
3.
Compute an orthonormal basis
4.
Return
O(mns)
for the range of Y via SVD.
O(ms2)
Total Running Time:
O(mns + ms2)
51
Analysis: RandProj
Lemma (Random projection error bound).
Partition the SVD of A like so:
Let
and
. Then
optimal error
extra cost proportional to tail
singular values:
“wasted sampling”
52
Analysis: RandProj
Lemma (Random projection error bound).
Take expectations w.r.t.
Theorem (SampleRowsCols average Frobenius error).
RandProj finds a rank k matrix
such that
53
Comparing the algorithms
Method
SampleCols
SampleRowsCols
AdaSample
RandProj
Exact partial SVD
Running Time
#
Passes
O(mn+ms2)
2
O(mn+s3)
2
O(mnsT + s2T2(m+n))
2T
O(mns + ms2)
2
O(mnk)
O(k)
Error w.h.p.
54
RandProj refinement
• Can combine RandProj with power
iterations:
– Drives noise down from tail n-k singular
values exponentially fast:
– But pay in running time & # of passes:
O((q+1)mns + ms2) & 2q passes through data
55
Comparing the algorithms
Running Time
#
Passes
O(mn+ms2)
2
O(mn+s3)
2
O(mnsT + s2T2(m+n))
2T
O(mns + ms2)
2
RandProj+power
O((q+1)mns + ms2)
2q
Exact partial SVD
O(mnk)
O(k)
Method
SampleCols
SampleRowsCols
AdaSample
RandProj
Error w.h.p.
57
Outline
• Introduction
• Linear Algebra preliminaries and intuition
• The algorithms
– SampleCols
– AdaSample
– RandProj
• A comparison
58
Which one is best?
Fix k and s, assume bounds are tight, and that
SampleRowsCols
O(mn+s3)
SampleCols
O(mn+ms2)
Error
RandProj
O((q+1)mns + ms2)
optimal error
Time
AdaSample
O(mnsT + s2T2(m+n))
rounds of adaptive
sampling
power
iterations
59
Caveats
• Bounds are not tight
• Relative scaling of time-vs-error axes
depend on m, n, s, k, and
• Which is best in practice?
60
Experiments
function
function
[U,S,V]
[U,S,V]
= =
sampleColsRows(A,s)
adaSample(A,s,T)
%subsample
S = [];
Y E
= =
subsample(A,s);
A;
function
[U,S,V] = randProj(A,s,q)
W for
=
subsample(Y’,s)’;
t=1:T
Omega = randn(size(A,2),s);
%sample columns
Y = A*Omega;
%SVD
of
=
W sqrt(sum(E.^2,1));
for p
i=1:q
[Uw,S,Vw]
p
= A'*Y;
p
= /
svd(W);
sum(p);
Y =
S = sqrt(S);
idx
=
sample_from_weights(p,s);
Y = A*Y;
U end
= Y*Vw;
St = A(:,idx);
[U,R]
S
= =
=
qr(U);
[S St];
[Q,R]
qr(Y,0);
V B==(U'*A)';
%orthogonalize
& project
Q'*A;
V [U0,S,V]
= bsxfun(@times,V,1./sqrt(sum(V.^2)));
[Q,R]==svds(B,s);
qr(S,0);
proj_A_on_Qt
= Q*(Q'*A);
U = Q*U0;
subfunE[Xsub]
= A - proj_A_on_Qt;
= subsample(X,s)
p_col
end = sqrt(sum(A.^2));
p_col
%wrap
= up
p_col / sum(p_col);
colidx
B = Q'*A;
= sample_from_weights(p_col,s);
Xsub
[U,S,V]
= X(:,colidx);
= svds(B,s*T);
Xsub
U ==Q*U;
bsxfun(@times,Xsub,...
1./sqrt(p_col(colidx)))/sqrt(s);
61
Experiments
Eigenfaces from Labeled Faces in the Wild
• 13,233 images, each 96x96 pixels, collected online in 2007
• A is a 13233x9216 matrix, 975.6Mb double precision
62
Eigenfaces
pixels
examples
“loadings”
k-dim.
face basis
We want the top k = 25 eigenfaces.
63
Time (seconds)
in ~ 4.5 minutes via
Matlab’s svds(A,25)
64
SampleRowsCols
RandProj
AdaSample
Time (seconds)
65
in ~ 4 minutes
SampleRowsCols
RandProj
AdaSample
log(Time)
66
in ~ 4 minutes
SampleRowsCols
RandProj
AdaSample
RandProj +
power iters
q=1
q=2
4.6 secs!
q=3
log(Time)
67
in ~ 4 minutes
Summary
Exact eigenfaces in 4+ minutes
Approximate eigenfaces in 4 seconds
(75 random projections + 1 power iteration)
68
Conclusion
• Classical truncated SVD
– ill-suited for large datasets
• Randomized algorithms
– allow an error-vs-computation tradeoff
– require only a few passes through the data
– simple and robust
69
Thanks!
70
LFW: svds takes ~8 GB of memory and uses up to all 8 cores, takes 263 secs, k=25
for X = 13233 x 9216 (96x96 images)
function [U,S,V] = randProj(A,s,q)
Omega = randn(size(A,2),s);
Y = A*Omega;
for i=1:q
function [U,S,V] = sampleColsRows(A,s) Y = A'*Y;
function [U,S,V] = adaSample(A,s,T)
Y = A*Y;
%subsample
S = [];
end
Y = subsample(A,s);
E = A;
[Q,R] = qr(Y,0);
W = subsample(Y’,s)’;
for t=1:T
B = Q'*A;
%sample columns
[U0,S,V] = svds(B,s);
%SVD of W'*W
p = sqrt(sum(E.^2,1));
U = Q*U0;
[Uw,S,Vw] = svd(W'*W);
p = p / sum(p);
S = sqrt(S);
idx = sample_from_weights(p,s);
U = Y*Vw;
St = A(:,idx);
[U,R] = qr(U);
S = [S St];
V = (U'*A)';
%orthogonalize & project
V = bsxfun(@times,V,1./sqrt(sum(V.^2)));
[Q,R] = qr(S,0);
proj_A_on_Qt = Q*(Q'*A);
subfun [Xsub] = subsample(X,s)
E = A - proj_A_on_Qt;
p_col = sqrt(sum(A.^2));
end
p_col = p_col / sum(p_col);
%wrap up
colidx = sample_from_weights(p_col,s);
B = Q'*A;
Xsub = X(:,colidx);
[U,S,V] = svds(B,s*T);
Xsub = bsxfun(@times,Xsub,...
71
U = Q*U;
1./sqrt(p_col(colidx)))/sqrt(s);