Sketching as a Tool for Numerical Linear Algebra
Download
Report
Transcript Sketching as a Tool for Numerical Linear Algebra
Sketching as a Tool for Numerical
Linear Algebra
All Lectures
David Woodruff
IBM Almaden
Massive data sets
Examples
Internet traffic logs
Financial data
etc.
Algorithms
Want nearly linear time or less
Usually at the cost of a randomized approximation
2
Regression analysis
Regression
Statistical method to study dependencies between
variables in the presence of noise.
3
Regression analysis
Linear Regression
Statistical method to study linear dependencies
between variables in the presence of noise.
4
Regression analysis
Linear Regression
Statistical method to study linear dependencies
between variables in the presence of noise.
Example Regression
Example
Ohm's law V = R ∙ I
250
200
150
Example Regression
100
50
0
0
50
100
150
5
Regression analysis
Linear Regression
Statistical method to study linear dependencies
between variables in the presence of noise.
Example Regression
Example
Ohm's law V = R ∙ I
Find linear function that
best fits the data
250
200
150
Example Regression
100
50
0
0
50
100
150
6
Regression analysis
Linear Regression
Statistical method to study linear dependencies between
variables in the presence of noise.
Standard Setting
One measured variable b
A set of predictor variables a1 ,…, a d
Assumption:
b = x0 + a 1 x1 + … + a d xd + e
e is assumed to be noise and the xi are model
parameters we want to learn
Can assume x0 = 0
Now consider n observations of b
7
Regression analysis
Matrix form
Input: nd-matrix A and a vector b=(b1,…, bn)
n is the number of observations; d is the number of
predictor variables
Output: x* so that Ax* and b are close
Consider the over-constrained case, when n À d
Can assume that A has full column rank
8
Regression analysis
Least Squares Method
Find x* that minimizes |Ax-b|22 = S (bi – <Ai*, x>)²
Ai* is i-th row of A
Certain desirable statistical properties
9
Regression analysis
Geometry of regression
We want to find an x that minimizes |Ax-b|2
The product Ax can be written as
A*1x1 + A*2x2 + ... + A*dxd
where A*i is the i-th column of A
This is a linear d-dimensional subspace
The problem is equivalent to computing the point of the
column space of A nearest to b in l2-norm
10
Regression analysis
Solving least squares regression via the normal equations
How to find the solution x to minx |Ax-b|2 ?
Equivalent problem: minx |Ax-b |22
Write b = Ax’ + b’, where b’ orthogonal to columns of A
Cost is |A(x-x’)|22 + |b’|22 by Pythagorean theorem
Optimal solution x if and only if AT(Ax-b) = AT(Ax-Ax’) = 0
Normal Equation: ATAx = ATb for any optimal x
x = (ATA)-1 AT b
If the columns of A are not linearly independent, the MoorePenrose pseudoinverse gives a minimum norm solution x
11
Moore-Penrose Pseudoinverse
• Any optimal solution x has the form A− b + I − V′V′T z, where
V’ corresponds to the rows i of V T for which Σ𝑖,𝑖 > 0
•
Why?
•
Because A I − V′V′T z = 0, so A− b + I − V′V′T z is a
solution. This is a d-rank(A) dimensional affine space so it
spans all optimal solutions
•
Since A− b is in column span of V’, by Pythagorean theorem,
|A− b + I − V′V′T z|22 = A− b 22 + |(I − V ′ V ′T )z|22 ≥ A− b 22
12
Time Complexity
Solving least squares regression via the normal equations
Need to compute x = A-b
Naively this takes nd2 time
Can do nd1.376 using fast matrix multiplication
But we want much better running time!
13
Sketching to solve least squares regression
How to find an approximate solution x to minx |Ax-b|2 ?
Goal: output x‘ for which |Ax‘-b|2 · (1+ε) minx |Ax-b|2
with high probability
Draw S from a k x n random family of matrices, for a
value k << n
Compute S*A and S*b
Output the solution x‘ to minx‘ |(SA)x-(Sb)|2
x’ = (SA)-Sb
14
How to choose the right sketching matrix S?
Recall: output the solution x‘ to minx‘ |(SA)x-(Sb)|2
Lots of matrices work
S is d/ε2 x n matrix of i.i.d. Normal random variables
To see why this works, we
introduce the notion of a
subspace embedding
15
Subspace Embeddings
• Let k = O(d/ε2)
• Let S be a k x n matrix of i.i.d. normal
N(0,1/k) random variables
• For any fixed d-dimensional subspace, i.e.,
the column space of an n x d matrix A
– W.h.p., for all x in Rd, |SAx|2 = (1±ε)|Ax|2
• Entire column space of A is preserved
Why is this true?
Subspace Embeddings – A Proof
• Want to show |SAx|2 = (1±ε)|Ax|2 for all x
• Can assume columns of A are orthonormal
(since we prove this for all x)
• Claim: SA is a k x d matrix of i.i.d. N(0,1/k)
random variables
– First property: for two independent random variables X
and Y, with X drawn from N(0,a2 ) and Y drawn from
N(0,b2 ), we have X+Y is drawn from N(0, a2 + b2 )
X+Y is drawn from N(0,
2
𝑎
+
2
𝑏 )
2
• Probability density function 𝑓𝑧 of Z = X+Y𝑏2is
𝑧
𝑥− 2 2
2
𝑎 +𝑏 𝑓 and 𝑓
𝑧
convolution of probability density
𝑋
𝑌
− 2 functions
−
2
−
•
𝑧−𝑥 2
2𝑎2
2 𝑎 +𝑏
𝑥2
− 2
2𝑏
2
Calculation: 𝑒
= 𝑒
𝑓𝑍 𝑧 = ∫ 𝑓𝑌 𝑧 − 𝑥 𝑓𝑋 𝑥 𝑑𝑥
• 𝑓𝑥 𝑥 =
•Density
𝑓𝑍 𝑧
2 /2𝑎 2
1
−𝑥
𝑒
𝑎 2𝜋 .5
, 𝑓𝑦 𝑦 =
2 /2𝑏2
1
−𝑥
𝑒
2𝑧 2
𝑏
.5
𝑏 2𝜋
𝑥− 2 2
𝑎 +𝑏
−
𝑎𝑏 2
2
/2𝑏2 𝑎2 +𝑏2
.5
𝑎2 +𝑏2 2
2 /2𝑎 2
1
1
−(𝑧−𝑥)
−𝑥
of
distribution:
= Gaussian
∫
𝑒
𝑒
∫
2𝜋 .5 𝑎𝑏
𝑎 2𝜋 .5
𝑏 2𝜋 .5
−
=
1
−𝑧 2 /2 𝑎2 +𝑏2
𝑒
2𝜋 .5 𝑎2 +𝑏2 .5
𝑎𝑏 2
𝑎2 +𝑏2
∫
.5
𝑎2 +𝑏2
2𝜋 .5 𝑎𝑏
𝑒
𝑒 𝑑𝑥
2
𝑏2 𝑧
𝑥− 2 2
𝑎 +𝑏
2
𝑎𝑏 2
𝑎2 +𝑏2
dx
dx = 1
Rotational Invariance
• Second property: if u, v are vectors with <u, v> = 0,
then <g,u> and <g,v> are independent, where g is a
vector of i.i.d. N(0,1/k) random variables
• Why?
• If g is an n-dimensional vector of i.i.d. N(0,1)
random variables, and R is a fixed matrix, then
the probability density function of Rg is
𝑥𝑇 RRT
𝑓
𝑥 =
−
1
𝑒
det(RRT ) 2𝜋 𝑑/2
−1
𝑥
2
– RRT is the covariance matrix
– For a rotation matrix R, the distribution of Rg
and of g are the same
Orthogonal Implies Independent
• Want to show: if u, v are vectors with <u, v> = 0, then
<g,u> and <g,v> are independent, where g is a vector of
i.i.d. N(0,1/k) random variables
• Choose a rotation R which sends u to αe1 , and sends v
to βe2
• < g, u > = < gR, RT u > = < h, αe1 > = αh1
• < g, v > = < gR, RT v > = < h, βe2 > = βh2
where h is a vector of i.i.d. N(0, 1/k) random variables
• Then h1 and h2 are independent by definition
Where were we?
• Claim: SA is a k x d matrix of i.i.d. N(0,1/k) random
variables
• Proof: The rows of SA are independent
– Each row is: < g, A1 >, < g, A2 > , … , < g, Ad >
– First property implies the entries in each row are
N(0,1/k) since the columns Ai have unit norm
– Since the columns Ai are orthonormal, the entries in a
row are independent by our second property
Back to Subspace Embeddings
•
•
•
•
Want to show |SAx|2 = (1±ε)|Ax|2 for all x
Can assume columns of A are orthonormal
Can also assume x is a unit vector
SA is a k x d matrix of i.i.d. N(0,1/k) random variables
• Consider any fixed unit vector 𝑥 ∈ 𝑅𝑑
• SAx 22 = i∈ k < g i , x >2 , where g i is i-th row of SA
• Each < g i , x
>2
is distributed as N
1 2
0,
k
• E[< g i , x >2 ] = 1/k, and so E[ SAx 22 ] = 1
How concentrated is SAx 22 about its expectation?
Johnson-Lindenstrauss Theorem
• Suppose h1 , … , hk are i.i.d. N(0,1) random variables
• Then G = i h2i is a 𝜒 2 -random variable
• Apply known tail bounds to G:
– (Upper) Pr G ≥ k + 2 kx .5 + 2x ≤ e−x
– (Lower) Pr G ≤ k − 2 kx .5 ≤ e−x
• If x =
ϵ2 k
,
16
then Pr G ∈ k(1 ± ϵ) ≥ 1 −
2 k/16
−ϵ
2e
1
δ
• If k = Θ(ϵ−2 log( )), this probability is 1-δ
• Pr SAx 22 ∈ 1 ± ϵ ≥ 1 − 2−Θ d
This only holds for a fixed x, how to argue for all x?
Net for Sphere
• Consider the sphere Sd−1
• Subset N is a γ-net if for all x ∈ Sd−1 , there is a y ∈ N,
such that x − y 2 ≤ γ
• Greedy construction of N
– While there is a point x ∈ Sd−1 of distance larger than
γ from every point in N, include x in N
• The sphere of radius γ/2 around ever point in N is
contained in the sphere of radius 1+ γ around 0d
• Further, all such spheres are disjoint
• Ratio of volume of d-dimensional sphere of radius 1+ γ
to dimensional sphere of radius 𝛾 is 1 + γ d /(γ/2)d , so
N ≤ 1 + γ d /(γ/2)d
Net for Subspace
• Let M = {Ax | x in N}, so M ≤ 1 + γ d /(γ/2)d
• Claim: For every x in Sd−1 , there is a y in M for which
Ax − y 2 ≤ γ
• Proof: Let x’ in S d−1 be such that x − x ′ 2 ≤ γ
Then Ax − Ax ′ 2 = x − x ′ 2 ≤ γ, using that the
columns of A are orthonormal. Set y = Ax’
Net Argument
• For a fixed unit x, Pr SAx 22 ∈ 1 ± ϵ ≥ 1 − 2−Θ d
• For a fixed pair of unit x, x’, SAx 22 , SAx′ 22 , SA x − x ′
are all 1 ± ϵ with probability 1 − 2−Θ d
• SA x − x ′ 22 = SAx 22 + SAx ′ 22 − 2 < SAx, SAx ′ >
• A x − x ′ 22 = Ax 22 + Ax ′ 22 − 2 < Ax, Ax ′ >
– So Pr < Ax, Ax ′ > = < SAx, SAx ′ > ± O ϵ
2
2
= 1 − 2−Θ(d)
• Choose a ½-net M = {Ax | x in N} of size 5𝑑
• By a union bound, for all pairs y, y’ in M,
< y, y′ > = < Sy, Sy′ > ± O ϵ
• Condition on this event
• By linearity, if this holds for y, y’ in M, for αy, βy′ we have
< αy, βy′ > = αβ < Sy, Sy′ > ± O ϵ αβ
Finishing the Net Argument
• Let y = Ax for an arbitrary x ∈ Sd−1
• Let y1 ∈ M be such that y − y1 2 ≤ γ
• Let α be such that α(y − y1 ) 2 = 1
– α ≥ 1/γ (could be infinite)
• Let y2′ ∈ M be such that α y − y1 − y2 ′
• Then y − y1 −
y′2
.
α
y2 ′
α 2
≤
γ
α
2
≤γ
≤ γ2
• Set y2 =
Repeat, obtaining y1 , y2 , y3 , … such that for
all integers i,
y − y1 − y2 − … − yi 2 ≤ γi
• Implies yi 2 ≤ γi−1 + γi ≤ 2γi−1
Finishing the Net Argument
• Have y1 , y2 , y3 , … such that y =
• Sy
2
2
= |S
i yi
and yi
2
≤ 2γi−1
2
|
y
i i 2
=
i
Syi
=
i
yi
2
2
2
2
+2
+2
i,j
i,j
< Syi , Syj >
< yi , yj > ± O ϵ
i,j
yi
2
yj
= | i yi |22 ± O ϵ
= y 22 ± O ϵ
=1±O ϵ
• Since this held for an arbitrary y = Ax for unit x, by
linearity it follows that for all x, |SAx|2 = (1±ε)|Ax|2
2
Back to Regression
• We showed that S is a subspace
embedding, that is, simultaneously for all x,
|SAx|2 = (1±ε)|Ax|2
What does this have to do with regression?
Subspace Embeddings for
Regression
• Want x so that |Ax-b|2 · (1+ε) miny |Ay-b|2
• Consider subspace L spanned by columns of A
together with b
• Then for all y in L, |Sy|2 = (1± ε) |y|2
• Hence, |S(Ax-b)|2 = (1± ε) |Ax-b|2 for all x
• Solve argminy |(SA)y – (Sb)|2
• Given SA, Sb, can solve in poly(d/ε) time
Only problem is computing SA takes O(nd2) time
How to choose the right sketching matrix S? [S]
S is a Subsampled Randomized Hadamard Transform
S = P*H*D
D is a diagonal matrix with +1, -1 on diagonals
H is the Hadamard transform
P just chooses a random (small) subset of rows of H*D
S*A can be computed in O(nd log n) time
Why does it work?
31
Why does this work?
We can again assume columns of A are orthonormal
It suffices to show SAx
2
2
= PHDAx
HD is a rotation matrix, so HDAx
2
2
2
2
= 1 ± ϵ for all x
= Ax
2
2
= 1 for any x
Notation: let y = Ax
Flattening Lemma: For any fixed y,
Pr [ HDy
∞
≥C
log.5 nd/δ
]
n.5
≤
δ
2d
32
Proving the Flattening Lemma
Flattening Lemma: Pr [ HDy
∞
≥C
log.5 nd/δ
]
n.5
≤
δ
2d
Let C > 0 be a constant. We will show for a fixed i in [n],
Pr [ HDy
≥C
i
log.5 nd/δ
]
n.5
≤
δ
2nd
If we show this, we can apply a union bound over all i
HDy i = j Hi,j Dj,j yj
(Azuma-Hoeffding) Pr |
probability 1
t2
)
2 j β2
j
−(
j Zj |
> t ≤ 2e
, where |Zj | ≤ βj with
Zj = Hi,j Dj,j yj has 0 mean
|Zj | ≤
|yj |
n.5
1
2
β
=
j j
n
= βj with probability 1
nd
Pr |
j Zj | >
C log.5 δ
n.5
δ
C2 log
nd
−
2
≤ 2e
≤
δ
2nd
33
Consequence of the Flattening Lemma
Recall columns of A are orthonormal
HDA has orthonormal columns
Flattening Lemma implies HDAei
probability 1 −
δ
2d
With probability 1
∞
≤C
log.5 nd/δ
n.5
with
for a fixed i ∈ d
δ
− ,
2
Given this, ej HDA
2
ej HDAei
≤C
≤C
d.5 log.5 nd/δ
n.5
log.5 nd/δ
n.5
for all i,j
for all j
(Can be optimized further)
34
Matrix Chernoff Bound
Let X1 , … , Xs be independent copies of a symmetric random matrix X ∈ Rdxd
1
with E X = 0, X 2 ≤ γ, and E X T X 2 ≤ 𝜎 2 . Let W = s i∈[s] Xi . For any ϵ > 0,
Pr W
2
> ϵ ≤ 2d ⋅ e
(here W
γϵ
3
−sϵ2 /(𝜎2 + )
2
= sup Wx 2 / x 2 )
Let V = HDA, and recall V has orthonormal columns
Suppose P in the S = PHD definition samples uniformly with replacement. If
row i is sampled in the j-th sample, then Pj,i = n, and is 0 otherwise
Let Yi be the i-th sampled row of V = HDA
Let Xi = Id − n ⋅ YiT Yi
E X i = Id − n ⋅
Xi
2
≤ Id
1
j n
ViT Vi = Id − V T V = 0d
2 + n ⋅ max ej HDA
2
2
= 1 + n ⋅ C 2 log
nd
δ
d
⋅ n = Θ(d log
nd
δ
)
35
Matrix Chernoff Bound
Recall: let Yi be the i-th sampled row of V = HDA
Let Xi = Id − n ⋅ YiT Yi
E X T X + Id = Id + Id − 2n E YiT Yi + n2 E[YiT Yi YiT Yi ]
1
i n
nd
T
2
v
v
C
log
i
i i
δ
= 2Id − 2Id + n2
⋅ viT vi viT vi = n
d
⋅ n = C 2 dlog
T
i vi vi
nd
𝛿
⋅ 𝑣𝑖
2
2
Define Z = n
Note that 𝑋 𝑇 𝑋 + 𝐼𝑑 and Z are real symmetric, with non-negative
eigenvalues
Claim: for all vectors y, we have: 𝑦 𝑇 𝑋 𝑇 𝑋𝑦 + 𝑦 𝑇 𝑦 ≤ 𝑦 𝑇 𝑍𝑦
Proof: 𝑦 𝑇 𝑋 𝑇 𝑋𝑦 + 𝑦 𝑇 𝑦 = 𝑛 𝑖 𝑦 𝑇 𝑣𝑖𝑇 𝑣𝑖 𝑦 𝑣𝑖 22 = 𝑛 𝑖 < 𝑣𝑖 , 𝑦 >2 𝑣𝑖 22 and
𝑛𝑑 𝑑
𝑛𝑑
𝑇
𝑇
𝑇
2
2
2
𝑦 𝑍𝑦 = 𝑛
𝑦 𝑣𝑖 𝑣𝑖 𝑦 𝐶 log
⋅ =𝑛
< 𝑣𝑖 , 𝑦 > 𝐶 log
𝛿
𝑛
𝛿
𝑖
𝑋𝑇 𝑋
Hence, 𝐸
Hence, E X T X
𝐼𝑑
𝑖
2
2
≤ 𝐸 𝑋 𝑇 𝑋 + 𝐼𝑑
= |𝐸 𝑋 𝑇 𝑋 + 𝐼𝑑 |2 + 1
𝑛𝑑
2
≤ 𝑍 2 + 1 ≤ 𝐶 𝑑 log
+1
𝛿
= O d log
nd
δ
2
+ 𝐼𝑑
2
36
Matrix Chernoff Bound
nd
δ
Hence, E X T X
Recall: (Matrix Chernoff) Let X1 , … , Xs be independent copies of a
symmetric random matrix X ∈ Rdxd with E X = 0, X 2 ≤ γ, and E X T X
𝜎 2.
Let W =
1
s
2
= O d log
γϵ
i∈[s] X i .
Pr |Id − PHDA
d
Set s = d log
nd log δ
δ
ϵ2
For any ϵ > 0, Pr W
T
PHDA
2
−sϵ2 /(𝜎2 + 3 )
≤
> ϵ ≤ 2d ⋅ e
−s ϵ2 /(Θ(d log
2
2
> ϵ ≤ 2d ⋅ e
nd
)
δ
δ
, to make this probability less than 2
37
SRHT Wrapup
Have shown |Id − PHDA
T
|2 < ϵ using Matrix
PHDA
Chernoff Bound and with s = d log
d
nd log δ
δ
ϵ2
samples
Implies for every unit vector x,
|1− PHDAx 22 | = x T x − x PHDA
so PHDAx
2
2
T
PHDA x < ϵ ,
∈ 1 ± ϵ for all unit vectors x
Considering the column span of A adjoined with b, we can
again solve the regression problem
The time for regression is now only O(nd log n) +
𝑑 log 𝑛
poly(
𝜖
38
). Nearly optimal in matrix dimensions (n >> d)
Faster Subspace Embeddings S [CW,MM,NN]
CountSketch matrix
Define k x n matrix S, for k = O(d2/ε2)
S is really sparse: single randomly chosen non-zero
entry per column
[
[
0010 01 00
1000 00 00
0 0 0 -1 1 0 -1 0
0-1 0 0 0 0 0 1
nnz(A) is number of non-zero entries of A
Can compute
S ⋅ A in nnz(A)
time!
39
Simple Proof [Nguyen]
Can assume columns of A are orthonormal
Suffices to show |SAx|2 = 1 ± ε for all unit x
For regression, apply S to [A, b]
SA is a 2d2/ε2 x d matrix
Suffices to show | ATST SA – I|2 · |ATST SA – I|F · ε
Matrix product result shown below:
Pr[|CSTSD – CD|F2 · [3/(𝛿(# rows of S))] * |C|F2 |D|F2] ≥ 1 − 𝛿
Set C = AT and D = A.
Then |A|2F = d and (# rows of S) = 3 d2/(δε2)
40
Matrix Product Result [Kane, Nelson]
Show: Pr[|CSTSD – CD|F2 · [3/(𝛿(# rows of S))] * |C|F2 |D|F2] ≥ 1 − 𝛿
(JL Property) A distribution on matrices S ∈ Rkx n has the (ϵ, δ, ℓ)-JL
moment property if for all x ∈ Rn with x 2 = 1,
ES Sx 22 − 1
ℓ
≤ ϵℓ ⋅ δ
(From vectors to matrices) For ϵ, δ ∈ 0,
1
2
, let D be a distribution on
matrices S with k rows and n columns that satisfies the (ϵ, δ, ℓ)-JL
moment property for some ℓ ≥ 2. Then for A, B matrices with n rows,
Pr AT S T SB − AT B
S
F
≥3ϵ A
F
B
F
≤δ
41
From Vectors to Matrices
1
(From vectors to matrices) For ϵ, δ ∈ 0, 2 , let D be a distribution on matrices
S with k rows and n columns that satisfies the (ϵ, δ, ℓ)-JL moment property
for some ℓ ≥ 2. Then for A, B matrices with n rows,
Pr AT S T SB − AT B
S
Proof: For a random scalar X, let X
Sometimes consider X = T
| T
F
|p = E T
p
F
F
p
≥3ϵ A
= EX
F
B
F
≤δ
p 1/p
for a random matrix T
1/p
Can show |. |p is a norm
Minkowski’s Inequality: X + Y
F
p
≤ X
p
+ Y
p
For unit vectors x, y, need to bound |〈Sx, Sy〉 - 〈x, y〉|ℓ
42
From Vectors to Matrices
For unit vectors x, y, |〈Sx, Sy〉 - 〈x, y〉|ℓ
=
≤
≤
1
(
2
1
(
2
1
2
Sx 22 −1 + Sy
Sx
2
2
2
2
− 1 ℓ + Sy
1
ℓ
−1 − S x−y
2
2
1
ℓ
−1 ℓ+ S x−y
(ϵ ⋅ δ +ϵ ⋅ δ + x − y
≤3ϵ⋅δ
2
2
2
2
− x−y
2
2
2
2
− x−y
|ℓ
2
2 ℓ)
1
ℓ
ϵ⋅δ )
1
ℓ
Sx,Sy − x,y ℓ
x2y2
1
ℓ
By linearity, for arbitrary x, y,
Suppose A has d columns and B has e columns. Let the columns of A be
A1 , … , Ad and the columns of B be B1 , … , Be
Define Xi,j =
AT S T SB
−
1
Ai 2 Bj
2
2
AT B F
=
≤3ϵ⋅δ
⋅ ( SAi , SBj − Ai , Bj )
i
2
j Ai 2
⋅
2 2
Bj Xi,j
2
43
From Vectors to Matrices
Have shown: for arbitrary x, y,
For Xi,j =
| AT ST SB − A B F |ℓ/2 =
1
Sx,Sy − x,y ℓ
x2y2
SAi , SBj − Ai , Bj : AT S T SB − AT B
⋅
Ai 2 Bj
2
T 2
i
≤
j
2
2
2
⋅ Bj Xi,j
j
Ai
2
2
2 |
⋅ Bj |Xi,j
ℓ/2
j
Ai
2
2
⋅ Bj
i
≤ 3ϵδ
1
ℓ
= 3 ϵδ
T T
T
Since E A S SB − A B
Pr
AT S T SB
−
AT B
F
ℓ
F
1
ℓ
2
F
=
i
j
Ai
2
2
2
2
⋅ Bj Xi,j
2
ℓ/2
2
2
2
2
Xi,j
Ai
2
2
2
i
2
A
j
2
F
B
2
ℓ
2
Bj
2
2
F
T T
T
= A S SB − A B
> 3ϵ A
2
2
Ai
i
=
≤3ϵ⋅δ
1
ℓ
ℓ/2
2
F ℓ
, by Markov’s inequality,
2
F
B
F
≤
1
3ϵ A F B F
ℓ
E |AT S T SB − AT B|ℓF ≤ δ
44
Result for Vectors
Show: Pr[|CSTSD – CD|F2 · [3/(𝛿(# rows of S))] * |C|F2 |D|F2] ≥ 1 − 𝛿
(JL Property) A distribution on matrices S ∈ Rkx n has the (ϵ, δ, ℓ)-JL moment
property if for all x ∈ Rn with x 2 = 1,
ES Sx
2
2
−1
ℓ
≤ ϵℓ ⋅ δ
1
(From vectors to matrices) For ϵ, δ ∈ 0, 2 , let D be a distribution on matrices
S with k rows and n columns that satisfies the (ϵ, δ, ℓ)-JL moment property
for some ℓ ≥ 2. Then for A, B matrices with n rows,
Pr AT S T SB − AT B
S
F
≥3ϵ A
F
B
F
≤δ
Just need to show that the CountSketch matrix S satisfies JL property and
bound the number k of rows
45
CountSketch Satisfies the JL Property
(JL Property) A distribution on matrices S ∈ Rkx n has the (ϵ, δ, ℓ)-JL moment
property if for all x ∈ Rn with x 2 = 1,
ES Sx
2
2
−1
ℓ
≤ ϵℓ ⋅ δ
We show this property holds with ℓ = 2. First, let us consider ℓ = 1
For CountSketch matrix S, let
h:[n] -> [k] be a 2-wise independent hash function
σ: n → {−1,1} be a 4-wise independent hash function
Let δ E = 1 if event E holds, and δ E = 0 otherwise
E[ Sx 22 ] =
j∈ k
E[
i∈ n
δ h i = j σi x i
=
j∈ k
i1,i2∈[n] E[δ
=
j∈ k
i∈[n] E[δ
=
1
k
j∈[k[
i∈ n
2
]
h i1 = j δ h i2 = j σi1 σi2 ]xi1 xi2
h i = j 2 ]xi2
xi2 = x
2
2
46
CountSketch Satisfies the JL Property
E[ Sx 42 ] = 𝐸[
𝑗1 ,𝑗2 ,𝑖1 .𝑖2 ,𝑖3 ,𝑖4
𝑗′∈
j∈ k
i∈ n
𝑘
δ h i = j σi xi
2
2
i′∈ n
δ h i′ = j′ σi′ xi′ ] =
𝐸[𝜎𝑖1 𝜎𝑖2 𝜎𝑖3 𝜎𝑖4 𝛿 ℎ 𝑖1 = 𝑗1 𝛿 ℎ 𝑖2 = 𝑗1 𝛿 ℎ 𝑖3 = 𝑗2 𝛿 ℎ 𝑖4 = 𝑗2 ]𝑥𝑖1 𝑥𝑖2 𝑥𝑖3 𝑥𝑖4
We must be able to partition {𝑖1 , 𝑖2 , 𝑖3 , 𝑖4 } into equal pairs
Suppose 𝑖1 = 𝑖2 = 𝑖3 = 𝑖4 . Then necessarily 𝑗1 = 𝑗2 . Obtain
Suppose 𝑖1 = 𝑖2 and 𝑖3 = 𝑖4 but 𝑖1 ≠ 𝑖3 . Then get
Suppose 𝑖1 = 𝑖3 and 𝑖2 = 𝑖4 but 𝑖1 ≠ 𝑖2 . Then necessarily 𝑗1 = 𝑗2 . Obtain
1
1
2 2
4
𝑥
𝑥
≤
𝑥
2 . Obtain same bound if 𝑖1 = 𝑖4 and 𝑖2 = 𝑖3 .
𝑗 𝑘 2 𝑖1 ,𝑖2 𝑖1 𝑖2
𝑘
Hence, E[ Sx 42 ] ∈ [ x 42 , 𝑥 42 (1 + 𝑘)] = [1, 1 + k]
So, ES Sx
2
2
2
−1
2
2
1
𝑗𝑘
4
𝑖 𝑥𝑖
1 2 2
𝑗1 ,𝑗2 ,𝑖1 ,𝑖3 𝑘 2 𝑥𝑖1 𝑥𝑖3
= 𝑥
= 𝑥
4
2
4
4
− 𝑥
4
4
2
2
1
≤ 1 + 𝑘 − 2 + 1 = 𝑘 . Setting k = 2𝜖2 𝛿 finishes the proof
47
Where are we?
(JL Property) A distribution on matrices S ∈ Rkx n has the (ϵ, δ, ℓ)-JL
moment property if for all x ∈ Rn with x 2 = 1,
ES Sx
2
2
−1
ℓ
≤ ϵℓ ⋅ δ
1
(From vectors to matrices) For ϵ, δ ∈ 0, 2 , let D be a distribution on
matrices S with k rows and n columns that satisfies the (ϵ, δ, ℓ)-JL moment
property for some ℓ ≥ 2. Then for A, B matrices with n rows,
Pr AT S T SB − AT B
S
2
F
≥ 3 ϵ2 A
2
F
B
2
F
≤δ
2
We showed CountSketch has the JL property with ℓ = 2, 𝑎𝑛𝑑 𝑘 = 𝜖2 𝛿
Matrix product result we wanted was:
Pr[|CSTSD – CD|F2 · (3/(𝛿k)) * |C|F2 |D|F2] ≥ 1 − 𝛿
We are now down with the proof CountSketch is a subspace embedding
48
Course Outline
Subspace embeddings and least squares regression
Gaussian matrices
Subsampled Randomized Hadamard Transform
CountSketch
Affine embeddings
Application to low rank approximation
High precision regression
Leverage score sampling
49
Affine Embeddings
Want to solve min 𝐴𝑋 − 𝐵 2𝐹 , A is tall and thin with d columns, but B has a large
𝑋
number of columns
Can’t directly apply subspace embeddings
Let’s try to show SAX − SB
we need of S
Can assume A has orthonormal columns
Let 𝐵 ∗ = 𝐴𝑋 ∗ − 𝐵, where 𝑋 ∗ is the optimum
F
= 1 ± ϵ AX − B
F
for all X and see what properties
𝑆 𝐴𝑋 − 𝐵 2𝐹 − 𝑆𝐵 ∗ 2𝐹 = 𝑆𝐴 𝑋 − 𝑋 ∗ + 𝑆 𝐴𝑋 ∗ − 𝐵 2𝐹 − 𝑆𝐵 ∗ 2𝐹
= 𝑆𝐴 𝑋 − 𝑋 ∗ 2𝐹 − 2𝑡𝑟[ 𝑋 − 𝑋 ∗ 𝑇 𝐴𝑇 𝑆 𝑇 𝑆𝐵∗ ]
∈ 𝑆𝐴 𝑋 − 𝑋 ∗ 2𝐹 ± 2 𝑋 − 𝑋 ∗ 𝐹 𝐴𝑇 𝑆 𝑇 𝑆𝐵 ∗ 𝐹 (use 𝑡𝑟 𝐶𝐷 ≤ 𝐶 𝐹 𝐷 𝐹 )
∈ 𝑆𝐴 𝑋 − 𝑋 ∗ 2𝐹 ±2𝜖 𝑋 − 𝑋 ∗ 𝐹 B ∗ F (if we have approx. matrix product)
∈ 𝐴 𝑋 − 𝑋 ∗ 2𝐹 ± 𝜖( 𝐴 𝑋 − 𝑋 ∗ 2𝐹 + 2 𝑋 − 𝑋 ∗ 𝐹 𝐵 ∗ ) (subspace embedding for50A)
Affine Embeddings
We have 𝑆 𝐴𝑋 − 𝐵 2𝐹 − 𝑆𝐵∗ 2𝐹 ∈ 𝐴 𝑋 − 𝑋 ∗
𝜖( 𝐴 𝑋 − 𝑋 ∗ 2𝐹 + 2 𝑋 − 𝑋 ∗ 𝐹 𝐵∗ )
Normal equations imply that
𝐴𝑋 − 𝐵 2𝐹 = 𝐴 𝑋 − 𝑋 ∗ 2𝐹 + 𝐵∗
∗
= 1±𝜖
𝐵∗ 2𝐹
2
𝐹
2
∈ ±𝜖 𝐴 𝑋 − 𝑋 𝐹 + 𝐵 𝐹
2
∈ ±2𝜖 𝐴 𝑋 − 𝑋 ∗ 𝐹 + 𝐵 ∗
= ±2𝜖 𝐴𝑋 − 𝐵 2𝐹
𝑆𝐵∗ 2𝐹
±
2
𝐹
𝑆 𝐴𝑋 − 𝐵 2𝐹 − 𝑆𝐵∗ 2𝐹 − 𝐴𝑋 − 𝐵 2𝐹 − 𝐵∗
∈ 𝜖( 𝐴 𝑋 − 𝑋 ∗ 2𝐹 + 2 𝑋 − 𝑋 ∗ 𝐹 𝐵∗ 𝐹 )
∗
2
𝐹
2
𝐹
51
(this holds with constant probability)
Affine Embeddings
Know: 𝑆 𝐴𝑋 − 𝐵 2𝐹 − 𝑆𝐵∗ 2𝐹 − 𝐴𝑋 − 𝐵
±2𝜖 𝐴𝑋 − 𝐵 2𝐹
Know: 𝑆𝐵∗ 2𝐹 = 1 ± 𝜖 𝐵∗ 2𝐹
𝑆 𝐴𝑋 − 𝐵
2
𝐹
= (1 ± 2𝜖) 𝐴𝑋 − 𝐵 2𝐹 +𝜖 𝐵∗
= 1 ± 3𝜖 𝐴𝑋 − 𝐵 2𝐹
2
𝐹
− 𝐵∗
2
𝐹
∈
2
𝐹
Completes proof of affine embedding!
52
Affine Embeddings: Missing Proofs
Claim: A + B
2
F
= A
2
F
Proof: A + B
2
F
=
Ai + Bi
2
2
2
2
Bi
=
i
Ai
+ B
+
i
= A
2
F
+ 2Tr(AT B)
2
2
+ 2〈Ai , Bi 〉
i
2
F
+ B
2
F
+ 2Tr(AT B)
53
Affine Embeddings: Missing Proofs
Claim: Tr AB ≤ A
Proof: Tr AB =
≤
i
Ai
2
≤
2
i
i A 2
= A
F
B
F
B
F
i
i〈A , Bi 〉
Bi
1
2
2
for rows Ai and columns Bi
by Cauchy-Schwarz for each i
1
2 2
B
i i 2
another Cauchy-Schwarz
F
54
Affine Embeddings: Homework Proof
Claim: SB ∗ 2F = 1 ± ϵ B ∗ 2F with constant probability if
1
CountSketch matrix S has k = O( 2 ) rows
ϵ
Proof:
SB ∗ 2F =
i
SBi∗
2
2
By our analysis for CountSketch and linearity of expectation,
E SB ∗ 2F = i E SBi∗ 22 = B ∗ 2F
E[ SB ∗ 4F ] =
i,j
E[ SBi∗
2
2
2
SBj∗ ]
2
By our CountSketch analysis,E[
SBi∗ 42 ]]
≤
For cross terms see Lemma 40 in [CW13]
2
∗ 4
Bi 2 (1 + )
k
55
Low rank approximation
A is an n x d matrix
Think of n points in Rd
E.g., A is a customer-product matrix
Ai,j = how many times customer i purchased item j
A is typically well-approximated by low rank matrix
E.g., high rank because of noise
Goal: find a low rank matrix approximating A
Easy to store, data more interpretable
56
What is a good low rank approximation?
Singular Value Decomposition (SVD)
Any matrixAAk == argmin
U ¢ Σ ¢ rank
V k matrices B |A-B|F
U has orthonormal columns
Σ is diagonal with non-increasing
positive
1/2
2
(|C|
)
F = (Σ
i,jVC
i,j )
The
rows
of
are
k
entries down the diagonal
the orthonormal
top k principal
V has
rows
components
Computing
Ak exactly is
expensive
Rank-k approximation:
Ak = Uk ¢ Σk ¢ Vk
57
Low rank approximation
Goal: output a rank k matrix A’, so that
|A-A’|F · (1+ε) |A-Ak|F
Can do this in nnz(A) + (n+d)*poly(k/ε) time [S,CW]
nnz(A) is number of non-zero entries of A
58
Solution to low-rank approximation [S]
Given n x d input matrix A
Compute S*A using a random matrix S with k/ε << n
rows. S*A takes random linear combinations of rows of A
A
SA
Project rows of A onto SA, then find best rank-k
approximation to points inside of SA.
59
What is the matrix S?
S can be a k/ε x n matrix of i.i.d. normal random
variables
[S] S can be a k/ε x n Fast Johnson Lindenstrauss
Matrix
Uses Fast Fourier Transform
[CW] S can be a poly(k/ε) x n CountSketch matrix
[
[
0010 01 00
1000 00 00
0 0 0 -1 1 0 -1 0
0-1 0 0 0 0 0 1
S ¢ A can be
computed in
nnz(A) time
60
Why do these Matrices Work?
Consider the regression problem min Ak X − A
Write Ak = WY, where W is n x k and Y is k x d
Let S be an affine embedding
Then SAk X − SA
By normal equations, argmin SAk X − SA
X
F
= 1 ± ϵ Ak X − A
F
X
− SA −
for all X
F
= SAk
So, Ak SAk
But Ak SAk
Let’s formalize why the algorithm works now…
− SA
A
F
≤ 1 + ϵ Ak − A
F
− SA
F
is a rank-k matrix in the row span of SA!
61
Why do these Matrices Work?
min
rank−k X
XSA − A
2
F
≤ Ak SAk
−
SA − A|2F ≤ 1 + ϵ A − A_k|2F
By the normal equations,
XSA − A 2F = XSA − A SA − SA
Hence,
min
rank−k X
XSA − A
2
F
= A SA − SA − A
2
F
2
F
+
+ A SA − SA − A
min
rank−k X
2
F
XSA − A SA − SA
Can write SA = U ΣV T in its SVD, where U, Σ are k x k and V T is k x d
Then, min
rank−k X
XSA − A SA − SA
2
F
=
=
min
XUΣ − A SA − UΣ
min
Y − A SA − UΣ
rank−k X
rank−k Y
2
F
2
F
2
F
Hence, we can just compute the SVD of A SA − UΣ
But how do we compute A SA
− UΣ
quickly?
62
Caveat: projecting the points onto SA is slow
Current algorithm:
1. Compute S*A
2. Project each of the rows onto S*A
3. Find best rank-k approximation of projected points
inside of rowspace of S*A
minrank-k X |X(SA)R-AR|F2
Bottleneck is step 2
Can solve with affine embeddings
[CW] Approximate the projection
Fast algorithm for approximate regression
minrank-k X |X(SA)-A|F2
Want nnz(A) + (n+d)*poly(k/ε) time
63
Using Affine Embeddings
2
F
We know we can just output arg min
Choose an affine embedding R:
XSAR − AR 2F = 1 ± ϵ XSA − A
rank−k X
XSA − A
Note: we can compute AR and SAR in nnz(A) time
Can just solve
min
rank−k X
min
rank−k X
XSAR − AR
2
F
XSAR − AR
= AR SAR
−
2
F
for all X
2
F
−
SAR − AR
2
F
2
F
+
min
rank−k X
XSAR − AR SAR
k
ϵ
−
SAR
2
F
Compute
Necessarily, Y = XSAR for some X. Output Y SAR − SA in factored form. We’re done!
min
rank−k Y
Y − AR SAR
SAR
using SVD which is n + d poly
time
64
Low Rank Approximation Summary
1. Compute SA
2. Compute SAR and AR
3. Compute
min
rank−k Y
Y − AR SAR
−
SAR
2
F
using SVD
4. Output Y SAR − SA in factored form
Overall time: nnz(A) + (n+d)poly(k/ε)
65
Course Outline
Subspace embeddings and least squares regression
Gaussian matrices
Subsampled Randomized Hadamard Transform
CountSketch
Affine embeddings
Application to low rank approximation
High precision regression
Leverage score sampling
66
High Precision Regression
Goal: output x‘ for which |Ax‘-b|2 · (1+ε) minx |Ax-b|2
with high probability
Our algorithms all have running time poly(d/ε)
Goal: Sometimes we want running time poly(d)*log(1/ε)
Want to make A well-conditioned
κ A = sup Ax 2 / inf Ax 2
x 2 =1
x 2 =1
Lots of algorithms’ time complexity depends on κ A
Use sketching to reduce κ A to O(1)!
67
Small QR Decomposition
Let S be a 1 + ϵ0 - subspace embedding for A
Compute SA
Compute QR-factorization, SA = QR−1
Claim: κ AR =
1+ϵ0
1−ϵ0
For all unit x, 1 − ϵ0 ARx
2
≤ SAR x 2 = 1
For all unit x, 1 + ϵ0 ARx
2
≥ SARx
So κ AR = sup ARx 2 / inf ARx
x 2 =1
x 2 =1
2
2
=1
≤
1+ϵ0
1−ϵ0
68
Finding a Constant Factor Solution
Let S be a 1 + ϵ0 - subspace embedding for AR
Solve x0 = argmin SARx − Sb
x
2
Time to compute AR and x0 is nnz(A) + poly(d) for constant ϵ0
xm+1 ← xm + RT AT b − AR xm
AR xm+1 − x ∗ = AR (xm + RT AT b − ARxm − x ∗ )
= AR − ARRT AT AR xm − x ∗
= U Σ − Σ 3 V T (xm − x ∗ ),
where AR = U ΣV T is the SVD of AR
AR xm+1 − x ∗
ARxm − b
2
2
2
=
Σ − Σ 3 V T xm − x ∗
= AR xm −
x∗
2
2
+
2
ARx ∗
= O ϵ0 |AR(xm − x ∗ )|2
−b
2
2
69
Course Outline
Subspace embeddings and least squares regression
Gaussian matrices
Subsampled Randomized Hadamard Transform
CountSketch
Affine embeddings
Application to low rank approximation
High precision regression
Leverage score sampling
70
Leverage Score Sampling
This is another subspace embedding, but it is based on sampling!
If A has sparse rows, then SA has sparse rows!
Let A = U ΣV T be an n x d matrix with rank d, written in its SVD
Define the i-th leverage score ℓ i
What is
iℓ
of A to be Ui,∗
2
2
i ?
Let q1 , … , qn be a distribution with qi ≥
βℓ i
d
, where β is a parameter
Define sampling matrix S = D ⋅ ΩT , where D is k x k and Ω is n x k
Ω is a sampling matrix, and D is a rescaling matrix
For each column j of Ω, D, independently, and with replacement, pick a row
index i in [n] with probability qi , and set Ωi,j = 1 and Dj,j = (qi k)^(.5) 71
Leverage Score Sampling
Note: leverage scores do not depend on choice of orthonormal
basis U for columns of A
Indeed, let U and U’ be two such orthonormal bases
Claim: ei U
2
2
= ei U′
2
2
for all i
Proof: Since both U and U’ have column space equal to that of A,
we have U = U ′ Z for change of basis matrix Z
Since U and U’ each have orthonormal columns, Z is a rotation
matrix (orthonormal rows and columns)
Then ei U
2
2
= ei U′ Z
2
2
= ei U′
2
2
72
Leverage Score Sampling gives a Subspace Embedding
2
2
2
2
Want to show for S = D ⋅ ΩT , that SAx
Writing A = U ΣV T in its SVD, this is equivalent to showing
SUy 22 = 1 ± ϵ Uy 22 = 1 ± ϵ y 22 for all y
As usual, we can just show with high probability, U T S T SU − I
How can we analyze U T S T SU?
(Matrix Chernoff) Let X1 , … , Xk be independent copies of a symmetric
random matrix X ∈ Rdxd with E X = 0, X 2 ≤ γ, and E X T X 2 ≤ σ2 . Let W =
1
k
j∈[k] X j .
= 1 ± ϵ Ax
for all x
2
≤ϵ
For any ϵ > 0,
γϵ
Pr W
(here W
2
= sup
Wx 2
.
x2
2
−kϵ2 /(σ2 + 3 )
> ϵ ≤ 2d ⋅ e
Since W is symmetric, W
2
= sup x T Wx . )
x 2 =1
73
Leverage Score Sampling gives a Subspace Embedding
Let i(j) denote the index of the row of U sampled in the j-th trial
UT
i(j) Ui(j)
Let Xj = Id −
The Xj are independent copies of a symmetric matrix random variable
E
2
≤ Id
XTX
2
+
UT
i j Ui j 2
qi j
= Id − 2E
=
, where Ui(j) is the j-th sampled row of U
UT
i Ui
i qi
qi
E X j = Id −
Xj
qi(j)
= Id − Id = 0d
≤ 1 + max
UT
i j Ui j
qi j
T
UT
i Ui Ui Ui
i
q i
i
+E
− Id ≤
Ui 22
qi
d
≤1+β
T
UT
i j Ui j Ui j Ui j
q2i j
d
β
T
T
i Ui Ui − Id ≤
d
β
− 1 Id ,
where A ≤ B means x T Ax ≤ x Bx for all x
d
Hence, |E X T X |2 ≤ β − 1
74
Applying the Matrix Chernoff Bound
(Matrix Chernoff) Let X1 , … , Xk be independent copies of a symmetric
random matrix X ∈ Rdxd with E X = 0, X 2 ≤ γ, and E X T X 2 ≤ σ2 . Let W =
1
k
j∈[k] X j .
For any ϵ > 0,
Pr W
(here W
2
= sup
d
γ = 1 + β, and σ2 =
X j = Id −
UT
i j Ui j
qi j
Wx 2
.
x2
d
β
2
γϵ
3
−kϵ2 /(σ2 + )
> ϵ ≤ 2d ⋅ e
Since W is symmetric, W
2
= sup x T Wx . )
x 2 =1
−1
, and recall how we generated S = D ⋅ ΩT : For each
column j of Ω, D, independently, and with replacement, pick a row index i in
[n] with probability qi , and set Ωi,j = 1 and Dj,j = (qi k)^(.5)
Implies W = Id − UT S T SU
Pr Id −
U T S T SU
β
2
> ϵ ≤ 2d ⋅
−kϵ2 Θ d
e
d log d
)
βϵ2
. Set k = Θ(
and we’re done.
75
Fast Computation of Leverage Scores
Naively, need to do an SVD to compute leverage scores
Suppose we compute SA for a subspace embedding S
Let SA = QR−1 be such that Q has orthonormal columns
Set ℓ′i = ei AR
2
2
Since AR has the same column span of A, AR = UT −1
1 − ϵ ARx 2 ≤ SARx 2 = x 2
1 + ϵ ARx 2 ≥ SARx 2 = x 2
1 ± O(ϵ ) x 2 = ARx 2 = UT −1 x 2 = T −1 x 2 ,
ℓi = ei ART
2
2
= 1 ± O(ϵ) ei AR
2
2
= 1 ± O(ϵ) ℓi ′
But how do we compute AR? We want nnz(A) time
76
Fast Computation of Leverage Scores
ℓi = 1 ± O(ϵ) ℓi ′
Suffices to set this ϵ to be a constant
Set ℓ′i = ei AR 22
This takes too long
Let G be a d x O(log n) matrix of i.i.d. normal random variables
For any vector z, Pr[ zG
2
2
= 1±
1
2
z 2] ≥ 1 −
1
n2
Instead set ℓ′i = ei ARG 22 .
Can compute in (nnz(A) + d2 ) log n time
Can solve regression in nnz(A) log n + poly(d(log n)/ε) time
77
Course Outline
Subspace embeddings and least squares regression
Gaussian matrices
Subsampled Randomized Hadamard Transform
CountSketch
Affine embeddings
Application to low rank approximation
High precision regression
Leverage score sampling
Distributed Low Rank Approximation
78
Distributed low rank approximation
We have fast algorithms for low rank approximation, but
can they be made to work in a distributed setting?
Matrix A distributed among s servers
For t = 1, …, s, we get a customer-product matrix from
the t-th shop stored in server t. Server t’s matrix = At
Customer-product matrix A = A1 + A2 + … + As
Model is called the arbitrary partition model
More general than the row-partition model in which each
customer shops in only one shop
79
The Communication Model
Coordinator
…
Server 1
Server 2
Server s
• Each player talks only to a Coordinator via 2-way communication
• Can simulate arbitrary point-to-point communication up to factor of 2
80
(and an additive O(log s) factor per message)
Communication cost of low rank approximation
Input: n x d matrix A stored on s servers
Server t has n x d matrix At
A = A1 + A2 + … + As
Assume entries of At are O(log(nd))-bit integers
Output: Each server outputs the same k-dimensional space W
C = A1 PW + A2 PW + … + As PW , where PW is the projector onto W
|A-C|F · (1+ε)|A-Ak|F
Application: k-means clustering
Resources: Minimize total communication and computation.
Also want O(1) rounds and input sparsity time
81
Work on Distributed Low Rank Approximation
[FSS]: First protocol for the row-partition model.
O(sdk/ε) real numbers of communication
Don’t analyze bit complexity (can be large)
SVD Running time, see also [BKLW]
[KVW]: O(skd/ε) communication in arbitrary partition model
[BWZ]: O(skd) + poly(sk/ε) words of communication in
arbitrary partition model. Input sparsity time
Matching Ω(skd) words of communication lower bound
Variants: kernel low rank approximation [BLSWX], low rank
approximation of an implicit matrix [WZ], sparsity [BWZ]
82
Outline of Distributed Protocols
[FSS] protocol
[KVW] protocol
[BWZ] protocol
83
Constructing a Coreset [FSS]
Let A = U ΣV T be its SVD
Let m = k + k/ϵ
Let Σm agree with Σ on the first m diagonal entries, and be 0
otherwise
Claim: For all projection matrices Y=I-X onto (n-k)-dimensional
subspaces,
Σm V T Y
where c = A −
2
F
Am 2F
= 1 ± ϵ AY
2
F
+ c,
does not depend on Y
T so that SA = U T UΣV T = Σ V T is a sketch
We can think of S as Um
m
m
84
Constructing a Coreset
Claim: For all projection matrices Y=I-X onto (n-k)-dimensional subspaces,
Σm V T Y
where c = A − Am
Proof: AY
2
F
= AX
=
2
F
2
TX
−
Σ
V
m
F
− Σm V T X
Σ − Σm V T X
+ U Σ − Σm V T Y
F
+ A − Am
F
+ A − Am
F
= Σm V T
does not depend on Y
2
2
Also, Σm V T Y
2
+
c
=
1
±
ϵ
AY
F,
F
2
= UΣm V T Y
≤ Σm V T Y
2
F
2
2
F
2
− AY
2
F
2
F
= Σm V T Y
2
F
+c
2
F
+ A − Am
F
2
F
− A
2
F
+ AX
2
F
2
F
2
F
2
≤ Σ − Σm V T 2 ⋅
≤ σ2m+1 k ≤ ϵ σ2m+1
X
2
F
m−k+1 ≤ϵ
2
i∈{k+1,..,m+1} σi
≤ ϵ A − Ak
2
F
85
Unions of Coresets
Suppose we have matrices A1 , … , As and construct
1 T,1 2 T,2
s T,s
Σm
V , Σm V , … , Σm
V as in the previous slide, together with c1 , … , cs
i T,i
Then i
Σm
V Y F + ci = 1 ± ϵ AY 2F , where A is the matrix formed by
concatenating the rows of A1 , … , As
Let B be the matrix obtained by concatenating the rows of
1 V T,1 , Σ 2 V T,2 , … , Σ s V T,s
Σm
m
m
Suppose we compute B = U ΣV T and compute Σm V T and B − Bm
Then Σm V T Y
So Σm V T and the constant c +
2
2
+c+
F
i ci
= 1 ± ϵ BY
i ci
2
F
+
i ci
= 1 ± O(ϵ) AY
2
F
2
F
are a coreset for A
86
[FSS] Row-Partition Protocol
[KVW] protocol
will handle 2, 3,
Problems:
Coordinator and 4
1. sdk/ε real numbers of communication
2. bit complexity can be large
3. running time for SVDs [BLKW]
4. doesn’t work in arbitrary partition model
…
P s ∈ Rn s x d
P 2 ∈ Rn2 x d
P1 ∈ Rn1 x d
This is an SVD-based protocol. Maybe
our random matrix techniques can t
Server t sends the top k/ε + k principal components of P , scaled by the top
improve communication just like they
k/ε + k singular values Σ t , together with 𝑐 𝑡
improved computation?
Coordinator returns top k principal components of Σ1 V1 ; Σ 2 V 2 ; … ; Σ s V s
87
[KVW] Arbitrary Partition Model Protocol
Inspired by the sketching algorithm presented earlier
Fix:
Problems:
Let S be
one ofofthe
k/ε x n random
matrices
Instead
projecting
A onto SA,
recalldiscussed
2 small seed
Can’t
output
projection
of At onto
SA
since
Scan
be
generated
pseudorandomly
from
T
we can solve min A SA XSA − A F
rank−k X
the rank is too
large
Coordinator
sends
small seed for S to all servers
Let 𝑇1 , 𝑇2 be affine embeddings, solve
2
T
Could
communicate
this
projection
min T1 A SA XSAT2 − T1 AT2 F to the
t and sends it to Coordinator
rank−k
X
Server
t
computes
SA
coordinator
whoproblem
could find
a k-dimensional
(optimization
is small
and has a
space, form
but communication
depends on n
closed
solution)
s SAt = SA to all servers
Coordinator
sends
Σ
t=1
Everyone can then compute XSA and
then output k directions
There is a good k-dimensional subspace inside of SA. If
we knew it, t-th server could output projection of At onto it88
[KVW] protocol
Phase 1:
Learn the row space of SA
optimal k-dimensional
space in SA
SA
cost · (1+ε)|A-Ak|F
89
[KVW] protocol
Phase 2:
Find an approximately optimal space W inside of SA
optimal space in SA
approximate
space W in SA
SA
cost · (1+ε)2|A-Ak|F
90
[BWZ] Protocol
Main Problem: communication is O(skd/ε) + poly(sk/ε)
We want O(skd) + poly(sk/ε) communication!
Idea: use projection-cost preserving sketches [CEMMP]
Let A be an n x d matrix
If S is a random k/ε2 x n matrix, then there is a constant
𝑐 ≥ 0 so that for all k-dimensional projection matrices P:
SA I − P F + c = 1 ± ϵ A I − P F
91
[BWZ] Protocol
Let S be a k/ε2 x n projection-cost preserving sketch
Let T be a d x k/ε2 projection-cost
preserving
sketch
Intuitively,
U looks like
top k
t
Server t sends SA T to Coordinator
left singular vectors of SA
Coordinator sends back SAT =
t
t SA T
to servers
Thus, U T SA looks like2 top k
Each right
server
computes
k/ε of
x kSA
matrix U of top k left singular
singular
vectors
vectors of SAT
Server t sends U T SAt Top
to Coordinator
k right singular vectors of SA
work because S is a projectioncost preserving sketch!
Coordinator returns the space U T SA =
tU
T SAt
to output
92
[BWZ] Analysis
Let W be the row span of UT SA, and P be the projection onto W
Want to show A − AP
F
≤ 1 + ϵ A − Ak
F
Since T is a projection-cost preserving sketch,
(*)
SA − SAP
F
≤ SA − UUT SA
F
+ c1 ≤ 1 + ϵ SA − SA
k F
Since S is a projection-cost preserving sketch, there is a scalar c >
0, so that for all k-dimensional projection matrices P,
SA − SAP
F
+ c = 1 ± ϵ A − AP
Add c to both sides of (*) to conclude A − AP
F
F
≤ 1 + ϵ A − Ak
F
93
Conclusions for Distributed Low Rank Approximation
[BWZ] Optimal O(sdk) + poly(sk/ε) communication protocol for low
rank approximation in arbitrary partition model
Handle bit complexity by adding Tao/Vu noise
Input sparsity time
2 rounds, which is optimal [W]
Optimal data stream algorithms improves [CW, L, GP]
Communication of other optimization problems?
Computing the rank of an n x n matrix over the reals
Linear Programming
Graph problems: Matching
etc.
94
Additional Time-Permitting Topics
Will cover some recent topics at a research-level (many
details omitted)
Weighted Low Rank Approximation
Regression and Low Rank Approximation with MEstimator Loss Functions
Finding Heavy Hitters in a Data Stream optimally
95