Sketching as a Tool for Numerical Linear Algebra Lecture 1
Download
Report
Transcript Sketching as a Tool for Numerical Linear Algebra Lecture 1
Sketching as a Tool for Numerical
Linear Algebra
Lecture 1
David Woodruff
IBM Almaden
Massive data sets
Examples
Internet traffic logs
Financial data
etc.
Algorithms
Want nearly linear time or less
Usually at the cost of a randomized approximation
2
Regression analysis
Regression
Statistical method to study dependencies between
variables in the presence of noise.
3
Regression analysis
Linear Regression
Statistical method to study linear dependencies
between variables in the presence of noise.
4
Regression analysis
Linear Regression
Statistical method to study linear dependencies
between variables in the presence of noise.
Example Regression
Example
Ohm's law V = R ∙ I
250
200
150
Example Regression
100
50
0
0
50
100
150
5
Regression analysis
Linear Regression
Statistical method to study linear dependencies
between variables in the presence of noise.
Example Regression
Example
Ohm's law V = R ∙ I
Find linear function that
best fits the data
250
200
150
Example Regression
100
50
0
0
50
100
150
6
Regression analysis
Linear Regression
Statistical method to study linear dependencies between
variables in the presence of noise.
Standard Setting
One measured variable b
A set of predictor variables a1 ,…, a d
Assumption:
b = x0 + a 1 x1 + … + a d xd + e
e is assumed to be noise and the xi are model
parameters we want to learn
Can assume x0 = 0
Now consider n observations of b
7
Regression analysis
Matrix form
Input: nd-matrix A and a vector b=(b1,…, bn)
n is the number of observations; d is the number of
predictor variables
Output: x* so that Ax* and b are close
Consider the over-constrained case, when n À d
Can assume that A has full column rank
8
Regression analysis
Least Squares Method
Find x* that minimizes |Ax-b|22 = S (bi – <Ai*, x>)²
Ai* is i-th row of A
Certain desirable statistical properties
9
Regression analysis
Geometry of regression
We want to find an x that minimizes |Ax-b|2
The product Ax can be written as
A*1x1 + A*2x2 + ... + A*dxd
where A*i is the i-th column of A
This is a linear d-dimensional subspace
The problem is equivalent to computing the point of the
column space of A nearest to b in l2-norm
10
Regression analysis
Solving least squares regression via the normal equations
How to find the solution x to minx |Ax-b|2 ?
Equivalent problem: minx |Ax-b |22
Write b = Ax’ + b’, where b’ orthogonal to columns of A
Cost is |A(x-x’)|22 + |b’|22 by Pythagorean theorem
Optimal solution x if and only if AT(Ax-b) = AT(Ax-Ax’) = 0
Normal Equation: ATAx = ATb for any optimal x
x = (ATA)-1 AT b
If the columns of A are not linearly independent, the MoorePenrose pseudoinverse gives a minimum norm solution x
11
Moore-Penrose Pseudoinverse
• Any optimal solution x has the form A− b + I − V′V′T z, where
V’ corresponds to the rows i of V T for which Σ𝑖,𝑖 > 0
•
Why?
•
Because A I − V′V′T z = 0, so A− b + I − V′V′T z is a
solution. This is a d-rank(A) dimensional affine space so it
spans all optimal solutions
•
Since A− b is in column span of V’, by Pythagorean theorem,
|A− b + I − V′V′T z|22 = A− b 22 + |(I − V ′ V ′T )z|22 ≥ A− b 22
12
Time Complexity
Solving least squares regression via the normal equations
Need to compute x = A-b
Naively this takes nd2 time
Can do nd1.376 using fast matrix multiplication
But we want much better running time!
13
Sketching to solve least squares regression
How to find an approximate solution x to minx |Ax-b|2 ?
Goal: output x‘ for which |Ax‘-b|2 · (1+ε) minx |Ax-b|2
with high probability
Draw S from a k x n random family of matrices, for a
value k << n
Compute S*A and S*b
Output the solution x‘ to minx‘ |(SA)x-(Sb)|2
x’ = (SA)-Sb
14
How to choose the right sketching matrix S?
Recall: output the solution x‘ to minx‘ |(SA)x-(Sb)|2
Lots of matrices work
S is d/ε2 x n matrix of i.i.d. Normal random variables
To see why this works, we
introduce the notion of a
subspace embedding
15
Subspace Embeddings
• Let k = O(d/ε2)
• Let S be a k x n matrix of i.i.d. normal
N(0,1/k) random variables
• For any fixed d-dimensional subspace, i.e.,
the column space of an n x d matrix A
– W.h.p., for all x in Rd, |SAx|2 = (1±ε)|Ax|2
• Entire column space of A is preserved
Why is this true?
Subspace Embeddings – A Proof
• Want to show |SAx|2 = (1±ε)|Ax|2 for all x
• Can assume columns of A are orthonormal
(since we prove this for all x)
• Claim: SA is a k x d matrix of i.i.d. N(0,1/k)
random variables
– First property: for two independent random variables X
and Y, with X drawn from N(0,a2 ) and Y drawn from
N(0,b2 ), we have X+Y is drawn from N(0, a2 + b2 )
X+Y is drawn from N(0,
2
𝑎
+
2
𝑏 )
2
• Probability density function 𝑓𝑧 of Z = X+Y𝑏2is
𝑧
𝑥− 2 2
2
𝑎 +𝑏 𝑓 and 𝑓
𝑧
convolution of probability density
𝑋
𝑌
− 2 functions
−
2
−
•
𝑧−𝑥 2
2𝑎2
2 𝑎 +𝑏
𝑥2
− 2
2𝑏
2
Calculation: 𝑒
= 𝑒
𝑓𝑍 𝑧 = ∫ 𝑓𝑌 𝑧 − 𝑥 𝑓𝑋 𝑥 𝑑𝑥
• 𝑓𝑥 𝑥 =
•Density
𝑓𝑍 𝑧
2 /2𝑎 2
1
−𝑥
𝑒
𝑎 2𝜋 .5
, 𝑓𝑦 𝑦 =
2 /2𝑏2
1
−𝑥
𝑒
2𝑧 2
𝑏
.5
𝑏 2𝜋
𝑥− 2 2
𝑎 +𝑏
−
𝑎𝑏 2
2
/2𝑏2 𝑎2 +𝑏2
.5
𝑎2 +𝑏2 2
2 /2𝑎 2
1
1
−(𝑧−𝑥)
−𝑥
of
distribution:
= Gaussian
∫
𝑒
𝑒
∫
2𝜋 .5 𝑎𝑏
𝑎 2𝜋 .5
𝑏 2𝜋 .5
−
=
1
−𝑧 2 /2 𝑎2 +𝑏2
𝑒
2𝜋 .5 𝑎2 +𝑏2 .5
𝑎𝑏 2
𝑎2 +𝑏2
∫
.5
𝑎2 +𝑏2
2𝜋 .5 𝑎𝑏
𝑒
𝑒 𝑑𝑥
2
𝑏2 𝑧
𝑥− 2 2
𝑎 +𝑏
2
𝑎𝑏 2
𝑎2 +𝑏2
dx
dx = 1
Rotational Invariance
• Second property: if u, v are vectors with <u, v> = 0,
then <g,u> and <g,v> are independent, where g is a
vector of i.i.d. N(0,1/k) random variables
• Why?
• If g is an n-dimensional vector of i.i.d. N(0,1)
random variables, and R is a fixed matrix, then
the probability density function of Rg is
𝑥𝑇 RRT
𝑓
𝑥 =
−
1
𝑒
det(RRT ) 2𝜋 𝑑/2
−1
𝑥
2
– RRT is the covariance matrix
– For a rotation matrix R, the distribution of Rg
and of g are the same
Orthogonal Implies Independent
• Want to show: if u, v are vectors with <u, v> = 0, then
<g,u> and <g,v> are independent, where g is a vector of
i.i.d. N(0,1/k) random variables
• Choose a rotation R which sends u to αe1 , and sends v
to βe2
• < g, u > = < gR, RT u > = < h, αe1 > = αh1
• < g, v > = < gR, RT v > = < h, βe2 > = βh2
where h is a vector of i.i.d. N(0, 1/k) random variables
• Then h1 and h2 are independent by definition
Where were we?
• Claim: SA is a k x d matrix of i.i.d. N(0,1/k) random
variables
• Proof: The rows of SA are independent
– Each row is: < g, A1 >, < g, A2 > , … , < g, Ad >
– First property implies the entries in each row are
N(0,1/k) since the columns Ai have unit norm
– Since the columns Ai are orthonormal, the entries in a
row are independent by our second property
Back to Subspace Embeddings
•
•
•
•
Want to show |SAx|2 = (1±ε)|Ax|2 for all x
Can assume columns of A are orthonormal
Can also assume x is a unit vector
SA is a k x d matrix of i.i.d. N(0,1/k) random variables
• Consider any fixed unit vector 𝑥 ∈ 𝑅𝑑
• SAx 22 = i∈ k < g i , x >2 , where g i is i-th row of SA
• Each < g i , x
>2
is distributed as N
1 2
0,
k
• E[< g i , x >2 ] = 1/k, and so E[ SAx 22 ] = 1
How concentrated is SAx 22 about its expectation?
Johnson-Lindenstrauss Theorem
• Suppose h1 , … , hk are i.i.d. N(0,1) random variables
• Then G = i h2i is a 𝜒 2 -random variable
• Apply known tail bounds to G:
– (Upper) Pr G ≥ k + 2 kx .5 + 2x ≤ e−x
– (Lower) Pr G ≤ k − 2 kx .5 ≤ e−x
• If x =
ϵ2 k
,
16
then Pr G ∈ k(1 ± ϵ) ≥ 1 −
2 k/16
−ϵ
2e
1
δ
• If k = Θ(ϵ−2 log( )), this probability is 1-δ
• Pr SAx 22 ∈ 1 ± ϵ ≥ 1 − 2−Θ d
This only holds for a fixed x, how to argue for all x?
Net for Sphere
• Consider the sphere Sd−1
• Subset N is a γ-net if for all x ∈ Sd−1 , there is a y ∈ N,
such that x − y 2 ≤ γ
• Greedy construction of N
– While there is a point x ∈ Sd−1 of distance larger than
γ from every point in N, include x in N
• The sphere of radius γ/2 around ever point in N is
contained in the sphere of radius 1+ γ around 0d
• Further, all such spheres are disjoint
• Ratio of volume of d-dimensional sphere of radius 1+ γ
to dimensional sphere of radius 𝛾 is 1 + γ d /(γ/2)d , so
N ≤ 1 + γ d /(γ/2)d
Net for Subspace
• Let M = {Ax | x in N}, so M ≤ 1 + γ d /(γ/2)d
• Claim: For every x in Sd−1 , there is a y in M for which
Ax − y 2 ≤ γ
• Proof: Let x’ in S d−1 be such that x − x ′ 2 ≤ γ
Then Ax − Ax ′ 2 = x − x ′ 2 ≤ γ, using that the
columns of A are orthonormal. Set y = Ax’
Net Argument
• For a fixed unit x, Pr SAx 22 ∈ 1 ± ϵ ≥ 1 − 2−Θ d
• For a fixed pair of unit x, x’, SAx 22 , SAx′ 22 , SA x − x ′
are all 1 ± ϵ with probability 1 − 2−Θ d
• SA x − x ′ 22 = SAx 22 + SAx ′ 22 − 2 < SAx, SAx ′ >
• A x − x ′ 22 = Ax 22 + Ax ′ 22 − 2 < Ax, Ax ′ >
– So Pr < Ax, Ax ′ > = < SAx, SAx ′ > ± O ϵ
2
2
= 1 − 2−Θ(d)
• Choose a ½-net M = {Ax | x in N} of size 5𝑑
• By a union bound, for all pairs y, y’ in M,
< y, y′ > = < Sy, Sy′ > ± O ϵ
• Condition on this event
• By linearity, if this holds for y, y’ in M, for αy, βy′ we have
< αy, βy′ > = αβ < Sy, Sy′ > ± O ϵ αβ
Finishing the Net Argument
• Let y = Ax for an arbitrary x ∈ Sd−1
• Let y1 ∈ M be such that y − y1 2 ≤ γ
• Let α be such that α(y − y1 ) 2 = 1
– α ≥ 1/γ (could be infinite)
• Let y2′ ∈ M be such that α y − y1 − y2 ′
• Then y − y1 −
y′2
.
α
y2 ′
α 2
≤
γ
α
2
≤γ
≤ γ2
• Set y2 =
Repeat, obtaining y1 , y2 , y3 , … such that for
all integers i,
y − y1 − y2 − … − yi 2 ≤ γi
• Implies yi 2 ≤ γi−1 + γi ≤ 2γi−1
Finishing the Net Argument
• Have y1 , y2 , y3 , … such that y =
• Sy
2
2
= |S
i yi
and yi
2
≤ 2γi−1
2
|
y
i i 2
=
i
Syi
=
i
yi
2
2
2
2
+2
+2
i,j
i,j
< Syi , Syj >
< yi , yj > ± O ϵ
i,j
yi
2
yj
= | i yi |22 ± O ϵ
= y 22 ± O ϵ
=1±O ϵ
• Since this held for an arbitrary y = Ax for unit x, by
linearity it follows that for all x, |SAx|2 = (1±ε)|Ax|2
2
Back to Regression
• We showed that S is a subspace
embedding, that is, simultaneously for all x,
|SAx|2 = (1±ε)|Ax|2
What does this have to do with regression?
Subspace Embeddings for
Regression
• Want x so that |Ax-b|2 · (1+ε) miny |Ay-b|2
• Consider subspace L spanned by columns of A
together with b
• Then for all y in L, |Sy|2 = (1± ε) |y|2
• Hence, |S(Ax-b)|2 = (1± ε) |Ax-b|2 for all x
• Solve argminy |(SA)y – (Sb)|2
• Given SA, Sb, can solve in poly(d/ε) time
Only problem is computing SA takes O(nd2) time
How to choose the right sketching matrix S? [S]
S is a Subsampled Randomized Hadamard Transform
S = P*H*D
D is a diagonal matrix with +1, -1 on diagonals
H is the Hadamard transform
P just chooses a random (small) subset of rows of H*D
S*A can be computed in O(nd log n) time
Why does it work?
31
Why does this work?
We can again assume columns of A are orthonormal
It suffices to show SAx
2
2
= PHDAx
HD is a rotation matrix, so HDAx
2
2
2
2
= 1 ± ϵ for all x
= Ax
2
2
= 1 for any x
Notation: let y = Ax
Flattening Lemma: For any fixed y,
Pr [ HDy
∞
≥C
log.5 nd/δ
]
n.5
≤
δ
2d
32
Proving the Flattening Lemma
Flattening Lemma: Pr [ HDy
∞
≥C
log.5 nd/δ
]
n.5
≤
δ
2d
Let C > 0 be a constant. We will show for a fixed i in [n],
Pr [ HDy
≥C
i
log.5 nd/δ
]
n.5
≤
δ
2nd
If we show this, we can apply a union bound over all i
HDy i = j Hi,j Dj,j yj
(Azuma-Hoeffding) Pr |
probability 1
t2
)
2 j β2
j
−(
j Zj |
> t ≤ 2e
, where |Zj | ≤ βj with
Zj = Hi,j Dj,j yj has 0 mean
|Zj | ≤
|yj |
n.5
1
2
β
=
j j
n
= βj with probability 1
nd
Pr |
j Zj | >
C log.5 δ
n.5
δ
C2 log
nd
−
2
≤ 2e
≤
δ
2nd
33
Consequence of the Flattening Lemma
Recall columns of A are orthonormal
HDA has orthonormal columns
Flattening Lemma implies HDAei
probability 1 −
δ
2d
With probability 1
∞
≤C
log.5 nd/δ
n.5
with
for a fixed i ∈ d
δ
− ,
2
Given this, ej HDA
2
ej HDAei
≤C
≤C
d.5 log.5 nd/δ
n.5
log.5 nd/δ
n.5
for all i,j
for all j
(Can be optimized further)
34
Matrix Chernoff Bound
Let X1 , … , Xs be independent copies of a symmetric random matrix X ∈ Rdxd
1
with E X = 0, X 2 ≤ γ, and E X T X 2 ≤ 𝜎 2 . Let W = s i∈[s] Xi . For any ϵ > 0,
Pr W
2
γϵ
3
−sϵ2 /(𝜎2 + )
> ϵ ≤ 2d ⋅ e
(here W
2
= sup Wx 2 / x 2 )
x
Let V = HDA, and recall V has orthonormal columns
Suppose P in the S = PHD definition samples uniformly with replacement. If
row i is sampled in the j-th sample, then Pj,i = n, and is 0 otherwise
Let Yi be the i-th sampled row of V = HDA
Let Xi = Id − n ⋅ YiT Yi
E X i = Id − n ⋅
Xi
2
≤ Id
1
j n
ViT Vi = Id − V T V = 0d
2 + n ⋅ max ej HDA
2
2
= 1 + n ⋅ C 2 log
nd
δ
d
⋅ n = Θ(d log
nd
δ
)
35
Matrix Chernoff Bound
Recall: let Yi be the i-th sampled row of V = HDA
Let Xi = Id − n ⋅ YiT Yi
E X T X = Id − 2n E YiT Yi + n2 E[YiT Yi YiT Yi ]
1
T
T
i n ⋅ vi vi vi vi
d
nd
T
2
v
v
⋅
C
log
i
i i
n
δ
nd
− 1 Id
δ
= Id − 2Id + n2
= −Id + n
= C 2 d log
nd
δ
Hence, E X T X
Recall: (Matrix Chernoff) Let X1 , … , Xs be independent copies of a
symmetric random matrix X ∈ Rdxd with E X = 0, X 2 ≤ γ, and E X T X
𝜎 2.
Let W =
1
s
2
= O d log
γϵ
i∈[s] X i .
For any ϵ > 0, Pr W
2
−sϵ2 /(𝜎2 + 3 )
2
≤
> ϵ ≤ 2d ⋅ e
Pr |Id − PHDA
T
PHDA
−s ϵ2 /(Θ(d log
2
> ϵ ≤ 2d ⋅ e
d
Set s = d log
nd log δ
2
, to make this probability less than
δ
nd
)
δ
36
SRHT Wrapup
Have shown |Id − PHDA
T
PHDA
Chernoff Bound and with s = d log
|2 < ϵ using Matrix
d
nd log δ
δ
ϵ2
samples
Implies for every unit vector x,
|1− PHDAx 22 | = x T x − x PHDA
so PHDAx
2
2
T
PHDA x < ϵ ,
∈ 1 ± ϵ for all unit vectors x
Considering the column span of A adjoined with b, we can
again solve the regression problem
The time for regression is now only O(nd log n) +
𝑑 log 𝑛
poly(
𝜖
37
). Nearly optimal in matrix dimensions (n >> d)
Faster Subspace Embeddings S [CW,MM,NN]
CountSketch matrix
Define k x n matrix S, for k = O(d2/ε2)
S is really sparse: single randomly chosen non-zero
entry per column
[
[
0010 01 00
1000 00 00
0 0 0 -1 1 0 -1 0
0-1 0 0 0 0 0 1
nnz(A) is number of non-zero entries of A
Can compute
S ⋅ A in nnz(A)
time!
38
Simple Proof [Nguyen]
Can assume columns of A are orthonormal
Suffices to show |SAx|2 = 1 ± ε for all unit x
For regression, apply S to [A, b]
SA is a 2d2/ε2 x d matrix
Suffices to show | ATST SA – I|2 · |ATST SA – I|F · ε
Matrix product result shown below:
|CSTSD – CD|F2 · [2/(# rows of S)] * |C|F2 |D|F2
Set C = AT and D = A.
Then |A|2F = d and (# rows of S) = 2 d2/ε2
39
Matrix Product Result
Show: |CSTSD – CD|F2 · [1/(# rows of S)] * |C|F2 |D|F2
(JL Property) A distribution on matrices S ∈ Rkx n has the
(ϵ, δ, ℓ)-JL moment property if for all x ∈ Rn with x 2 = 1,
ES Sx
2
2
−1
ℓ
≤ ϵℓ ⋅ δ
(From vectors to matrices) For ϵ, δ ∈
1
0,
2
, let D be a
distribution on matrices S with k columns that satisfies
the (ϵ, δ, ℓ)-JL moment property for some ℓ ≥ 2. Then for
A, B matrices with n rows,
Pr AT ST SB − AT B
S
F
≥3ϵ A
F
B
F
≤δ
40
From Vectors to Matrices
1
(From vectors to matrices) For ϵ, δ ∈ 0, 2 , let D be a distribution on matrices
S with k columns that satisfies the (ϵ, δ, ℓ)-JL moment property for some ℓ ≥
2. Then for A, B matrices with n rows,
Pr AT S T SB − AT B
S
Proof: For a random scalar X, let X
Sometimes consider X = T
| T
F
|p = E T
p
F
F
p
≥3ϵ A
= EX
F
B
F
≤δ
p 1/p
for a random matrix T
1/p
Can show |. |p is a norm
Minkowski’s Inequality: X + Y
F
p
≤ X
p
+ Y
p
For unit vectors x, y, need to bound |〈Sx, Sy〉 - 〈x, y〉|ℓ
41
From Vectors to Matrices
For unit vectors x, y, |〈Sx, Sy〉 - 〈x, y〉|ℓ
=
≤
≤
1
(
2
1
(
2
1
2
Sx 22 −1 + Sy
Sx
2
2
2
2
− 1 ℓ + Sy
1
ℓ
−1 − S x−y
2
2
1
ℓ
−1 ℓ+ S x−y
(ϵ ⋅ δ +ϵ ⋅ δ + x − y
≤3ϵ⋅δ
2
2
2
2
− x−y
2
2
2
2
− x−y
|ℓ
2
2 ℓ)
1
ℓ
ϵ⋅δ )
1
ℓ
Sx,Sy − x,y ℓ
x2y2
1
ℓ
By linearity, for arbitrary x, y,
Suppose A has d columns and B has e columns. Let the columns of A be
A1 , … , Ad and the columns of B be B1 , … , Be
Define Xi,j =
AT S T SB
−
1
Ai 2 Bj
2
2
AT B F
=
≤3ϵ⋅δ
⋅ ( SAi , SBj − Ai , Bj )
i
2
j Ai 2
⋅
2 2
Bj Xi,j
2
42
From Vectors to Matrices
Have shown: for arbitrary x, y,
For Xi,j =
| AT ST SB − A B F |ℓ/2 =
1
Sx,Sy − x,y ℓ
x2y2
SAi , SBj − Ai , Bj : AT S T SB − AT B
⋅
Ai 2 Bj
2
T 2
i
≤
j
2
2
2
⋅ Bj Xi,j
j
Ai
2
2
2 |
⋅ Bj |Xi,j
ℓ/2
j
Ai
2
2
⋅ Bj
i
≤ 3ϵδ
1
ℓ
= 3 ϵδ
T T
T
Since E A S SB − A B
Pr
AT S T SB
−
AT B
F
ℓ
F
1
ℓ
2
F
=
i
j
Ai
2
2
2
2
⋅ Bj Xi,j
2
ℓ/2
2
2
2
2
Xi,j
Ai
2
2
2
i
2
A
j
2
F
B
2
ℓ
2
Bj
2
2
F
T T
T
= A S SB − A B
> 3ϵ A
2
2
Ai
i
=
≤3ϵ⋅δ
1
ℓ
ℓ/2
2
F ℓ
, by Markov’s inequality,
2
F
B
F
≤
1
3ϵ A F B F
ℓ
E |AT S T SB − AT B|ℓF ≤ δ
43
Result for Vectors
Show: |CSTSD – CD|F2 · [1/(# rows of S)] * |C|F2 |D|F2
(JL Property) A distribution on matrices S ∈ Rkx n has the (ϵ, δ, ℓ)-JL moment
property if for all x ∈ Rn with x 2 = 1,
ES Sx
2
2
−1
ℓ
≤ ϵℓ ⋅ δ
1
(From vectors to matrices) For ϵ, δ ∈ 0, 2 , let D be a distribution on matrices
S with k columns that satisfies the (ϵ, δ, ℓ)-JL moment property for some ℓ ≥
2. Then for A, B matrices with n rows,
Pr AT S T SB − AT B
S
F
≥3ϵ A
F
B
F
≤δ
Just need to show that the CountSketch matrix S satisfies JL property and
bound the number k of rows
44
CountSketch Satisfies the JL Property
(JL Property) A distribution on matrices S ∈ Rkx n has the (ϵ, δ, ℓ)-JL moment
property if for all x ∈ Rn with x 2 = 1,
ES Sx
2
2
−1
ℓ
≤ ϵℓ ⋅ δ
45