Randomized algorithms for matrices and massive

Download Report

Transcript Randomized algorithms for matrices and massive

SDM06 TUTORIAL: Randomized Algorithms
for Matrices and Massive Data Sets
Petros Drineas
Michael W. Mahoney
CS - RPI
Yahoo! Research
Tutorial given at SIAM Data Mining Meeting April 22, 2006
(Most recent copy) available at:
http://www.cs.yale.edu/homes/mmahoney
http://www.cs.rpi.edu/~drinep
Randomized Linear Algebra Algorithms
Goal: To develop and analyze (fast) Monte Carlo algorithms for performing
useful computations on large (and later not so large!) matrices and tensors.
• Matrix Multiplication
• Computation of the Singular Value Decomposition
• Computation of the CUR Decomposition
• Testing Feasibility of Linear Programs
• Least Squares Approximation
• Tensor computations: SVD generalizations
• Tensor computations: CUR generalization
Such computations generally require time which is superlinear in the number
of nonzero elements of the matrix/tensor, e.g., O(n3) for n x n matrices.
2
Example: the CUR decomposition
Algorithmic Motivation: To speed up computations in applications where extremely large data
sets are modeled by matrices and, e.g., O(n2) space and O(n3) time is not an option.
Structural Motivation: To reveal novel structural properties of the datasets, given sufficient
computational time, that are useful in applications.
Carefully
chosen U
O(1) columns
O(1) rows
Why? Given a sample consisting of a few
columns (C) and a few rows (R) of A, we can
compute U and “reconstruct” A as CUR. If
the sampling probabilities are not “too bad”,
we get provably good accuracy.
Goal: make ||A-CUR|| small.
Why? (Algorithmic) After making two
passes over A, we can compute provably
good C, U, and R and store them
(“sketch”) instead of A: O(m+n) vs. O(n2)
space.
Why? (Structural) Given sufficient time,
we can find C, U and R such that A – CUR
is “very” small. This might lead to better
understanding of the data.
3
Applications of such algorithms
Matrices arise, e.g., since m objects (documents, genomes, images, web
pages), each with n features, may be represented by an m x n matrix A.
• Covariance Matrices
• Latent Semantic Indexing
• DNA Microarray Data
• Eigenfaces and Image Recognition
• Similarity Queries
• Matrix Reconstruction
• LOTS of other data applications!!
More generally,
• Linear and Nonlinear Programming Applications
• Design of Approximation Algorithms
• Statistical Learning Theory Applications
4
Overview (1/2)
• Data Streaming Models and Random Sampling
• Matrix Multiplication
• Singular Value Decomposition
• CUR Matrix Decomposition
• Applications of Matrix CUR
•
•
•
•
Data mining
DNA microarray (and DNA SNP) data
Recommendation Systems
Kernel-CUR and the Nystrom Method
5
Overview (2/2)
• Tensor-based data sets
• Tensor-CUR
• Hyperspectral data
• Recommendation systems
• From Very-Large to Medium-Sized Data
• Relative-error CX and CUR Matrix Decompositions
• L2 Regression Problems
• Application to DNA SNP Data
• Conclusions and Open Problems
6
The Pass Efficient Model
Motivation: Amount of disk/tape space has increased enormously; RAM and
computing speeds have increased less rapidly.
• Can store large amounts of data, but
• Cannot process these data with traditional algorithms.
In the Pass-Efficient Model:
• Data are assumed to be stored on disk/tape.
• Algorithm has access to the data via a pass over the data.
• Algorithm is allowed additional RAM space and additional computation time.
An algorithm is pass-efficient if it uses a small constant number of passes and
sublinear additional time and space to compute a description of the solution.
Note: If data are an m x n matrix A, then algorithms which require additional
time and space that is O(m+n) or O(1) are pass-efficient.
7
Random Sampling
Random Sampling and Randomized Algorithms:
• Better complexity properties (randomization as a resource).
• Simpler algorithms and/or analysis (maybe de-randomize later).
Uniform Sampling:
• Typically things work in expectation, but poor variance properties.
Non-uniform Sampling:
• With “good” probabilities, can make the variance small.
Randomized Algorithms for Linear Algebra problems:
A “sketch” consisting of a small number of judiciously chosen and
randomly sampled rows and columns (or elements) is sufficient for
provably rapid and efficient approximation of many matrix operations.
8
Overview (1/2)
• Data Streaming Models and Random Sampling
• Matrix Multiplication
• Singular Value Decomposition
• CUR Matrix Decomposition
• Applications of Matrix CUR
•
•
•
•
Data mining
DNA microarray (and DNA SNP) data
Recommendation Systems
Kernel-CUR and the Nystrom Method
9
Approximating Matrix Multiplication …
(D. & Kannan FOCS ’01, and D., Kannan, & M. TR ’04, SICOMP ’06)
Problem Statement
Given an m-by-n matrix A and an n-by-p matrix B, approximate the product A·B,
OR, equivalently,
Approximate the sum of n rank-one matrices.
Each term in the
summation is a
rank-one matrix
i-th column of A
i-th row of B
10
…by random sampling
i-th column of A
i-th row of B
Algorithm
1.
Fix a set of probabilities pi, i=1,…,n, summing up to 1.
2.
For t=1 up to s,
set jt = i, where Pr(jt = i) = pi;
(Pick s terms of the sum, with replacement, with respect to the pi.)
3.
Approximate AB by the sum of the s terms, after scaling.
11
Random sampling (cont’d)
i-th column of A
i-th row of B
Keeping the terms
j1, j2, … js.
12
The algorithm (matrix notation)
•
Create C and R by performing s i.i.d. trials, with replacement.
•
For t=1 up to s, pick a column A(jt) and a row B(jt) with probability
•
Include A(jt)/(spjt)1/2 as a column of C, and B(jt)/(spjt)1/2 as a row of R.
13
Simple Lemmas
The input matrices are given in “sparse unordered representation”; e.g.,
their non-zero entries are presented as triples (i, j, Aij) in any order.
•
The expectation of CR (element-wise) is AB.
•
Our nonuniform sampling minimizes the variance of the estimator.
•
It is easy to implement the sampling in two passes.
•
If the matrices are dense the algorithm runs in O(smp) time, instead
of O(nmp) time,
•
It requires O(sm+sp) RAM space.
•
Does not tamper with the sparsity of the matrices.
14
Error Bounds
For the above algorithm,
For the above algorithm, with probability at least 1-,
• This is a relative error bound if ||AB||F = (||A||F ||B||F), i.e. if there is “not much
cancellation” in the multiplication.
• We removed the expectation (by applying a martingale argument) and so have an
extra log(1/) factor.
• Markov’s inequality would also remove the expectation, introducing an extra 1/
factor.
15
Special case: B = AT
If B = AT, then the sampling probabilities are
Also, R = CT, and the error bounds are
16
Special case: B = AT (cont’d)
(Rudelson & Vershynin ’04, Vershynin ’04)
Improvement for the spectral norm bound for the special case B = AT.
• Uses a result of M. Rudelson for random vectors in isotropic position.
• Tight concentration results can be proven using Talagrand’s theory.
• The sampling procedure is slightly different; s columns/rows are kept in
expectation, i.e., column i is picked with probability:
17
Overview (1/2)
• Data Streaming Models and Random Sampling
• Matrix Multiplication
• Singular Value Decomposition
• CUR Matrix Decomposition
• Applications of Matrix CUR
•
•
•
•
Data mining
DNA microarray (and DNA SNP) data
Recommendation Systems
Kernel-CUR and the Nystrom Method
18
Singular Value Decomposition (SVD)
U (V): orthogonal matrix containing the left (right) singular vectors of A.
S: diagonal matrix containing the singular values of A.
1.
Exact computation of the SVD takes O(min{mn2 , m2n}) time.
2.
The top few singular vectors/values can be approximated faster
(Lanczos/ Arnoldi methods).
19
Rank k approximations (Ak)
Uk (Vk): orthogonal matrix containing the top k left (right) singular vectors of A.
S k: diagonal matrix containing the top k singular values of A.
Also, Ak=UkUkTA.
Ak is a matrix of rank k such that ||A-Ak||2,F is minimized over all rank k
matrices!
This property of very useful in the context of Principal Component Analysis.
20
Approximating SVD in O(n) time
(D., Frieze, Kannan, Vempala & Vinay SODA ’99, JML ’04, D. Kannan, & M. TR ’04, SICOMP ’06)
Given: m x n matrix A
• Sample c columns from A and rescale to form the m x c matrix C.
• Compute the m x k matrix Hk of the top k left singular vectors of C.
Structural Theorem: For any probabilities and number of columns:
||A-HkHkTA||2,F2 ≤ ||A-Ak||2,F2 + 2√k||AAT-CCT||F
Algorithmic Theorem: If pi = |A(i)|2/||A||F2 and c ≥ 42k/2, then:
||A-HkHkTA||2,F2 ≤ ||A-Ak||2,F2 + ||A||F2.
Proof: via matrix multiplication theorem and matrix perturbation theory.
21
Example of randomized SVD
A
C
Title:
C:\Petros\Image Processing\baboondet.eps
Creator:
MATLAB, The Mathworks, Inc.
Preview:
This EPS picture was not saved
with a preview included in it.
Comment:
This EPS picture will print to a
PostScript printer, but not to
other types of printers.
Original matrix
After sampling columns
Compute the top k left singular vectors of the matrix C and store them
in the 512-by-k matrix Hk.
22
Example of randomized SVD (cont’d)
Title:
C:\Petros\Image Processing\baboondet.eps
Creator:
MATLAB, The Mathworks, Inc.
Preview:
This EPS picture was not saved
with a preview included in it.
Comment:
This EPS picture will print to a
PostScript printer, but not to
other types of printers.
A
HkHkTA
A and HkHkTA are close.
23
Element-wise sampling
(Achlioptas & McSherry, STOC ’01, JACM ’05)
The Algorithm in 2 lines:
• To approximate a matrix A, keep a few elements of the matrix (instead of rows or
columns) and zero out the remaining elements.
• Compute a rank k approximation to this sparse matrix (using Lanczos methods).
More details:
Let pij 2 [0,1] for all i,j. Create the matrix S from A such that:
||A-S||2 is bounded ! (i) the singular values of A and S are close, and (ii, under
additional assumptions) the top k left (right) singular vectors of S span a subspace
that is close the to subspace spanned by the top k left (right) singular vectors of A.
24
Element-wise sampling (cont’d)
Approximating singular values fast:
• Zero out (a large number of) elements of A, scale the remaining ones appropriately.
• Compute the singular values of the resulting sparse matrix using iterative techniques.
• (Good choice for pij: pij = sAij2/i,j Aij2, where s denotes the expected number of
elements that we seek to keep in S.)
• Note: Each element is kept or discarded independently of the others.
Similar ideas have been used to:
• explain the success of Latent Semantic Indexing (LSI):
Vempala, PODS ’98 & Azar, Fiat, Karlin, McSherry, Saia STOC ’01)
(Papadimitriou, Raghavan, Tamaki,
• design recommendation systems: (Azar, Fiat, Karlin, McSherry, Saia STOC ’01 )
• speedup kernel computations:
(Achlioptas, McSherry, and Schölkopf, NIPS ’02)
25
Element-wise vs. row/column sampling
Quite different techniques!
• Row/column sampling preserves subspace/structural properties of the matrices.
• Element-wise sampling explains how adding noise and/or quantizing the elements of a
matrix perturbs its singular values/vectors.
• The two techniques should be complementary!
Some similarities and differences:
• Similar error bounds.
• Element-wise sampling is doable in one pass!
• Running time of element-wise sampling depends on the speed of, e.g., Arnoldi methods.
• Element-wise methods do not seem amenable to many of the extensions we will
present.
26
Overview (1/2)
• Data Streaming Models and Random Sampling
• Matrix Multiplication
• Singular Value Decomposition
• CUR Matrix Decomposition
• Applications of Matrix CUR
•
•
•
•
Data mining
DNA microarray (and DNA SNP) data
Recommendation Systems
Kernel-CUR and the Nystrom Method
27
A novel CUR matrix decomposition
(D. & Kannan, SODA ’03, D., Kannan, & M. TR ’04, SICOMP ’06)
Create an approximation to the original matrix of the following form:
Carefully
chosen U
O(1) columns
O(1) rows
An application: Given a query vector x, instead of computing A · x, compute
CUR · x to identify its nearest neighbors,
28
The CUR decomposition
Given a large m-by-n matrix A (stored on disk), compute a decomposition
CUR of A such that:
1.
C consists of c = O(k/2) columns of A.
2.
R consists of r = O(k/2) rows of A.
3.
C (R) is created using importance sampling, e.g. columns (rows) are picked in
i.i.d. trials with respect to probabilities:
4.
C, U, R can be stored in O(m+n) space, after making two passes through the
entire matrix A, using O(m+n) additional space and time.
5.
The product CUR satisfies (with high probability):
29
Computing U
Intuition: (which can be formalized - see later)
The CUR algorithm expresses every row of the matrix A as a linear
combination of a small subset of the rows of A.
•
This small subset consists of the rows in R.
•
Given a row of A – say A(i) – the algorithm computes a good fit for the row A(i)
using the rows in R as the basis, by approximately solving:
But, only c = O(1) elements of the i-th row are given as input.
So, we only approximate the optimal vector u instead of computing it exactly.
Actually, the pass-efficient CUR decomposition approximates the approximation.
30
Error bounds for CUR
Assume Ak is the “best” rank k approximation to A (through SVD).
Then, if we pick O(k/2) rows and O(k/2) columns,
If we pick O(1/2) rows and O(1/2) columns,
31
Previous CUR-type decompositions
Goreinov, Tyrtyshnikov, &
Zamarashkin
(LAA ’97, …)
Berry, Stewart, & Pulatova
(Num. Math. ’99, TR ’04, … )
Williams & Seeger
(NIPS ’01, …)
D., Kannan & M.
(SODA ’03, TR ’04, SICOMP ‘06)
D., M., & Muthukrishnan
(TR ’06)
C: columns that span max volume
U: W+
R: rows that span max volume
Existential result
Error bounds depend on ||W+||2
Spectral norm bounds!
C: variant of the QR algorithm
R: variant of the QR algorithm
U: minimizes ||A-CUR||F
No a priori bounds
A must be known to construct U.
Solid experimental performance
C: uniformly at random
U: W+
R: uniformly at random
Experimental evaluation
A is assumed PSD
Connections to Nystrom method
C: w.r.t. column lengths
U: in linear/constant time
R: w.r.t. row lengths
“Sketching” massive matrices
Provable, a priori, bounds
Explicit dependency on A – Ak
C: depends on singular vectors of A.
U: (almost) W+
R: depends on singular vectors of C
(1+) approximation to A – Ak
Computable in low polynomial time
(Suffices to compute SVD(A))
32
(For details see Drineas & Mahoney, “A Randomized Algorithm for a Tensor-Based Generalization of the SVD”, ‘05.)
Lower Bounds
Question: How many queries does a sampling algorithm need to approximate a
given function accurately with high probability?
Lower bounds for the low rank matrix approximation
problem and the matrix reconstruction problem.
(Ziv Bar-Yossef ’03, ’04)
• Any sampling algorithm that w.h.p. finds a good low rank approximation requires (m+n) queries.
• Even if the algorithm is given the exact weight distribution over the columns of a matrix it will still
require (k/4) queries.
• Finding a matrix D such that ||A-D||F ≤  ||A||F requires (mn) queries and that finding a D
such that ||A-D||2 ≤  ||A||2 requires (m+n) queries.
Applied to our results:
• The LinearTimeSVD algorithm is optimal w.r.t. ||||F bounds.
• The ConstantTimeSVD algorithm is optimal w.r.t. ||||2 bounds up to poly factors.
• The CUR algorithm is optimal for constant .
33
Overview (1/2)
• Data Streaming Models and Random Sampling
• Matrix Multiplication
• Singular Value Decomposition
• CUR Matrix Decomposition
• Applications of Matrix CUR
•
•
•
•
Data mining
DNA microarray (and DNA SNP) data
Recommendation Systems
Kernel-CUR and the Nystrom Method
34
CUR application: Data Mining
Database: An m-by-n matrix A, e.g., m (>106) objects and n(>105) features.
Queries: Given a new object x, find similar objects (nearest neighbors) in A.
Closeness: Two normalized objects x and y are “close” xT·d = cos(x,d) is high.
•
Given a query vector x, the matrix product A·x computes all the angles/distances.
Key observation: The exact value xT· d might not be necessary.
•
The feature values in the vectors are set by coarse heuristics.
•
It is in general enough to see if xT· d > Threshold.
Algorithm: Given a query vector x, compute CUR · x to identify nearest neighbors.
Theorem: We have a bound on the worst case of x using CUR instead of A:
35
CUR application: Genetic Microarray Data
Exploit structural properties of CUR in biological applications:
Experimental conditions
genes
Find a “good” set of genes and arrays to
include in C and R?
Provable and/or heuristic strategies
are acceptable.
Common in Biological/Chemical/Medical applications of PCA:
• Explain the singular vectors, by mapping them to meaningful biological processes.
• This is a “challenging” task (think: reification) !
CUR is a low-rank decomposition in terms of the data that practitioners understand.
• Use it to explain the data and do dimensionality reduction, classification, clustering.
Gene microarray data: M., D., & Alter (UT Austin) (sporulation and cell cycle data).
36
CUR application: Recommendation Systems
(D., Raghavan, & Kerenidis, STOC ’02)
The problem: m customers and n products; Aij is the (unknown) utility of
product j for customer i.
The goal: recreate A from a few samples to recommend high utility products.
• (KRRT98): Assuming strong clustering of the products, competitive algorithms even with only 2 samples/customer.
• (AFKMS01): Assuming sampling of (mn) entries of A and a gap requirement, accurately recreate A.
Question: Can we get competitive performance by sampling o(mn) elements?
Answer: Apply the CUR decomposition:
Customer sample
products
(purchases, small surveys)
Customer sample
customers
(guinea pigs)
37
Kernel-CUR Motivation
Kernel-based learning methods to extract non-linear structure:
• Choose features to define a (dot product) space F.
• Map the data, X, to F by : X -> F.
• Do classification, regression, and clustering in F with linear methods (SVMs,GPs,SVD).
If the Gram matrix G, where Gij=kij=((X(i)), (X(j))), is dense but has low numerical
rank, then calculations of interest need O(n2) space and O(n3) time:
• matrix inversion in GP prediction,
• quadratic programming problems in SVMs,
• computation of eigendecomposition of G.
Relevant recent work using low-rank methods:
• (Williams and Seeger, NIPS ’01, etc.): Nystrom method for out-of-sample extensions.
• (Achlioptas, McSherry, and Schölkopf, NIPS ’02): randomized kernels.
38
Kernel-CUR Decomposition
(D. & M., COLT ’05, TR ’05, JMLR ‘05)
Input: n x n SPSD matrix G, probabilities {pi, 1=1,…,n}, c <= n, and k <= c.
Algorithm:
• Let C be the n x c matrix containing c randomly sampled columns of G.
• Let W be the c x c matrix with containing intersection of C and CT.
Theorem: Let pi = Gii2/ Si Gii2. If c = O(k log(1/)/4), then w.p. at least 1-,
If c = O(log(1/)/4), then with probability at least 1-,
39
Overview (2/2)
• Tensor-based data sets
• Tensor-CUR
• Hyperspectral data
• Recommendation systems
• From Very-Large to Medium-Sized Data
• Relative-error CX and CUR Matrix Decompositions
• L2 Regression Problems
• Application to DNA SNP Data
• Conclusions and Open Problems
40
Datasets modeled as tensors
Tensors: (naively, a dataset subscripted by multiple indices) appear both in Math and CS.
• Represent high dimensional functions.
• Connections to complexity theory (i.e., matrix multiplication complexity).
• Statistical applications (i.e., ICA, HOS, etc.).
• Large data-set applications (e.g., Medical Imaging & Hyperspectral Imaging)
Problem: There does not exist a definition of tensor rank (and associated
tensor SVD) with the – nice – properties found in the matrix case.
• (Lek-Heng Lim ’05: strong impossibility results!)
Common heuristic: “unfold” the tensor along a mode and apply Linear Algebra.
• We will do this, but note that this kills the essential tensor structure.
41
Datasets modeled as tensors (cont’d)
Goal: Extract structure from a tensor dataset A (naively, a dataset
subscripted by multiple indices) using a small number of samples.
Mode 3
Mode 1
Mode 2
• Tensor rank (minimum number of
rank-one tensors) is NP-hard to
compute.
• Tensor -rank (“unfold” along the th
mode and the the matrix SVD) is a
commonly-used heuristic.
m x n x p tensor A
Randomized-Tensor-CUR: unfold along a “distinguished” mode and reconstruct.
Randomized-Tensor-SVD: unfold along every mode and choose columns.
(Drineas & Mahoney, “A Randomized Algorithm for a Tensor-Based Generalization of the SVD,” TR05.)
42
The TensorCUR algorithm (3-modes)
 Choose the preferred mode  (e.g., time).
 Pick a few representative “slabs” (let R denote the tensor of the sampled slabs).
 Use only information in a small number of representative “fibers” (let C denote the
tensor of sampled fibers and U a low-dimensional encoding tensor).
 Express the remaining slabs as linear combinations of the basis of sampled slabs.
43
Tensor-CUR application: Hyperspectral Image Analysis
(with M. Maggioni and R. Coifman at Yale)
Goal: Extract structure from temporally-resolved images or spectrally-resolved images
of medical interest using a small number of samples (images and/or pixels).
128 frequencies
Note: A temporally or spectrally
resolved image may be viewed as a
tensor (naively, a dataset subscripted by
multiple indices) or as a matrix (whose
columns have internal structure that is
not modeled).
ca. 500
pixels
ca. 500
pixels
m x n x p tensor A or mn x p matrix A
Note: The chosen images are a dictionary from the data to express every image.
Note: The chosen pixels are a dictionary from the data to express every pixel.
44
45
Sampling hyperspectral data
Sample slabs depending on total absorption.
• For example, absorption at two pixel types:
Sample fibers uniformly (since intensity depends on stain).
46
Eigen-analysis of slabs and fibers
47
Look at the exact (65-th) slab.
48
The (65-th) slab approximately reconstructed
This slab was
reconstructed
by approximate
least-squares
fit to the basis
from slabs 41
and 50, using
1000 (of 250K)
pixels/fibers.
49
Tissue Classification - Exact Data
50
Tissue Classification - Ns=12 & Nf=1000
51
Tensor-CUR application: Recommendation Systems
Important Comment:
• Utility is ordinal and not cardinal concept.
• Compare products; don’t assign utility
values.
Recommendation Model Revisited:
• Every customer has an n-by-n matrix
(whose entries are +1,-1) and represent pairwise product comparisons.
m
customers
• There are m such matrices, forming an nby-n-by-m 3-mode tensor A.
• Extract the “structure” of this tensor.
n
products
n
products
52
Application to Jester Joke Recommendations
Use just the 14,140 “full” users who rated all 100 Jester jokes.
For each user, convert the utility vector to 100 x 100 pair-wise preference matrix.
Choose, e.g., 300 slabs/users, and a small number of fibers/comparisons.
300 slabs
8000
10,000 fibers
100 fibers
7000
75 fibers
6000
50 fibers
Number of users
5000
4000
3000
2000
1000
0
0
1
2
3
4
5
6
7
Successful recommendations (out of 10)
8
9
10
53
Overview (2/2)
• Tensor-based data sets
• Tensor-CUR
• Hyperspectral data
• Recommendation systems
• From Very-Large to Medium-Sized Data
• Relative-error CX and CUR Matrix Decompositions
• L2 Regression Problems
• Application to DNA SNP Data
• Conclusions and Open Problems
54
Modeling data as matrices
People studying data :
•
•
•
•
put the data onto a graph or into a vector space
even if the data don’t naturally or obviously live there
and perform graph operations or vector space operations
to extract information from the data.
Such data often have structure unrelated to the graphical or linear
algebraic structure implicit in the modeling.
• This non-modeled structure is difficult to formalize.
Practitioners often have extensive field-specific intuition about the data.
• This intuition is often used to choose “where the data live.”
• The choice of where the data live may capture non-modeled structure.
55
Modeling data as matrices (cont’d)
Matrices often arise since n objects (“documents,” genomes, images, web
pages), each with m features, may be represented by an m x n matrix A.
Such data matrices often have structure:
• for linear structure, SVD or PCA is often used,
• for non-linear structure, kernel, e.g., diffusion-based, methods used,
• other structures include sparsity, nonnegativity, etc.
Note: We know what the rows/columns “mean” from the application area.
Goal: Develop principled provably-accurate algorithmic methods such that:
• they are agnostic with respect to any particular field,
• one can fruitfully couple them to the field-specific intuition,
• they perform well on complex non-toy data sets.
56
SVD and low-rank approximations
Theorem: Let A be an m x n matrix rank . Truncate the SVD of A by keeping k
≤  terms: Ak = Uk Sk VkT. This gives the “best” rank-k approximation to A.
Interesting properties of truncated SVD:
• Used in data analysis via Principal Components Analysis (PCA) .
•The rows of Uk (= UA,k) are NOT orthogonal and are NOT unit length.
• The lengths/Euclidean norms of the rows of Uk capture a notion of information
dispersal.
• Gives a low-rank approximation with a very particular structure (rotate-rescale-rotate).
• Best at capturing Frobenius (and other) norm.
• Problematic w.r.t. sparsity, interpretability, etc.
57
Problems with SVD/Eigen-Analysis
Problems arise since structure in the data is not respected by mathematical
operations on the data:
•
•
•
•
Reification - maximum variance directions are just that.
Interpretability - what does a linear combination of 6000 genes mean.
Sparsity - is destroyed by orthogonalization.
Non-negativity - is a convex and not linear algebraic notion.
The SVD gives two bases to diagonalize the matrix.
• Truncating gives a low-rank matrix approximation with a very particular structure.
• Think: rotation-with-truncation; rescaling; rotation-back-up.
Question: Do there exist “better” low-rank matrix approximations.
• “better” structural properties for certain applications.
• “better” at respecting relevant structure.
• “better” for interpretability and informing intuition.
58
Exactly and approximately rank-k matrices
Theorem: Let the m x n matrix A be exactly rank k. Then:
• There exists k linear combinations of columns, rows, etc. such that:
A = Ak = UkSkVkT.
• There exists k actual columns and rows of A, permuted, such that:
Take-home message: Low-rank structure IS redundancy in columns/rows.
Theorem: Let the m x n matrix A be approximately rank k. Then:
A ≈ Ak = UkSkVkT is the “best” approximation.
Question: Can we express approximately rank k matrices in terms of their
actual columns and/or rows.
59
Dictionaries for data analysis
Discrete Fourier Transform (DCT):
• fj = Si=0,…,N-1 xn cos[j(n+1/2)/N]
• the basis is fixed.
• O(N2) or O(Nlog(N)) computation to determine coefficients.
Singular Value Decomposition (SVD):
• A = Si=1,…, iU(i)V(i)T = Si=1,…, i A[i]
• O(N3) computation to determine basis and coefficients.
Many other more complex/expensive procedures depending on the application.
Question: Can actual data points and/or feature vectors be the dictionary?
• “Core-sets” on graphs.
• “CUR-decompositions” on matrices.
60
CX and CUR matrix decompositions
Recall: Matrices are about their rows and columns.
Recall: Low-rank matrices have redundancy in their rows and columns.
Def: A CX matrix decomposition is a low-rank approximation explicitly expressed in
terms of a small number of columns of the original matrix A (e.g., PCA = CC+A).
Def: A CUR matrix decomposition is a low-rank approximation explicitly expressed
in terms of a small number of columns and rows of the original matrix A.
Carefully
chosen U
O(1) columns
O(1) rows
61
Dictionaries & the SVD
A = U SVT = S i=1,..., iU(i)V(i)T,
•
where U(i),V(i) = eigen-cols and eigen-rows.
Approximate: A(j) ≈ S i=1,...,k zijU(i)
•
by minzij|| A(j) - S i=1,...,k zijU(i) ||2
Z = UkTA --> A ≈ Ak = (UkUkT)A
•
project onto space of top k eigen-cols.
Z = SkVkT --> A ≈ Ak = Uk(SkVkT)
•
approximate every column of A i.t.o. a small number of eigen-rows and a lowdimensional encoding matrix Sk.
62
Dictionaries & columns and rows
A = CUR = S
•
ij
uijC(i)R(i), where U=W+ and W = intersection of C and R,
where C(i),R(i) = actual-cols and actual-rows.
Approximate: A(j) ≈ S i=1,...,c yijC(i)
•
by minyij|| A(j) - S i=1,...,c yijC(i) ||2
Y = C+A --> A ≈ PCA = (CC+)A
•
project onto space of those c actual-cols.
Y ≈ W+R --> A ≈ PCA ≈ C(W+R)
•
approximate every column of A i.t.o. a small number of actual-rows and a lowdimensional encoding matrix U=W+.
63
Overview (2/2)
• Tensor-based data sets
• Tensor-CUR
• Hyperspectral data
• Recommendation systems
• From Very-Large to Medium-Sized Data
• Relative-error CX and CUR Matrix Decompositions
• L2 Regression Problems
• Application to DNA SNP Data
• Conclusions and Open Problems
64
Problem formulation (1 of 3)
Never mind columns and rows - just deal with columns (for now) of the matrix A.
• Could ask to find the “best” k of n columns of A.
• Combinatorial problem - trivial algorithm takes nk time.
• Probably NP-hard if k is not fixed.
Let’s ask a different question.
• Fix a rank parameter k.
• Let’s over-sample columns by a little (e.g., k+3, 10k, k2, etc.).
• Try to get close (additive error or relative error) to the “best” rank-k approximation..
65
Problem formulation (2 of 3)
Ques: Do there exist O(k), or O(k2), or …, columns s.t.:
||A-CC+A||2,F < ||A-Ak||2,F + ||A||F
Ans: Yes - and can find them in O(m+n) space and time after two passes over the data! (DFKVV99,DKM04)
Ques: Do there exist O(k), or O(k2), or …, columns such that
||A-CC+A||2,F < (1+)-1||A-Ak||2,F + t||A||F
Ans: Yes - and can find them in O(m+n) space and time after t passes over the data! (RVW05,DM05)
Ques: Do there exist, and can we find, O(k), or O(k2), or …, columns such that
||A-CC+A||F < (1+)||A-Ak||F
Ans: Yes, they exist - existential proof - no non-exhaustive algorithm given! (RVW05,DRVW06)
Ans: ...
66
Problem formulation (3 of 3)
Ques: Do there exist O(k), or O(k2), or …, columns and rows such that
||A-CUR||2,F < ||A-Ak||2,F + ||A||F
Ans: Yes - lots of them, and can find them in O(m+n) space and time after two passes
over the data! (DK03,DKM04)
Note: “lots of them” since these are randomized Monte Carlo algorithms!
Ques: Do there exist O(k), or O(k2), or …, columns and rows such that
||A-CUR||F < (1+)||A-Ak||F
Ans: …
67
Theorem: Relative-Error CUR
Fix any k, , . Then, there exists a Monte Carlo algorithm that uses
O(SVD(Ak)) time to find C and R and construct U s.t.:
holds with probability at least 1-, by picking
c = O( k2 log(1/) / 2 ) columns, and
r = O( k4 log2(1/) / 6 ) rows.
(Current theory work: we can improve the sampling complexity to c,r=O(k poly(1/, 1/)).)
(Current empirical work: we can usually choose c,r ≤ k+4.)
(Don’t worry about : choose =1 if you want!)
68
L2 Regression problems
First consider overconstrained problems, n >> d.
•
Typically, there is no x such that Ax = b.
•
Can generalize to non-overconstrained problems if rank(A)=k.
We seek sampling-based algorithms for approximating l2 regression.
•
Nontrivial structural insights in overconstrained problems
•
Nontrivial algorithmic insights for non-overconstrained problems.
69
Creating an induced subproblem
Algorithm
1.
Fix a set of probabilities pi, i=1…n,
summing up to 1.
2.
Pick r indices from {1…n} in r i.i.d.
trials, with respect to the pi’s.
3.
For each sampled index j, keep the
j-th row of A and the j-th element
of b; rescale both by (1/rpj)1/2.
70
The induced subproblem
sampled rows
of A, rescaled
sampled elements
of b, rescaled
71
Our main L2 Regression result
If the pi satisfy certain conditions, then with probability at least 1-,
(A): condition
number of A
The sampling complexity is
(New improvement: we can reduce the sampling complexity to r = O(d).)
72
Conditions for the probabilities
The conditions that the pi must satisfy, for some 1, 2, 3  (0,1]:
lengths of rows of
matrix of left
singular vectors of A
Component of b not
in the span of the
columns of A
Small i )
more sampling
The sampling complexity is:
73
Rows of left singular vectors
What do the lengths of the rows of the n x d matrix U = UA “mean”?
Consider possible n x d matrices U of d left singular vectors:
In|k = k columns from the identity
row lengths = 0 or 1
In|k x -> x
Hn|k = k columns from the n x n Hadamard (real Fourier) matrix
row lengths all equal
Hn|k x -> maximally dispersed
Uk = k columns from any orthogonal matrix
row lengths between 0 and 1
Lengths of the rows of U = UA correspond to a notion of information dispersal.
Where in Rm the S information in A is sent, not what the S information is.
74
Comments on L2 Regression
Main point: The relevant information for l2 regression if n >> d is contained
in an induced subproblem of size O(d2)-by-d.
In O(nd2) = O(SVD(A)) = O(SVD(Ad)) time we can easily compute pi’s that
satisfy all three conditions, with 1 = 2 = 3 = 1/3.
• Too expensive in practice for this over-constrained problem!
• NOT too expensive when applied to CX and CUR matrix problems!!
Key observation: (FKV98, DK01, DKM04, RV04) : Us is almost orthogonal,
and we can bound the spectral and the Frobenius norm of: UsT Us – I.
NOTE: K. Clarkson in SODA 2005 analyzed sampling-based algorithms for
overconstrained l1 regression ( = 1 ) problems.
75
L2 Regression and CUR Approximation
Extended L2 Regression Algorithm:
• Input:
m x n matrix A, m x p matrix B, and a rank parameter k.
• Output: n x p matrix X approximately solving minX ||AkX - B ||F.
• Algrthm: Randomly sample r=O(d2) or r=O(d) rows from Ak and B
Solve the induced sub-problem.
Xopt = Ak+B ≈ (SAk)+SB
• Cmplxty: O(SVD(Ak)) time and space
Corollary 1: Approximately solve: minX ||ATkX - AT||F to get columns C such
that: ||A-CC+A||F ≤ (1+)||A-Ak||F.
Corollary 2: Approximately solve: minX ||CX - A||F to get rows R such that:
||A-CUR||F ≤ (1+) ||A-CC+A||F.
76
Theorem: Relative-Error CUR
Fix any k, , . Then, there exists a Monte Carlo algorithm that uses
O(SVD(Ak)) time to find C and R and construct U s.t.:
holds with probability at least 1-, by picking
c = O( k2 log(1/) / 2 ) columns, and
r = O( k4 log2(1/) / 6 ) rows.
(Current theory work: we can improve the sampling complexity to c,r=O(k poly(1/, 1/)).)
(Current empirical work: we can usually choose c,r ≤ k+4.)
(Don’t worry about : choose =1 if you want!)
77
Subsequent relative-error algorithms
November 2005: Drineas, Mahoney, and Muthukrishnan
•
The first relative-error low-rank matrix approximation algorithm.
•
O(SVD(Ak)) time and O(k2) columns for both CX and CUR decompositions.
January 2006: Har-Peled
•
Used -nets and VC-dimension arguments on optimal k-flats.
•
O(mn k2 log(k)) - “linear in mn” time to get 1+ approximation.
March 2006: Despande and Vempala
•
Used a volume sampling - adaptive sampling procedure of RVW05, DRVW06.
•
O(Mk/) ≈ O(SVD(Ak)) time and O(k log(k)) columns for CX-like decomposition.
April 2006: Drineas, Mahoney, and Muthukrishnan
•
Improved the DMM November 2005 result to O(k log(k)) columns.
78
Overview (2/2)
• Tensor-based data sets
• Tensor-CUR
• Hyperspectral data
• Recommendation systems
• From Very-Large to Medium-Sized Data
• Relative-error CX and CUR Matrix Decompositions
• L2 Regression Problems
• Application to DNA SNP Data
• Conclusions and Open Problems
79
CUR data application: DNA tagging-SNPs
(data from K. Kidd’s lab at Yale University, joint work with Dr. Paschou at Yale University)
Single Nucleotide Polymorphisms: the most common type of genetic variation in the
genome across different individuals.
They are known locations at the human genome where two alternate nucleotide bases
(alleles) are observed (out of A, C, G, T).
individuals
SNPs
… AG CT GT GG CT CC CC CC CC AG AG AG AG AG AA CT AA GG GG CC GG AG CG AC CC AA CC AA GG TT AG CT CG CG CG AT CT CT AG CT AG GG GT GA AG …
… GG TT TT GG TT CC CC CC CC GG AA AG AG AG AA CT AA GG GG CC GG AA GG AA CC AA CC AA GG TT AA TT GG GG GG TT TT CC GG TT GG GG TT GG AA …
… GG TT TT GG TT CC CC CC CC GG AA AG AG AA AG CT AA GG GG CC AG AG CG AC CC AA CC AA GG TT AG CT CG CG CG AT CT CT AG CT AG GG GT GA AG …
… GG TT TT GG TT CC CC CC CC GG AA AG AG AG AA CC GG AA CC CC AG GG CC AC CC AA CG AA GG TT AG CT CG CG CG AT CT CT AG CT AG GT GT GA AG …
… GG TT TT GG TT CC CC CC CC GG AA GG GG GG AA CT AA GG GG CT GG AA CC AC CG AA CC AA GG TT GG CC CG CG CG AT CT CT AG CT AG GG TT GG AA …
… GG TT TT GG TT CC CC CG CC AG AG AG AG AG AA CT AA GG GG CT GG AG CC CC CG AA CC AA GT TT AG CT CG CG CG AT CT CT AG CT AG GG TT GG AA …
… GG TT TT GG TT CC CC CC CC GG AA AG AG AG AA TT AA GG GG CC AG AG CG AA CC AA CG AA GG TT AA TT GG GG GG TT TT CC GG TT GG GT TT GG AA …
There are ∼10 million SNPs in the human genome, so this table could have ~10 million columns.
80
Recall “the” human genome
•
Human genome ≈ 3 billion base pairs
•
30,000 – 40,000 genes
•
The functionality of 97% of the genome is unknown.
•
BUT: individual differences (polymorphic variation) at ≈ 1 b.p. per thousand.
individuals
SNPs
AG CT GT GG CT CC CC CC CC AG AG AG AG AG AA CT AA GG GG CC GG AG CG AC CC AA CC AA GG TT AG CT CG CG CG AT CT CT AG CT AG GG GT GA AG
GG TT TT GG TT CC CC CC CC GG AA AG AG AG AA CT AA GG GG CC GG AA GG AA CC AA CC AA GG TT AA TT GG GG GG TT TT CC GG TT GG GG TT GG AA
GG TT TT GG TT CC CC CC CC GG AA AG AG AA AG CT AA GG GG CC AG AG CG AC CC AA CC AA GG TT AG CT CG CG CG AT CT CT AG CT AG GG GT GA AG
GG TT TT GG TT CC CC CC CC GG AA AG AG AG AA CC GG AA CC CC AG GG CC AC CC AA CG AA GG TT AG CT CG CG CG AT CT CT AG CT AG GT GT GA AG
GG TT TT GG TT CC CC CC CC GG AA GG GG GG AA CT AA GG GG CT GG AA CC AC CG AA CC AA GG TT GG CC CG CG CG AT CT CT AG CT AG GG TT GG AA
GG TT TT GG TT CC CC CG CC AG AG AG AG AG AA CT AA GG GG CT GG AG CC CC CG AA CC AA GT TT AG CT CG CG CG AT CT CT AG CT AG GG TT GG AA
GG TT TT GG TT CC CC CC CC GG AA AG AG AG AA TT AA GG GG CC AG AG CG AA CC AA CG AA GG TT AA TT GG GG GG TT TT CC GG TT GG GT TT GG AA
SNPs occur quite frequently within the genome allowing the tracking of disease genes
and population histories.
Thus, SNPs are effective markers for genomic research.
81
Focus at a specific locus and
assay the observed alleles.
C
T
SNP: exactly two alternate
alleles appear.
Two copies of a chromosome
(father, mother)
An individual could be:
individuals
- Heterozygotic (in our study, CT = TC)
SNPs
… AG CT GT GG CT CC CC CC CC AG AG AG AG AG AA CT AA GG GG CC GG AG CG AC CC AA CC AA GG TT AG CT CG CG CG AT CT CT AG CT AG GG GT GA AG …
… GG TT TT GG TT CC CC CC CC GG AA AG AG AG AA CT AA GG GG CC GG AA GG AA CC AA CC AA GG TT AA TT GG GG GG TT TT CC GG TT GG GG TT GG AA …
… GG TT TT GG TT CC CC CC CC GG AA AG AG AA AG CT AA GG GG CC AG AG CG AC CC AA CC AA GG TT AG CT CG CG CG AT CT CT AG CT AG GG GT GA AG …
… GG TT TT GG TT CC CC CC CC GG AA AG AG AG AA CC GG AA CC CC AG GG CC AC CC AA CG AA GG TT AG CT CG CG CG AT CT CT AG CT AG GT GT GA AG …
… GG TT TT GG TT CC CC CC CC GG AA GG GG GG AA CT AA GG GG CT GG AA CC AC CG AA CC AA GG TT GG CC CG CG CG AT CT CT AG CT AG GG TT GG AA …
… GG TT TT GG TT CC CC CG CC AG AG AG AG AG AA CT AA GG GG CT GG AG CC CC CG AA CC AA GT TT AG CT CG CG CG AT CT CT AG CT AG GG TT GG AA …
… GG TT TT GG TT CC CC CC CC GG AA AG AG AG AA TT AA GG GG CC AG AG CG AA CC AA CG AA GG TT AA TT GG GG GG TT TT CC GG TT GG GT TT GG AA …
82
Focus at a specific locus and
assay the observed alleles.
C
C
SNP: exactly two alternate
alleles appear.
Two copies of a chromosome
(father, mother)
An individual could be:
- Heterozygotic (in our study, CT = TC)
individuals
- Homozygotic at the first allele, e.g., C
SNPs
… AG CT GT GG CT CC CC CC CC AG AG AG AG AG AA CT AA GG GG CC GG AG CG AC CC AA CC AA GG TT AG CT CG CG CG AT CT CT AG CT AG GG GT GA AG …
… GG TT TT GG TT CC CC CC CC GG AA AG AG AG AA CT AA GG GG CC GG AA GG AA CC AA CC AA GG TT AA TT GG GG GG TT TT CC GG TT GG GG TT GG AA …
… GG TT TT GG TT CC CC CC CC GG AA AG AG AA AG CT AA GG GG CC AG AG CG AC CC AA CC AA GG TT AG CT CG CG CG AT CT CT AG CT AG GG GT GA AG …
… GG TT TT GG TT CC CC CC CC GG AA AG AG AG AA CC GG AA CC CC AG GG CC AC CC AA CG AA GG TT AG CT CG CG CG AT CT CT AG CT AG GT GT GA AG …
… GG TT TT GG TT CC CC CC CC GG AA GG GG GG AA CT AA GG GG CT GG AA CC AC CG AA CC AA GG TT GG CC CG CG CG AT CT CT AG CT AG GG TT GG AA …
… GG TT TT GG TT CC CC CG CC AG AG AG AG AG AA CT AA GG GG CT GG AG CC CC CG AA CC AA GT TT AG CT CG CG CG AT CT CT AG CT AG GG TT GG AA …
… GG TT TT GG TT CC CC CC CC GG AA AG AG AG AA TT AA GG GG CC AG AG CG AA CC AA CG AA GG TT AA TT GG GG GG TT TT CC GG TT GG GT TT GG AA …
83
Focus at a specific locus and
assay the observed alleles.
T
T
SNP: exactly two alternate
alleles appear.
Two copies of a chromosome
(father, mother)
An individual could be:
- Heterozygotic (in our study, CT = TC)
- Homozygotic at the first allele, e.g., C
- Homozygotic at the second allele, e.g., T
individuals
SNPs
… AG CT GT GG CT CC CC CC CC AG AG AG AG AG AA CT AA GG GG CC GG AG CG AC CC AA CC AA GG TT AG CT CG CG CG AT CT CT AG CT AG GG GT GA AG …
… GG TT TT GG TT CC CC CC CC GG AA AG AG AG AA CT AA GG GG CC GG AA GG AA CC AA CC AA GG TT AA TT GG GG GG TT TT CC GG TT GG GG TT GG AA …
… GG TT TT GG TT CC CC CC CC GG AA AG AG AA AG CT AA GG GG CC AG AG CG AC CC AA CC AA GG TT AG CT CG CG CG AT CT CT AG CT AG GG GT GA AG …
… GG TT TT GG TT CC CC CC CC GG AA AG AG AG AA CC GG AA CC CC AG GG CC AC CC AA CG AA GG TT AG CT CG CG CG AT CT CT AG CT AG GT GT GA AG …
… GG TT TT GG TT CC CC CC CC GG AA GG GG GG AA CT AA GG GG CT GG AA CC AC CG AA CC AA GG TT GG CC CG CG CG AT CT CT AG CT AG GG TT GG AA …
… GG TT TT GG TT CC CC CG CC AG AG AG AG AG AA CT AA GG GG CT GG AG CC CC CG AA CC AA GT TT AG CT CG CG CG AT CT CT AG CT AG GG TT GG AA …
… GG TT TT GG TT CC CC CC CC GG AA AG AG AG AA TT AA GG GG CC AG AG CG AA CC AA CG AA GG TT AA TT GG GG GG TT TT CC GG TT GG GT TT GG AA …
84
Why are SNPs important?
Genetic Association Studies:
• Locate causative genes for common
complex disorders (e.g., diabetes, heart
disease, etc.) by identifying association
between affection status and known SNPs.
• No prior knowledge about the function of
the gene(s) or the etiology of the disorder
is necessary.
Biology and Association Studies: The subsequent investigation of candidate genes
that are in physical proximity with the associated SNPs is the first step towards
understanding the etiological “pathway” of a disorder and designing a drug.
Data Analysis and Association Studies: Susceptibility alleles (and genotypes
carrying them) should be more common in the patient population.
85
SNPs carry redundant information




Key observation: non-random
relationship between SNPs.
Human genome is organized into
block-like structure.
Strong intra-block correlations.
We can focus only on “tagSNPs”.
• Among different populations (eg., European, Asian, African, etc.), different
patterns of SNP allele frequencies or SNP correlations are often observed.
• Understanding such differences is crucial in order to develop the “next generation”
of drugs that will be “population specific” (eventually “genome specific”) and not just
“disease specific”.
86
Funding …
• Mapping the whole genome sequence of a single individual is very expensive.
• Mapping all the SNPs is also quite expensive, but the costs are dropping fast.
HapMap project (~$100,000,000 funding from NIH and other sources):
• Map all 10,000,000 SNPs for 270 individuals from 4 different populations (YRI,
CEU, CHB, JPT), in order to create a “genetic map” to be used by researchers.
• Also, funding from pharmaceutical companies, NSF, the Department of Justice*,
etc.
*Is
it possible to identify the ethnicity of a suspect from his DNA?
87
Research directions
Research questions (working within a population):
Why?
(i) Are different SNPs correlated, within or across
populations?
- Understand structural properties
of the human genome.
(ii) Find a “good” set of tagging-SNPs capturing the
diversity of a chromosomal region of the human genome.
(iii) Find a set of individuals that capture the diversity
of a chromosomal region.
(iii) Is extrapolation feasible?
- Save time/money by assaying only
the tSNPs and predicting the rest.
- Save time/money by running (drug)
tests only on the cell lines of the
selected individuals.
Existing literature
Pairwise metrics of SNP correlation, called LD (linkage disequilibrium) distance, based on nucleotide frequencies and
co-occurrences.
Almost no metrics exist for measuring correlation between more than 2 SNPs and LD is very difficult to generalize.
Exhaustive and semi-exhaustive algorithms in order to pick “good” ht-SNPs that have small LD distance with all other
SNPs.
Using Linear Algebra: an SVD based algorithm was proposed by Lin & Altman, Am. J. Hum. Gen. 2004.
88
The DNA SNP data
• Samples from 38 different populations.
• Average size 50 subjects/population.
• For each subject 63 SNPs were assayed, from a region in chromosome 17 called SORCS3,
≈ 900,000 bases long.
• We are in the process of analyzing HapMap data as well as more 3 regions assayed by
Kidd’s lab (with Asif Javed).
89
N > 50
N: 25 ~ 50
Finns
Kom Zyrian
Yakut
Khanty
Irish
European, Mixed
Danes
Chuvash
Russians
African Americans
Jews, Ashkenazi
Adygei
Druze
Samaritans
Pima, Arizona
Cheyenne
Chinese,
Hakka
Japanese
Han
Chinese, Taiwan
Cambodians
Jews, Yemenite
Ibo
Maya
Atayal
Hausa
Yoruba
Pima, Mexico
Ami
Biaka
Jews, Ethiopian
Mbuti
Ticuna
Micronesians
Chagga
Nasioi
Surui
Karitiana
Africa
Europe
NW Siberia
NE Siberia
SW Asia
E Asia
N America
S America
Oceania
90
Encoding the data
SNPs
individuals
0 0 0 1 0 -1 1 1 1 0 0 0 0 0 1 0 1 -1 -1 1 -1 0 0 0 1 1 1 1 -1 -1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
-1 -1 -1 1 -1 -1 1 1 1 -1 1 0 0 0 1 0 1 -1 -1 1 -1 1 -1 1 1 1 1 1 -1 -1 1 -1 -1 -1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 1
-1 -1 -1 1 -1 -1 1 1 1 -1 1 0 0 1 0 0 1 -1 -1 1 0 0 0 0 1 1 1 1 -1 -1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
-1 -1 -1 1 -1 -1 1 1 1 -1 1 0 0 0 1 1 -1 1 1 1 0 -1 1 0 1 1 0 1 -1 -1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
-1 -1 -1 1 -1 -1 1 1 1 -1 1 -1 -1 -1 1 0 1 -1 -1 0 -1 1 1 0 0 1 1 1 -1 -1 -1 1 0 0 0 0 0 0 0 0 0 1 -1 -1 1
-1 -1 -1 1 -1 -1 1 0 1 0 0 0 0 0 1 0 1 -1 -1 0 -1 0 1 -1 0 1 1 1 -1 -1 0 0 0 0 0 0 0 0 0 0 0 1 -1 -1 1
-1 -1 -1 1 -1 -1 1 1 1 -1 1 0 0 0 1 -1 1 -1 -1 1 0 0 0 1 1 1 0 1 -1 -1 1 -1 -1 -1 -1 -1 -1 1 -1 -1 -1 0 -1 -1 1
How?
•
Exactly two nucleotides (out of A,G,C,T) appear in each column.
•
Thus, the two alleles might be both equal to the first one (encode by +1), both
equal to the second one (encode by -1), or different (encode by 0).
Notes
•
Order of the alleles is irrelevant, so TG is the same as GT.
•
Encoding, e.g., GG to +1 and TT to -1 is not any different (for our purposes)
from encoding GG to -1 and TT to +1.
(Flipping the signs of the columns of a matrix does not affect our techniques.)
91
Evaluating (linear) structure
For each population
 We ran SVD to determine the “optimal” number k of eigenSNPs covering 90% of the variance.
If we pick the top k left singular vectors we can express every column (i.e, SNP) of A as a
linear combination of the left singular vectors loosing 10% of the data.
 We ran CUR to pick a small number (e.g., k+2) of columns of A and express every column
(i.e., SNP) of A as a linear combination of the picked columns, loosing 10% of the data.
SNPs
individuals
0 0 0 1 0 -1 1 1 1 0 0 0 0 0 1 0 1 -1 -1 1 -1 0 0 0 1 1 1 1 -1 -1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
-1 -1 -1 1 -1 -1 1 1 1 -1 1 0 0 0 1 0 1 -1 -1 1 -1 1 -1 1 1 1 1 1 -1 -1 1 -1 -1 -1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 1
-1 -1 -1 1 -1 -1 1 1 1 -1 1 0 0 1 0 0 1 -1 -1 1 0 0 0 0 1 1 1 1 -1 -1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
-1 -1 -1 1 -1 -1 1 1 1 -1 1 0 0 0 1 1 -1 1 1 1 0 -1 1 0 1 1 0 1 -1 -1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
-1 -1 -1 1 -1 -1 1 1 1 -1 1 -1 -1 -1 1 0 1 -1 -1 0 -1 1 1 0 0 1 1 1 -1 -1 -1 1 0 0 0 0 0 0 0 0 0 1 -1 -1 1
-1 -1 -1 1 -1 -1 1 0 1 0 0 0 0 0 1 0 1 -1 -1 0 -1 0 1 -1 0 1 1 1 -1 -1 0 0 0 0 0 0 0 0 0 0 0 1 -1 -1 1
-1 -1 -1 1 -1 -1 1 1 1 -1 1 0 0 0 1 -1 1 -1 -1 1 0 0 0 1 1 1 0 1 -1 -1 1 -1 -1 -1 -1 -1 -1 1 -1 -1 -1 0 -1 -1 1
92
93
Predicting SNPs within a population
Split the individuals in two sets: training and test.
Given a small number of SNPs for all individuals (tagging-SNPs), and all SNPs
for individuals in the training set, predict the unassayed SNPs.
Tagging-SNPs are selected using only the training set.
SNPs
“Training” individuals,
chosen uniformly at random
individuals
(for a few subjects, we are
given all SNPs)
SNP sample
(for all subjects, we are given a
small number of SNPs)
94
95
96
Predicting SNPs across populations
Given all SNPs for all individuals in population X, and a small number of taggingSNPs for population Y, predict all unassayed SNPs for all individuals of Y.
Tagging-SNPs are selected using only the training set.
Training set: individuals in X.
Test set: individuals in Y.
A: contains all individuals in both X and Y.
SNPs
All individuals in population X.
individuals
SNP sample
(for all subjects, we are given a
small number of SNPs)
97
Select tSNPs
OUT: set of tSNPs
Assay
tSNPs in
population
B
SNPs
individuals
IN: population A
individuals
… AA CT GT GG TT TT CC GG GG AA GG CT AG CC …
Population A
… AG ?? GG ?? ?? ?? CC ?? ?? ?? AG ?? ?? ?? …
… AA ?? GG ?? ?? ?? CG ?? ?? ?? GG ?? ?? ?? …
Reconstruct SNPs
… AG CT GT GG CT CC CG AG AG AC AG CT AG CT …
… AG CC GG GT CT CT CC GG AG CC GG CC AG CT …
… AG ?? GG ?? ?? ?? CC ?? ?? ?? AA ?? ?? ?? …
Population B
SNPs
… GG TT TT GG TT CC GG AG AA AC AG CT GG CT …
… AA ?? GT ?? ?? ?? CG ?? ?? ?? AA ?? ?? ?? …
IN: population A & assayed
tSNPs in B
OUT: unassayed SNPs in B
: tSNP
Transferability of tagging SNPs
individuals
SNPs
… AA TT GT TT CC CT CG AG GG CC AA CC AA TT …
… AG CT GG TT TT CT CC GG AA AA AA CC AA TT …
… AG CC GG GT CT CC CC AG AA AC AG CT AA CT …
… AA CC GG GT CT TT CG AA AG CC GG CT AG CC …
Population B
FIG. 6
99
100
Keeping both SNPs and individuals
Given a small number of SNPs for all individuals, and all SNPs for some
judiciously chosen individuals, predict the values of the remaining SNPs.
SNPs
“Basis” individuals
JUDICIOUCLY CHOSEN
individuals
(for a few subjects, we are
given all SNPs)
SNP sample
(for all subjects, we are given a
small number of SNPs)
101
102
103
Overview (2/2)
• Tensor-based data sets
• Tensor-CUR
• Hyperspectral data
• Recommendation systems
• From Very-Large to Medium-Sized Data
• Relative-error CX and CUR Matrix Decompositions
• L2 Regression Problems
• Application to DNA SNP Data
• Conclusions and Open Problems
104
Conclusions & Open Problems
• Impose other structural properties in CUR-type decompositions
• Non-negativity
• Element quantization, e.g., to 0,1,-1
• Block-SVD type structure
• Robust heuristics and robust extensions
• Especially for noisy data
• L1 norm bounds
• Extension to different statistical learning problems
•
•
•
•
Matrix reconstruction
Regression
Classification, e.g. SVM
Clustering
105
Conclusions & Open Problems
• Relate to traditional numerical linear algebra
• Gu and Eisenstat - deterministically find well-conditioned columns
• Goreinov and Tyrtyshnikov - volume-maximization and conditioning criteria
• Stewart - backward error analysis
• Empirical evaluation of different sampling probabilities
• Uniform
• Non-uniform; norms of rows/columns or of left/right singular vectors
• Others, depending on the problem to be solved?
• Use CUR and CW+CT for improved interpretability for data matrices
Structural vs. algorithmic issues
106
Workshop on “Algorithms for Modern Massive Data Sets”
(http://www.stanford.edu/group/mmds/)
@ Stanford University and Yahoo! Research, June 21-24, 2006
Objective:
- Explore novel techniques for modeling and analyzing massive, high-dimensional, and nonlinearstructured data.
- Bring together computer scientists, computational and applied mathematicians, statisticians, and
practitioners to promote cross-fertilization of ideas.
Organizers:
G. Golub, M. W. Mahoney, L-H. Lim, and P. Drineas.
Sponsors:
NSF, Yahoo! Research, Ask!.
107