Transcript O(N 2 )
Computational Mathematics
for Large-scale Data Analysis
Alexander Gray
Georgia Institute of Technology
College of Computing
FASTlab: Fundamental Algorithmic and Statistical Tools Laboratory
The FASTlab
Fundamental Algorithmic and Statistical Tools Laboratory
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
Arkadas Ozakin: Research scientist, PhD Theoretical Physics
Dong Ryeol Lee: PhD student, CS + Math
Ryan Riegel: PhD student, CS + Math
Parikshit Ram: PhD student, CS + Math
William March: PhD student, Math + CS
James Waters: PhD student, Physics + CS
Hua Ouyang: PhD student, CS
Sooraj Bhat: PhD student, CS
Ravi Sastry: PhD student, CS
Long Tran: PhD student, CS
Michael Holmes: PhD student, CS + Physics (co-supervised)
Nikolaos Vasiloglou: PhD student, EE (co-supervised)
Wei Guan: PhD student, CS (co-supervised)
Nishant Mehta: PhD student, CS (co-supervised)
Wee Chin Wong: PhD student, ChemE (co-supervised)
Abhimanyu Aditya: MS student, CS
Yatin Kanetkar: MS student, CS
Praveen Krishnaiah: MS student, CS
Devika Karnik: MS student, CS
Prasad Jakka: MS student, CS
Exponential growth in
dataset sizes
Instruments
Data: CMB Maps
1.0E+07
1.0E+07
1.0E+06
1.0E+06
1.0E+05
1.0E+05
1.0E+04
1.0E+04
1.0E+03
1.0E+03
1.0E+02
1.0E+02
1985
1985
1990
1990
[Science, Szalay & J. Gray, 2001]
Data: Local Redshift Surveys
1986 CfA
3,500
1996 LCRS 23,000
2003 2dF 250,000
2005 SDSS 800,000
Data: Angular Surveys
1970 Lick
1M
1990 APM
2M
2005 SDSS 200M
2010 LSST
2B
1995
1995
2000
2000
2005
2005
2010
2010
1990 COBE
1,000
2000 Boomerang 10,000
2002 CBI
50,000
2003 WMAP
1 Million
2008 Planck 10 Million
Astrophysicist
1. How did galaxies evolve?
2. What was the early universe like?
3. Does dark energy exist?
4. Is our model (GR+inflation) right?
R. Nichol, Inst. Cosmol. Gravitation
A. Connolly, U. Pitt Physics
C. Miller, NOAO
R. Brunner, NCSA
G. Djorgovsky, Caltech
G. Kulkarni, Inst. Cosmol. Gravitation
D. Wake, Inst. Cosmol. Gravitation
R. Scranton, U. Pitt Physics
M. Balogh, U. Waterloo Physics
I. Szapudi, U. Hawaii Inst. Astronomy
G. Richards, Princeton Physics
A. Szalay, Johns Hopkins Physics
Machine learning/
statistics guy
Astrophysicist
1. How did galaxies evolve?
2. What was the early universe like?
3. Does dark energy exist?
4. Is our model (GR+inflation) right?
• Kernel density estimator O(N2)
n
• n-point
R. Nichol, Inst. Cosmol.
Grav. spatial statistics O(N )
A. Connolly, U. Pitt
Physics
• Nonparametric
Bayes classifier O(N2)
C. Miller, NOAO • Support vector machine O(N2)
R. Brunner, NCSA
• Nearest-neighbor statistics O(N2)
G. Djorgovsky, Caltech
• Gaussian process regression O(N3)
G. Kulkarni, Inst. Cosmol. Grav.
DT(N))
O(c
•
Hierarchical
clustering
D. Wake, Inst. Cosmol. Grav.
R. Scranton, U. Pitt Physics
M. Balogh, U. Waterloo Physics
I. Szapudi, U. Hawaii Inst. Astro.
G. Richards, Princeton Physics
A. Szalay, Johns Hopkins Physics
Machine learning/
statistics guy
Astrophysicist
1. How did galaxies evolve?
2. What was the early universe like?
3. Does dark energy exist?
4. Is our model (GR+inflation) right?
• Kernel density estimator O(N2)
n
• n-point
R. Nichol, Inst. Cosmol.
Grav. spatial statistics O(N )
A. Connolly, U. Pitt
Physics
• Nonparametric
Bayes classifier O(N2)
C. Miller, NOAO • Support vector machine O(N3)
R. Brunner, NCSA
• Nearest-neighbor statistics O(N2)
G. Djorgovsky, Caltech
3
Gaussian
G. Kulkarni, Inst. •Cosmol.
Grav. process regression O(N )
• Hierarchical
clustering O(N3)
D. Wake, Inst. Cosmol.
Grav.
R. Scranton, U. Pitt Physics
M. Balogh, U. Waterloo Physics
I. Szapudi, U. Hawaii Inst. Astro.
G. Richards, Princeton Physics
A. Szalay, Johns Hopkins Physics
Machine learning/
statistics guy
Astrophysicist
1. How did galaxies evolve?
2. What was the early universe like?
3. Does dark energy exist?
4. Is our model (GR+inflation) right?
• Kernel density estimator O(N2)
n
• n-point
R. Nichol, Inst. Cosmol.
Grav. spatial statistics O(N )
A. Connolly, U. Pitt
Physics
• Nonparametric
Bayes classifier O(N2)
C. Miller, NOAO • Support vector machine O(N3)
R. Brunner, NCSA
• Nearest-neighbor statistics O(N2)
G. Djorgovsky, Caltech
3
Gaussian
G. Kulkarni, Inst. •Cosmol.
Grav. process regression O(N )
• Hierarchical
clustering O(N3)
D. Wake, Inst. Cosmol.
Grav.
R. Scranton, U. Pitt Physics
M. Balogh, U. Waterloo Physics
I. Szapudi,
Hawaii Inst.
Astro.
But I U.have
1 million
G. Richards, Princeton Physics
A. Szalay, Johns Hopkins Physics
points
Machine learning/
statistics guy
Uncertainty quantification
Many types – all generally lead to
more computational expense:
• Uncertainty of statistical model:
cross-validation, bootstrap
• Uncertainty of simulation: run
many times, estimate a model
each time
• Uncertainty of simulation vs. real
data: expensive spatial statistics
How to do Stats/Machine Learning
on Very Large Survey Datasets?
1. Choose the appropriate
statistical task and method
for the scientific question
2. Use the fastest algorithm
and data structure for the
statistical method
3. Apply method to the data!
How to do Stats/Machine Learning
on Very Large Survey Datasets?
1. Choose the appropriate
statistical task and method
for the scientific question
2. Use the fastest algorithm
and data structure for the
statistical method
3. Apply method to the data!
10 data analysis problems, and
scalable tools we’d like for them
1.
Querying (e.g. characterizing a region of space):
–
–
–
–
2.
spherical range-search O(N)
orthogonal range-search O(N)
k-nearest-neighbors O(N)
all-k-nearest-neighbors O(N2)
Density estimation (e.g. comparing galaxy types):
–
–
–
–
–
mixture of Gaussians
kernel density estimation O(N2)
L2 density tree [Ram and Gray in prep]
manifold kernel density estimation O(N3) [Ozakin and Gray
2008, to be submitted]
hyper-kernel density estimation O(N4) [Sastry and Gray 2008,
submitted]
10 data analysis problems, and
scalable tools we’d like for them
3. Regression (e.g. photometric redshifts):
–
–
–
linear regression O(D2)
kernel regression O(N2)
Gaussian process regression/kriging O(N3)
4. Classification (e.g. quasar detection, star-galaxy
separation):
–
–
–
–
–
–
k-nearest-neighbor classifier O(N2)
nonparametric Bayes classifier O(N2)
support vector machine (SVM) O(N3)
non-negative SVM O(N3) [Guan and Gray, in prep]
false-positive-limiting SVM O(N3) [Sastry and Gray, in prep]
separation map O(N3) [Vasiloglou, Gray, and Anderson
2008, submitted]
10 data analysis problems, and
scalable tools we’d like for them
5.
Dimension reduction (e.g. galaxy or spectra
characterization):
–
–
–
–
–
–
–
6.
principal component analysis O(D2)
non-negative matrix factorization
kernel PCA O(N3)
maximum variance unfolding O(N3)
co-occurrence embedding O(N3) [Ozakin and Gray, in prep]
rank-based manifolds O(N3) [Ouyang and Gray 2008, ICML]
isometric non-negative matrix factorization O(N3) [Vasiloglou,
Gray, and Anderson 2008, submitted]
Outlier detection (e.g. new object types, data
cleaning):
–
–
by density estimation, by dimension reduction
by robust Lp estimation [Ram, Riegel and Gray, in prep]
10 data analysis problems, and
scalable tools we’d like for them
7. Clustering (e.g. automatic Hubble sequence)
–
–
–
–
by dimension reduction, by density estimation
k-means
mean-shift segmentation O(N2)
hierarchical clustering (“friends-of-friends”) O(N3)
8. Time series analysis (e.g. asteroid tracking, variable
objects):
–
–
–
–
–
Kalman filter O(D2)
hidden Markov model O(D2)
trajectory tracking O(Nn)
Markov matrix factorization [Tran, Wong, and Gray 2008,
submitted]
functional independent component analysis [Mehta and Gray
2008, submitted]
10 data analysis problems, and
scalable tools we’d like for them
9. Feature selection and causality (e.g. which features
predict star/galaxy)
–
–
–
–
–
–
LASSO regression
L1 SVM
Gaussian graphical model inference and structure search
discrete graphical model inference and structure search
0-1 feature-selecting SVM [Guan and Gray, in prep]
L1 Gaussian graphical model inference and structure search
[Tran, Lee, Holmes, and Gray, in prep]
10. 2-sample testing and matching (e.g. cosmological
validation, multiple surveys):
– minimum spanning tree O(N3)
– n-point correlation O(Nn)
– bipartite matching/Gaussian graphical model inference O(N3)
[Waters and Gray, in prep]
How to do Stats/Machine Learning
on Very Large Survey Datasets?
1. Choose the appropriate
statistical task and method
for the scientific question
2. Use the fastest algorithm
and data structure for the
statistical method
3. Apply method to the data!
Core computational problems
What are the basic mathematical
operations, or bottleneck
subroutines, can we focus on
developing fast algorithms for?
Core computational problems
Aggregations, GNPs, graphs, linear algebra, optimization, integration
•
•
•
•
•
•
•
•
•
•
Querying: nearest-neighbor, sph range-search, ortho range-search, all-nn
Density estimation: kernel density estimation, mixture of Gaussians
Regression: linear regression, kernel regression, Gaussian process
regression
Classification: nearest-neighbor classifier, nonparametric Bayes classifier,
support vector machine
Dimension reduction: principal component analysis, non-negative matrix
factorization, kernel PCA, maximum variance unfolding
Outlier detection: by robust L2 estimation, by density estimation, by
dimension reduction
Clustering: k-means, mean-shift, hierarchical clustering (“friends-offriends”), by dimension reduction, by density estimation
Time series analysis: Kalman filter, hidden Markov model, trajectory
tracking
Feature selection and causality: LASSO regression, L1 support vector
machine, Gaussian graphical models, discrete graphical models
2-sample testing and matching: n-point correlation, bipartite matching
Simple example: spherical range search.
How can we compute this efficiently?
kd-trees:
most widely-used spacepartitioning tree
[Bentley 1975], [Friedman, Bentley &
Finkel 1977],[Moore & Lee 1995]
A kd-tree: level 1
A kd-tree: level 2
A kd-tree: level 3
A kd-tree: level 4
A kd-tree: level 5
A kd-tree: level 6
Range-count recursive algorithm
Range-count recursive algorithm
Range-count recursive algorithm
Range-count recursive algorithm
Range-count recursive algorithm
Pruned!
(inclusion)
Range-count recursive algorithm
Range-count recursive algorithm
Range-count recursive algorithm
Range-count recursive algorithm
Range-count recursive algorithm
Range-count recursive algorithm
Range-count recursive algorithm
Range-count recursive algorithm
Pruned!
(exclusion)
Range-count recursive algorithm
Range-count recursive algorithm
Range-count recursive algorithm
fastest
practical
algorithm
[Bentley 1975]
our
algorithms
can use
any tree
How to do Stats/Machine Learning
on Very Large Survey Datasets?
1. Choose the appropriate
statistical task and method
for the scientific question
2. Use the fastest algorithm
and data structure for the
statistical method
3. Apply method to the data!
All-k-nearest neighbors
• Applications: querying, adaptive density
estimation first step [Budavari et al., in prep]
• Naïve cost: O(N2)
• Fastest algorithm: Generalized Fast
Multipole Method, aka multi-tree methods
– Technique: Use two trees simultaneously
– [Gray and Moore 2001, NIPS; Riegel, Boyer and
Gray, to be submitted]
– O(N), exact
Kernel density estimation
• Applications: accurate density estimates,
e.g. galaxy types [Balogh et al. 2004]
• Naïve cost: O(N2)
• Fastest algorithm: dual-tree fast Gauss
transform, Monte Carlo multipole method
– Technique: Hermite operators, Monte Carlo
– [Lee, Gray and Moore 2005, NIPS; Lee and Gray
2006, UAI; Holmes, Gray and Isbell 2007, NIPS;
Lee and Gray 2008, NIPS]
– O(N), relative error guarantee
Nonparametric Bayes
classification (or KDA)
• Applications: accurate classification, e.g.
quasar detection [Richards et al. 2004,2009],
[Scranton et al. 2005]
• Naïve cost: O(N2)
• Fastest algorithm: dual-tree bounding
with hybrid tree expansion
– Technique: bound each posterior
– [Liu, Moore, and Gray 2004; Gray and Riegel 2004,
CompStat; Riegel and Gray 2007, SDM]
– O(N), exact
Friends-of-friends
• Applications: cluster-finding, 2-sample
• Naïve cost: O(N3)
• Fastest algorithm: dual-tree Boruvka
algorithm
– Technique: similar to all-k-nn
– [March and Gray, to be submitted]
– O(NlogN), exact
n-point correlations
• Applications: 2-sample, e.g. AGN [Wake
et al. 2004], ISW [Giannantonio et al. 2006],
3pt validation [Kulkarni et al. 2007]
• Naïve cost: O(Nn)
• Fastest algorithm: n-tree algorithms
– Technique: use n trees simultaneously
– [Gray and Moore 2001, NIPS; Moore et al. 2001,
Mining the Sky; Waters, Riegel and Gray, to be
submitted]
– O(Nlogn), exact
3-point runtime
(biggest previous:
20K)
n=2: O(N)
n=3: O(Nlog3)
VIRGO
simulation data,
N = 75,000,000
naïve: 5x109 sec.
(~150 years)
multi-tree: 55 sec.
(exact)
n=4: O(N2)
Now fast!
very fast
•
•
•
•
•
•
•
•
•
•
as fast as possible (conjecture)
Querying: nearest-neighbor, sph range-search, ortho range-search, all-nn
Density estimation: kernel density estimation, mixture of Gaussians
Regression: linear regression, kernel regression, Gaussian process
regression
Classification: nearest-neighbor classifier, nonparametric Bayes classifier,
support vector machine
Dimension reduction: principal component analysis, non-negative matrix
factorization, kernel PCA, maximum variance unfolding
Outlier detection: by robust L2 estimation
Clustering: k-means, mean-shift, hierarchical clustering (“friends-offriends”)
Time series analysis: Kalman filter, hidden Markov model, trajectory
tracking
Feature selection and causality: LASSO regression, L1 support vector
machine, Gaussian graphical models, discrete graphical models
2-sample testing and matching: n-point correlation, bipartite matching
The end
• You can now use many of the state-of-theart data analysis methods on large
datasets
– MLPACK software
– CAS Analytics in SQL Server
• Computational astronomy workshop and largescale data analysis workshop coming soon
Alexander Gray [email protected]
(email is best; webpage sorely out of date)
Characterization of an entire distribution?
2-point correlation
“How many pairs have distance < r ?”
N
N
I ( x x
i
j i
2-point correlation
function
i
j
r)
r
The n-point correlation functions
• Spatial inferences: filaments, clusters, voids,
homogeneity, isotropy, 2-sample testing, …
• Foundation for theory of point processes
[Daley,Vere-Jones 1972], unifies spatial statistics [Ripley
1976]
• Used heavily in biostatistics, cosmology, particle
physics, statistical physics
2pcf definition:
dP dV1dV2 [1 (r )]
2
3pcf definition:
dP 3dV1dV2 dV3 [1 (r12 ) (r23 ) (r13 ) (r12 , r23 , r13 )]
Standard model: n>0 terms
should be zero!
r2
r1
r3
3-point correlation
“How many triples have
pairwise distances < r ?”
N
N
N
I (
i
j i k j i
ij
r1 ) I ( jk r2 ) I ( ki r3 )
How can we count n-tuples efficiently?
“How many triples have
pairwise distances < r ?”
Use n trees!
[Gray & Moore, NIPS 2000]
“How many valid triangles a-b-c
(where a A, b B, c C )
could there be?
r
count{A,B,C} =
?
A
B
C
“How many valid triangles a-b-c
(where a A, b B, c C )
could there be?
r
count{A,B,C} =
count{A,B,C.left}
+
count{A,B,C.right}
A
B
C
“How many valid triangles a-b-c
(where a A, b B, c C )
could there be?
r
count{A,B,C} =
count{A,B,C.left}
+
count{A,B,C.right}
A
B
C
“How many valid triangles a-b-c
(where a A, b B, c C )
could there be?
r
A
B
count{A,B,C} =
C
?
“How many valid triangles a-b-c
(where a A, b B, c C )
could there be?
r
A
B
count{A,B,C} =
C
Exclusion
0!
“How many valid triangles a-b-c
(where a A, b B, c C )
could there be?
r
A
B
count{A,B,C} =
C
?
“How many valid triangles a-b-c
(where a A, b B, c C )
could there be?
Inclusion
A
r
B
count{A,B,C} =
Inclusion
C
|A| x |B| x |C|
Inclusion