FAST Lab Group Meeting 4/11/06

Download Report

Transcript FAST Lab Group Meeting 4/11/06

Fast Algorithms & Data Structures
for Visualization and Machine
Learning on Massive Data Sets
Alexander Gray
Fundamental Algorithmic and Statistical Tools Laboratory (FASTlab)
Computational Science and Engineering Division
College of Computing
Georgia Institute of Technology
The FASTlab
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Arkadas Ozakin: Research scientist, PhD theoretical physics
Dong Ryeol Lee: PhD student, CS + Math
Ryan Riegel: PhD student, CS + Math
Parikshit Ram: PhD student, CS + Math
William March: PhD student, Math + CS
James Waters: PhD student, Physics + CS
Nadeem Syed: PhD student, CS
Hua Ouyang: PhD student, CS
Sooraj Bhat: PhD student, CS
Ravi Sastry: PhD student, CS
Long Tran: PhD student, CS
Michael Holmes: PhD student, CS + Physics (co-supervised)
Nikolaos Vasiloglou: PhD student, EE (co-supervised)
Wei Guan: PhD student, CS (co-supervised)
Nishant Mehta: PhD student, CS (co-supervised)
Wee Chin Wong: PhD student, ChemE (co-supervised)
Abhimanyu Aditya: MS student, CS
Yatin Kanetkar: MS student, CS
Goal
• New displays for high-dimensional data
–
–
–
–
Isometric non-negative matrix factorization
Rank-based embedding
Density-preserving maps
Co-occurrence embedding
• New algorithms for scaling them to big datasets
– Distances: Generalized Fast Multipole Method
– Dot products: Cosine Trees and QUIC-SVD
– MLPACK
Plotting high-D in 2-D
Dimension reduction beyond PCA:
manifolds, embedding, etc.
Goal
• New displays for high-dimensional data
–
–
–
–
Isometric non-negative matrix factorization
Rank-based embedding
Density-preserving maps
Co-occurrence embedding
• New algorithms for scaling them to big datasets
– Distances: Generalized Fast Multipole Method
– Dot products: Cosine Trees and QUIC-SVD
– MLPACK
Isometric
Non-negative Matrix Factorization
• NMF maintains the interpretability of
components of data like images or text or
spectra (SDSS)
• However as a low-D display it is not
faithful in general to the original distances
• Isometric NMF [Vasiloglou, Gray,
Anderson, to be submitted SIAM DM 2008]
preserves both distances and nonnegativity; geometric prog formulation
Rank-based Embedding
• Suppose you don’t really have meaningful or
reliable distances… but you can say “A and B are
farther apart than A and C”, e.g. in document
relevance
• It is still possible to make an embedding! In fact
there is some indication that using ranks is more
stable than using distances
• Can be formulated using hyperkernels; becomes
either an SDP or a QP [Ouyang and Gray, ICML
2008]
Density-preserving Maps
• Preserving densities is statistically more
meaningful than preserving distances
• Might allow more reliable conclusions
from the low-D display about clustering and
outliers
• DC formulation (Ozakin and Gray, to be
submitted AISTATS 2008)
Co-occurrence Embedding
• Consider InBio data: 3M occurrences of species in
Costa Rica
• Densities are not reliable as the sampling strategy
is unknown
• But the overlapping of two species’ densities (cooccurrence) may be more reliable
• How can distribution distances be embedded
(Syed, Ozakin, Gray to be submitted ICML 2009)?
Goal
• New displays for high-dimensional data
–
–
–
–
Isometric non-negative matrix factorization
Rank-based embedding
Density-preserving maps
Co-occurrence embedding
• New algorithms for scaling them to big datasets
– Distances: Generalized Fast Multipole Method
– Dot products: Cosine Trees and QUIC-SVD
– MLPACK
Computational problem
• Such manifold methods are expensive:
typically O(N3)
– But it is big datasets that are often the most
important to visually summarize
• What are the underlying computations?
–
–
–
–
All-k-nearest-neighbors
Kernel summations
Eigendecomposition
Convex optimization
Computational problem
• Such manifold methods are expensive:
typically O(N3)
– But it is big datasets that are often the most
important to visually summarize
• What are the underlying computations?
–
–
–
–
All-k-nearest-neighbors (distances)
Kernel summations (distances)
Eigendecomposition (dot products)
Convex optimization (dot products)
Distances:
Generalized Fast Multipole Method
• Generalized N-body Problems (Gray and
Moore NIPS 2000; Riegel, Boyer, and Gray
TR 2008) include:
–
–
–
–
All-k-nearest-neighbors
Kernel summations
Force summations in physics
A very large number of bottleneck statistics
and machine learning computations
• Defined using category theory
Distances:
Generalized Fast Multipole Method
• There exists a generalization (Gray and
Moore NIPS 2000; Riegel, Boyer, and Gray
TR 2008) of the Fast Multipole Method
(Greengard and Rokhlin 1987) which:
– specializes to each of these problems
– is the fastest practical algorithm for these
problems
– elucidates general principles for such problems
• Parallel: THOR (Tree-based Higher-Order
Reduce)
Distances:
Generalized Fast Multipole Method
• Elements of the GFMM:
– A spatial tree data structure, e.g. kd-trees,
metric trees, cover-trees, SVD trees, disk trees
– A tree expansion pattern
– Tree-stored cached statistics
– An error criterion and pruning criterion
– A local approximation/pruning scheme with
error bounds, e.g. Hermite expansions, Monte
Carlo, exact pruning
kd-trees:
most widely-used spacepartitioning tree
[Bentley 1975], [Friedman, Bentley &
Finkel 1977],[Moore & Lee 1995]
A kd-tree: level 1
A kd-tree: level 2
A kd-tree: level 3
A kd-tree: level 4
A kd-tree: level 5
A kd-tree: level 6
Example: Generalized histogram
ˆf (q)   I  q  x j  h 
j
bandwidth h
query point q
Range-count recursive algorithm
Range-count recursive algorithm
Range-count recursive algorithm
Range-count recursive algorithm
Range-count recursive algorithm
Pruned!
(inclusion)
Range-count recursive algorithm
Range-count recursive algorithm
Range-count recursive algorithm
Range-count recursive algorithm
Range-count recursive algorithm
Range-count recursive algorithm
Range-count recursive algorithm
Range-count recursive algorithm
Pruned!
(exclusion)
Range-count recursive algorithm
Range-count recursive algorithm
Range-count recursive algorithm
fastest
practical
algorithm
[Bentley 1975]
our
algorithms
can use
any tree
Dot products:
Cosine Trees and QUIC-SVD
QUIC-SVD (Holmes, Gray, Isbell NIPS 2008)
• Cosine Trees: Trees for dot products
• Use Monte Carlo within cosine trees to
achieve best-rank approximation with userspecified relative error
• Very fast, but with probabilistic bounds
Dot products:
Cosine Trees and QUIC-SVD
Uses of QUIC-SVD:
• PCA, KPCA eigendecomposition
• Working on: fast interior-point convex
optimization
Bigger goal: make all the best
statistical/learning methods efficient!
• Ground rules:
–
–
–
–
Asymptotic speedup as well as practical speedup
Arbitrarily high accuracy with error guarantees
No manual tweaking
Really works (validated in a big real-world problem)
• Treating entire classes of methods
– Methods based on distances (generalized N-body problems)
– Methods based on dot products (linear algebra)
– Soon: Methods based on discrete structures (combinatorial/graph
problems)
• Watch for MLPACK, coming Dec. 2008
– Meant to be the equivalent of linear algebra’s LAPACK
So far: fastest algs for…
•
•
•
•
•
•
•
•
•
2000 all-nearest-neighbors (1970)
2000 n-point correlation functions (1950)
2003,05,06 kernel density estimation (1953)
2004 nearest-neighbor classification (1965)
2005,06,08 nonparametric Bayes classifier (1951)
2006 mean-shift clustering/tracking (1972)
2006 k-means clustering (1960s)
2007 hierarchical clustering/EMST (1960s)
2007 affinity propagation/clustering (2007)
So far: fastest algs for…
•
•
•
•
2008 principal component analysis* (1930s)
2008 local linear kernel regression (1960s)
2008 hidden Markov models* (1970s)
Working on:
–
–
–
–
–
linear regression, Kalman filters (1960s)
Gaussian process regression (1960s)
Gaussian graphical models (1970s)
Manifolds, spectral clustering (2000s)
Convex kernel machines (2000s)
Some application highlights so far…
• First large-scale dark energy confirmation: top
Science breakthrough of 2003)
• First large-scale cosmic magnification confirmation
(Nature, 2005)
• Integration into Google image search (we think),
2005
• Integration into Microsoft SQL Server, 2008
• Working on:
– Integration into Large Hadron Collider pipeline, 2008
– Fast IP-based spam filtering (Secure Comp, 2008)
– Fast recommendation (Netflix)
To find out more:
• Best way: [email protected]
• Mostly outdated:
www.cc.gatech.edu/~agray