CS685 : Special Topics in Data Mining, UKY

Download Report

Transcript CS685 : Special Topics in Data Mining, UKY

Dimensionality Reduction
Part 2: Nonlinear Methods
Comp 790-090
Spring 2007
The UNIVERSITY
of Mining,
KENTUCKY
CS685 : Special
Topics in Data
UKY
Why Dimensionality Reduction
• Two approaches to reduce number of features
– Feature selection: select the salient features by
some criteria
– Feature extraction: obtain a reduced set of
features by a transformation of all features
• Data visualization and exploratory data
analysis also need to reduce dimension
– Usually reduce to 2D or 3D
CS685 : Special Topics in Data Mining, UKY
Deficiencies of Linear Methods
• Data may not be best summarized by linear
combination of features
– Example: PCA cannot discover 1D structure of a
helix
20
15
10
5
0
1
0.5
1
0.5
0
0
-0.5
-0.5
-1
-1
CS685 : Special Topics in Data Mining, UKY
Intuition: how does your brain
store these pictures?
CS685 : Special Topics in Data Mining, UKY
Brain Representation
CS685 : Special Topics in Data Mining, UKY
Brain Representation
• Every pixel?
• Or perceptually meaningful
structure?
– Up-down pose
– Left-right pose
– Lighting direction
So, your brain successfully
reduced the high-dimensional
inputs to an intrinsically 3dimensional manifold!
CS685 : Special Topics in Data Mining, UKY
Manifold Learning
latent
• Discover low dimensional representations
(smooth manifold) for data in high
dimension.
yi  R d
Y
xi  R N
X
• Linear approaches(PCA, MDS)
•
Non-linear approaches (ISOMAP, LLE,
others)
observed
CS685 : Special Topics in Data Mining, UKY
Linear Approach- PCA
• PCA Finds subspace linear projections of input data.
CS685 : Special Topics in Data Mining, UKY
Linear Method
• Linear Methods for Dimensionality Reduction
– PCA: rotate data so that principal axes lie in direction
of maximum variance
– MDS: find coordinates that best preserve pairwise
distances
PCA
CS685 : Special Topics in Data Mining, UKY
Motivation
• Linear Dimensionality Reduction doesn’t
always work
• Data violates underlying “linear”
assumptions
– Data is not accurately modeled by “affine”
combinations of measurements
– Structure of data, while apparent, is not
simple
– In the end, linear methods do nothing
more than “globally transform” (rate,
translate, and scale) all of the data,
sometime what’s needed is to “unwrap”
the data first
CS685 : Special Topics in Data Mining, UKY
What does PCA Really Model?
• Principle Component Analysis assumptions
• Mean-centered distribution
– What if the mean, itself is atypical?
• Eigenvectors of
Covariance
– Basis vectors aligned
with successive directions
of greatest variance
• Classic 1st Order
statistical model
– Distribution is characterized
by its mean and variance (Gaussian Hyperspheres)
CS685 : Special Topics in Data Mining, UKY
Nonlinear Approaches- Isomap
Josh. Tenenbaum, Vin de Silva, John langford 2000
• Constructing neighbourhood graph G
• For each pair of points in G, Computing shortest path distances ---geodesic distances.
• Use Classical MDS with geodesic distances.
Euclidean distance Geodesic distance
CS685 : Special Topics in Data Mining, UKY
Sample points with Swiss Roll
• Altogether there are 20,000
points in the “Swiss roll” data
set. We sample 1000 out of
20,000.
CS685 : Special Topics in Data Mining, UKY
Construct neighborhood graph G
K- nearest neighborhood (K=7)
DG is 1000 by 1000 (Euclidean) distance matrix of two neighbors (figure A)
CS685 : Special Topics in Data Mining, UKY
Compute all-points shortest path in
G
Now DG is 1000 by 1000 geodesic distance matrix
of two arbitrary points along the manifold
(figure B)
CS685 : Special Topics in Data Mining, UKY
Use MDS to embed graph in Rd
Find a d-dimensional Euclidean space Y
(Figure c) to preserve the pariwise
diatances.
CS685 : Special Topics in Data Mining, UKY
Isomap
Small
Euclidean
distance
• Key Observation:
On a manifold distances are
measured using geodesic
distances rather than
Euclidean distances
Large
geodesic
distance
CS685 : Special Topics in Data Mining, UKY
Problem: How to Get Geodesics
• Without knowledge of the manifold it is
difficult to compute the geodesic distance
between points
• It is even difficult if you know the manifold
• Solution
– Use a discrete geodesic approximation
– Apply a graph algorithm to approximate the
geodesic distances
CS685 : Special Topics in Data Mining, UKY
Dijkstra’s Algorithm
• Efficient Solution to all-points-shortest path
problem
• Greedy breath-first algorithm
CS685 : Special Topics in Data Mining, UKY
Dijkstra’s Algorithm
• Efficient Solution to all-points-shortest path
problem
• Greedy breath-first algorithm
CS685 : Special Topics in Data Mining, UKY
Dijkstra’s Algorithm
• Efficient Solution to all-points-shortest path
problem
• Greedy breath-first algorithm
CS685 : Special Topics in Data Mining, UKY
Dijkstra’s Algorithm
• Efficient Solution to all-points-shortest path
problem
• Greedy breath-first algorithm
CS685 : Special Topics in Data Mining, UKY
Isomap algorithm
– Compute fully-connected
neighborhood of points for each
item
• Can be k nearest neighbors or ε-ball
• Neighborhoods must be symmetric
• Test that resulting graph is fullyconnected, if not increase either K or 
– Calculate pairwise Euclidean
distances within each
neighborhood
– Use Dijkstra’s Algorithm to
compute shortest path from
each point to non-neighboring
points
– Run MDS on resulting distance
matrix
CS685 : Special Topics in Data Mining, UKY
Isomap Results
• Find a 2D embedding of the 3D S-curve
(also shown for LLE)
• Isomap does a good job of preserving metric structure
(not surprising)
• The affine structure is also well preserved
CS685 : Special Topics in Data Mining, UKY
Residual Fitting Error
CS685 : Special Topics in Data Mining, UKY
Neighborhood Graph
CS685 : Special Topics in Data Mining, UKY
More Isomap Results
CS685 : Special Topics in Data Mining, UKY
More Isomap Results
CS685 : Special Topics in Data Mining, UKY
Isomap Failures
• Isomap also has problems on closed manifolds of
arbitrary topology
CS685 : Special Topics in Data Mining, UKY
Local Linear Embeddings
• First Insight
– Locally, at a fine enough scale, everything looks
linear
CS685 : Special Topics in Data Mining, UKY
Local Linear Embeddings
• First Insight
– Find an affine combination the “neighborhood”
about a point
that best
approximates it
CS685 : Special Topics in Data Mining, UKY
Finding a Good Neighborhood
• This is the remaining “Art” aspect of nonlinear
methods
• Common choices
• -ball: find all items that lie within an epsilon
ball of the target item as measured under
some metric
– Best if density of items is high and every point has
a sufficient number of neighbors
• K-nearest neighbors: find the k-closest
neighbors to a point under some metric
– Guarantees all items are similarly
represented,
CS685 : Special
Topics in Data Mining, UKY
Characterictics of a Manifold
Rn
M
z
Locally it is a linear patch
Key: how to combine all local
patches together?
R2
x: coordinate for z
x2
x
x1
CS685 : Special Topics in Data Mining, UKY
LLE: Intuition
• Assumption: manifold is approximately “linear”
when viewed locally, that is, in a small
neighborhood
– Approximation error, e(W), can be made small
• Local neighborhood is effected by the constraint
Wij=0 if zi is not a neighbor of zj
• A good projection should preserve this local
geometric property as much as possible
CS685 : Special Topics in Data Mining, UKY
LLE: Intuition
We expect each data point and its
neighbors to lie on or close
to a locally linear patch of the
manifold.
Each point can be written as a
linear combination of its
neighbors.
The weights chosen to
minimize the reconstruction
Error.
CS685 : Special Topics in Data Mining, UKY
LLE: Intuition
• The weights that minimize the reconstruction errors are
invariant to rotation, rescaling and translation of the data
points.
– Invariance to translation is enforced by adding the constraint that
the weights sum to one.
– The weights characterize the intrinsic geometric properties of each
neighborhood.
• The same weights that reconstruct the data points in D
dimensions should reconstruct it in the manifold in d
dimensions.
– Local geometry is preserved
CS685 : Special Topics in Data Mining, UKY
LLE: Intuition
Low-dimensional embedding
the i-th row of W
Use the same weights
from the original space
CS685 : Special Topics in Data Mining, UKY
Local Linear Embedding (LLE)
• Assumption: manifold is approximately “linear” when viewed locally, that is, in a
small neighborhood
• Approximation error, (W), can be made small
• Meaning of W: a linear representation of every data point by its neighbors
– This is an intrinsic geometrical property of the manifold
• A good projection should preserve this geometric property as much as possible
CS685 : Special Topics in Data Mining, UKY
Constrained Least Square problem
Compute the optimal weight for each point individually:
Neightbors of x
Zero for all non-neighbors of x
CS685 : Special Topics in Data Mining, UKY
Finding a Map to a Lower
Dimensional Space
• Yi in Rk: projected vector for Xi
• The geometrical property is best preserved if
the error below is small
Use the same weights
computed above
• Y is given by the eigenvectors of the lowest d
non-zero eigenvalues of the matrix
CS685 : Special Topics in Data Mining, UKY
Numerical Issues
• Numerical problems can arise in computing LLEs
• The least-squared covariance matrix that arises
in the computation of the weighting matrix, W,
solution can be ill-conditioned
– Regularization (rescale the measurements by adding
a small multiple of the Identity to covariance matrix)
• Finding small singular (eigen) values is not as
well conditioned as finding large ones. The small
ones are subject to numerical precision errors,
and to get mixed
– Good (but slow) solvers exist, you have to use them
CS685 : Special Topics in Data Mining, UKY
Results
• The resulting parameter vector, yi, gives the
coordinates associated with the item xi
• The dth embedding coordinate is formed from
the orthogonal vector associated with the
dst singular value of A.
CS685 : Special Topics in Data Mining, UKY
Reprojection
• Often, for data analysis, a parameterization is
enough
• For interpolation and compression we might
want to map points from the parameter space
back to the “original” space
• No perfect solution, but a few approximations
– Delauney triangulate the points in the embedding
space, find the triangle that the desired parameter
setting falls into, and compute the baricenric
coordinates of it, and use them as weights
– Interpolate by using a radially symmetric kernel
CS685 : Special Topics in Data Mining, UKY
centered about the desired parameter
setting
LLE Example
• 3-D S-Curve manifold with points color-coded
• Compute a 2-D embedding
– The local affine structure is well maintained
– The metric structure is okay locally, but can drift slowly over the
domain (this causes the manifold to taper)
CS685 : Special Topics in Data Mining, UKY
More LLE Examples
CS685 : Special Topics in Data Mining, UKY
More LLE Examples
CS685 : Special Topics in Data Mining, UKY
LLE Failures
• Does not work on to closed manifolds
• Cannot recognize Topology
CS685 : Special Topics in Data Mining, UKY
Summary
• Non-Linear Dimensionality Reduction
Methods
– These methods are considerably more powerful
and temperamental than linear method
– Applications of these methods are a hot area of
research
• Comparisons
– LLE is generally faster, but more brittle than
Isomaps
– Isomaps tends to work better on smaller data sets
CS685 : Special Topics in Data Mining, UKY