Clustering - University of Kentucky

Download Report

Transcript Clustering - University of Kentucky

Dimensionality Reduction
CS 685: Special Topics in Data Mining
Spring 2009
Jinze Liu
The UNIVERSITY
of Mining,
KENTUCKY
CS685 : Special
Topics in Data
UKY
Overview
• What is Dimensionality Reduction?
– Simplifying complex data
– Using dimensionality reduction as a Data Mining
“tool”
– Useful for both “data modeling” and “data analysis”
– Tool for “clustering” and “regression”
• Linear Dimensionality Reduction Methods
– Principle Component Analysis (PCA)
– Multi-Dimensional Scaling (MDS)
• Non-Linear Dimensionality Reduction
CS685 : Special Topics in Data Mining, UKY
What is Dimensionality Reduction?
• Given N objects, each with M measurements, find
the best D-dimensional parameterization
• Goal: Find a “compact parameterization” or
“Latent Variable” representation
D
~
p


x

f
(
p
)
Given N examples of xi   find
where
M
• Underlying assumptions to DimRedux
– Measurements over-specify data, M > D
– The number of measurements exceed the number of
“true” degrees of freedom in the system
– The measurements capture all of the significant variability
CS685 : Special Topics in Data Mining, UKY
Uses for DimRedux
• Build a “compact” model of the data
– Compression for storage, transmission, & retrieval
– Parameters for indexing, exploring, and organizing
– Generate “plausible” new data
• Answer fundamental questions about data
– What is its underlying dimensionality?
How many degrees of freedom are exhibited?
How many “latent variables”?
– How independent are my measurements?
– Is there a projection of my data set where
important relationships stand out?
CS685 : Special Topics in Data Mining, UKY
DimRedux in Data Modeling
• Data Clustering - Continuous to
Discrete
– The curse of dimensionality: the
sampling density is proportional to N1/p.
– Need a mapping to a lower-dimensional
space that preserves “important”
relations
• Regression Modeling –
Continuous to Continuous
– A functional model that
generates input data
– Useful for interpolation
• Embedding Space
CS685 : Special Topics in Data Mining, UKY
Today’s Focus
• Linear DimRedux methods
– PCA – Pearson (1901); Hotelling (1935)
– MDS – Torgerson (1952), Shepard (1962)
• “Linear” Assumption
~
x  Mp
where the matrix
M is m
xd
– Data is a linear function
of the
parameters
(latent
variables)
x( )  x1  ( 1 )x2
" line  like"
x( , )  x1  x2  ( 1   )x3
" plane  like"
– Data lies on a linear (Affine) subspace
CS685 : Special Topics in Data Mining, UKY
PCA: What problem does it solve?
• Minimizes “least-squares” (Euclidean) error
– The D-dimensional model provided by PCA has the
smallest Euclidean error of any D-parameter linear
model.
 n ~

min  ( xi xi )2 
 i1

~x
where i is the model predicted by the D-
dimensional PCA.
• Projects data s.t. the variance is maximized
• Find an optimal “orthogonal” basis set for
describing the given data
CS685 : Special Topics in Data Mining, UKY
Principle Component Analysis
• Also known to engineers as the Karhunen-Loéve
Transform (KLT)
• Rotate data points to align successive axes with
directions of greatest variance
– Subtract mean from data
– Normalize variance along each direction, and reorder
according
to the variance magnitude from high to low
– Normalized variance direction = principle component
• Eigenvectors of system’s Covariance Matrix
1 n
T
C
(
x


)
(
x


)
 i
i
n 1 i
( C i I) ei  0
permute to order eigenvectors in descending order
CS685 : Special Topics in Data Mining, UKY
Simple PCA Example
• Simple 3D example
>> x = rand(2, 500);
>> z = [1,0; 0,1; -1,-1] * x + [0;0;1] * ones(1, 500);
>> m = (100 * rand(3,3)) * z + rand(3, 500);
>> scatter3(m(1,:), m(2,:), m(3,:), 'filled');
CS685 : Special Topics in Data Mining, UKY
Simple PCA Example (cont)
>> mm = (m- mean(m')' * ones(1, 500));;
>> [E,L] = eig(cov(mm ‘ ));
>> E
E=
0.8029 -0.5958 0.0212
0.1629 0.2535 0.9535
0.5735 0.7621 -0.3006
>> L
L=
172.2525
0
0
0 116.2234
0
0
0 0.0837
>> newm = E’ * (m - mean(m’)’' * ones(1, 500));
>> scatter3(newm(1,:), newm(2,:), newm(3,:), 'filled');
axis([-50,50, -50,50, -50,50]);
CS685 : Special Topics in Data Mining, UKY
Simple PCA Example (cont)
CS685 : Special Topics in Data Mining, UKY
PCA Applied to Reillumination
• Illumination can be modeled as an additive
linear system.
R xy (wi )
CS685 : Special Topics in Data Mining, UKY
Simulating New Lighting
• We can simulate the appearance of a model under new illumination by
combining images taken from a set of basis lights
•
We can then capture real-world lighting and use it to modulate our basis lighting
functions
CS685 : Special Topics in Data Mining, UKY
Problems
• There are too many basis lighting functions
– These have to be stored in order to use them
– The resulting lighting model can be huge, in particular
when representing high frequency lighting
– Lighting differences can be very subtle
• The cost of modulation is excessive
– Every basis image must be scaled and added together
– Each image requires a high-dynamic range
• Is there a more compact representation?
– Yes, use PCA.
CS685 : Special Topics in Data Mining, UKY
PCA Applied to Illumination
• More than 90%
variance is
captured in the
first five principle
components
• Generate new
illumination by
combining only 5
basis images
V0
for n lights
CS685 : Special Topics in Data Mining, UKY
Results Video
CS685 : Special Topics in Data Mining, UKY
Results Video
CS685 : Special Topics in Data Mining, UKY
Results Video
CS685 : Special Topics in Data Mining, UKY
MDS: What problem does it solve?
• Takes as input a dissimilarity matrix M,
containing pairwise dissimilarities between
N-dimensional data points
• Finds the best D-dimensional linear
parameterization compatible with M
• (in other words, outputs a projection of
data in D-dimensional space where the
pairwise distances match the original
dissimilarities as faithfully as possible)
CS685 : Special Topics in Data Mining, UKY
Multidimensional Scaling (MDS)
• Dissimilarities can be metric or non-metric
• Useful when absolute measurements are
unavailable; uses relative measurements
• Computation is invariant to dimensionality of
data
CS685 : Special Topics in Data Mining, UKY
An example: map of the US
• Given only the distance between a bunch of cities
CS685 : Special Topics in Data Mining, UKY
An example: map of the US
• MDS finds suitable coordinates for the points of the
specified dimension.
CS685 : Special Topics in Data Mining, UKY
MDS Properties
• Parameterization is not unique –
Axes are meaningless
– Not surprising since Euclidean transformations and
reflections preserve distances between points
• Useful for visualizing relationships in high
dimensional data.
– Define a dissimilarity measure
– Map to a lower-dimensional space using MDS
• Common preprocess before cluster analysis
– Aids in understanding patterns and relationships in
data
• Widely used in marketing and psychometrics
CS685 : Special Topics in Data Mining, UKY
Dissimilarities
• Dissimilarities are distance-like quantities
that satisfy the following conditions:
1)  ij  0
2)  ii  0
3)  ij   ji
(self - similarity)
(symmetry)
• A dissimilarity is metric if, in addition, it
satisfies:
k ij  ik  kj
“The triangle inequality”
CS685 : Special Topics in Data Mining, UKY
Relating MDS to PCA
• Special case: when distances are Euclidean
• PCA = eigendecomposition of covariance
matrix MTM
• Convert the pair-wise distance matrix to
the covariance matrix
CS685 : Special Topics in Data Mining, UKY
How to get MTM from Euclidean Pair-wise
Distances
Law of cosines
i
d ki
dij

Definition of a dot product
j
k
d kj
• Eigendecomposition on b to get VSVT
• VS1/2 = matrix of new coordinates
CS685 : Special Topics in Data Mining, UKY
Algebraically…
n
n
n
n


1 1
1
1
Bij  2  n  Dik  n  Dkj  Dij  n2  Dkl 
k 1
k 1 l 1
 k 1
“MatrixTheAverage”
The distance
between
points
pi and pj
The *Row Average* the
average distance that a given
point is from pi
The *Column Average* the
average distance that a given
point is from pj
So we “centered” the
matrix
CS685 : Special Topics in Data Mining, UKY
MDS Mechanics
• Given a Dissimilarity matrix, D, the MDS model
is computed as follows:
B   21 HD2H
(B  i I) ei  0
• Where, H, the so called “centering” matrix, is
a scaled identity matrix computed as follows:
i
H  ( 1 n1)I
• MDS coordinates given by (in order of
decreasing :
pi  ( 1 e1,i , 2 e2,i ,....)
CS685 : Special Topics in Data Mining, UKY
MDS Stress
• The residual variance of B (i.e. the sum of the remaining
eigenvalues) indicate the goodness of fit for the selected
d-dimensional model
• This term is often called MDS “stress”
• Examining the residual variance gives an indication of the
inherent dimensionality
CS685 : Special Topics in Data Mining, UKY
Reflectance Modeling Example
The top row of white, grey, and
black balls have the same
“physical” reflectance
parameters, however, the bottom
row is “perceptually” more
consistent.
• From Pellacini, et. al. “Toward a Psychophysically-Based
Light Reflection Model for Image Synthesis,” SIGGRAPH
2000
• Objective – Find a perceptually meaningful
parameterization for reflectance modeling
CS685 : Special Topics in Data Mining, UKY
Reflectance Modeling Example
• User Task – Subjects were presented with 378 pairs of
rendered spheres an asked to rate their difference in
“glossiness” on a scale of 0 (no difference) to 100.
• A dissimilarity 27 x 27 dissimilarity matrix was
constructed and MDS applied
CS685 : Special Topics in Data Mining, UKY
Reflectance Modeling Example
• Parameters of a
2D embedding
space were
determined
• Two axes of
“gloss” were
established
CS685 : Special Topics in Data Mining, UKY
Limitations of Linear methods
• What if the data does not lie
within a linear subspace?
• Do all convex combinations of
the measurements generate
plausible data?
• Low-dimensional non-linear
Manifold embedded in a higher
dimensional space
• Next time: Nonlinear
Dimensionality Reduction
CS685 : Special Topics in Data Mining, UKY
Summary
• Linear dimensionality reduction tools are widely
used for
– Data analysis
– Data preprocessing
– Data compression
• PCA transforms the measurement data s. t.
successive directions of greatest variance are
mapped to orthogonal axis directions (bases)
– An D-dimensional embedding space
(parameterization) can be established by modeling the
data using only the first d of these basis vectors
– Residual modeling error is the sum of the remaining
eigenvalues
CS685 : Special Topics in Data Mining, UKY
Summary (cont)
• MDS finds a d-dimensional parameterization that
best preserves a given dissimilarity matrix
– Resulting model can be Euclidean transformed to align
data with a more intuitive parameterization
– An D-dimensional embedding spaces
(parameterization) are established by modeling the
data using only the first d coordinates of the scaled
eigenvectors
– Residual modeling error (MDS stress) is the sum of the
remaining eigenvalues
– If Euclidean metric dissimilarity matrix is used for MDS
the resulting d-dimensional model will match the PCA
weights for the same dimensional model
CS685 : Special Topics in Data Mining, UKY