PCA - Arizona State University
Download
Report
Transcript PCA - Arizona State University
Principal Component Analysis
Jieping Ye
Department of Computer Science and
Engineering
Arizona State University
http://www.public.asu.edu/~jye02
Outline of lecture
•
•
•
•
•
What is feature reduction?
Why feature reduction?
Feature reduction algorithms
Principal Component Analysis (PCA)
Nonlinear PCA using Kernels
What is feature reduction?
• Feature reduction refers to the mapping of the original highdimensional data onto a lower-dimensional space.
– Criterion for feature reduction can be different based on different
problem settings.
• Unsupervised setting: minimize the information loss
• Supervised setting: maximize the class discrimination
• Given a set of data points of p variables
x1 , x2 ,, xn
Compute the linear transformation (projection)
G pd : x p y GT x d (d p)
What is feature reduction?
Original data
reduced data
Linear transformation
Y d
GT d p
X p
G p d : X Y G T X d
High-dimensional data
Gene expression
Face images
Handwritten digits
Outline of lecture
•
•
•
•
•
What is feature reduction?
Why feature reduction?
Feature reduction algorithms
Principal Component Analysis
Nonlinear PCA using Kernels
Why feature reduction?
• Most machine learning and data mining techniques may
not be effective for high-dimensional data
– Curse of Dimensionality
– Query accuracy and efficiency degrade rapidly as the dimension
increases.
• The intrinsic dimension may be small.
– For example, the number of genes responsible for a certain type
of disease may be small.
Why feature reduction?
• Visualization: projection of high-dimensional data onto
2D or 3D.
• Data compression: efficient storage and retrieval.
• Noise removal: positive effect on query accuracy.
Application of feature reduction
•
•
•
•
•
•
Face recognition
Handwritten digit recognition
Text mining
Image retrieval
Microarray data analysis
Protein classification
Outline of lecture
•
•
•
•
•
What is feature reduction?
Why feature reduction?
Feature reduction algorithms
Principal Component Analysis
Nonlinear PCA using Kernels
Feature reduction algorithms
• Unsupervised
–
–
–
–
Latent Semantic Indexing (LSI): truncated SVD
Independent Component Analysis (ICA)
Principal Component Analysis (PCA)
Canonical Correlation Analysis (CCA)
• Supervised
– Linear Discriminant Analysis (LDA)
• Semi-supervised
– Research topic
Outline of lecture
•
•
•
•
•
What is feature reduction?
Why feature reduction?
Feature reduction algorithms
Principal Component Analysis
Nonlinear PCA using Kernels
What is Principal Component Analysis?
• Principal component analysis (PCA)
– Reduce the dimensionality of a data set by finding a new set of
variables, smaller than the original set of variables
– Retains most of the sample's information.
– Useful for the compression and classification of data.
• By information we mean the variation present in the sample,
given by the correlations between the original variables.
– The new variables, called principal components (PCs), are
uncorrelated, and are ordered by the fraction of the total information
each retains.
Geometric picture of principal components (PCs)
z1
• the 1st PC
z1 is a minimum distance fit to a line in
X space
• the 2nd PC z2 is a minimum distance fit to a line in the plane
perpendicular to the 1st PC
PCs are a series of linear least squares fits to a sample,
each orthogonal to all the previous.
Algebraic definition of PCs
Given a sample of n observations on a vector of p variables
x1 , x2 ,, xn
p
define the first principal component of the sample
by the linear transformation
p
z1 a1T x j ai1 xij ,
j 1,2,, n.
i 1
where the vector
a1 (a11, a21,, a p1 )
x j ( x1 j , x2 j ,, x pj )
is chosen such that
var[ z1 ]
is maximum.
Algebraic derivation of PCs
To find
a1
first note that
n
1
var[ z1 ] E (( z1 z1 ) 2 ) a1T xi a1T x
n i 1
2
T
1 n T
a1 xi x xi x a1 a1T Sa1
n i 1
where
T
1 n
S xi x xi x
n i 1
1 n
is the covariance matrix.
x
x is the mean.
n
i 1
In the following, we assume the
Data is centered.
i
x0
Algebraic derivation of PCs
Assume
Form the matrix:
then
x0
X [ x1 , x2 ,, xn ]
p n
1
S XX T
n
Obtain eigenvectors of S by computing the SVD of X:
X UV
T
Algebraic derivation of PCs
To find
a1 that
maximizes var[ z1 ] subject to a1T a1 1
Let λ be a Lagrange multiplier
L a1T Sa1 (a1T a1 1)
L Sa1 a1 0
a1
( S I p )a1 0
therefore
a1
is an eigenvector of S
corresponding to the largest eigenvalue
1.
Algebraic derivation of PCs
To find the next coefficient vector
subject to
and to
First note that
cov[ z2 , z1 ] 0
a2
maximizing var[ z2 ]
uncorrelated
a2T a2 1
cov[ z2 , z1 ] a Sa2 a a2
T
1
T
1 1
then let λ and φ be Lagrange multipliers, and maximize
L a Sa2 (a a2 1) a a
T
2
T
2
T
2 1
Algebraic derivation of PCs
L a Sa2 (a a2 1) a a
T
2
T
2
T
2 1
L Sa2 a2 a1 0 0
a2
Sa2 a2
and a Sa2
T
2
Algebraic derivation of PCs
We find that
a2
whose eigenvalue
is also an eigenvector of S
2 is the second largest.
In general
var[ z k ] a Sak k
T
k
• The kth largest eigenvalue of S is the variance of the kth PC.
• The kth PC z k retains the kth greatest fraction of the variation
in the sample.
Algebraic derivation of PCs
• Main steps for computing PCs
– Form the covariance matrix S.
a
d
– Use the first d eigenvectors a
i i 1
– Compute its eigenvectors:
p
i i 1
to form the d PCs.
– The transformation G is given by
G [a1 , a2 ,, ad ]
A test point x G x .
p
T
d
Optimality property of PCA
Dimension reduction X pn GT X d n
Original data
Reconstruction
GT X d n X G(GT X ) pn
GT d p
Y G T X d n
X p n
X pn
G
p d
Optimality property of PCA
Main theoretical result:
The matrix G consisting of the first d eigenvectors of the
covariance matrix S solves the following min problem:
2
min G pd X G(G X ) subject to G TG I d
T
F
X X
2
F
reconstruction error
PCA projection minimizes the reconstruction error among all
linear projections of size d.
Applications of PCA
• Eigenfaces for recognition. Turk and Pentland. 1991.
• Principal Component Analysis for clustering gene
expression data. Yeung and Ruzzo. 2001.
• Probabilistic Disease Classification of ExpressionDependent Proteomic Data from Mass Spectrometry of
Human Serum. Lilien. 2003.
PCA for image compression
d=1
d=16
d=2
d=32
d=4
d=64
d=8
d=100
Original
Image
Outline of lecture
•
•
•
•
•
What is feature reduction?
Why feature reduction?
Feature reduction algorithms
Principal Component Analysis
Nonlinear PCA using Kernels
Motivation
Linear projections
will not detect the
pattern.
Nonlinear PCA using Kernels
• Traditional PCA applies linear transformation
– May not be effective for nonlinear data
• Solution: apply nonlinear transformation to potentially very highdimensional space.
: x ( x)
• Computational efficiency: apply the kernel trick.
– Require PCA can be rewritten in terms of dot product.
K ( xi , x j ) ( xi ) ( x j )
More on kernels
later
Nonlinear PCA using Kernels
Rewrite PCA in terms of dot product
Assume the data has been centered, i.e., xi 0.
i
1
The covariance matrix S can be written as S xi xiT
n i
Let v be The eigenvector of S corresponding to
nonzero eigenvalue
1
1
T
T
Sv xi xi v v v
(
x
i v) xi
n i
n i
Eigenvectors of S lie in the space spanned by all data points.
Nonlinear PCA using Kernels
1
1
T
T
Sv xi xi v v v
(
x
i v) xi
n i
n i
The covariance matrix can be written in matrix form:
1
S XX T , where X [x 1 , x 2 ,, x n ].
n
v i xi X
i
1
Sv XX T X X
n
1 T
( X X )( X T X ) ( X T X )
n
1 T
( X X )
n
Any benefits?
Nonlinear PCA using Kernels
: x ( x)
Next consider the feature space:
1 T
S X X , where X [x 1 , x 2 , , x n ].
n
T
1
v i ( xi ) X
X X
n
i
The (i,j)-th entry of
X
Apply the kernel trick:
T
X
is
( xi ) ( x j )
K ( xi , x j ) ( xi ) ( x j )
K is called the kernel matrix.
1
K
n
Nonlinear PCA using Kernels
• Projection of a test point x onto v:
( x) v ( x) i ( xi )
i
i ( x) ( xi ) i K ( x, xi )
i
i
Explicit mapping is not required here.
Reference
• Principal Component Analysis. I.T. Jolliffe.
• Kernel Principal Component Analysis. Schölkopf, et al.
• Geometric Methods for Feature Extraction and
Dimensional Reduction. Burges.