pca - The University of Kansas
Download
Report
Transcript pca - The University of Kansas
EECS 800 Research Seminar
Mining Biological Data
Instructor: Luke Huan
Fall, 2006
The UNIVERSITY of Kansas
Administrative
Project design is due Oct 30th
~3 weeks from now
Include the following items in the document
The goal of the project
A brief introduction of the overall project
A list of background materials that will be covered in the final
report
A high level design of your project
A testing plan
9/25/2006
Data Types
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide2
Overview
Gain insights of high dimensional space by projection
pursuit (feature reduction).
PCA: Principle components analysis
A data analysis tool
Mathematical background
PCA and gene expression profile analysis
briefly
9/25/2006
Data Types
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide3
A Group of Related Techniques
Unsupervised
Principal Component Analysis (PCA)
Latent Semantic Indexing (LSI): truncated SVD
Independent Component Analysis (ICA)
Canonical Correlation Analysis (CCA)
Supervised
Linear Discriminant Analysis (LDA)
Semi-supervised
Research topic
9/25/2006
Data Types
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide4
Rediscovery – Renaming of PCA
Statistics:
Principal Component Analysis (PCA)
Social Sciences:
Factor Analysis (PCA is a subset)
Probability / Electrical Eng:
Karhunen – Loeve expansion
Applied Mathematics:
Proper Orthogonal Decomposition (POD)
Geo-Sciences:
Empirical Orthogonal Functions (EOF)
9/25/2006
Data Types
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide5
An Interesting Historical Note
The 1st (?) application of PCA to Functional Data
Analysis:
Rao, C. R. (1958) Some statistical methods for comparison of
growth curves, Biometrics, 14, 1-17.
1st Paper with “Curves as Data” viewpoint
9/25/2006
Data Types
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide6
What is Principal Component Analysis?
Principal component analysis (PCA)
Reduce the dimensionality of a data set by finding a new set of
variables, smaller than the original set of variables
Retains most of the sample's information.
Useful for the compression and classification of data.
By information we mean the variation present in the
sample, given by the correlations between the original
variables.
The new variables, called principal components (PCs), are
uncorrelated, and are ordered by the fraction of the total
information each retains.
9/25/2006
Data Types
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide7
A Geometric Picture
z1
z
• the 1st PC
is the line in the space such that the “projected”
1
data set has the largest total variance
z
• the 2nd PC z2 is the line, orthogonal to 1, to capture the
remaining total variance
PCs are a series of linear fits to a sample, each orthogonal to all
the previous.
9/25/2006
Data Types
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide8
Connect Math to Graphics
2-d Toy Example
Feature Space
Object Space
Data Points (Curves) are columns of data matrix, X
9/25/2006
Data Types
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide9
Connect Math to Graphics (Cont.)
2-d Toy Example
Feature Space
Object Space
Sample Mean, X
9/25/2006
Data Types
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide10
Connect Math to Graphics (Cont.)
2-d Toy Example
Feature Space
Object Space
Residuals from Mean = Data - Mean
9/25/2006
Data Types
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide11
Connect Math to Graphics (Cont.)
2-d Toy Example
Feature Space
Object Space
Recentered Data = Mean Residuals, shifted to 0
9/25/2006
Data Types
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide12
Connect Math to Graphics (Cont.)
2-d Toy Example
Feature Space
Object Space
PC1 Direction = η = Eigenvector (w/ biggest λ)
9/25/2006
Data Types
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide13
Connect Math to Graphics (Cont.)
2-d Toy Example
Feature Space
Object Space
Centered Data
9/25/2006
Data Types
PC1 Projection
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
Residual
slide14
Connect Math to Graphics (Cont.)
2-d Toy Example
Feature Space
Object Space
PC2 Direction = η = Eigenvector (w/ 2nd biggest λ)
9/25/2006
Data Types
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide15
Connect Math to Graphics (Cont.)
2-d Toy Example
Feature Space
Object Space
Centered Data
9/25/2006
Data Types
PC2 Projection
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
Residual
slide16
Connect Math to Graphics (Cont.)
Note for this 2-d Example:
PC1 Residuals = PC2 Projections
PC2 Residuals = PC1 Projections
(i.e. colors common across these pics)
9/25/2006
Data Types
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide17
PCA and Complex Data Analysis
Data set is a set of curves
How to find clusters?
Treat curves as points in a high dimensional space
Applications in gene expression profile analysis
Zhao, X., Marron, J.S. and
Wells, M.T. (2004) The
Functional Data View of
Longitudinal Data, Statistica
Sinica, 14, 789-808
9/25/2006
Data Types
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide18
N-D Toy Example
Upper left shows the
mean. Upper right is
residuals from mean.
Lower left is
projections of the
mean residuals in the
PC1 direction.
Lower right is
further residuals
from PC1
projections.
9/25/2006
Data Types
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide19
Yeast Cell Cycle Data
Central question:
Which genes are “periodic” over 2 cell cycles?
9/25/2006
Data Types
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide20
Yeast Cell Cycle Data, PCA analysis
Periodic
genes?
Naïve
approach:
Simple PCA
9/25/2006
Data Types
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide21
Yeast Cell Cycle Data, FDA View
Central question: which genes are “periodic” over 2
cell cycles?
Naïve approach: Simple PCA
Doesn’t work
No apparent (2 cycle) periodic structure?
Eigenvalues suggest large amount of “variation”
PCA finds “directions of maximal variation”
Often, but not always, same as “interesting directions”
Here need better approach to study periodicities
9/25/2006
Data Types
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide22
Yeast Cell Cycles, Freq. 2 Proj.
PCA on
Freq. 2
Periodic
Component
Of Data
9/25/2006
Data Types
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide23
PCA for 2D Surfaces
2-d M-Rep Example:
Corpus Callosum
Atoms
Spokes
Implied
Boundary
9/25/2006
Data Types
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide24
Pros and Cons
PCA works for
Multi-dimensional Gaussian distribution
It doesn’t work for
Gaussian mixtures
Data in non-Euclidian spaces
9/25/2006
Data Types
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide25
Detailed Look at PCA
Three important (and interesting) viewpoints:
Mathematics
Numerics
Statistics
1st: Review linear alg. and multivar. prob.
9/25/2006
Data Types
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide26
Review of Linear Algebra
Vector Space:
set of “vectors”, x ,
and “scalars” (coefficients or an element in a field),
“closed” under “linear combination” (
ai x i in space)
a
i
For example:
x1
d
x : x1 ,..., xd
x
d
“ d dim Euclid’n space”
9/25/2006
Data Types
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide27
Subspace
Subspace:
subset that is again a vector space which is closed under linear
combination
Examples:
lines through the origin
planes through the origin
all linear combos of a subset of vector (= a hyperplane through
origin)
9/25/2006
Data Types
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide28
Basis
Basis of subspace: set of vectors that
span, i.e. everything is a lin. com. of them
are linearly indep’t, i.e. lin. Com. is unique
Example:
“unit vector basis” in
d
1 0
0
0
1
,
,...,
0
1
0 0
e.g. x
1
1
0
0
0
1
x2
x1 x 2 x d 0
0
0
1
xd
9/25/2006
Data Types
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide29
Basis Matrix
Basis Matrix, of subspace of
d
v ,..., v
Given a basis:
1
n
create matrix of columns:
vn1
v11
vn
v
v
1d
nd d n
B v1
Then “linear combo” is a matrix multiplicat’n:
n
a v
i 1
9/25/2006
Data Types
i
i
Ba
a1
a
a
n
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide30
Linear Transformation
Aside on matrix multiplication: (linear transformation) for
matrices
a1,1 a1, m
A
a
a
k ,m
k ,1
b1,1 b1, n
B
b
b
m,n
m,1
Define the “matrix product”
m
a1,i bi ,1
i 1
AB
m
a k ,i bi ,1
i 1
a
b
1, i i , n
i 1
m
a
b
k ,i i , n
i 1
m
(“inner products” of columns with rows)
(composition of linear transformations)
9/25/2006
Data Types
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide31
Matrix Trace
For a square matrix
• Define
m
tr ( A) ai ,i
a1,1 a1, m
A
a
a
m,m
m,1
i 1
• Trace commutes with matrix multiplication:
tr AB tr BA
9/25/2006
Data Types
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide32
Dimension
Dimension of subspace (a notion of “size”):
number of elements in a basis (unique)
dim d d (use basis above)
Example
Dimension of a line is 1
Dimension of a plane is 2
Dimension is “degrees of freedom”
9/25/2006
Data Types
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide33
Vector Norm
•
in
•
Idea: “length” of the vector
d,
1/ 2
2
x x j
j 1
d
•
x x
t
1/ 2
x
x
“length normalized vector”:
(has length one, thus on surf. of unit sphere)
•
get “distance” as:
d x , y x y
9/25/2006
Data Types
x y x y
t
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide34
Inner Product
Inner (dot, scalar) product:
• for vectors
d
x and y , x, y x j y j x y
• related to norm, via
t
j 1
x
x, x x x
t
• measures “angle between x and y ” as:
x, y
1
anglex, y cos
x y
• key to “orthogonality”,
t
x
y
cos 1
xt x yt y
i.e. “perpendicul’ty”:
x y if and only if x, y 0
9/25/2006
Data Types
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide35
Orthonormal Basis
Orthonormal basis
v1 ,..., vn :
• All ortho to each other, i.e. vi , vi ' 0 ,
• All have length 1, i.e. vi , vi 1 ,
fori i '
for i 1,..., n
n
• “Spectral Representation”: x ai vi whereai x, vi
check:
x, v i
i 1
n
n
a
i '1
• Matrix notation: x B a
i'
vi ' , vi a i ' vi ' , vi a i
i '1
t
wherea x B
t
t
a
B
x
i.e.
a is called “transform (e.g. Fourier, wavelet) of x ”
9/25/2006
Data Types
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide36
Vector Projection
Projection of a vector x onto a subspace V :
• Idea: member of V that is closest to x
(i.e. “approx’n”)
• Find PV x V that solves: min x v (“least squa’s”)
vV
• General solution in d : for basis matrix BV
PV x BV B BV BVt x
1
t
V
• So “proj’n operator” is “matrix mult’n”:
(thus projection is another linear operation)
PV BV B BV BVt
t
V
9/25/2006
Data Types
1
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide37
Vector Projection (cont)
v1 ,..., vn :
Projection using orthonormal basis
BVt BV I nn
• Basis matrix is “orthonormal”:
v1 , v1
v1t
v1 vn
v ,v
vt
n
n 1
• So
PV x BV BVt x =
v1 , vn
vn , vn
1 0
0 1
Recon(Coeffs of x “in V
dir’n”)
• For “orthogonal complement”, V ,
x PV x PV x
and
• Parseval inequality:
9/25/2006
Data Types
x PV x PV x
2
2
n
P x x x, vi
2
Mining Biological Data
V
KU EECS 800, Luke Huan, Fall’06
2
i 1
2
n
2
ai2 a
i 1
2
slide38
Random Vectors
Given a “random vector”
X1
X
X
d
A “center” of the distribution is the mean vector,
EX 1
EX
EX
d
A “measure of spread” is the covariance matrix:
cov X 1 , X d
var X 1
cov( X )
cov X , X
var
X
d
1
d
9/25/2006
Data Types
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide39
Empirically
Given a random sample X 1 ,..., X n , estimate the
theoretical mean
, with the sample mean:
X1
1 n
ˆ X X i
X n i 1
d
9/25/2006
Data Types
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide40
Empirically (Cont.)
And estimate the “theoretical cov.”
cov.”:
n
2
X
X
i1 1
i 1
1
ˆ
n
n 1
X id X d X i1 X 1
i 1
9/25/2006
Data Types
, with the “sample
X
X
X
X
i1 1 id
d
n
i 1
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
n
2
X
X
id
d
i 1
slide41
With Linear Algebra
Outer product representation:
2
X i1 X 1
n
ˆ 1
n 1 i 1
X id X d X i1 X 1
X i1 X 1 X id X d
2
X id X d
ˆ 1 X i X X i X t X~X~ t
,
n 1 i 1
n
where:
9/25/2006
Data Types
~
X
1
X 1 X X n X d n
n Mining
1 Biological Data
KU EECS 800, Luke Huan, Fall’06
slide42
PCA as an Optimization Problem
Find “direction of greatest variability”:
9/25/2006
Data Types
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide43
Applications of PCA
Eigenfaces for recognition. Turk and Pentland. 1991.
Principal Component Analysis for clustering gene expression data.
Yeung and Ruzzo. 2001.
Probabilistic Disease Classification of Expression-Dependent
Proteomic Data from Mass Spectrometry of Human Serum. Lilien.
2003.
Zhao, X., Marron, J.S. and Wells, M.T. (2004) The Functional Data
View of Longitudinal Data, Statistica Sinica, 14, 789-808
9/25/2006
Data Types
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide44