Transcript Document
Bioinformatics
Other data reduction techniques
Kristel Van Steen, PhD, ScD
([email protected])
Université de Liege - Institut Montefiore
2008-2009
Acknowledgements
Material based on:
work from Pradeep Mummidi
class notes from Christine Steinhoff
Outline
Intuition behind PCA
Theory behind PCA
Applications of PCA
Extensions of PCA
Multidimensional scaling MDS (not to be
confused with MDR)
Intuition behind PCA
Introduction
Most of the scientific or industrial data is Multivariate
data (huge size of data)
Is
If
all the data useful?
not, how do we quickly extract useful
information only?
Problem
When we use traditional techniques,
1. Not easy to extract useful information from the multivariate
data
1) Many bivariate plots are needed
2) Bivariate plots, however, mainly represent correlations
between variables (not samples).
Visualization Problem
Not easy to visualize multivariate data
- 1D: dot
- 2D: Bivariate plot (i.e. X-Y plane)
- 3D: X-Y-Z plot
- 4D: ternary plot with a color code /Tetrahedron- 5D, 6D,
etc. : ???
Visualization?????
As the number of variables increases, data space becomes harder to visualize
Basics of PCA
PCA is useful when we need to extract useful information
from multivariate data sets.
This technique is based on the reduced dimensionality.
Therefore, trends in multivariate data are easily visualized.
Variable Reduction Procedure
Principal component analysis is a variable reduction
procedure. It is useful when you have obtained data on a
number of variables (possibly a large number of variables),
and believe that there is some redundancy in those variables
Redundancy means that some of the variables are correlated
with one another, possibly because they are measuring the
same construct.
Because of this redundancy, you believe that it should be
possible to reduce the observed variables into a smaller
number of principal components (artificial variables) that will
account for most of the variance in the observed variables.
What is Principal Component
A principal component can be defined as a linear
combination of optimally-weighted observed variables.
Based on how subject scores on a principal component are
computed.
7 Item measure of Job Satisfaction
General Formula
Below is the general form for the formula to compute scores on the
first component extracted (created) in a principal component
analysis:
C1 = b 11(X1) + b12(X 2) + ... b1p(Xp)
where
C1 = the subject’s score on principal component 1 (the first
component extracted)
b1p = the regression coefficient (or weight) for observed variable p,
as used in
creating principal component 1
Xp = the subject’s score on observed variable p.
For example, assume that component 1 in the present study
was the “satisfaction with supervision” component. You could
determine each subject’s score on principal component 1 by
using the following fictitious formula:
C1 = .44 (X1) + .40 (X2) + .47 (X3) + .32 (X4)+ .02 (X5) +
.01 (X6) + .03 (X7)
Obviously, a different equation, with different regression
weights, would be used to compute subject scores on
component 2 (the satisfaction with pay component). Below is
a fictitious illustration of this formula:
C2 = .01 (X1) + .04 (X2) + .02 (X3) + .02 (X4)+ .48 (X5) +
.31 (X6) + .39 (X7)
Number of components
Extracted
If a principal component analysis were performed on data
from the 7-item job satisfaction questionnaire, only two
components was created. However, such an impression
would not be entirely correct.
In reality, the number of components extracted in a principal
component analysis is equal to the number of observed
variables being analyzed.
However, in most analyses, only the first few components
account for meaningful amounts of variance, so only these
first few components are retained, interpreted, and used in
subsequent analyses (such as in multiple regression
analyses).
Characteristics of principal
components
The first component extracted in a principal component analysis
accounts for a maximal amount of total variance in the observed
variables.
Under typical conditions, this means that the first component will
be correlated with at least some of the observed variables. It
may be correlated with many.
The second component extracted will have two important
characteristics. First, this component will account for a maximal
amount of variance in the data set that was not accounted for by
the first component.
Under typical conditions, this means that the second
component will be correlated with some of the observed
variables that did not display strong correlations with
component 1.
The second characteristic of the second component is that it
will be uncorrelated with the first component. Literally, if you
were to compute the correlation between components 1 and
2, that correlation would be zero.
The remaining components that are extracted in the analysis
display the same two characteristics: each component
accounts for a maximal amount of variance in the observed
variables that was not accounted for by the preceding
components, and is uncorrelated with all of the preceding
components.
Generalization
A principal component analysis proceeds in this fashion, with
each new component accounting for progressively smaller and
smaller amounts of variance (this is why only the first few
components are usually retained and interpreted).
When the analysis is complete, the resulting components will
display varying degrees of correlation with the observed
variables, but are completely uncorrelated with one another.
References
http://support.sas.com/publishing/pubcat/
chaps/55129.pdf
http://www.cs.otago.ac.nz/cosc453/stude
nt_tutorials/principal_components.pdf
http://www.cis.hut.fi/jhollmen/dippa/node
30.html
Theory behind PCA
Theory behind PCA
Linear Algebra
OUTLINE
What do we need from „linear algebra“ for understanding
principal component analysis ?
•Standard deviation, Variance, Covariance
•The Covariance matrix
•Symmetric matrix and orthogonality
•Eigenvalues and Eigenvectors
•Properties
Motivation
Motivation
Protein 2
Proteins 1 and 2 measured for 200 patients
Protein1
Motivation
Patients 1
Genes
1
Microarray Experiment
? Visualize ?
? Which genes are important ?
? For which subgroup of patients ?
22,000
200
Motivation
Genes
1
Patients 1
200
10
Basics for Principal Component Analysis
•Orthogonal/Orthonormal
•Some Theorems...
•Standard deviation, Variance, Covariance
•The Covariance matrix
•Eigenvalues and Eigenvectors
Standard Deviation
The average distance from the mean of the data set to a point
MEAN:
Example:
Measurement 1: 0,8,12,20
Measurement 2: 8,9,11,12
M1
M2
Mean 10
Mean 10
SD 8.33
SD 1.83
Variance
Example:
Measurement 1: 0,8,12,20
Measurement 2: 8,9,11,12
M1
M2
Mean 10
Mean 10
SD 8.33
SD 1.83
Var 69.33
Var 3.33
Covariance
Standard Deviation and Variance are 1-dimensional
How much do the dimensions vary from the mean with respect to each other ?
Covariance measures between 2 dimensions
We easily see, if X=Y we end up with variance
Covariance Matrix
Let X be a random vector.
Then the covariance matrix of X denoted by Cov(X), is
,
The diagonals of Cov(X) are
In matrix notation,
The covariance matrix is symmetric
.
Symmetric Matrix
Let
be a square matrix of size nxn. The matrix A is symmetric, if
for all
Orthogonality/Orthonormality
<v1,v2> = <(1 0),(0 1)>
= 0
1.5
1
0.5
0.5
1.0
1.5
Two vectors v1 and v2 for which <v1,v2>=0 holds are said to be orthogonal
Unit vectors which are orthogonal are said to be orthonormal.
Eigenvalues/Eigenvectors
Let A be an nxn square matrix and x an nx1 column vector. Then a (right)
eigenvector of A is a nonzero vector x such that:
For some scalar
Eigenvalue
Eigenvector
Procedure:
Finding the eigenvalues
=0
Finding corresponding eigenvectors
R:
eigen(matrix)
Matlab: eig(matrix)
Finding lambdas
Some Remarks
If A and B are matrices whose sizes are such that the given operations are
defined and c is any scalar then,
( At )t A
( A B) A B
t
t
(cA) cA
t
t
( AB) B A
t
t
t
t
Now,…
We have enough definitions to go into the procedure how to
perform Principal Component Analysis
Theory behind PCA
Linear algebra applied
OUTLINE
What is principal component analysis good for?
Principal Component Analysis: PCA
•The basic Idea of Principal Component Analysis
•The idea of transformation
•How to get there ? The mathematics part
•Some remarks
•Basic algorithmic procedure
Idea of PCA
•Introduced by Pearson (1901) and Hotelling (1933) to describe the variation in a set
of multivariate data in terms of a set of uncorrelated variables
•We typically have a data matrix of n observations on p correlated variables x1,x2,…xp
•PCA looks for a transformation of the xi into p new variables yi that are uncorrelated
Idea
Genes
x1
Patients 1
n
Dimension high
So how can we reduce the dimension ?
Simplest way: take the first one, two, three;
Plot and discard the rest:
Obviously a very bad idea.
Matrix: X
xp
Transformation
We want to find a transformation that involves ALL columns, not only the first
ones
So find a new basis, order it such that in the first component lies almost ALL
information of the whole dataset
Looking for a transformation of the data matrix X (pxn) such that
T
Y= X=1 X1+ 2 X2+..+ p Xp
Transformation
What is a reasonable choice for the ?
Remember: We wanted a transformation that maximizes „information“
That means: captures „Variance in the data“
Maximize the variance of the projection of the observations on the Y variables !
Find such that
T
Var( X) is maximal
The matrix C=Var(X) is the covariance matrix of the Xi variables
Transformation
Can we intuitively see that in a picture?
Good
Better
Transformation
PC2
Orthogonality
PC1
How do we get there?
Patients 1
n
Genes
x1
X is a real valued pxn matrix
Cov(X) is a real value pxp matrix or nxn matrix
-> decide whether you want to analyse patient groups
Or do you want to analyse gene groups?
xp
How do we get there?
Lets decide for genes:
Cov(X)=
v( x1 ) c(x1,x2 ) ........c(x1,x p )
c(x1,x2 ) v( x2 ) ........c(x2 ,x p )
c(x ,x ) c(x ,x )..........v( x )
2 p
p
1 p
How do we get there
Some Features on Cov(X)
•Cov(X) is a symmetric pxp matrix
•The diagonal terms of Cov(X) are the variance genes across patients
•The off-diagonal terms of Cov(X) are the covariance between gene vectors
•Cov(X) captures the correlations between all possible pairs of measurements
•In the diagonal terms, by assumption, large values correspond to interesting dynamic
•In the off diagonal terms large values correspond to high redundancy
How do we get there?
The principal Components of X are the Eigenvectors of Cov(X)
Assume, we can „manipulate“ X a bit: Lets call this Y
Y should be manipulated in a way that it is a bit more optimal than X was
What does optimal mean?
That means:
SMALL!
Var
Var
Cov
Var
LARGE!
In other words: should be diagonal and large values on the diagonal
How do we get there?
The manipulation is a change of the basis with orthonormal vectors
And they are ordered in a way that the most important comes first (principal) ...
How do we put this in mathematical terms?
Find orthonormal P such that
Y=PX
With Cov(Y) diagonalized
Then the rows of P are the principal components of X
How do we get there?
Y PX
Cov(Y) = 1/(n-1) YY
t
1
Cov(Y )
( PX )( PX )t
n 1
1
PXX t P t
n 1
1
P ( XX t ) P t
n 1
1
PAP t
n 1
A:=XX
t
How do we get there?
A is symmetric
Therefore there is a matrix E of eigenvectors and a diagonal matrix D such that:
A EDE
t
Now define P to be the transpose of the matrix E of eigenvectors
P : E
t
Then we can write A:
A P DP
t
How do we get there?
Now we can go back to our Covariance Expression:
Cov(Y)
1
PAP t
n 1
1
Cov(Y )
P ( P t DP ) P t
n 1
1
( PP t ) D( PP t )
n 1
How do we get there?
The inverse of an orthogonal matrix is its transpose (due to its definition):
P 1 Pt
In our context that means:
Cov(Y)
1
( PP 1 ) D( PP 1 )
n 1
1
D
n 1
How do we get there?
P diagonalizes Cov(Y)
t
Where P is the transpose of the matrix of Eigenvectors of XX
The principal components of X are the eigenvectors of XX
(thats the same as the rows of P)
t
The ith diagonal value of Cov(Y) is the variance of X along pi (=along the ith principa
Essentially we need to compute
EIGENVALUES
Explained variance
and
EIGENVECTORS
Principal components
Of the covariance matrix of the original matrix X
Some Remarks
•If you multiply one variable by a scalar you get different results
•This is because it uses covariance matrix (and not correlation)
•PCA should be applied on data that have approximately the same scale in each variable
•The relative variance explained by each PC is given by eigenvalue/sum(eigenvalues)
• When to stop? For example: Enough PCs to have a cumulative variance explained by the PCs that is >5070%
•Kaiser criterion: keep PCs with
eigenvalues >1
Some Remarks
Some Remarks
If variables have very heterogenous variances we standardize them
The standardized variables Xi*
Xi*= (Xi-mean)/variance
The new variables all have the same variance, so each variable have the same weight.
REMARKS
•PCA is useful for finding new, more informative, uncorrelated features; it reduces dimensionality by
rejecting low variance features
•PCA is only powerful if the biological question is related to the highest variance in the dataset
Algorithm
Data = (Data.old – mean ) /sqrt(variance)
Cov(data) = 1/(N-1) Data*tr(Data)
Find Eigenvector/Eigenvalue (Function in R and matlab: eig) and sort
Eigenvectors: V
Eigenvalues: P
Project the original data: P * data
Plot as many components as necessary
Applications of PCA
Applications
Include:
Image Processing
Micro array Experiments
Pattern Recognition
OUTLINE
Principal component analysis in bioinformatics
OUTLINE
Principal component analysis in bioinformatics
Example 1
Lefkovits et al.
Clones 1
n
Spots
x1
X is a real valued pxn matrix
They want to analyse relatedness of clones
Cov(X) is a real value nxn matrix
They take Correlation matrix (which is on the top the division by
the standard deviations)
xp
Lefkovits et al.
Example 2
Yang et al.
Yang et al.
Babo
tkv
Control
Ulloa-Montoya et al.
Multipotent
Adult progenitor cells
Pluripotent
Embryonic stem cells
Mesenchymal
stem cells
Ulloa-Montoya et al.
Yang et al.
But:
We only see the different experiments
If we do it the other way round – that means analysing for the genes not for the experiments we see grouping of
genes
But we never see both together.
So, can we relate somehow the experiments and the genes?
That means group genes whose expression might be explained by the the respective experimental group (tkv,
babo, control)?
This goes into „correspondence analysis“
Extensions of PCA
Difficult example
Non-linear PCA
Kernel PCA
(http://research.microsoft.com/users/Cambridge/nicolasl/papers/eigen_dimred.pdf)
PCA in feature space
PCA in feature space
PCA in feature space
PCA in feature space
Side remark
Summary of kernel PCA
Multidimensional Scaling (MDS)
Common stress functions