Transcript Document

Principal Component Analysis
PCA Example: FOODS
• 20 food products
16 European Countries
Country Gr_coffee Inst_coffee Tea
Germany
Italy
France
90
49
88
Sweetener Biscuit Pe_Soup Ti_soup In_Portat Fro_Fish
19
57
51
19
21
27
PCA Example: FOODS
PCA Example: FOODS
PCA Example: Red Sox
 Dataset: 110 Years of Redsox Performance Data
 Question: Pitchers and Batters Ages Matter for Performance
Redundancy
• Arbitrary observations by r1 and r2
• Low to high redundancies from (a) to (c)
• (c) can be represented by a single variable
• Spread across the best-fit line – covariance between
two variables
Transform
• Linear Transformation
• (x,y) in Cartesian coordinate
• The same point becomes in (a,b) in another coordinate
system
• Assuming linear transformation
• a = f(x, y) = x*c11 + y*c12
• b = g(x,y) = x*c21 + y*c22
æ a ö æ c11
ç
÷ =ç
è b ø è c21
c12
c22
öæ x ö
÷÷
÷çç
øè y ø
For review of matrix, www.cs.uml.edu/~kim/580/review_matrix.pdf
Eigenvector
æ 2 3 öæ 1 ö æ
ö
ç
֍
÷ = ç 11 ÷
è 2 1 øè 3 ø è 5 ø
æ 2 3 öæ 3 ö æ
æ 3 ö
ö
12
ç
֍
÷= ç
÷
÷ = 4* ç
è 2 1 øè 2 ø è 8 ø
è 2 ø
• Eigenvector – projection to the same coordinate
• Eigenvectors of a square matrix are orthogonal
• Unique eigenvalues are associated with eigenvectors
Transform
• http://www.ams.org/samplings/feature-column/fcarc-svd
æ 3 0 öæ x ö
÷÷ =
ç
÷çç
è 0 1 øè y ø
æ 3x ö
çç
÷÷
è y ø
Transform
• http://www.ams.org/samplings/feature-column/fcarc-svd
æ 2 1 ö
M =ç
÷
è 1 2 ø
(Symmetric)
Eigenvectors
• Mv = λiv
λi is scalar
v is orthogonal vectors
• Non-symmetric
æ 1 1 ö
M =ç
÷
è 0 1 ø
• For unit vectors u1 and u2
• A general vector x has coefficients projected by unit vectors
•
• Vector product is of the form:
•
• =>
•
• SVD (Singular Vector Decomposition)
•
M: mxn; U: mxm; Σ: mxn; V: nxn
• U is eigenvector of MMT
• V is eigenvector of MTM
Transform = New Coordinate
• What should be good for transform matrix for PCA ?
•
Covariance
Mean, Variance, Covariance
• X = (x1, x2, ….. xn)
Y = (y1, y2, …..yn)
• E[X] = ∑i xi /n
E[Y] = ∑I yi /n
• Variance = (st. dev.)2:
• V[X] = ∑I (xi – E[X])2 / (n-1)
• Covariance -• cov[X,Y] = ∑I (xi – E[X]) (yi – E[Y]) / (n-1)
Covariance Matrix
• Three variables X,Y,Z
cov[x,x] cov[x,y] cov[x,z]
cov[y,x] cov[y,y] cov[y,z]
cov[z,x] cov[z,y] cov[z,z]
• cov[X,X] = V[X]
• cov[X,Y] = cov{Y,X]
PCA
Numerical Example
X
Y
2.5
2.4
0.5
0.7
2.2
2.9
1.9
2.2
3.1
3.0
2.3
2.7
2.0
1.6
1.0
1.1
1.5
1.6
1.1
0.9
167
749
13l92
62.42
X (adj)
• After adjustments
Y (adj)
0.69
0.49
-1.31
-1.21
0.39
0.99
0.09
0.29
1.29
1.09
0.49
0.79
0.19
-0.31
-0.81
-0.81
-0.31
-0.31
-0.71
-1.01
• Covariance matrix
cov = (.6165
.6154)
.6154 .7166
• Eigenvalues
|.6165-λ
|.6154
.6154 |
.7166-λ| =0
1.2840, 0.0491
• Eigenvectors
Example: Amino Acid (AA) - Basic
Clustering of AAs
 How many clusters ?
Use



4 AA groups
Good for acidic and basic
P in polar group
Nonpolar group is wide spread
Similarities of AA’s determine the ease of substitutions
Some alignment tools show similar AA’s in colors
Needs a more systematic approach
Physico-Chemical Properties

Physico-chemical properties of AA determine protein
structures
(1) Size in volume
(2) Partial Vol.
 Measure expanded volume in solution when dissolved
(3) Bulkiness
 The ratio of side chain volume to its length: average crosssectional area of the side chain
(4) pH of isoelectric point of AA (pI)
(5) Hydrophobicity
(6) Polarity index
(7) Surface area
(8) Fraction of area
 Fraction of the accessible surface area that is buried in the
interior in a set of known crystal structures
Vol.
Bulk
Pol.
pI
Hydro
Surf2
Frac
Alanine
Ala
A
67
11.5
0.0
6.0
1.8
113
0.74
Arginine
Arg
R
148
14.3
52.0
10.8
-4.5
241
0.64
Asparagine
Asn
N
96
12.3
3.4
5.4
-3.5
158
0.63
Aspartic
Asp
D
91
11.7
49.7
2.8
-3.5
151
0.62
Cysteine
Cys
C
86
13.5
1.5
5.1
2.5
140
0.91
Glutamine
Gln
Q
114
14.5
3.5
5.7
-3.5
189
0.62
Glu. Acid
Glu
E
109
13.6
49.9
3.2
-3.5
183
0.62
Glycine
Gly
G
48
3.4
0.0
6.0
-0.4
85
0.72
Histidine
His
H
118
13.7
51.6
7.6
-3.2
194
0.78
Isoleucine
Ile
I
124
21.4
0.1
6.0
4.5
182
0.88
Leucine
Leu
L
124
21.4
0.1
6.0
3.8
180
0.85
Lysine
Lys
K
135
13.7
49.5
9.7
-3.9
211
0.52
Methionine
Met
M
124
16.3
1.4
5.7
1.9
204
0.85
Phenyl.
Phe
F
135
10.8
0.4
5.5
2.9
218
0.88
Proline
Prot
P
90
17.4
1.6
6.3
-1.6
143
0.64
Serine
Ser
S
73
9.5
1.7
5.7
-0.8
122
0.66
Threonine
Thr
T
93
15.8
1.7
5.7
-0.7
146
0.70
Tryptophan
Trp
W
163
21.7
2.1
5.9
-0.9
259
0.85
Tyrosine
Thr
Y
141
18.0
1.6
5.7
-1.3
229
0.76
Valine
Val
V
105
21.6
0.1
6.0
4.2
160
0.86
109
15.4
13.6
6.0
-0.5
175
0.74
Mean
Red: acidic
Orange: basic
Green: polar
(hydrophillic)
Yellow: non-polar
(hydrophobic)
PCA of AAs

How to incorporate different properties
In
order to group similar AA’s
Visual clustering with Volume and pI
PCA

Given NxP matrix (e.g., 20x7),
Each
row represents a p-dimensional data point
Each data point is
 Scaled and shifted to the origin
 Rotated to spread out points as much as possible
Scaling


For property j, compute the average and the s.d.
 μj = ∑i xij /N, σj2 = ∑i (xij - μj)2 /N
Since each property has a different scales and means, define
normalized variables,
 zij = (xij - μj) /σj
 zij measures the deviation from the mean for each
property with the mean of 0 and s.d. of 1
PCA

New orthogonal coordinate system
Find vj
= (vj1, vj2 ,…, vjP ) such that
∑k vik vjk = 0 for i ≠ j (orthogonal) and ∑k vjk2 = 0 (unit length)
vj represents new coordinate vector
Data points in z-coordinate becomes

yij = ∑k zjk vik

New y coordinate systems is a rotation of the z
coordinate system
vjk turns out to be related to the correlation coefficient

PCA

Correlation coefficient, Cij
Cij = ∑k(zik - mi)(zjk - mj) /Psi sj (mi, si mean and s.d. of the i-th row)
-1 ≤ Cij ≤ 1

Results in NxN simiarlity matrix, Sij
Vol
Bulk
Polar
pI
Hyd
SA
FrA
Vol
Bulk
Polar pI
Hyd
SA
FrA
1.00
9.73
0.24
0.37
-0.08
0.99
0.18
1.00
-0.20
0.08
0.44
0.64
0.49
1.00
0.27
-0.69
0.29
-0.53
1.00
-0.20
0.36
-0.18
1.00
-0.18
0.84
1.00
0.12
1.00
Clustering




Family of related sequences evolved from a common ancestor is
studied with phylogenetic trees showing the order of evolution
Criteria needed
Closeness between sequences
The number of clusters
Hierarchical Clustering Algorithm – connectivity-based
K-mean -- centroid
Hierarchical Clustering
Hierarchical Clustering Algorithm

Each
point forms its own cluster, initially
Join two clusters with the highest similarity to form a single larger cluster
Recompute similarities between all cluster
Repeat two steps above until all points are connected to clusters

Criteria of similarities ?
Use
scaled coordinates z
 Vector zi from origin to each data
point i with length |zi|2 = ∑k zik2
 Use cosine angle between two points
for similarity
 cos θij = ∑k zikzjk / |zi||zj|
N elements, nxn distrance matrix d
Hierarchical_Clustering (d, n)
Form n clusters, each with 1 element
Construct a graph T by assigning an isolated vertex to each cluster
while there is more than 1 cluster
Find the two closest clusters C1 and C2
Merge C1 and C2 into new cluster C with | C1 | + | C2| elements
Compute distance from C to all other clusters
Add a new vertex C to T
Remove rows and columns of d for C1 and C2, and add for C
return T
k-mean Clustering


The number of clusters, k, is known ahead of the time
Minimize the squared errors between data points and k cluster
centers
k-means Clustering Problem
Given n data points, find k center points minimizing the squared error
distortion,
d(V, X) = ∑id(vi,X)2/n
input: A set V of n data points and a parameter k
output: A set X consisting of k center points minimizing d(V,X)
over all possible choices of X

No known polynomial algorithm
Heuristics


– Lloyd algorithm
initially partition n points arbitrarily to k centers, then move some
points between clusters
Converge to a local minimum, may move many points in each
iteration
k-mean Clustering


Assume every possible partition of n elements to k clusters
And each partition has cost(P)
Progressive_Greedy_k-means(n)
Select an arbitray partition P into k clusters
while forever
bestChange = 0
for every cluster C
for every element i not in C
if moving i to C reduces Cost(P)
if Δ(i → C) > bestChange
bestChange ← Δ(i → C)
i* = i
C* = C
if bestChange >0
change partition P by moving i* to C*
else
return P
Move
one point in each iteration
Dynamic Modeling in Chameleon


Similarity between clusters is determined by
Relative interconnectivity (RI)
Relative closeness (RC)
Select pairs with high RI and RC to merge
Hierarchical Clustering - Cluto



Generates a set of clusters within clusters
Algorithm can be arranged as a tree
Each node becomes where two smaller
clusters join
CLUTO package with cosine and groupaverage rules
Red/green
indicates values significantly
higher/lower than the average
Dark colors close to the average
Clustering of properties: properties can be
ordered illustrating groups of
properties that are correlated
1.
red on both pI and polarity scale
2.
green on hydrophobicity and pI (can
be separated into two smaller
clusters)
green on volume and surface area
3.
4.
5.
6.
C is unusual in protein structure due
to its potential to form disulfide bonds
between pairs of cysteine residues
(thus, difficult to interchange for other
residues)
Hydrophobic
Two largest AA’s

6 clusters
cluster
Property
AA
1
Basic
K, R, H
2
Acid and amide
E, D, Q, N
3
Small
P, T, S, G, A
4
Cysteine
C
5
Hydrophobic
V, L, I, M, F
6
Large, aromatic
W, Y
Dayhoff Clustering - 1978




In PAM matrix, considered probabilities of pairs of amino
acids appearing together
Pairs of amino acids that tend to appear together are
grouped into a cluster
six clusters
(KRH) (EDQN) (PTSGA) (C) (VLIM) (FWY)
Contrast to clusters via hierarchical clustering
(KRH) (EDQN) (PTSGA) (C) (VLIMF) (WY)
2nd Base
U
1
s
t
B
a
s
e
C
A
G
U
UUU
UUC
UUA
UUG
CUU
CUC
CUA
CUG
AUU
AUC
AUA
AUG
GUU
GUC
GUA
GUG
C
F
L
I
M
V
F
I
I
I
I
UCU
UCC
UCA
UCG
CCU
CCC
CCA
CCG
ACU
ACC
ACA
ACG
GCU
GCC
GCA
GCG
A
S
P
T
A
A
A
A
A
UAU
UAC
UAA
UAG
CAU
CAC
CAA
CAG
AAU
AAC
AAA
AAU
GAU
GAC
GAA
GAG
G
Y
H
Q
N
K
D
E
Y
H
D
D
H
D
D
UGU
UGC
UGA
UGG
CGU
CGC
CGA
CGG
AGU
AGC
AGA
AGG
GGU
GGC
GGA
GGG
(KRH) (EDQN) (PTSGA) (C) (VLIM) (FWY)
C
W
R
S
R
G
C
U
C
A
W G
U
R C
A
G
G U
C
R A
G
U
G C
A
G
Murphy, Wallqvist, Levy, 2000


To study protein folding
Used BLOSUM50 similarity matrix
Determine correlation coefficients between similarity
matrix elements for all pairs of AA’s

e.g., CAV = (∑i MA,i MV,i )/[(∑i MA,i MA,i q)*(∑i MV,i MV,i )] with
summation over i is taken for 20 AA’s
Group
two AA’s with highest CC’s, and either add the next
AA with the highest CC to a group or a new group