Conformational Space
Download
Report
Transcript Conformational Space
Conformational Space
Conformational Space
Conformation of a molecule: specification of
the relative positions of all atoms in 3D-space,
Typical parameterizations:
List of coordinates of atom centers
List of torsional angles (e.g., the f-y-c for
a protein)
Conformational space:
Space of all conformations
Conformational Space
qj
qi
qN-1
q2
qN
q1
Conformational Space
q0
q1
qn
q4
q3
Relation to Robotics/Graphics
q0
q1
q2
qn
t(t)
q4
Configuration space
q3
Need for a Metric
Simulation and sampling techniques can
produce millions of conformations
Which conformations are similar?
Which ones are close to the folded one?
Do some conformations form small
clusters (e.g. key intermediates while
folding)?
Metric in Conformational Space
A metric over conformational space C is a
function:
d: c,c’ C d(c,c’) +{0}
such that:
d(c,c’) = 0 c = c’
d(c,c’) = d(c’,c)
d(c,c’) + d(c’,c”) d(c,c”)
(non-degeneracy)
(symmetry)
(triangle inequality)
But not all metrics are “good”
Euclidean metric:
d(c,c’) =
Si=1,...,n(|fi-fi’|2+ |yi-yi’|2)
Metric in Conformational Space
A “good” metric should measure how
well the atoms in two conformations can
be aligned
Usual metrics: cRMSD, dRMSD
RMSD
Given two sets of n points in 3
A = {a1,…,an} and B = {b1,…,bn}
The RMSD between A and B is:
RMSD(A,B) = [(1/n)Si=1,…,n||ai-bi||2]1/2
where ||ai-bi|| denotes the Euclidean
distance between ai and bi in 3
RMSD(A,B) = 0 iff ai = bi for all i
cRMSD
Molecule M with n atoms a1,…,an
Two conformations c and c’ of M
ai(c) is position of ai when M is at c
cRMSD(c,c’) is the minimized RMSD
between the two sets of atom centers:
minT[(1/n)Si=1,…,n||ai(c) – T(ai(c’))||2]1/2
where the minimization is over all
possible rigid-body transform T
cRMSD
cRMSD verifies triangle inequality
cRMSD takes linear time to compute
Often, cRMSD is restricted to a subset
of atoms, e.g., the Ca atoms on a
protein’s backbone
Representation Restricted
to Ca Atoms
Protein 1tph
- The positions of AA residue centers (Cα atoms)
mainly determine the structure of a protein.
- In structural comparison, people usually work
only on the backbone of Cα atoms, and neglect
the other atoms.
Possible project: Design a method for
efficiently finding nearest neighbors in
a sampled conformation space of a
protein, using the cRMSD metric.
dRMSD
Molecule M with n atoms a1,…,an
Two conformations c and c’ of M
{dij(c)}: nn symmetrical intra-molecular
distance matrix in M at c
dRMD(c, c’) is :
[(1/n(n-1))Si=1,…,n-1Sj=i+1,…,n(dij(c) – dij(c’))2]1/2
{dij} is usually restricted to a subset of atoms,
e.g., the Ca atoms on a protein’s backbone
Intra-Molecular Distance Matrix
Distances between Ca pairs of a protein with 142 residues.
Darker squares represent shorter distances.
Intra-Molecular Distance Matrix
45
40
85
1
Distances between Ca pairs of a protein with 142 residues.
Darker squares represent shorter distances.
Intra-Molecular Distance Matrix
dRMSD
Molecule M with n atoms a1,…,an
Two conformations c and c’ of M
{dij(c)}: nn symmetrical intra-molecular
distance matrix in M at c
dRMSD(c, c’) =
[(2/n(n-1))Si=1,…,n-1Sj=i+1,…,n(dij(c) – dij(c’))2]1/2
{dij} is usually restricted to a subset of atoms,
e.g., the Ca atoms on a protein’s backbone
dRMSD
Molecule M with n atoms a1,…,an
Two conformations c and c’ of M
{dij(c)}: nn symmetrical intra-molecular
distance matrix in M at c
dRMSD(c, c’) =
[(2/n(n-1))Si=1,…,n-1Sj=i+1,…,n(dij(c) – dij(c’))2]1/2
{dij} is usually restricted to a subset of atoms,
e.g., the Ca atoms on a protein’s backbone
Advantage: No aligning transform
Drawback: Takes quadratic time to compute
Is dRMSD a metric?
dRMSD(c, c’) =
[(2/n(n-1))Si=1,…,n-1Sj=i+1,…,n(dij(c) – dij(c’))2]1/2
is a metric in the n(n-1)/2-dimensional space, where a
conformation c is represented by {dij(c)}
But, in this representation, the same point represents
both a conformation and its mirror image
k-Nearest-Neighbors Problem
Given a set S of
conformations of a
protein and a query
conformation c, find the k
conformations in S most
similar to c (w.r.t. cRMSD,
dRMSD, other metric)
Can be done in time O(N(log k + L)) where:
- N = size of S
- L = time to compare two conformations
k-Nearest-Neighbors Problem
The total time needed to compute the k
nearest neighbors of every conformation
in S is O(N2(log k + L))
Much too long for large datasets where
N ranges from 10,000’s to millions!!!
Can be improved by:
1. Reducing L
2. More efficient algorithm (e.g., kd-tree)
kd-Tree
In a d-dimensional space, where d>2, range searching for a point takes O(dn1-1/d)
k-Nearest-Neighbors Problem
Idea: simplify protein’s description
Assume that each conformation is described by
the coordinates of the n Ca atoms
cRMSD O(n) time
dRMSD O(n2) time
This representation is highly
redundant
Proximity along the chain entails spatial
proximity
d 3l
Atoms can’t bunch up, hence far away
atoms along the chain are on average
spatially distant
ci
cj
m-Averaged Approximation
Cut the backbone into fragments of m
Ca atoms
Replace each fragment by the centroid
of the m Ca atoms
Simplified cRMSD and dRMSD
3n coordinates
3n/m coordinates
Evaluation: Test Sets
[Lotan and Schwarzer, 2003]
8 diverse proteins (54 -76 residues)
Decoy sets of N =10,000 conformations from the
Park-Levitt set [Park et al, 1997]
Correlation:
m
cRMSD
dRMSD
3
0.99
0.96-0.98
4
0.98-0.99
0.94-0.97
6
0.92-0.99
0.78-0.93
9
0.81-0.98
0.65-0.96
12
0.54-0.92
0.52-0.69
Higher correlation for random sets ( greater savings)
Running Times
Further Reduction for dRMSD
1) Stack m-averaged distance matrices
as vectors of a matrix A
N
r
1 n n
r 1
2 m m
A
Vector ai of elements of
distance matrix of
ith conformation (i = 1 to N)
1
dRMSDm (c,c
ai -aj
i j )=
r
2
Further Reduction for dRMSD
1) Stack m-averaged distance matrices
as vectors of a matrix A
2) Compute the SVD A = UDVT
SVD Decomposition
N
r
A
(rxN)
=
Vector aj of elements of
distance matrix of
jth conformation (j = 1 to N)
U
(rxr)
D
(rxr)
Diagonal matrix
Orthonormal
(rotation) matrix
VT
(rxN)
SVD Decomposition
N
r
A
(rxN)
=
Vector aj of elements of
distance matrix of
jth conformation (j = 1 to N)
U
(rxr)
s1
s2 0
0
sr
VT
(rxN)
Diagonal matrix
s1 s2 ... sr 0
(singular values)
Orthonormal
(rotation) matrix
SVD Decomposition
N
r
A
(rxN)
=
Vector aj of elements of
distance matrix of
jth conformation (j = 1 to N)
U
(rxr)
VT
D
(rxr)
(rxN)
vjT
vkT
Diagonal matrix
Orthonormal
(rotation) matrix
Matrix with
orthonormal rows
vi and vj are orthogonal
unit Nx1 vectors
SVD Decomposition
N
A
(rxN)
r
=
U
(rxr)
D
(rxr)
y
X
VT
(rxN)
Representation of
A in space (X,Y)
1
dRMSDm (c,c
)=
ai -aj
i j
r
Y
does not depend on the
coordinate system!
r-dimensional space r 1 n n 1
2 m m
x
2
SVD Decomposition
N
r
s1
s2
A
(rxN)
=
U
(rxr)
v1 T
v2 T
s3
sr
D
(rxr)
VT
(rxN)
||s1v1|| ||s2v2|| ...
SVD Decomposition
N
r
s1
s2
A
(rxN)
=
U
(rxr)
v1 T
v2 T
s3
sr
vpT
D
(rxr)
VT
(rxN)
p principal
components
SVD Decomposition
N
r
s1
A
(rxN)
=
U
(rxr)
v1 T
v2 T
s2
sp
vpT
0
D
(rxr)
VT
(rxN)
p principal
components
Further Reduction for dRMSD
1) Stack m-averaged distance matrices
as vectors of a matrix A
2) Compute the SVD A = UDVT
3) Project onto p principal components
Correlation
between dRMSD and dRMSD4PC
dRMSD is
PC
4
reduced to
summing up 12 to
20 terms
(instead of ~ 80 to
200, since the
proteins have 54 to
76 amino acids)
Complexity of SVD
SVD of rxN matrix, where N > r, takes
O(r2N) time
Here r ~ (n/m)2
So, time complexity is O(n4N)
Would be too costly without m-averaging
Evaluation for 1CTF Decoy Sets
[Lotan and Schwarzer, 2003]
N = 100,000, k = 100, 4-averaging, 16 PCs
70% correct, with furthest NN off by 20%
Brute-force:
84 h
Brute-force + m-averaging:
4.8 h
Brute-force + m-averaging + PC:
41 min
kD-tree + m-averaging + PC:
19 min
Speedup greater than x200
6k approximate NNs contain all true k NNs
Use m-averaging and PC reduction as
fast filters