Ppt appendix on automated structure comparison

Download Report

Transcript Ppt appendix on automated structure comparison

Appendix: Automated Methods
for Structure Comparison
• Basic problem: how are any two given structures to be
automatically compared in a meaningful way?
• How are distant relationships to be recognized?
program
method
DALI
distance matrix comparison (basis
for FSSP structural classification)
SSAP
dynamic programming (used in CATH
to classify topologies)
VAST
convert secondary structures to vectors
and align vectors
Structure comparison is pretty easy when two
proteins are very similar
•
when two proteins are so similar that the sequences can be reliably aligned,
say >35% identical, structure comparison can proceed from the seq. alignment:
1. Align the sequences
sequence 1:
YIREV-GKL
sequence 2:
YITQVRNKA
2. Superpose the structures to minimize the RMSD for equivalent residue pairs in
the alignment
note: these
structures do not
correspond to
the sequences above
it is harder when the proteins are very
different...
• if one cannot align the sequence reliably, how does one
establish which residues, if any, play equivalent structural
roles in the two proteins?
• the answer is to attempt to align the structures directly in
such a way that structural equivalencies in the two proteins
are revealed
• we will discuss how the distance-matrix based algorithm of
DALI solves this problem
Distance Matrices
•2D representation of 3D
structure
•plot sequence against
itself
•identify pairs of
residues which are close
in space to each other
•usually distance
between C-alpha carbons
is used
•identify closeness
between residues as dark
parts of the matrix
Distance matrices
Different substructures, such as secondary or
supersecondary structures, give rise to distinct
patterns in the matrix
e.g. antiparallel vs.
parallel beta-sheets
in principle, one
could recognize
structural similarity
in two proteins
by comparing patterns
in distance matrices,
but it’s not that simple
Problem: two structures with the same topology may differ
in the precise location of secondary structure elements along
the sequence, i.e. loop lengths may differ
same
fold,
different
matrices
Or two common architectures may differ in
connectivity (topology)...
both
three-stranded
antiparallel
beta-sheets
how might
we compare
their distance
matrices to
reveal this
similarity?
DALI algorithm
• not useful to compare entire matrices
• instead, chop distance matrices into all possible
submatrices of 6x6 amino acids
• compare this set of submatrices for pattern similarities
rather than comparing entire matrix
1. identify a pair of matching submatrices
within the two matrices
make an initial
sequence alignment
from this match...
2. Identify a second pair which overlaps the first
(contains one common structural element)
3. Combine overlapping pairs
overall alignment
of structurally
equivalent sequence
regions
4. Rearrange and “collapse” the matrix
according to the aligned regions of the sequence
now the common
structural elements
are aligned as are
the structurally
equivalent residues
in the sequence!
All together now...
The Power of DALI
• DALI is quite powerful because it can
recognize architectural similarities even
when topologies are different.
• It is also flexible because it can be made
more topologically restrictive (i.e. no
swapping of segments in chain allowed) to
focus on closer relationships
FSSP uses DALI alignments to classify
structures
all PDB entries
eliminate similar sequences
representative set of structures
8320
947
divide into domains
representative set of domains
align domains with DALI!
group domains into fold types
(clusters of similar structures)
and make set of representatives of
each fold
1484
540
Judging DALI alignments
• Z-score: how much better than average is the alignment,
i.e. how many standard deviations from the mean of a
distribution of alignments of random pairs of proteins.
>16 very close, 8-16 pretty close, <8 not so close.
• RMSD: root mean square deviation of alpha carbons for
the matching portion of the structures.
• LALI: length of alignment (recognizably matching portion
of the structures)
• LSEQ2: total length of the sequence being matched.
• %IDE: % sequence identity between the two sequences
if you go into FSSP, and search for a
particular structure, you’ll get an output of its
best DALI alignments with other structures
STRID2
1plc
2pcy
1bqk
1aac
1ibzA
1qhqA
1rcy
1qniA
1kcw
2cuaA
1nwpA
Z
RMSD LALI LSEQ2 %IDE
24.4 0.0
99
99 100
23.4 0.2
99
99 100
12.1 2.0
89
124
29
11.0 1.9
84
104
24
9.1 2.5
83
111
19
8.3 2.4
87
139
29
8.2 2.5
90
151
17
7.7 2.2
78
572
19
7.1 2.4
81 1017
17
7.0 2.2
80
122
15
6.7 3.1
85
128
24
PROTEIN
Plastocyanin (cu2+, ph 6.0)
Apo-plastocyanin (pH 6.0)
pseudoazurin
amicyanin
nitrosocyanin
auracyanin
rusticyanin biological_unit
nitrous-oxide reductase
ceruloplasmin biological_unit
cua fragment
azurin