No Slide Title

Download Report

Transcript No Slide Title

Review of Neutral networks in protein space:
a computational study based on knowledgebased potentials of mean force and Exploring
protein sequence space using knowledgebased potentials
Todd Taylor
George Mason University
School of Computational Sciences
[email protected]
Summary of Prosa II Potential
W(x,)= W[xi, xj, |i-j|; dij ] +  V[xi; (i) ]
i<j
W[xi, xj, |i-j|; dij ] = additive pair contribution
 = C or
C
x = AA sequence
 = structure
a and b = amino acids: a at xi and b at xj
|i-j| = separation in sequence of a and b
dij = Euclidean distance between  atoms of a and b
Summary of Prosa II Potential Continued
Z-score = ( W(x,) - W(x)
) / w(x)
W(x)=average energy over all structures
w(x) = standard deviation of energy over all structures
V[xi; (i) ] = surface term
 = C or
C
x = AA sequence
a and b = amino acids: a at xi and b at xj
 = the number of protein atoms in a sphere centered at xi
Low Prosa Z-scores Correspond to
Native Structures
Definition of Adaptive Walk
Pick a structure and the corresponding "wild type" sequence.
This structure is what your sequences will "adaptively walk"
toward.
Pick some other sequence with the same AA frequencies as
globular proteins generally. Compute the PROSA Z-score for
this sequence on the above structure. If it is not less than the
wild type Z-score, generate one-residue mutations until you
find one that has a lower Z-score than than the original.
Lower Z-scores are more significant. Sequence- structure
alignments that PROSA scores are ungapped.
Repeat until you find sequences with Z-scores below the wild
type
Definition of Neutral Walk
Start with a sequence found by an adaptive walk that has a
Z-score at least as low as the wild type and that is therefore
assured (at least for the purposes of this paper) of folding to
the same structure as the wild type sequence.
Make one-residue mutations until you find a second
sequence that has a Z-score at least as low as the wild type.
This becomes the current sequence.
Repeat until you hit a dead end and cannot find a mutant
with a sufficiently low Z-score.
Definition of Hamming Distance in the
Context of these Papers
The authors use the term Hamming distance even though
their sequences come from the 20 letter AA alphabet. Here,
Hamming distance means the number of places that the two
sequences don't have the same letter. Sequence identity is
1-(Hamming dist/sequence length).
Prosa II Z-scores Along Adaptive Walks
Hamming Distances Between Neutral Sequences
HP Patterns in Neutral Sequences
HP Profiles of Highly Designable Sequences
Secondary Structure of Neutral Sequences
Data from Closest Approach Walks
Df is Hamming distance between the pairs of final sequences
D1 and D2 are the Hamming distances between wild type and dead end
sequences in walks 1 and 2.
N is the average Hamming distance between dead end seqeuences from all
runs and n is the number of residues in the proteins.
Results for Adaptive and Neutral Walks
Surprisingly, many sequences with Z-scores much better
than wild type were found.
Neutral networks seem to be very extensive and sequences
tend to have low sequence identity with each other. The
average Hamming distance between neutral net sequences
is comparable to the distance between random sequences.
Neutral network studies of RNA secondary structure indicate
that the nets typically permeate all of fold space--there is
"shape space covering", i.e., the distance is usually small
from any randomly picked sequence to some other sequence
that folds to any arbitrary structure you might pick. The
authors claim their results indicate the same is true for
proteins.
More Results for Adaptive and Neutral Walks
As a check that neutral net sequences could actually fold to
the structure, the authors did secondary structure prediction
on the novel sequences and checked them against the
known secondary structure assignments of the wild type
sequence. The rates of agreement for neutral sequences
with good Z-scores were high.
Reduced alphabet neutral sequences (HP and ADLG) have
higher sequence identity than 20 letter sequences but still
seem to permeate fold space.
Results from Closest Approach Walks and
Janus Protein
The Hamming distance between wild type sequences in
adjacent nets is large. The Hamming distance between the
sequences on the border between two nets is small, ~5
mutations.
The Janus sequence of Dalal has 50% sequence identity with
1PGB and 43% identity with 1ROP. 1PGB and 1ROP have
very different structures and Janus folds to the same structure
as 1ROP. The structure of Janus was correctly predicted by
PROSA and several other sequences having high sequence
homology to 1PGB but predicted to fold to the structure 1ROP
were generated by neutral walks.
Interesting Points Raised by This Work
The authors found sequences with Z-scores many standard
deviations below the wild type. Is this due to inaccuracies in
the PROSA potential or is stability beyond some threshold not
strongly selected for? Is robustness to mutation optimized at
the expense of stability?
It is not stated in the Babajide papers what fraction of oneresidue mutants were rejected at each point in the neutral
walk, but you can guestimate ~80%+ from one figure. This
fraction would correspond to the fraction of mutations that are
deleterious (at least deleterious due to disruption of correct
folding) and could presumably be checked experimentally.
More Interesting Points Raised by This Work
The closest approach walks indicate that sequences at the
"edges" of neutral nets are separated by only ~5 mutations, but
the wild type sequences are widely separated. What is the
topology of protein sequence neutral nets? How rapidly do the
Z-scores change as you leave the neutral net, i.e.., for the 1 or
2 residue mutants near the neutral sequences on the boundary
of the net?
Sequences in a neutral network tend to have low sequence
identity, as low as 10-15%. In structural genomics papers you
often see statements like "the functions of 30-50% of putative
proteins from complete genomes cannot be inferred due to low
sequence homology with known proteins". Might it be true that
most globular folds have already been found and many
sequenced but unidentified proteins are remote neutral net
neighbors of existing sequences?
References
Babajide A, Farber R, Hofacker IL, Inman J, Lapedes AS,
Stadler PF (2001): Exploring protein sequence space using
knowledge-based potentials. J Theor Biol. 212(1):35-46.
Babajide A, Hofacker IL, Sippl MJ, Stadler PF (1997): Neutral
networks in protein space: a computational study based on
knowledge-based potentials of mean force. Fold Des.
2(5):261-9.