Identification of Domains using Structural Data

Download Report

Transcript Identification of Domains using Structural Data

Identification of Domains using
Structural Data
Niranjan Nagarajan
Department of Computer Science
Cornell University
Assorted Definitions of Domains
• Subsequences that can fold independently
into a stable structure.
• Structurally compact substructures.
• Functionally well-defined building blocks.
• Evolutionarily conserved and reused
fragments.
Protein Structural Domain
Identification
William R. Taylor
Basic Algorithm
• Initial Assignment of Labels
– Sequential residue numbering
• Update of Labels
• Termination Condition
– Mean squared deviation of average between
successive cycles < 10^-6 or number of
iterations > (length of protein)/2
Update Formula
• Sit+1 = Sit + step(t+1)*sign(jf(Sit, Sjt)) i.
• sign(x) = 1 if x > 0, -1 if x < 0, 0 if x = 0.
• f(Sit, Sjt) =
– r/dij if Sjt > Sit and dij < r.
– -r/dij if Sjt < Sit and dij < r.
– 0 otherwise.
• Step(x) =
– 1 if x < N/2.
– 2(N-x)/N if N/2 <= x < N.
– 0 otherwise.
Example
• Full lines indicate protein backbone.
• Neighboring residues within radius r are connected by
dashed lines.
• Connections between i and i + 2 have been omitted for
clarity.
• Label evolution is done without inverse distance
weighting.
Refinements
• Median based smoothing with a window
size of 21 to reclaim short loops of 10 or
less residues.
• Small domains reassigned by using the
weighted mean values of its neighbors
(weights are given using f.)
• Domain recalculation repeated for at most
five times.
Preserving -sheets
• Matrix B of possible -sheet interactions
between residues generated based on
distance data and heuristics.
• Weighted mean heuristic used to generate
initial assignment of labels with the
averaging being iterated to convergence.
• Post-processing also done to badly broken
-sheets.
Self-testing with fake homologs
• Fake homologs generated by smoothing
– Replacing central atom of triple by average.
– Process repeated five times.
• Domain assignments compared and
similarity evaluated based on overlap score.
• r optimized for best overlap score.
Extension to Multiple Structures
• Algorithm is simultaneously run on
structures corresponding to a multiple
sequence alignment.
• Labels are synchronized to the average of
the labels at a position after each iteration.