Algorithms for Structural Comparison and Statistical

Download Report

Transcript Algorithms for Structural Comparison and Statistical

Algorithms for Structural Comparison and
Statistical Analysis of 3D Protein Motifs
Brian Y. Chen, Viacheslav Y. Fofanov,
David M. Kristensen, Marek Kimmel,
Olivier Lichtarge, Lydia E. Kavraki
Motivation
• Understanding the function
of proteins is a fundamental
purpose of biology
• Experimental determination of protein
function is expensive and time consuming
• Algorithms for computational function
prediction could guide and accelerate
protein function discovery process
B. Chen, V. Fofanov, D. Kristensen, M. Kimmel, O. Lichtarge, L. Kavraki
Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs
A Computational Approach
• Comparative Analysis
– Focus: Algorithms for Comparative Analysis
• What is similar about proteins with similar
function?
– Sequence – same components?
– Geometry – same structure?
– Dynamics – same motion?
B. Chen, V. Fofanov, D. Kristensen, M. Kimmel, O. Lichtarge, L. Kavraki
Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs
A Computational Approach
• Comparative Analysis
– Focus: Algorithms for Comparative Analysis
• What is similar about proteins with similar
function?
– Sequence – same components?
– Geometry – same structure? (Same Chemistry)
– Dynamics – same motion?
B. Chen, V. Fofanov, D. Kristensen, M. Kimmel, O. Lichtarge, L. Kavraki
Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs
What do we need?
• A motif for comparison
– Representative of Biological function
• An algorithm for comparison
– Search for Geometric and Chemical similarity
• Statistical analysis
– Classification of results
B. Chen, V. Fofanov, D. Kristensen, M. Kimmel, O. Lichtarge, L. Kavraki
Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs
Outline
• Evolutionary Trace (ET)
– A source of biologically relevant motifs using
evolutionary data
• Match Augmentation (MA)
– An algorithm for identifying geometric similarity
• Statistical analysis
– Statistically determined geometric thresholds
B. Chen, V. Fofanov, D. Kristensen, M. Kimmel, O. Lichtarge, L. Kavraki
Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs
The Evolutionary Trace (ET)
Evolutionary Trace
A
G
A
G
G
G
A
G
A
A
Y
K
T
A
T
K
Y
K
A
R
R
R
K
R
K
R
Y
.
I
.
L
I
I
.
L
.
E
G
D
F
Y
A
D
C
E
C
C
C
C
C
C
C
C
C
W
K
W
L
L
K
W
L
W
position 1 2 3 4 5 6 7
Structure
G T R I A C K
alignment
G
G
G
G
+
Y
Y
T
A
R
R
R
K
I
L
L
I
G
C
F
Y
C
C
C
C
K
L
L
L
A A A . E C W
tree
A K K . D C W
A K R . D C W
A K Y . E C W
consensus X - - X - C X
rank 2 - - 4 - 1 3
rank 4
Functional site
Lichtarge et al, JMB 1996; Lichtarge et al, JMB 1997; Lichtarge et al, PNAS 1996; Sowa et al, NSB 2001
ET Clusters Functionally Relevant
Cluster Type
Galectin CRD
Dihydropteroate Synthase
Trp1 domain of Hop
Ligand
binding site
ET clusters
Structural Epitope
ET Clusters
: Yellow = ligand,
: Yellow = ligand,
Blue = Residues within 5Å of the ligand
Red = Largest Cluster, Other colors = trace residues
http://imgen.bcm.tmc.edu/molgenlabs/lichtarge/trace_of_the_week/traces.html
Geometric Motifs
• Trace Clusters are functionally relevant
• A source for geometric motifs
?
– Given a protein structure:
• Geometric Motifs
Function
• Same Amino Acids
• Same Geometry and Chemistry
– Does the protein have the same function?
B. Chen, V. Fofanov, D. Kristensen, M. Kimmel, O. Lichtarge, L. Kavraki
Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs
Outline
• Evolutionary Trace (ET)
– A source of biologically relevant motifs using
evolutionary data
• Match Augmentation (MA)
– An algorithm for identifying geometric similarity
• Statistical analysis
– Statistically determined geometric thresholds
B. Chen, V. Fofanov, D. Kristensen, M. Kimmel, O. Lichtarge, L. Kavraki
Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs
Geometric Comparison Algorithms
• Geometric Hashing
Wolfson H.J. et al. IEEE Comp. Sci. Eng., 4(4):10–21,1997.
• JESS
Barker J.A. et al. Bioinformatics, 19(13):1644-9, 2003.
• PINTS
Stark A. et al. Journal of Molecular Biology, 326:1307-16, 2003.
• Many Others
– webFEATURE, DALI, CE, SSAP…
• Our method: Match Augmentation
– Integrate Structural and Evolutionary data
– Efficient application of hashing and depth first search
B. Chen, V. Fofanov, D. Kristensen, M. Kimmel, O. Lichtarge, L. Kavraki
Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs
PSB 2005
Geometric Comparison Strategy
• Biological Input:
– A structure of a functional site (Motif)
– A protein structure with unknown function (Target)
• Geometric Search:
– Find target atoms geometrically similar to motif
atoms, similar atoms and amino acids (Match)
• Output:
– Match of atoms with greatest geometric similarity
• Might potentially identify a similar functional site in the target
B. Chen, V. Fofanov, D. Kristensen, M. Kimmel, O. Lichtarge, L. Kavraki
Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs
Motifs: Structure & Evolution Data
• Structure of a Functional Site
– Points in three dimensions (3D)
taken from atom coordinates
(motif point)
– Labeled by residue and atom identity
– Alternate residues from mutation
{G,C,T}
Ca
• Support for complex active sites
4
– Priority-ranked motif points
• Functional relevance
3
1
2
B. Chen, V. Fofanov, D. Kristensen, M. Kimmel, O. Lichtarge, L. Kavraki
Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs
Input Data: Targets
• Targets
– Points in 3D taken from atom
coordinates of whole protein
structures (target points)
– Labeled by residue and atom identity
– No Alternate residues
– No ranking
B. Chen, V. Fofanov, D. Kristensen, M. Kimmel, O. Lichtarge, L. Kavraki
Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs
{Y}
Cb
Search: Matching Criteria
• Geometric Similarity
<e
– points are within e when optimally
superimposed
• Label Compatibility
– Target residue label is a member of
Alternative Residues
– Atom labels identical
B. Chen, V. Fofanov, D. Kristensen, M. Kimmel, O. Lichtarge, L. Kavraki
Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs
{S,L,T} Ca
{S}
Ca
Matches
Motif
• Matches correlate motif points to
target points
– Bijection
– Fulfill Geometric and Label Criteria
• Geometric Similarity measured by
Least Root Mean Squared Distance
(LRMSD)
Target
• The match we seek:
– Bijection of all motif points
– Smallest LRMSD of all matches considered
B. Chen, V. Fofanov, D. Kristensen, M. Kimmel, O. Lichtarge, L. Kavraki
Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs
Match
Match Augmentation at a Glance
• Design Principle:
– Correlate high ranking points first
– Exhaustively test potential matches
– Filter for the match with lowest LRMSD
• Two Phases:
– Seed Matching
– Augmentation
Input
Seed
Matching
Augmentation
Output
B. Chen, V. Fofanov, D. Kristensen, M. Kimmel, O. Lichtarge, L. Kavraki
Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs
Match Augmentation at a Glance
• Design Principle:
Input
– Correlate high ranking points first
– Exhaustively test potential matches
– Filter for the match with lowest LRMSD
• Two Phases:
– Seed Matching
– Augmentation
3
1
Seed
Matching
Augmentation
2
Match High Ranked Points
Output
B. Chen, V. Fofanov, D. Kristensen, M. Kimmel, O. Lichtarge, L. Kavraki
Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs
Match Augmentation at a Glance
• Design Principle:
Input
– Correlate high ranking points first
– Exhaustively test potential matches
– Filter for the match with lowest LRMSD
• Two Phases:
– Seed Matching
– Augmentation
3
1
Augmentation
2
3
1
Expand matches to rest of Motif
Seed
Matching
Output
2
B. Chen, V. Fofanov, D. Kristensen, M. Kimmel, O. Lichtarge, L. Kavraki
Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs
Match Augmentation at a Glance
• Design Principle:
Input
– Correlate high ranking points first
– Exhaustively test potential matches
– Filter for the match with lowest LRMSD
• Two Phases:
– Seed Matching
– Augmentation
3
1
Augmentation
2
3
1
Expand matches to rest of Motif
Seed
Matching
Output
2
4
B. Chen, V. Fofanov, D. Kristensen, M. Kimmel, O. Lichtarge, L. Kavraki
Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs
Match Augmentation at a Glance
• Design Principle:
Input
– Correlate high ranking points first
– Exhaustively test potential matches
– Filter for the match with lowest LRMSD
• Two Phases:
– Seed Matching
– Augmentation
1 3
Augmentation
2
3
1
Expand matches to rest of Motif
Seed
Matching
2
5
4
B. Chen, V. Fofanov, D. Kristensen, M. Kimmel, O. Lichtarge, L. Kavraki
Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs
Output
Filtering Completed Matches
• Augmentation implements a depth first search:
• Data is stored in heap of matches
• Final output: match with smallest LRMSD
LRMSD:
2.41
No more points
to match
Matches Sorted
by LRMSD
B. Chen, V. Fofanov, D. Kristensen, M. Kimmel, O. Lichtarge, L. Kavraki
Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs
Final Output
LRMSD: 0.87
Match Augmentation Summary
• Hybrid Algorithm
– Seed Matching: Hashing
– Augmentation: Depth First Search
• Finds matches to motifs within target
structures
– Final output: match with smallest LRMSD
B. Chen, V. Fofanov, D. Kristensen, M. Kimmel, O. Lichtarge, L. Kavraki
Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs
Testing MA on Biological data
• Data Set
– 12 motifs selected from residues surrounding enzymatic active
sites
– 73 targets, each evolutionarily related to one of the motifs
– Details: www.cs.rice.edu/~brianyc/papers/PSB2005/
• Experimental Protocol
– Search for each motif within every target.
• Matches of evolutionarily related motif-target pairs are “HPs” (BLUE)
• Matches of unrelated motif-target pairs are “NHPs” (RED)
B. Chen, V. Fofanov, D. Kristensen, M. Kimmel, O. Lichtarge, L. Kavraki
Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs
Match Augmentation Conclusions
• Match Augmentation is accurate
– Identifies cognate active sites in 95.4% of
evolutionarily related proteins
• Match Augmentation is very efficient
– Matches can be found in a fraction of a second
B. Chen, V. Fofanov, D. Kristensen, M. Kimmel, O. Lichtarge, L. Kavraki
Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs
Outline
• Evolutionary Trace (ET)
– A source of biologically relevant motifs using
evolutionary data
• Match Augmentation (MA)
– An algorithm for identifying geometric similarity
• Statistical analysis
– Statistically determined geometric thresholds
B. Chen, V. Fofanov, D. Kristensen, M. Kimmel, O. Lichtarge, L. Kavraki
Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs
Evaluating Statistical Significance
• Hypothesis Testing Framework:
– H0: Motif and Target are functionally unrelated
– HA: Motif and Target are functionally related
• Reject H0 for a given match only if the
match is unusual under H0.
• Problem: how do we evaluate the H0 for a
given match?
B. Chen, V. Fofanov, D. Kristensen, M. Kimmel, O. Lichtarge, L. Kavraki
Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs
The “Usual” H0 distribution
• The set of matches between the motif and all
functionally unrelated targets
• Previous methods approximate this distribution:
– JESS
• Matches are compared to a reference population of motifs is
partially ordered by degree of occurrence
– PINTS
• Approximate the distribution of matches with an artificial
curve parameterized by motif size and residue content.
• MA can calculate this distribution explicitly by
computing matches to the entire PDB
B. Chen,•V. Fofanov,
D. Kristensen,
Kimmel,
O. Lichtarge,
L. Kavraki
JESS: Barker
J.A. etM.al.
Bioinformatics,
19(13):1644-9,
2003.
Algorithms
Structural
and Statistical
Analysis
of 3D
Protein Motifs
• for
PINTS:
StarkComparison
A. et al. Journal
of Molecular
Biology,
326:1307-16,
2003.
A Distribution of Match LRMSDs
Frequency
Smoothed
Frequency
Unsmoothed
0
1
2
LRMSD
3
4
0
1
2
LRMSD
3
4
5
• LRMSD distribution of matches with entire PDB
– Almost all known protein structures
– Almost no functional relation to a our motifs
• Reasonable H0 Distribution
B. Chen, V. Fofanov, D. Kristensen, M. Kimmel, O. Lichtarge, L. Kavraki
Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs
How unusual is our match?
• We want: the probability of observing a match
with lower LRMSD than given match
1.2
A: Area left of line
matches with lower LRMSD
estimated density
1.0
B: Area under curve
0.4Frequency
0.6
0.8
matches total
A
p-Value:
B
0.0
0.2
p=
0
1
2
LRMSD
RMSD
3
4
Match LRMSD
5
A
B
• Apply P-value to
reject H0
B. Chen, V. Fofanov, D. Kristensen, M. Kimmel, O. Lichtarge, L. Kavraki
Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs
Statistical Significance
• Result: Data driven statistical significance value (p-value)
– No dependence on approximations like previous work
• p-value of a match tells us the probability of observing
another match with lower LRMSD, with a functionally
unrelated target
• Apply p-value to reject H0
• Do matches identifying cognate active sites (HPs) have
low p-values? (i.e. Can we reject H0 for HPs?)
B. Chen, V. Fofanov, D. Kristensen, M. Kimmel, O. Lichtarge, L. Kavraki
Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs
Testing our Statistical Analysis
• Distributions of matches over the PDB can be calculated
efficiently
– 12:48 on a single machine, on average
• Do not have to scan the entire PDB to accurately
determine the H0 distribution
– 5% random sample accurate enough
– Reduces sample time to 0:38, on average
• Matches of cognate active sites (HPs) are statistically
significant (low p-values)
B. Chen, V. Fofanov, D. Kristensen, M. Kimmel, O. Lichtarge, L. Kavraki
Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs
Conclusions
• Match Augmentation is accurate and extremely efficient
– Correctly identifies cognate active sites (HPs)
– Identifies matches in fractions of a second
• Algorithmic efficiency enables detailed Statistical Analysis
– Explicitly calculate H0 distribution without dependence on
approximated H0 distributions
– Matches of cognate active sites (HPs) are statistically significant
– Significance threshold translates into useful motif-specific LRMSD
thresholds
B. Chen, V. Fofanov, D. Kristensen, M. Kimmel, O. Lichtarge, L. Kavraki
Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs
Special Thanks
• Kavraki Group
–
–
–
–
–
–
–
–
–
David Schwarz
Amarda Shehu
Allison Heath
Hernan Stamati
Anne Christian
Drew Bryant
Amanda Cruess
Brad Dodson
Jessica Wu
• Lichtarge Lab
–
–
–
–
David Kristensen
Dan Morgan
Ivana Mihalek
Hui Yao
• Kimmel Group
– Viacheslav
Fofanov
B. Chen, V. Fofanov, D. Kristensen, M. Kimmel, O. Lichtarge, L. Kavraki
Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs
• Funding
– NSF
– NLM
5T15LM07093
– March of Dimes
– Whitaker
Foundation
– Sloan
Foundation
– VIGRE
– AMD