Direct-Coupling Analysis (DCA)
Download
Report
Transcript Direct-Coupling Analysis (DCA)
Direct-Coupling Analysis (DCA) and Its
Applications in Protein Structure and
Protein-Protein Interaction Prediction
Wang Yang
2014.1.3
Outline
1. Molecular Co-evolution phenomenon
2. Applications of Co-evolution in protein structure
prediction and PPI prediction.
3. Co-evolution measurement:
− Local model-Mutual Information (MI) measure of coupling.
− Global model-Direct Coupling Analysis (DCA).
4. Principle of Direct Coupling Analysis (DCA).
5. Summary
2
Molecular Co-evolution
What is Molecular coevolution?
Two (or more) genes/ proteins/ residues :
1) exert selective pressures on each other
2) evolve in response to each other
• Molecular co-evolution can be due to specific co-adaptation
between the two co-evolving elements, where changes in one of
them are compensated by changes in the other, or by a less
specific external force affecting the evolutionary rates of both
elements in a similar magnitude.
• Co-evolutionary signatures between proteins serve as markers of
physical interactions and/or functional relationships
• For this reason, computational methods emerged for studying coevolution at the protein or residue level so as to predict features
such as protein-protein interactions, residue contacts within
protein structures and protein functional sites.
3
Native contact by co-evolution analysis
Co-evolution information for protein structure prediction
Marks, D. S.; Colwell, L. J.; Sheridan, R.; Hopf, T. A.; Pagnani, A.; Zecchina, R.;
Sander, C. PLoS One 2011, 6, e28766.
4
Use Co-evolution to predict protein 3D structure
De Juan, D.; Pazos, F.; Valencia, A. Nat Rev Genet 2013, 14, 249-61.
5
Groups of co-evolving residues are implicated in functional
specificity and structure–function coordination
Specificity-determining positions (SDPs) are groups of positions that coordinately
mutate in the context of subfamily divergence.
De Juan, D.; Pazos, F.; Valencia, A. Nat Rev Genet 2013, 14, 249-61.
6
Co-evolution measurement
Local statistical model : calculate correlation of each residue pair (i, j) in
the multiple sequence alignment independently.
Mutual Information(MI):
Global statistical model : Coupling of the pair i and j depends on the rest of
the alignment. To compute a set of direct residue couplings that best
explains all pair correlations observed in the multiple sequence alignment .
Direct-coupling Information(DI):
7
Shortcomings of Local statistical model
Correlation in amino acid substitution may arise from direct as
well as indirect interactions. Local covariance methods are unable
to distinguish between direct and indirect correlation.
A
However:
1. All direct interactions are contained
in the local correlations.
2. All detected correlations in
substitutions are generated by the
set of direct interactions
C
B
8
Direct information VS. Mutual information
Intradomain contacts prediction using DI and MI pairs.
9
Direct information VS. Mutual information
Intradomain contacts (<=8Å)
prediction using DI and MI pairs.
Interdomain contacts prediction
using DI and MI pairs.
Morcos, F.; Pagnani, A.; Lunt, B.; Bertolino, A.; Marks, D. S.; Sander,
C.; Zecchina, R.; Onuchic, J. N.; Hwa, T.; Weigt, M. Proc Natl Acad
Sci U S A 2011, 108, E1293-301.
Weigt, M.; White, R. A.; Szurmant, H.; Hoch, J. A.;
Hwa, T. Proc Natl Acad Sci U S A 2009, 106, 67-72.
10
Principle of DCA
To find a minimal set of pair interactions that, through
transitivity, will produce all the observed pair correlations.
More precisely, to seek a general model, the full jointprobability distribution P(A1…AL), for a particular amino acid
sequence A 1 …A L to be a member of the family under
consideration that the marginals probability Pij(Ai,Aj) for pair
occurrences are consistent with the observation of the MSA:
Where:
11
Maximum-entropy Modeling
Information is the reduction of uncertainty.
When you have only limited information, the best and safest
guess is to model all that is known and assume nothing about
the uncertainty.
− satisfy a set of constraints that must hold
− choose the most “uniform” distribution
Choose the one with maximum entropy
For our case:
Constraints:
Maximum S:
12
Why Maximum-entropy Modeling can find the direct
interactions(coupling)?
Actually, we are not finding the direct couplings!
What we have observed in the MSA (Mutual information) is selfredundant.
We are finding the minimum set of couplings that can deduce all
observed couplings. The model that can reflect all our
observations.
But we reduce all pair-wise couplings to as low as possible. Thus,
indirect couplings are removed! No addition assumption
(information) was added to the system, thus our guess has the
lowest risk!
13
Maximization of the entropy
Constraints:
Our goal is Maximum S and keep all the constraints:
Lagrange multipliers
Where Z is the partition function:
•
•
•
•
MCMC sampling
Message passing sampling
Mean field approximation
…
14
Example1: Use DI to predict protein-protein
binding interface
Low MI implies low DI, but high MI does not necessarily imply high DI.
15
Example1: Use DI to predict protein-protein
binding interface
High DI pairs are physical interactions.
Direct Information is inversely correlated with residue
distance of pairs in the Spo0B/Spo0F cocrystal structure
Weigt, M.; White, R. A.; Szurmant, H.; Hoch, J. A.; Hwa, T. Proc Natl Acad Sci U S A 2009, 106, 67-72.
16
Example2: Use DI to predict protein 3D structure
without using template information
Top-ranked predicted structures can make correct contacts in the absence of
constraints and avoid incorrect contacts in spite of false positive constraints
Marks, D. S.; Colwell, L. J.; Sheridan, R.; Hopf, T. A.; Pagnani, A.; Zecchina, R.;
Sander, C. PLoS One 2011, 6, e28766.
17
How many distance constraints are needed for
fold prediction?
18
When would it have been possible to fold from
sequence?
19
Summary
1. Until recently, co-evolution information has not been effectively used.
Local statistical models are not good enough.
2. DCA is a powerful global statistical model to find direct interactions.
3. However, statistical background noise (e.g. low statistical resolution in
the empirical correlations due to an insufficient number of proteins in
the family or due to global correlations from phylogenetic bias in the
frequency counts) and functional constraints may not be spatially close,
such as functional constraints imposed by protein-protein or proteinligand interactions.
4. But it can be used to improve the current structure prediction,
refinement and identify potential binding proteins!
Thank you very much!
20