Transcript slides

CRB Journal Club
February 13, 2006
Jenny Gu
Selected for a Reason
•
Residues selected by evolution for a reason, but conservation is not
distinguished between function or stability.
•
To disentangle between functional and structural constraints, predicted
sequence profiles generated for structural stability is compared to
naturally occurring sequence profiles.
•
Incorporates two additional measures, free energy and sequence
profile difference, in addition to residue conservation to identify
functional residues.
Datasets
•
Enzyme Active Site Set
(Suspicious. What about cross fold validation?)
Conservation Score Distribution
Sequence conservation score
calculated by SCORECONS with
multiple sequence alignment from
MUSCLE
A) All Residue Sites
B) Enzyme Active Site Set
Calculating Difference between Profiles
Designed Sequence Profiles
Rosetta design program
Generate 40 protein sequences
stable for structure. Align with
PSI-blast to generate position
specific scoring matrix (PSSM)
Natural Sequence Profiles
PSSM matrix from PSI-blast
Euclidean distance rescaled between :
0 (high similarity)
1 (low similiarity)
Difference between Natural and Designed Sequence Profiles
A) All residues in active sites.
B) Functional residues in active sites
Differences between profiles are
rescaled such that:
0 - High Similarity
1 - Low similarity
In other words:
Selection for function vs. stability
0 - Low selection
1 - High selection
Calculating Native/Optimal Residue Energy Difference
1. Use Rosetta G module to calculate free energy changes for each 20
amino acid substitutions at each position.
2. Compare to native G. If functional constraints are imposed, there
should be a big gap between G.
Rosetta G
Originally developed to identify binding interface hot spots.
Model based on all-atom rotamer description of side chains with energy
function dominated by Lennard Jones interactions, solvation
interactions, and hydrogen bonding.
Distribution of Free Energy Difference
Difference between free energy of
naturally occurring residue and
energetically most favorable
residue. (kcal/mol)
For all residues in active sites.
B) Functional residues in active sites.
A)
In other words:
Positions with smaller differences
have been selected for stability.
Residue Classification
Combine:
1. Sequence Conservation
2. Profile Difference (Natural vs. Designed)
3. Residue Free Energy Changes (Natural vs Optimal)
To classify functional vs. nonfunctional residues.
Logistic regression with linear model module used to determine weights for
input features.
Classification Performance
Largest improvement observed with
free energy measures.
Inclusion of profile difference with
free measures resulted in minor
improvements.
Combined measures reduces false
positives.
Chymosin B
Sequence Conservation Only
Combined Measures
Arginine Kinase
Sequence Conservation Only
Combined Measures
Testing Generality
Dataset 2 includes ligand
binding sites
Comparison to another predictor
Sources of Error
•
Sensitivity to multiple alignment quality.
•
Loop regions are difficult to align.
•
Functionally important residues can contribute to stability.
•
Suggested Improvements:
•
•
•
Better multiple sequence alignments.
Spatial clustering of high scoring residues.
Introducing backbone flexibility into energy calculations.
Other Approaches - Extracting from Sequence Design
1) Design procedure based on Monte Carlo simulation of amino acid
substitution process.
2) Fixed substitutions based on scoring function from template structure
and multiple alignment of homologs.
Other Approaches - Using Protein Homology Information
1) Identify high degree of conservation between homologous proteins.
2) Use information theory to identify positions where environment-specific
substitution tables make poor prediction of overall amino acid
substitution pattern.
3) Identify residues with highly conserved positions when homologous
family are superposed.
Interest in this Paper
•
Distinguishing between functional and structural constraints.
•
Designing sequences and subsequent profiles allows us to explore an
enlarge sequence space that is not captured by natural sequence.
Questions:
From an evolutionary perspective:
1) How does structure limit the exploration of sequence space.
2) How is sequence space expanded with structure change.
3) How do selective pressures for molten globules, flexible regions, and
disordered structures impact the sequence space?
Current Domain Coverage of Genome
Current perspective:
Ab initio structure
evolution is now difficult
now that system of
balance and checks is
implemented.
Evolution of current
protein repertoire
largely attributed to
recombination of
existing folds.
Reaching beyond structural genomics? ….
•
With known structures:
•
•
•
•
With unknown structures:
•
•
Use of Hidden Markov Model (HMM) or profile for domains to identify in
genome.
Evolutionary plasticity greater for loop regions than for core.
Work has been done in this area.
Can we design a structure not currently in PDB and identify it in nature?
With structures that nature “hasn’t seen before”.
•
•
•
De novo structure designed in 2003.
Maybe it already exists in nature, we just don’t know about it yet.
And if it doesn’t exist, is it just a proof of principle or can we actually do
something with it?