Transcript Document
Identifying and Modeling
Selection Pressure
(a review of three papers)
Rose Hoberman
BioLM seminar
Feb 9, 2004
Today
•
McClellan and McCracken: Estimating the Influence of
Selection on the Variable Amino Acid Sites of the
Cytochrome b Protein Functional Domains
•
Dagan et al: Ratios of Radical to Conservative Amino
Acid Replacement are Affected by Mutational and
Compositional FActors and May Not be Indicative of
Positive Darwinian Selection
•
Halpert and Bruno: Evolutionary Distances for ProteinCoding Sequences: Modeling Site-Specific Residue
Frequencies
Types of Selection
• negative purifying selection
– non-synonymous codon changes are selected
against
• neutral selection
– non-synonymous changes in codons have an
equivalent probability of elimination or fixation
• positive diversifying selection
– non-synonymous codon changes are selected for
Identifying Regions Under
Selective Pressure
• ds/dn << 1 and ds/dn >> 1 commonly used
• synonymous substitutions become
saturated more quickly than ns
• compare conservative/radical substitution
ratio to expected distribution under neutral
model
A “conservative” definition
• Cluster amino acids according to physiochemical properties
–
–
–
–
–
Charge
Volume
Polarity
Grantham’s distance
...
• Within-class = conservative
• Across-class = radical
Assessing Substitution Rates
• 2 sequences
– average over all possible pathways between two
codons
– TTG(Leu) - ATG(Met) - AGG(Arg) - AGA(Arg)
• Many sequences
– Build a phylogenetic tree
– Infer most likely ancestral sequences
– Count synonymous and nonsynonymous
substitutions
Cytochrome b Gene Evolution
• Matrix and Transmembrane regions have
comparable rates of change
• Intermembrane region has lower rate of change
(McClelland and McCracken)
Group Non-Syn Mutations
• 5 Properties
• 4 Groups
• Neutral model
– based only on
codon frequencies
• Chi-squared test
– observed vs.
expected (given
domain amino
acid frquencies)
Question
• Do factors unrelated to selection affect the
radical/conservative ratio?
– nucleotide frequencies
• e.g. GC content
– transition/transversion ratio
• transitions (A->G and T->C) are more common than transversion
– distances between amino acids
• genetic code
– codon biases
• due to tRNA availibility, energy usage, or pathogen avoidance
– amino acid frequencies
• ??
An Initial Test
• 3 proteins: Hemoglobin, Interleukin,
Ribosomal protein
• Simulated neutral evolution using
substutition matrix built from psuedogenes
• Tested for selection pressure
– volume/polarity: 100% FP
– grantham: 13-21% FP
– charge: 0% FP
(Dagan et al)
Simulation Study
• Generate virtual ancestral sequence
– 300 nt long
• Set mutational/compositional parameters
• Simulate evolution (ROSE software)
– 50 substitutions
• Calculate conservative/radical ratio
• Each parameter set simulated 50 times
ANOVA
Conclusion
• Many composition and mutation factors
influence conservative/radical ratio
• Poor indicator of positive selection
Correlation or Causation?
• Many factors are correlated, but direction
of causation is undetermined
– transitions more likely to cause conservative
changes than transversions
– codon bias can influence nucleotide frequencies
– purifying selective pressure will reduce the rate of
change
• Generative models which model many of
these relevant factors
Generative Models of Gene/Protein
Evolution
• Infer relative distances between
sequences
• Build a phylogenetic tree
• Infer which positions are under positive
selective pressure
• Find additional homologous proteins
• Identify co-varying sites
Modeling Evolutionary Processes
• Most models
– homogeneous, timereversible Markov
models
• Simplest models
– DNA mutation models
– nucleotide frequencies
– transition/transversion
ratio
Too Simplistic
• positions within codons are not independent
– codon or amino acid models
• parameters not sufficient to explain different
rates of change between specific characters
– empirical substitution matrix (e.g. PAM)
• site-specific rates of change
– use a gamma distribution to model variation in rates
Too Simplistic
• positions within codons clearly not independent
– codon or amino acid models
• different rates of change between specific
characters
– empirical substitution matrix (e.g. PAM)
• site-specific rates of change
– use a gamma distribution to model variation in rates
• equilibrium frequencies are also site-specific
– due to functional or structural constraints
Too Simplistic
• positions within codons are not independent
– codon or amino acid models
• parameters not sufficient to explain different
rates of change between specific characters
– empirical substitution matrix (e.g. PAM)
• site-specific rates of change
– use a gamma distribution to model variation in rates
• equilibrium frequencies are also site-specific
– due to functional or structural constraints
Halpern & Bruno 1998
A codon-based model of evolution
1. site-invariant dna-based mutation model
2. site-specific amino acid level selection model
pab
= probability of mutation
f abi
= probability of fixation at site i
rabi k pab f abi ,b a
raai
rabi
b ,b a
Halpern & Bruno 1998
• Assumptions
– most importantly, selectional pressures are constant at a
given position for all lineages over all times
– sites independent
– markov process is reversible
• Does not model
– selection at the codon level
• codon bias
• DNA or RNA structural requirements
– uncertainty in MSA
Calculating fixation rates
s
N
f ab
relative fitness of b to a
population size
2s
2 Ns
1 e
f ab
2 Ns
e
f ba
(Kimura 1962)
Fixation rates in terms of equilibrium
rates and mutation probabilities
s
N
f ab
relative fitness of b to a
population size
2s
2 Ns
1 e
b pba
f ab
2 Ns
e
a pab
f ba
(Kimura 1962)
A Simpler
Formulation
rab k pab
ln(
1
b pba
a pab
a pab
b pba
)
note: rab rba b a
• p is estimated from nucleotide frequencies and the
transition/transversion ratio
• π represents the frequency of each codon, and is
approximated via amino acid and nucleotide frequencies
• model ignores:
– site-specific nucleic acid selection effects (e.g. from RNA
structure)
– codon bias
Model Fallout
• Amount of “flux” between two codons
depends on their relative fitness
• Rates are not explicitly modeled, but...
– maximum substitution rate will be when all codons
are equally fit
– synonymous codons will have highest flux
– because of degeneracy of 3rd position changes,
they will be most frequent
Parameter Estimation
• Ideal
– estimate parameters simultaneously from large data set
• What they did
– nucleotide frequencies: from observed frequencies
– transition/transversion ratio: using existing nucleotidebased methods
– equilibrium amino acid frequencies:
• estimate number of times each amino acid was introduced at each
position (based on phylogenetic tree but ignores genetic code)
• add psuedo-counts
Evaluation
• Their hypothesis:
– methods that only model differing rates will
underestimate more remote divergence times
• Test hypothesis on simulated data
– given an MSA
• estimate the tree (multiplied branch lengths by 6.0)
• estimate amino acid frequencies
– arbitrarily choose mutational parameters
– stochastically generate sequences (how many?)
Predicting Distances Between
Sequences
A: DNA model (learned?)
B: DNA model with site-rate
variation
C: this model with
simulation parameters
D: this model with
parameters estimated
from simulated data
x axis: estimated distances
y axis: true distances (based on simulation)
Conclusions
• failing to model selection effects leads to
substantial underestimation of longer distances
• possible to estimate equilibrium amino acid
frequencies from realistic data sets with an
accuracy sufficient for estimating distances
between highly divergenct sequences
• model accounts for heterogeneity of rates in a
novel, and more biologically realistic way
• model parameters could in theory be estimated
simultaneously using ML or bayesian estimation