Transcript Document
Number of substitutions
between two proteincoding genes
Dan Graur
1
Handout
2
Computing the number of
substitutions between two proteincoding sequences is more
complicated, because a distinction
should be made between
synonymous and nonsynonymous
substitutions.
3
Number of synonymous substitutions
Number of synonymous sites
Number of nonsynonymous substitutions
Number of nonsynonymous sites
4
5
Aims:
1. Compute two numerators: The
numbers of synonymous and
nonsynonymous substitutions.
2. Compute two denominators: The
numbers of synonymous and
nonsynonymous sites.
6
Difficulties with denominator:
1. The classification of a site changes with time: For
example, the third position of CGG (Arg) is
synonymous. However, if the first position changes
to T, then the third position of the resulting codon,
TGG (Trp), becomes nonsynonymous.
T
Trp
Nonsynonymous
7
Difficulties with denominator:
2. Many sites are neither completely synonymous
nor completely nonsynonymous. For example, a
transition in the third position of GAT (Asp) will be
synonymous, while a transversion to GAG or GAA
will alter the amino acid.
8
Difficulties with numerator:
1. The classification of the
change depends on the order in
which the substitutions had
occurred.
9
Difficulties with numerator:
1. When two homologous codons differ from each other
by two substitutions or more the order of the
substitutions must be known in order to classify
substitutions into synonymous and nonsynonymous.
Example: CCC in sequence 1 and CAA in sequence 2.
Pathway I:
CCC (Pro) CCA (Pro) CAA (Gln)
1 synonymous and 1 nonsynonymous
Pathway II:
CCC (Pro) CAC (His) CAA (Gln)
2 nonsynonymous
10
Difficulties with numerator:
2. Transitions occur with different
frequencies than transversions.
3. The type of substitution depends on the
mutation. Transitions result more
frequently in synonymous substitutions
than transversions.
11
Methods: Miyata & Yasunaga (1980)
and Nei & Gojobori(1986)
1. Classification of sites. Consider a
particular position in a codon. Let i be
the number of possible synonymous
changes at this site. Then this site is
counted as i/3 synonymous and (3 –
i)/3 nonsynonymous.
12
In TTT (Phe), the first two positions are
nonsynonymous, because no synonymous change
can occur in them, and the third position is 1/3
synonymous and 2/3 nonsynonymous because one
of the three possible changes is synonymous. 13
2. Count the number of
synonymous and nonsynonymous
sites in each sequence and
compute the averages between the
two sequences. The average
number of synonymous sites is NS
and that of nonsynonymous sites
is NA.
14
3. Classify nucleotide differences
into synonymous and
nonsynonymous differences.
15
• For two codons that differ by only one
nucleotide, the difference is easily
inferred. For example, the difference
between the two codons GTC (Val) and
GTT (Val) is synonymous, while the
difference between the two codons GTC
(Val) and GCC (Ala) is nonsynonymous.
16
17
• For two codons that differ by two or
more nucleotides, the estimation problem
is more complicated, because we need to
determine the order in which the
substitutions occurred.
18
Pathway (1) requires one synonymous and one
nonsynonymous change, whereas pathway (2)
requires two nonsynonymous changes.
19
There are two approaches to deal with multiple
substitutions at a codon:
20
The unweighted method: Average the numbers of the different
types of substitutions for all the possible scenarios. For example, if
we assume that the two pathways are equally likely, then the number
of nonsynonymous differences is (1 + 2)/2 = 1.5, and the number of
21
synonymous differences is (1 + 0)/2 = 0.5.
The weighted method. Employ a priori criteria to assign the
probability of each pathway. For instance, if the weight of pathway 1
is 0.9, and the weight for pathway 2 is 0.1, then the number of
nonsynonymous differences between the two codons is (0.9 1) +
(0.1 2) = 1.1, and the number of synonymous differences is 0.9.22
23
4. The numbers of synonymous
and nonsynonymous differences
between the two protein-coding
sequences are MS and MA,
respectively.
24
The number of synonymous differences
per synonymous site is
pS = MS/NS
The number of nonsynonymous
differences per nonsynonymous site is
pA = MA/NA
25
If we take into account the effect of
multiple hits at the same site, we can
make corrections by using Jukes and
Cantor's formula:
26
27
28
Number of Amino-Acid Replacements
between Two Proteins
• The observed proportion of different amino
acids between the two sequences (p) is
p = n /L
• n = number of amino acid differences
between the two sequences
• L = length of the aligned sequences.
29
30
Number of Amino-Acid Replacements
between Two Proteins
The Poisson model is used to convert p into the
number of amino replacements between two
sequences (d ):
d = – ln(1 – p)
The variance of d is estimated as
V(d) = p/L (1 – p)
31
Saturation