Transcript Slide 1

Research on Mitochondrial Genomes
Lectures for 4Y03
Paul Higgs
Dept. of Physics, McMaster University, Hamilton, Ontario.
Supported by
Canada Research Chairs
and BBSRC
1.
Building a database for mitochondrial
genomes.
2.
Large scale - gene order evolution.
3.
Medium scale – sequence evolution.
Molecular phylogenetics.
4.
Small scale – mutation and selection.
Variation in base and amino acid
frequencies. Codon usage.
5.
Genetic code evolution
People:
1. Wenli Jia, Bin Tang, Daniel Jameson
2. Howsun Jow, Magnus Rattray, Cendrine
Hudelot, Vivek Gowri-Shankar, Xiaoguang
Yang
3. Wei Xu, Daniel Jameson
4. Daniel Urbina, Wenli Jia.
5. Supratim Sengupta
Mitochondria are organelles
inside eukaryotic cells.
They are the site of oxidative
phosphorylation and ATP
synthesis.
They contain their own genome
distinct from the DNA in the
nucleus.
Typical animal mitochondrial genomes
are short and circular (~16,000 bases).
They usually contain:
2 rRNAs
22 tRNAs
13 proteins
LOCUS
DEFINITION
ACCESSION
VERSION
KEYWORDS
SOURCE
ORGANISM
NC_001922
16646 bp
DNA
circular VRT 20-SEP-2002
Alligator mississippiensis mitochondrion, complete genome.
NC_001922
NC_001922.1 GI:5835540
.
mitochondrion Alligator mississippiensis (American alligator)
Alligator mississippiensis
Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
Archosauria; Crocodylidae; Alligatorinae; Alligator.
REFERENCE
1 (bases 1 to 16646)
AUTHORS
Janke,A. and Arnason,U.
TITLE
The complete mitochondrial genome of Alligator mississippiensis and
the separation between recent archosauria (birds and crocodiles)
JOURNAL
Mol. Biol. Evol. 14 (12), 1266-1272 (1997)
MEDLINE
98066357
PUBMED
9402737
FEATURES
Location/Qualifiers
source
1..16646
/organism="Alligator mississippiensis"
/organelle="mitochondrion"
/mol_type="genomic DNA"
/db_xref="taxon:8496"
/tissue_type="liver"
/dev_stage="adult"
rRNA
1..976
/product="12S ribosomal RNA"
tRNA
977..1044
/product="tRNA-Val"
/anticodon=(pos:1009..1011,aa:Val) 1 caacagactt agtcctggtc
61 gcgaaccagt gagaacaccc
rRNA
1046..2635
121 caaccgatag cccaaaacgc
/product="16S ribosomal RNA"
181 ccttaaacca taagcgaaag
tRNA
2636..2710
241 gtgccagcaa ccgcggttag
/product="tRNA-Leu"
301 gctagaactc tatctccccc
/note="codons recognized: UUR"
/anticodon=(pos:2672..2674,aa:Leu) 361 cacaccgcaa acatcaacac
421 aagctgagaa acaaactggg
gene
2711..3676
481 gtacacaaca gactaccctc
/gene="ND1"
541 gacggcactt taaacccccc
CDS
2711..3676
601 cgaccacctt tagcctactc
/gene="ND1"
661 aaacaaaacg cgcgcaacag
721 aggtggaaga gatgggctac
781 gaaatacagg actgtcaaag
841 gtcggtaacg aagtgcgtac
901 aacaacaggc acaatgttgg
961 ggtgcacttg gaacatcaaa
1021 agtcccacca tcggaccatt
An example of a
GenBank file
Complete
mitochondrial genome
of the Alligator
ttttcattag
tacaagtctg
ctagcccagc
cttgatttag
acgaaaacct
attagtgcag
aaaactggcc
attagatacc
gccagagaat
tagaggagcc
agtctgtata
ctcaaccgag
attttctcaa
ccggatttag
acaccgcccg
gcaagatggg
atgtagctta
ttgaaaccca
ctagtactca
acagacgaat
cacaccccca
ttagagtaga
caagttaatt
atacggtatc
ctaatctcaa
ccactatgct
tacgagcccc
tgtcctataa
ccgccgtcgc
ctaacacgtc
catgtagaaa
cagtaaactg
tcaccctcct
gaaagtcgta
aatttaaagc
tatctagccc
acttatacat
ggagccggca
agggtctcag
tatagaggcg
gacaaacggc
acagtagtga
agatgtactc
cagcccttaa
gcttaaaact
tcgacagtac
aagcccgtcc
aggtcaaggt
tattcaacgg
ggaaagaata
cgaacccaac
acaaggtaag
attcagttta
tacctccttt
gcaagcatcc
tcaggcacat
cagtgattaa
gtcaactctc
gtaaattgtg
taaacttcat
gattccacga
cattggtgta
caaaggactt
acgttacacc
catttgaggg
gcagccaaca
agagccctat
cctagttgaa
aaaatgccca
cgtaccggaa
cacctgaaaa
caacatgctt
OGRe (= Organellar Genome Retrieval) is a relational database.
available at http://ogre.mcmaster.ca
More than 800 complete animal mitochondrial genomes.
Efficient means of storage and retrieval of information. Uses PostgreSQL
Schema defines relationships between different types of information.
fi lei ndex
genome_c ode
offs t
fi lename
c itations
genome_c ode
medline_c ode
s pecies
s pecies_c ode
c las si fic ation
group_name
latin_name
c ommon_name
c las si fic ation
group_name
parent
alternative_parent
has _c hil dren
genome
genome_c ode
s pecies_c ode
genome_type
ncbiac
des cri pti on
ncbidate
acc es sion_date
lastmodi fied
lastmodi fiedby
notes
genome_length
geneti c_code
a_c ontent
c _c ontent
g_c ontent
t_content
fi nal
gene_order
gene_order_notrna
rna_a_content
rna_c_content
rna_g_content
rna_t_content
c odon_usage
genome_c ode
c odon
s trand
usage
feature
feature_i d
genome_c ode
type
feature_name
notes
des cri pti on
alignment_fil e
feature_l oc ation
feature_i d
s tart
s top
s trand
trna
feature_i d
ami no_aci d
anti codon
c odon
feature_desc riptions
feature_name
des cri pti on
The OGRe front page: http://ogre.mcmaster.ca
Sequence information for OGRe is taken from GenBank. We aim to keep up to
date with publicly available animal mitochondrial genomes.
Species may be selected individually
from an alphabetical list
Or taxa may be selected from a
hierarchy. Here the Arthropods
have been expanded and the
Myriapods and Crustaceans have
been selected
Large Scale – Evolution of Gene Order in Whole Genomes
On the ogre web site, a visual comparison can be made of any two selected
species. Colour is used to indicate conserved blocks of genes.
Alligator and Bird genomes differ
by interchange of two tRNA genes
(red and yellow)…
…and by translocation of the two
genes in the blue block.
Genome reshuffling mechanisms
Inversions:
C
-C -B
B
A B C D
A
A
D
Translocations:
A (B C) D
A D B C
Duplications and deletions
A B C D
A B
/ C B C/ D
A C B D
D
Example of an inversion
Example of a translocation
The T and –F genes are duplicated in
Cordylus warreni.
If the first T and the second –P were
deleted, the relative position of T
and –P would change.
Sometimes things go crazy ….
Drosophila and Thrips are both insects
yet there are 30 breakpoints for only 37 genes
i.e. almost nothing in common.
OGRe contains gene orders as strings. This allows searching and comparison.
231 unique gene orders have been found in 858 species.
The standard vertebrate order is shared by 398 species (including humans).
There are many other species with unique gene orders.
Some species conserve gene order over 100s of millions of years. Others get
scrambled in a few million.
Still to do (new project) :
- estimate relative rates of different rearrangement processes
- predict most likely ancestral gene orders
- use gene order evidence in phylogenetics
Medium Scale – Sequence Alignments and Phylogenetics
Part of sequence alignment of Mitochondrial Small Sub-Unit rRNA
Full gene is length ~950
11 Primate species with mouse as outgroup
Mouse
Lemur
Tarsier
SakiMonkey
Marmoset
Baboon
Gibbon
Orangutan
Gorilla
PygmyChimp
Chimp
Human
:
:
:
:
:
:
:
:
:
:
:
:
*
20
*
40
*
60
*
CUCACCAUCUCUUGCUAAUUCAGCCUAUAUACCGCCAUCUUCAGCAAACCCUAAAAAGG-UAUUAAAGUAAGCAAAAGA
CUCACCACUUCUUGCUAAUUCAACUUAUAUACCGCCAUCCCCAGCAAACCCUAUUAAGGCCC-CAAAGUAAGCAAAAAC
CUUACCACCUCUUGCUAAUUCAGUCUAUAUACCGCCAUCUUCAGCAAACCCUAAUAAAGGUUUUAAAGUAAGCACAAGU
CUUACCACCUCUUGCC-AU-CAGCCUGUAUACCGCCAUCUUCAGCAAACUCUA-UAAUGACAGUAAAGUAAGCACAAGU
CUCACCACGUCUAGCC-AU-CAGCCUGUAUACCGCCAUCUUCAGCAAACUCCU-UAAUGAUUGUAAAGUAAGCAGAAGU
CCCACCCUCUCUUGCU----UAGUCUAUAUACCGCCAUCUUCAGCAAACCCUGAUGAAGGCUACGAAGUGAGCGCAAAU
CUCACCAUCUCUUGCU----CAGCCUAUAUACCGCCAUCUUCAGCAAACCCUGACAAAGGCUAUAAAGUAAGCACAAAC
CUCACCACCCCUUGCU----CAGCCUAUAUACCGCCAUCUUCAGCAAACCCUGAUGAAGGCCACGAAGUAAGCGCAAAC
CUCACCACCUCUUGCU----CAGCCUAUAUACCGCCAUCUUCAGCAAACCCUGACGAAGGCCACAAAGUAAGCACAAGU
CUCACCGCCUCUUGCU----CAGCCUAUAUACCGCCAUCUUCAGCAAACCCUGAUGAAGGUUACAAAGUAAGCGCAAGU
CUCACCGCCUCUUGCU----CAGCCUAUAUACCGCCAUCUUCAGCAAACCCUGAUGAAGGUUACAAAGUAAGCGCAAGU
CUCACCACCUCUUGCU----CAGCCUAUAUACCGCCAUCUUCAGCAAACCCUGAUGAAGGCUACAAAGUAAGCGCAAGU
CucACC cuCUuGCu
cAgccUaUAUACCGCCAUCuuCAGCAAACcCu
A G
aAAGUaAGC AA
:
:
:
:
:
:
:
:
:
:
:
:
78
78
79
76
76
75
75
75
75
75
75
75
69 Mammals with
complete motochondrial
genomes.
Used two models
simulatneously
Total of 3571 sites
= 1637 single sites
+ 967 pairs
Hudelot et al. 2003
Afrotheria / Laurasiatheria
Striking examples of convergent evolution
Terebratulina
Katharina
Limulus
Heptathela
Ornithoctonus
Habronattus
Varroa
Carios
Ornithodoros moubata
Ornithodoros porcinus
Rhipicephalus
Amblyomma
Haemaphysalis
Ixodes holocyclus
Ixodes hexagonus
Ixodes persulcatus
Scutigera
Lithobius
Thyropygus
Narceus
Speleonectes
Vargula
Hutchinsoniella
Arthropod
phylogenetics
Very difficult due to
strong variation in
rates of evolution
between species.
Tigriopus
Armillifer
Argulus
Tetraclita
Pollicipes
Penaeus
Cherax
Portunus
Panulirus
Pagurus
Artemia
Triops
Daphnia
Tetrodontophora
Gomphiocephalus
Tricholepidion
Locusta
Aleurodicus
Triatoma
Philaenus
tRNA tree – branch
lengths optimized
on fixed consensus
topology
Long branch
species are
problematic if tree
is not fixed.
Thrips
Lepidopsocid
Heterodoxus
Pyrocoelia
Tribolium
Crioceris
Apis
Melipona
Ostrinia
Antheraea
Bombyx
Anopheles
Drosophila
Chrysomya
0.1
Images coutesy of University of
Nebraska, Dept.of Entomology.
http://entomology.unl.edu/images/
protein tree – branch lengths optimized on fixed consensus topology
Terebratulina
Katharina
Limulus
Heptathela
Ornithoctonus
Habronattus
Varroa
Carios
Ornithodoros moubata
Ornithodoros porcinus
Rhipicephalus
Amblyomma
Haemaphysalis
Ixodes holocyclus
Ixodes hexagonus
Ixodes persulcatus
Scutigera
Lithobius
Thyropygus
Narceus
Speleonectes
Vargula
Hutchinsoniella
Tigriopus
Armillifer
Argulus
Tetraclita
Pollicipes
Penaeus
Same
species are
on long
branches in
proteins as in
RNAs
0.1
Cherax
Portunus
Panulirus
Pagurus
Artemia
Triops
Daphnia
Tetrodontophora
Gomphiocephalus
Tricholepidion
Locusta
Aleurodicus
Triatoma
Philaenus
Thrips
Lepidopsocid
Heterodoxus
Pyrocoelia
Tribolium
Crioceris
Apis
Melipona
Ostrinia
Antheraea
Bombyx
Anopheles
Drosophila
Chrysomya
Images coutesy of University of
Nebraska, Dept.of Entomology.
http://entomology.unl.edu/images/
Relative rate test for sequence evolution - Templeton
Three aligned sequences with 0 known to be
the outgroup. Test whether rates of evolution
in branch 1 and branch 2 are equal.
0
1
Calculate:
2
m1 = number of sites where 0 and 2 are the
same and 1 is different.
m2 = number of sites where 0 and 1 are the
same and 2 is different.
2
(
m

m
)
2
 m2  1
(m1  m2 )
Should follow a chi squared distribution with one degree of freedom.
Many pairs of related species found to have different rates in the
mitochondrial sequences.
Gene Order sometimes gives evidence of phylogenetic relationships
The gene order of the ancestral
arthropod is thought to be the
same as that of the horseshoe
crab Limulus.
Image courtesy of Marine Biology Lab, Woods
Hole. www.mbl.edu/animals/Limulus
The same translocation of tRNA-Leu is found in insects and crustaceans but not
myriapods and chelicerates. Strong argument for the group Pancrustacea (=
insects plus crustaceans)
Moderately rearranged
Completely scrambled
Tigriopus
Heterodoxus
Thrips
Pollicipes
Cherax
Tetraclita
Argulus
Speleonectes
Apis
Hutchinsoniella
Pagurus
Vargula
Lepidopsocid
Habronattus
Ornithoctonus
Scutigera
Melipona
Varroa
Armillifer
Narceus
Thyropygus
Aleurodicus
Anopheles
Tetrodontophora
Artemia
Rhipicephalus
Amblyomma
Haemaphysalis
Locusta
Bombyx
Portunus
Ostrinia
Tribolium
Antheraea
Chrysomya
Tricholepidion
Daphnia
Pyrocoelia
Drosophila
Panulirus
Triatoma
Lithobius
Philaenus
Gomphiocephalus
Penaeus
Crioceris
Triops
Limulus
Ixodes
Ixodes
Ixodes
Carios
Ornithodoros
Heptathela
Ornithodoros
japonicus
macropus
imaginis
polymerus
destructor
japonica
americanus
tulumensis
mellifera
macracantha
longicarpus
hilgendorfii
RS-2001
oregonensis
huwena
coleoptrata
bicolor
destructor
armillatus
annularus
sp.
dugesii
gambiae
bielanensis
franciscana
sanguineus
triguttatum
flava
migratoria
mori
trituberculatus
furnacalis
castaneum
pernyi
putoria
gertschi
pulex
rufa
melanogaster
japonicus
dimidiata
forficatus
spumarius
hodgsoni
monodon
duodecimpunctata
cancriformis
polyphemus
persulcatus
holocyclus
hexagonus
capensis
porcinus
hangzhouensis
moubata
Breakpoints Inversions
35
32
35
32
32
29
22
16
20
16
20
16
20
18
19
16
19
16
18
16
18
12
17
15
17
16
16
14
15
13
15
15
14
8
14
12
13
12
9
9
9
9
8
5
8
6
8
6
7
5
7
6
7
6
7
6
6
5
6
5
6
5
6
5
6
5
6
5
4
2
3
2
3
2
3
2
3
2
3
2
3
2
3
3
3
2
3
2
3
2
3
2
3
2
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
Dup/Del
0
0
1
2
0
0
0
1
0
0
0
0
0
0
0
0
2
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
tRNA
2.15
1.39
1.34
0.69
0.54
0.66
0.72
0.83
0.84
0.86
0.65
0.79
0.60
1.48
1.95
0.48
0.93
0.83
0.85
0.63
0.49
1.04
0.41
0.77
0.63
0.82
0.88
0.82
0.38
0.51
0.51
0.49
0.55
0.50
0.36
0.44
0.62
0.52
0.37
0.58
0.59
1.13
0.69
0.69
0.34
0.55
0.42
0.36
0.72
0.76
0.74
0.70
0.67
0.76
0.68
Protein
1.34
1.83
1.32
0.59
0.57
0.57
1.12
0.93
1.50
0.87
0.45
1.41
0.59
1.09
1.23
0.44
1.66
1.09
1.73
0.58
0.46
1.54
0.47
0.70
0.64
0.96
1.00
0.96
0.52
0.54
0.44
0.48
0.53
0.54
0.42
0.39
0.51
0.77
0.42
0.53
0.50
0.61
0.58
0.62
0.32
0.58
0.40
0.40
0.82
0.83
0.90
0.79
0.86
0.87
0.88
Very High
High
Medium
Low
Species ranked
according to
breakpoint
distance from
ancestor.
R =0.99
R =0.53
R =0.59
R =0.69
Highly rearranged genomes have highly divergent sequences.
Rates of sequence evolution and genome rearrangement are correlated.
Both are very non-clocklike.
Breakpoint category
Very High
High
Moderate
Low
min
1.33
0.48
0.38
0.34
tRNA distance
mean
1.62
0.86
0.63
0.60
tRNA only High
tRNA only Mod/Low
0.66
0.34
1.01
0.60
max
2.14
1.94
1.04
1.13
protein distance
min
mean
max
1.32
1.50
1.83
0.44
0.99
1.73
0.43
0.69
1.54
0.32
0.62
0.90
1.94
1.13
0.57
0.32
1.15
0.63
1.73
1.54
There are many species where only tRNAs have changed position.
Species with highly reshuffled tRNAs have high rates of sequence evolution in
both tRNAs and proteins.
Relative rate of genome rearrangement (Xu et al 2006)
Three gene orders with 0 known to be the
outgroup. Test whether rates of rearrangement
in branch 1 and branch 2 are equal.
0
1
Calculate:
2
n1 = number of gene couples in 0 and 2 but
not in 1 – i.e. New breakpoint in 1
n2 = number of gene couples in 0 and 1 but
not in 2 – i.e. New breakpoint in 2
2
(
n

n
)
 n2  1 2
(n1  n2 )
Should follow a chi squared distribution with one degree of freedom.
We took pairs where there was a significant difference in rearrangement
rates (χn2 was large) and showed that there was a significant difference in
substitution rates too (χm2 was large).
Good Guys
Bad Guys
Gene order is sometimes a strong phylogenetic marker
but the Bad Guys are problematic in gene order analysis as well as phylogenetics.
Why does the evolutionary rate speed up in these isolated groups of species?
Why to tRNA genes move more frequently?
What are the relative rates of inversion and translocation?
Credits:
Daniel Jameson/ Bin Tang – Database design and management
Daniel Urbina – Base and Amino Acid Frequencies
Wei Xu – Gene Order Analysis and Arthropod Phylogenies
Small Scale Evolution –
Variation in Frequencies of Bases and Amino Acids
The two strands of DNA are complementary.
G G C AAAAT
C C GTTTTA
Freq of A on one strand = Freq of T on the other
Freq of C on one strand = Freq of G on the other
If the two strands are subject to the same mutational processes then the freq of
any base should be equal (statistically) on both strands.
This means that A = T and C = G on any one strand.
In this case base frequencies can be described by a single variable: G+C content.
BUT – mitochondrial genomes have an asymmetrical replication process. The two
strands are not equivalent.
The frequencies of bases on the two strands are not equal.
On any one strand the frequencies of the four bases may vary independently.
Mitochondrial genome
replication
Figure from
Faith & Pollock
(2003)
Genetics
Rank genes in order of increasing time spent single stranded
COI < COII < ATP8 < ATP6 < COIII < ND3 < ND4L < ND4 < ND1 < ND5 <ND2 < Cytb
ND6 is on the other strand
The Genetic Code maps the 64 DNA codons to the 20 amino acids.
(This version applies to Vertebrate Mitochondria)
SECOND POSITION
F
I
R
S
T
P
O
S
I
T
I
O
N
T
T
C
TTT F 1
TTC F
TCT
TCC
TCA
TCG
TTA L 2
TTG L
C
A
CTT
CTC
CTA
CTG
L
L
L
L
ATT I
ATC I
3
ATA M 4
ATG M
G
GTT
GTC
GTA
GTG
V
V 5
V
V
A
G
THIRD
POSITION
S
S 6
S
S
TAT Y 10
TAC Y
TGT C 17
TGC C
TAA Stop
TAG Stop
TGA W 18
TGG W
T
C
A
G
CCT
CCC
CCA
CCG
P
P 7
P
P
CAT H 11
CAC H
CGT
CGC
CGA
CGG
ACT
ACC
ACA
ACG
T
T 8
T
T
GCT
GCC
GCA
GCG
A
A 9
A
A
R
R 19
R
R
T
C
A
G
AAT N 13
AAC N
AGT S 20
AGC S
AAA K 14
AAG K
AGA Stop
AGG Stop
T
C
A
G
GAT D 15
GAC D
GGT G
GGC G 21
GGA G
GGG G
CAA Q 12
CAG Q
GAA E 16
GAG E
T
C
A
G
4-codon families where the third position is synonymous
Base frequencies at FFD sites in each gene (averaged over mammals)
Deamination: C to U and A to G on the heavy strand
Base frequencies at FFD sites are controlled by mutation.
Base frequencies at 1st and 2nd positions are influenced by mutation and selection
Model fitting (Data from Fish) – assume a fraction of fixed sites and a fraction of
neutral sites.
Selection at 1st position is weaker than at 2nd
Mutation pressure is sufficient to cause change in amino acid frequencies.
Second Position
F
i
r
s
t
T
C
P
o
s
i
t
i
o
n
A
T
C
A
G
Third
Pos.
F 1
F
S
S 6
S
S
Y 10
Y
C 17
C
Stop
Stop
W 18
W
T
C
A
G
P
P 7
P
P
H 11
H
R
R 19
R
R
T
C
A
G
T
T 8
T
T
N 13
N
S 20
S
K 14
K
Stop
Stop
T
C
A
G
A
A 9
A
A
D 15
D
G
G
L
L
L 2
L
L
L
I 3
I
M 4
M
G
V
V 5
V
V
Q 12
Q
E 16
E
21
G
G
T
C
A
G
Slopes of the amino
acid freq v base freq
show the response of
the amino acid to
mutational pressure.
Black = fish
White = mammals
Amino acids in the
first two columns of
the code have larger
slopes.
Physical Properties of Amino Acids
Vol
.
Bulk.
Polarity
pI
Hyd.1
Hyd.2
Surface
Area
Fract.
Area
Ala
A
67
11.50
0.00
6.00
1.8
1.6
113
0.74
Arg
R
148
14.28
52.00
10.76
-4.5
-12.3
241
0.64
Asn
N
96
12.28
3.38
5.41
-3.5
-4.8
158
0.63
Asp
D
91
11.68
49.70
2.77
-3.5
-9.2
151
0.62
Cys
C
86
13.46
1.48
5.05
2.5
2.0
140
0.91
Gln
Q
114
14.45
3.53
5.65
-3.5
-4.1
189
0.62
Glu
E
109
13.57
49.90
3.22
-3.5
-8.2
183
0.62
Gly
G
48
3.40
0.00
5.97
-0.4
1.0
85
0.72
His
H
118
13.69
51.60
7.59
-3.2
-3.0
194
0.78
y2
Each Amino Acid is a point in 8-d space.
dij = Euclidean distance between a.a. i and j in
8-d space.
y1
y3
Principal Component Analysis Projects the 8-d space into the two
‘most important’ dimensions.
Big
Small
Hydrophobic
Hydrophilic
Responsiveness measures how much an amino acid frequency varies in
response to mutational pressure
= Root mean square of 8 slopes for each amino acid (i.e. 4 bases x 2
data sets)
Second Position
F
i
R
s
t
P
o
s
i
t
i
o
n
T
C
A
T
C
A
G
Third
Pos.
F
F
S
S
S
S
Y
Y
C
C
Stop
Stop
W
W
T
C
A
G
P
P
P
P
H
H
R
R
R
R
T
C
A
G
T
T
T
T
N
N
S
S
K
K
Stop
Stop
T
C
A
G
A
A
A
A
D
D
G
E
E
G
L
L
L
L
L
L
I
I
M
M
G
V
V
V
V
Q
Q
G
G
T
C
A
G
Proximity measures how similar the
neighbouring amino acids are in the
genetic code = Mean of 1/d for
accessible amino acids
e.g. Prox (T) =

1  2
2
6
4
4
2
2


+
+
+
+
+
+
+
2

0


24  dTI
dTM
dTS
dTP
dTA
dTN
dTK

Responsiveness and Proximity are highly correlated.
R =0.87 (p < 10-6)
An amino acid frequency responds to mutational pressure more easily if
there are neighbouring amino acids with similar physical properties.
Urbina et al. (2006) J. Mol. Evol.
Homo sapiens Strand = + 3624 codons
F
F
L
L
UUU
UUC
UUA
UUG
69
139
65
11
S
S
S
S
UCU
UCC
UCA
UCG
29
99
81
7
Y
Y
*
*
UAU
UAC
UAA
UAG
35
89
4
3
C
C
W
W
UGU
UGC
UGA
UGG
5
17
90
9
L
L
L
L
CUU
CUC
CUA
CUG
65
167
276
42
P
P
P
P
CCU
CCC
CCA
CCG
37
119
52
7
H
H
Q
Q
CAU
CAC
CAA
CAG
18
79
82
8
R
R
R
R
CGU
CGC
CGA
CGG
6
26
28
0
I
I
M
M
AUU
AUC
AUA
AUG
112
196
165
32
T
T
T
T
ACU
ACC
ACA
ACG
50
155
132
10
N
N
K
K
AAU
AAC
AAA
AAG
29
131
84
9
S
S
*
*
AGU
AGC
AGA
AGG
11
37
1
0
V
V
V
V
GUU
GUC
GUA
GUG
22
45
61
8
A
A
A
A
GCU
GCC
GCA
GCG
39
123
79
5
D
D
E
E
GAU
GAC
GAA
GAG
12
51
63
15
G
G
G
G
GGU
GGC
GGA
GGG
16
87
61
19
Fish - 23
Frequency ratios
p(X2Y3 )
r(X 2Y3 ) =
q(X2 )q(Y3 )
Codon bias seems to be a
dinucleotide mutational
effect in mitochondria,
rather than an effect of
translational selection.
UU
1.250
CU
0.939
GU
0.605
UC
0.756
CC
1.205
GC
0.878
UA
1.030
CA
0.938
GA
1.145
UG
1.274
CG
0.554
GG
1.891
Mammals - 23
UU
0.939
CU
1.101
GU
0.763
UC
0.743
CC
1.163
GC
1.005
UA
1.136
CA
0.906
GA
1.027
UG
1.433
CG
0.552
GG
1.654
Fish - 31
UU
0.933
CU
1.162
AU
0.907
GU
0.911
UC
0.918
CC
1.371
AC
0.739
GC
0.839
CpG effect.... (increased
rate of C to U mutations in
CG dinucleotides. Expect
high UG and CA)
UA
1.096
CA
0.849
AA
1.135
GA
0.758
UG
1.049
CG
0.609
AG
1.228
GG
1.499
DNA binding proteins....
Mammals - 31
UU
0.855
CU
1.082
AU
0.996
GU
1.115
UC
0.994
CC
1.363
AC
0.797
GC
0.873
UA
1.206
CA
0.945
AA
0.974
GA
0.776
UG
0.856
CG
0.546
AG
1.293
GG
1.369
Changes in tRNA content of genomes from bacteria to mitochondria
Ala
Asp
Gln
Pro
Ser-UCN
Total
Anticodon GGCUGCCGCGUCUUGCUGGGGUGGCGGGGAUGACGA
Epsilon Proteobacteria
Campylobacter jejuni
1
3
0
2
1
0
0
1
0
1
1
0
42
Helicobacter pylori J99
1
1
0
1
1
0
1
1
0
1
1
0
36
Gamma Proteobacteria
Pseudomonas aeruginosa 2
4
0
4
1
0
1
1
1
1
1
1
63
Vibrio parahaemolyticus
1
4
0
6
6
0
0
3
0
1
4
0
126
Haemophilus influenzae
1
3
0
3
2
0
0
2
0
1
2
0
56
Buchnera aphidicola
1
1
0
1
1
0
0
1
0
1
1
0
32#
Blochmannia floridanus
0
1
0
1
1
0
0
1
0
0
1
1
36#
Wigglesworthia glossinidia 1
1
0
1
1
0
0
1
0
1
1
0
34#
Escherichia coli K12
2
3
0
3
2
2
1
1
1
2
1
1
86
Alpha Proteobacteria
Agrobacterium tumefaciens 1
4
0
2
1
1
1
1
1
1
1
1
53
Sinorhizobium meliloti
1
3
1
2
1
1
1
1
1
1
1
1
51
Rickettsia prowazekii
1
1
0
1
1
0
0
1
0
1
1
0
33#
Wolbachia (D. mel)
0
1
0
1
1
0
0
1
0
1
1
0
34#
Caulobacter crescentus
1
2
1
2
1
0
1
1
1
1
1
1
51
Mitochondria
Reclinomonas americana
0
1
0
1
1
0
0
1
0
0
1
0
26
Homo sapiens
0
1
0
1
1
0
0
1
0
0
1
0
22
Only one type of tRNA remains for each codon family in human mitochondria.
Still need 2 tRNAs for Leu and Ser. Therefore 22 in total.
# denotes intracellular parasite or endosymbiont. Small size genomes in
bacteria also have reduced numbers of tRNAs.
Evolution of the Genetic Code:
Before and After the LUCA
1. The genetic code evolved to its canonical form before the Last
Universal Common Ancestor of Archaea, Bacteria and
Eukaryotes - >3 billion years ago. It appears to be highly
optimized. How did it get to be this way?
2. Numerous small changes have occurred to the canonical code
since then. What is the mechanism of codon reassignment?
Codon Reassignment – The Genetic code is variable in mitochondria
(and also some cases of other types of genomes)
Second Position
F
i
r
s
t
P
o
s
i
t
i
o
n
U
C
A
G
U
C
A
G
Third
Pos.
F
F
L
L
S
S
S
S
Y
Y
Stop
Stop
C
C
Stop
W
U
C
A
G
L
L
L
L
P
P
P
P
H
H
Q
Q
R
R
R
R
U
C
A
G
CUN Leu to Thr
I
I
I
M
T
T
T
T
N
N
K
K
S
S
R
R
U
C
A
G
AGR Arg to Ser to Stop/Gly
V
V
V
V
A
A
A
A
D
D
E
E
G
G
G
G
U
C
A
G
UGA Stop to Trp
AUA Ile to Met
CGN Arg to unassigned
etc.....
But how can this happen? It should be disadvantageous.
Example 1: AUA was reassigned from Ile to Met during the early evolution of the
mitochondrial genome.
Before Codon Anticodon
Ile
Ile
Ile
Met
AUU
AUC
AUA
GAU
k2CAU
AUG
CAU
Codon
Anticodon
Ile
Ile
AUU
AUC
GAU
Met
Met
AUA
AUG
UAU or
f5CAU
After
Notes
G in the wobble position of the tRNA-Ile can pair with U
and C in the third codon position
Bacteria and some protist mitochondria possess another
tRNA-Ile with a modified base that translates AUA only.
The tRNA-Met translates AUG only.
Notes
In animal mitochondria the k2CAU tRNA has been
deleted.
There is a gain of function of the tRNA-Met by a mutation
or a base modification
Example 2: UGA was reassigned from Stop to Trp many times
(12 times in mitochondria).
Before
Codon
Anticodon
Notes
Stop
UGA
RF
Release Factor recognizes UGA codon.
Trp
UGG
CCA
Normal tRNA-Trp translates only UGG
codons.
After
Codon
Anticodon
Trp
Trp
UGA
UGG
UCA
Notes
In animal mitochondria (and elsewhere)
there is a gain of function of the tRNA-Trp
via mutation or base modification so that it
translates both UGG and UGA.
The GAIN-LOSS framework
(Sengupta & Higgs, Genetics 2005)
LOSS = deletion or loss of function of a tRNA or RF
GAIN = gain of a new tRNA or a gain of function of an existing one.
GAIN
Ambiguous codon.
Selective disadvantage.
New Code.
Selective disadvantage
because codons are used
in wrong places
Initial Code.
No Problem.
LOSS
LOSS
Unassigned codon.
Selective disadvantage.
Note – the strength of the selective
disadvantage depends on the number of
times the codon is used. There is no
disadvantage if the codon disappears.
GAIN
Mutations in coding
sequences
New Code.
Codons now used in right places.
No Problem.
Four possible mechanisms of codon reassignment.
1. Codon Disappearance - The codon disappears. The order of the gain and
loss is irrelevant.
For the other three mechanisms the codon does not disappear.
2. Ambiguous Intermediate – The gain happens before the loss. There is a
period when the gain is fixed in the population and translation is
ambiguous.
3. Unassigned Codon – The loss happens before the gain. There is a period
when the loss is fixed in the population and the codon is unassigned.
4. Compensatory Change – The gain and loss are fixed in the population
simultaneously (although they do not arise at the same time). There is
no intermediate period between the old and the new codes. - cf. theory
of compensatory substitutions in RNA helices.
Sengupta & Higgs (2005) showed that all four mechanisms work in a
population genetics simulation
Summary of Codon Reassignments in Mitochondria
Codon
reassignment
Can this be
explained by
GCAU mutation
pressure?
No.
of
times
Change
in No.
of
tRNAs
Is
mispairing
important?
Mechanism
UAG: Stop  Leu
2
G  A at 3rd pos.
+1
No
CD
UAG: Stop  Ala
1
G  A at 3rd pos.
+1
No
CD
0
Possibly. CA
at 3rd pos.
CD
UGA: Stop  Trp
12
G  A at 2nd pos.
CUN: Leu  Thr
1
C  U at 1st pos.
0
No
CD
CGN: Arg  Unass
5
C  A at 1st pos.
-1
No
CD
AUA: Ile  Met
or Unassigned
3 / 5
-1
Yes. GA at
3rd pos.
UC
0
Yes. GA at
3rd pos.
AI
0
Possibly. GA
at 3rd pos.
UC or AI
-1
Yes. GA at
3rd pos.
UC
AAA: Lys  Asn
AAA: Lys  Unass
AGR:
Arg  Ser
2
1
1
No
No
No
No
AGR: Ser  Stop
1
No
0
No
AI(b)
AGR: Ser  Gly
1
No
+1
No
AI(b)
UUA: Leu  Stop
1
No
0
No
UC or AI
UCA: Ser  Stop
1
No
0
No
UC or AI
CD mechanism explains disappearance of stop codons because they are
rare initially. Only a few examples of CD for sense codons. UC and AI are
important for sense codons.
Three examples in yeasts (Mutation pressure GC to AU)
CUN is rare (replaced by UUR)
Second Position
F
i
r
s
t
P
o
s
i
t
i
o
n
U
C
A
G
U
F
F
L
L
S
S
S
S
Y
Y
Stop
Stop
C
C
Stop
W
U
C
A
G
C
L
L
L
L
P
P
P
P
H
H
Q
Q
R
R
R
R
U
C
A
G
I
I
I
M
T
T
T
T
N
N
K
K
S
S
R
R
U
C
A
G
V
V
V
V
A
A
A
A
D
D
E
E
G
U
C
A
G
A
G
Third
Pos.
G
G
G
CUN Leu to Thr
CGN is rare (replaced by AGR)
CGN Arg codons become
unassigned.
AUA and AUU common and
AUC is rare
Nevertheless AUA is reassigned
to Met. Codon does not disappear
Leu and Arg codons in yeasts
Codon Disappearance causes reassignments
Leu Leu
CUN UUR
Arg
CGN
Arg
AGR
S
53
192
7
33
Y.
44
618
0**
75
C
3
279
12
29
C
132
397
47
26
C
66
547
39
45
P
25
714
18
67
K
0
286
0**
48
C
11*
294
1**
45
S
33*
333
7
49
S
19*
274
0**
40
S
22*
300
0**
46
* CUN = Thr. Unusual tRNA-Thr present instead of tRNA-Leu
** CGN = unassigned.
tRNA-Arg is deleted
AUA Ile to Met in Yeasts
codon
anticodon
AUU Ile
GUA
AUC Ile
“
AUA Ile
K2CAU
AUG Met CAU
Codon Usage
AUU AUC AUA AUG AUA is
J
133
40
32
48 Ile
O
161
34
0
57 Absent
P
113
39
49
51 Ile
tRNA
K2CAU
none
K2CAU
AUU AUC AUA AUG
119 81 229 100 Ile
303 32 193 117 Ile
274 18 562 105 Ile
213 16
7 63 ?
207 21 16 73 Met
239 31 60 73 Met
203
7 101 56 Met
218 11 95 70 Met
K2CAU
K2CAU
K2CAU
none
C*AU
C*AU
C*AU
C*AU
C
C
P
K
C
S
S
S
Reassignments
in Metazoa
Porifera
Cnidaria
Arthropoda
Nematoda
Lophotrochozoa
Loss of tRNA-Ile(CAU)
but AUA remains Ile
Loss of tRNA-Arg(UCU)
and AGR : Arg -> Ser
Loss of many tRNAs +
import from cytoplasm
Platyhelminthes
Echinodermata
Hemichordata
AUA : Ile -> Met
AGR : Ser -> Stop
Urochordata
AGR : Ser -> Gly
AAA : Lys -> Asn
AAA : Lys -> unassigned
Cephalochordata
Craniata
AGR in Metazoa – One loss of tRNA-Arg with several responses.
Second Position
U
F
i
r
s
t
P
o
s
i
t
i
o
n
U
C
A
G
C
A
G
Third
Pos.
F
F
L
L
S
S
S
S
Y
Y
Stop
Stop
C
C
Stop
W
U
C
A
G
L
L
L
L
P
P
P
P
H
H
Q
Q
R
R
R
R
U
C
A
G
I
I
I
M
T
T
T
T
N
N
K
K
S
S
R
R
U
C
A
G
V
V
V
V
A
A
A
A
D
D
E
E
G
G
G
G
U
C
A
G
codon
anticodon
AGU Ser
GUA
AGC Ser
“
AGA Arg
UCU
AGG Arg
“
AGR can become
(i)
Ser/Unass (e.g Arthropods)
(ii) Stop (e.g. Vertebrates)
(iii) Gly (e.g. Urochordates)
Evolution of the canonical code - Before the LUCA
The canonical code seems to be optimized to reduce the effects of
translational and mutational errors.
Neighbouring codons code for similar amino acids.
5
7
C LI F WM Y V PT A SG HQ
9
R
11
NK
Woese’s polar requirement scale
Measure difference between amino acid properties by
how far apart they are on this scale.
13
E
D
Cost function g(a,b) for replacing amino acid a by amino acid b
e.g. difference in Polar Requirement
E   rij g (ai , a j ) /  rij
i
j
i
j
rij = rate of mistaking codon i for codon j
= 1 for single position mistakes, 0 otherwise
E = measure of error associated with a code
Generate random codes by permuting the 20 amino acids in the code table
E is smaller for the canonical code than for almost all random codes.
f ~ 10-6
p(E)
Ereal
one in a million codes is
better (Freeland and Hurst)
f
E
Principal Component Analysis Projects the 8-d space into the two
‘most important’ dimensions.
Big
Small
Hydrophobic
Hydrophilic
Modified codes show that the Canonical code could have changed as it
evolved – not completely a frozen accident.
Possibility of competition between organisms with different codes – natural
selection.
Early codes had <20 amino acids (???). Gradual increase in complexity.
Increased repertoire of amino acids gives more protein functions.
Order of addition –
Astrobiology - which amino acids were common on early Earth?
Prebiotic synthesis of amino acids
Amino acids are found in
• Meteorites
• Atmospheric chemistry experiments (Miller-Urey)
• Hydrothermal synthesis
• Icy dust grains in space
Rank amino acids in order of decreasing frequency in 12 observations.
Derive mean ranking.
G A D E V S I L P T (found non-biologically - early amino acids)
K R H F Q N Y W C M (not found non-biologically – late amino acids)
Early and Late amino acids are determined by thermodynamics
Positions of early and late amino acids....
What does this mean?
Second Position
F
i
r
s
t
P
o
s
i
t
i
o
n
U
C
A
G
U
FF
FF
L
L
S
S
S
S
Y
Y
Stop
Stop
C
C
Stop
W
U
C
A
G
C
L
L
L
L
P
P
P
P
H
H
Q
Q
R
R
R
R
U
C
A
G
I
I
I
MM
T
T
T
T
N
N
K
K
S
S
R
R
U
C
A
G
V
V
V
V
A
A
A
A
D
D
E
E
G
A
G
Third
Pos.
G
G
G
U
C
A
G
Maybe only 2nd position was
relevant initially.
Late amino acids took over
codons previously assigned
to amino acids with similar
properties.
Other points –
Column structure suggests that translational errors were more important
than mutational errors (tRNA structure/RNA world)
Precursor-product pairs tend to be neighbours (but doubts over
statistical significance). Maybe late amino acids took over codons
previously assigned to their biochemical precursors.
Direct chemical interactions between RNA motifs and amino acids
(“stereochemical theory”). In vitro selection experiments suggest
binding sites of aptamers preferentially contain codon and anticodon
sequences.