Transcript Document
Nothing in (computational) biology makes
sense except in the light of evolution
after Theodosius Dobzhansky (1970)
A brief history and some central principles of
evolutionary (computational) genomics
A brief timeline of genomics
Year
1962
Event
The first theory of molecular evolution; the Molecular
Clock concept (Linus Pauling and Emile Zukerkandl)
1965 Atlas of Protein Sequences, the first protein database
(Margaret Dayhoff and coworkers)
1970 Needleman-Wunsch algorithm for global protein
sequence alignment
1976 First RNA genome sequence (MS2 phage)
determined directly from RNA (Walter Fierce)
1977 New DNA sequencing methods (Fred Sanger,
Walter Gilbert and coworkers); bacteriophage
X174 sequence
Ref.
[940]
[169]
[602]
[549,7
39]
1977 First software for sequence analysis (Roger Staden)
1977 Phylogenetic taxonomy; archaea discovered; the
notion of the three primary kingdoms of life
introduced (Carl Woese and coworkers)
1981 Smith-Waterman algorithm for local protein
sequence alignment
1981 Human mitochondrial genome sequenced
1981 The concept of a sequence motif (Russell Doolittle)
1982 GenBank Release 3 made public
1982 Phage genome sequenced (Fred Sanger and
coworkers)
[792]
[899]
[779]
[28]
[181]
[738]
1983 The first practical sequence database searching
algorithm (John Wilbur and David Lipman)
1985 FASTP/FASTN: fast sequence similarity searching
(William Pearson and David Lipman)
1986 Introduction of Markov models for DNA analysis
(Mark Borodovsky and coworkers)
1987 First profile search algorithm (Michael Gribskov,
[886]
[517]
[105]
[311]
Andrew McLachlan, David Eisenberg)
1988 National Center for Biotechnology Information
(NCBI) created at NIH/NLM
1988 EMBnet network for database distribution created
1990 BLAST: fast sequence similarity searching with
rigorous statistics (Stephen Altschul, David Lipman
and coworkers)
1991 EST: expressed sequence tag sequencing (Craig
Venter and coworkers)
[20]
[4]
1994 Hidden Markov Models of multiple alignments
(David Haussler and coworkers; Pierre Baldi and
coworkers)
1994 SCOP classification of protein structures (Alexei Murzin,
[69,70,
469]
[586]
Cyrus Chothia and coworkers)
1995 First bacterial genomes completely sequenced
[228,238]
1996 First archaeal genome completely sequenced
[127]
1996 First eukaryotic genome (yeast) completely
[286]
sequenced
1997 Introduction of gapped BLAST and PSI-BLAST
[22]
1997 COGs: Evolutionary classification of proteins from
[823]
complete genomes
1998 Worm genome, the first multicellular genome,
[834]
(nearly) completely sequenced
1999
2001
Fly genome (nearly) completely sequenced
Human genome (nearly) completely sequenced
[3]
[484,864]
J. Mol Biol 1982 Dec 25;162(4):729-73
Nucleotide sequence of bacteriophage lambda DNA.
Sanger F, Coulson AR, Hong GF, Hill DF, Petersen GB.
The DNA in its circular form contains 48,502 pairs of nucleotides.
…
Open reading frames were identified and, where possible, ascribed
to genes by comparing with the previously determined genetic map.
The reading frames for 46 genes were clearly identified…
There are about 20 other unidentified reading frames that may code
for proteins.
…
Protein sequence comparison or homology are not mentioned in
this paper...
Non-trivial evolutionary connections and functional predictions for bacteriophage proteins
Gene
product
Evolutionary
conservation
Structure, domain architecturea
Predicted function, reference
A
(TerL)
Bacteriophages,
herpesviruses
A modified P-loop ATPase domain,
distantly related to a vast class of
helicases
ATPase subunit of the terminase, involved in
DNA packaging in phage head
C
Bacteria and archaea
ClpP protease domain
Minor capsid protein, cleaves the scaffold protein
during maturation
K
Bacteria, archaea and
eukaryotes
Consists of an N-terminal JAB/MPN
domain (predicted metalloprotease)
and a C-terminal NLPC domain
(uncharacterized domain found in
bacterial lipoproteins)
Tail subunit; predicted protease involved in tail
assembly (based on the presence of the
JAB/MPN domain) [675]
Ea31
Scattered distribution
in bacteria and
archaea
Endo VII-colicin domain
Predicted nuclease of the McrA (HNH) family
[49]
Ea59
Bacteria, archaea and
eukaryotes
P-loop ATPase domain of the ABC
class
Predicted ATPase [292]
Exo
(RedX)
Bacteria, archaea,
eukaryotes, viruses
exonuclease domain, distantly
related to a broad variety of nucleases
A nuclease involved in phage recombination and
late rolling-circle replication
Table 1.2 – continued
CI
Bacteria, archaea
N-terminal helix-turn-helix DNAbinding domain fused to a C-terminal
serine protease domain of the
LexA/UmuD family
Transcription repressor of genes required for
lytic development
Cro
Bacteria, archaea
Helix-turn-helix DNA-binding domain
Transcription repressor of early genes
O
Bacteria, archaea
Helix-turn-helix DNA-binding domain
DNA-binding protein involved in the initiation of
replication
Ren
Bacteria, archaea
Helix-turn-helix DNA-binding domain
Protein involved in exclusion of replication of
heterologous genomes in -infected bacteria
Nin290
Bacteria, archaea,
eukaryotes
PP-loop ATPase domain
Predicted ATP pyrophosphatase, role in phage
replication unknown [100]
Nin221
Bacteria, archaea,
eukaryotes
Calcineurin-like serine/threonine
protein phosphatase domain
Protein phosphatase, role in phage replication
unknown [446]
Table 1.2 Non-trivial evolutionary connections and functional predictions for
bacteriophage proteins
Gene
product
Evolutionary
conservation
Structure, domain architecturea
Predicted function, reference
A
(TerL)
Bacteriophages,
herpesviruses
A modified P-loop ATPase domain,
distantly related to a vast class of
helicases
ATPase subunit of the terminase, involved in
DNA packaging in phage head
C
Bacteria and archaea
ClpP protease domain
Minor capsid protein, cleaves the scaffold protein
during maturation
K
Bacteria, archaea and
eukaryotes
Consists of an N-terminal JAB/MPN
domain (predicted metalloprotease)
and a C-terminal NLPC domain
(uncharacterized domain found in
bacterial lipoproteins)
Tail subunit; predicted protease involved in tail
assembly (based on the presence of the
JAB/MPN domain) [675]
Ea31
Scattered distribution
in bacteria and
archaea
Endo VII-colicin domain
Predicted nuclease of the McrA (HNH) family
[49]
Ea59
Bacteria, archaea and
eukaryotes
P-loop ATPase domain of the ABC
class
Predicted ATPase [292]
Exo
(RedX)
Bacteria, archaea,
eukaryotes, viruses
exonuclease domain, distantly
related to a broad variety of nucleases
A nuclease involved in phage recombination and
late rolling-circle replication
100
90
80
70
60
Bacteria
50
40
Archaea
30
20
Eukaryotes
10
0
1994
Total
1996
1998
2000
2002
Growth of the number of completely sequenced genomes
Figure 1.2. The current state of annotation of some genomes.
The data were derived from the original genome sequencing papers
Nothing in (computational) biology makes
sense except in the light of evolution
after Theodosius Dobzhansky (1970)
Homology: common ancestry of genes or portions thereof
(a qualitative notion as opposed to similarity)
Species 1
Species 3
Orthologs
Paralogs
Species 2
Evolution by gene duplication, 1970
Gene duplication with subsequent diversification the principal path to innovation in evolution
Table 2.2. Expansion of signaling domains in C. elegansa
Species
Proteins
19,100
S. cerevisiae
6,500
E. coli
4,289
B. subtilis
4,100
M. tuberculosis 3,918
Synechocystis
3,169
A. fulgidus
2,420
M. thermoauto- 1,869
trophicum
M. jannaschii
1,715
A. aeolicus
1,522
C. elegans
Ser/Thr/ Ser/Thr/Tyr
BRCT SH3 VWA WD40
Tyr kinase phosphatase
435
112
26
58
65
127
116
14
10
24
3
110
3
1
1
1
4
0
4
0
1
6
5
0
13
1
1
0
4
4
12
0
1
3
4
2
4
0
0
0
2
0
4
0
0
0
2
0
4
2
2
0
0
1
0
0
3
1
0
0
The data are from ref. [675]. Domain abbreviations are as in the SMART database (see ?3.3): BRCT,
BRCA1 C-terminal domain; SH3, Src homology 3 domain; VWA, von Willebrand factor A domain;
WD40, Trp,Asp-repeat domain.
a
6000
Number of proteins
5000
4000
3000
2000
1000
Sce
Bbu
Mtu
Ssp
Cpn
Mpn
Pho
Hpy
Eco
Bsu
Tpa
Jhp
Mja
Ctr
Afu
Mth
Hin
Rpr
Mge
0
Tma
not in COGs
Aae
in COGs
The majority of the proteins in each prokaryote,
but only ~1/3 of yeast proteins belong to COGs ancient conserved families
MOST OF THE COGs ARE REPRESENTED ONLY IN A SMALL
NUMBER OF CLADES
MAJOR ROLE OF HORIZONTAL
GENE TRANSFER AND CLADE-SPECIFIC GENE LOSS
IN EVOLUTION
Gene loss
speciation
descendants
ancestor
Gene loss
Non-orthologous displacement: two unrelated (or distantly
related) proteins for the same essential function
Figure 2.3. Structural alignment of goose lysozyme (PDB code 153L), chicken egg white lysozyme (3LZT)
and lysozymes from E. coli bacteriophages (1AM7) and T4 (1L92).
153L
3LZT
1AM7
1L92
.GEKLC.VE.PAVIAGIISRESHAG..KVLK....NGWGD...R..........
gLDNYRgYS.LGNWVCAAKFESNFN.........tQATNR...N..........
.mvEIN.NQrKAFLDMLAWSEGTDngrQKTRnhgyDVIVGgelftdysdhprkl
..........MNIFEMLRIDEG...........lrlKIYKdteG..........
153L
3LZT
1AM7
1L92
........GNGFGLMQVDKRSH...............KP........QG..TWN
.....tdgsTDYGILQINSRWWcndgrtpgsrnlcniPC........SAllSSD
vtlnpklkSTGAGRYQLLSRWW...............DayrkqlglkDF..SP.
........YYTIG.IGHLLT.........kspslnaakseldkaigrntngvIT
153L .GEVHITQGTTILINF.IKTIQK...KFPS.WTKD..QQLKGGISAYNAGAGNVR
3LZT ITASVNCAKKIVSDG.N........................GMNAWV.......
1AM7 ..KSQDAVALQQIKERgALPM...........idR..GDIRQAIDRCSN....iw
1L92 .KDEAEKLFNQDVDAA.VRGILRnakLKPVyDSLDavRRAAIINMVFQMGETGVA
153L .SYARMDIGT....................THDDYANDVV....ARAQYYKQHGY
3LZT ................................awRNRCK...gTDVQAWIRGCr
1AM7 .aslpGAGY...................gqfEHKA.DSLI....AKFKEAGgtvr
1L92 .gftnslrmlqqkrwdeaavnlaksrwynqTPNRAkrvittfrtgtwDAYK....
Structure-based sequence alignment of goose lysozyme (153L), chicken egg white lysozyme (3LZT) and
lysozymes from E. coli bacteriophages (1AM7) and T4 (1L92).
Only a small fraction of amino acid residues is directly
involved in protein function (including enzymatic);
the rest of the protein serves largely as structural
scaffold
Significant sequence conservation is evidence of homology
Proteins with different structural folds can perform the
same function - non-orthologous displacement
Proteins (domains) with the same fold are most likely
to be homologous
Convergence does not produce significant sequence or
structural similarity