Transcript worm
Why Is
Sequence Comparison
Useful?
Lipman, David
(NIH/NLM/NCBI)
Almost 100 Trillion BLAST
comparisons per quarter (10/01)
1.E+14
9.E+13
8.E+13
7.E+13
6.E+13
5.E+13
4.E+13
3.E+13
2.E+13
1.E+13
0.E+00
1998
1999
2000
Quarter
2001
Rapid similarity searches of nucleic acid and
protein data banks.
Wilbur WJ, Lipman DJ.
Proc Natl Acad Sci U S A 1983 Feb;80(3):726-30
With the development of large data banks of protein and nucleic acid
sequences, the need for efficient methods of searching such banks for
sequences similar to a given sequence has become evident. We present an
algorithm for the global comparison of sequences based on matching k-tuples
of sequence elements for a fixed k. The method results in substantial
reduction in the time required to search a data bank when compared with
prior techniques of similarity analysis, with minimal loss in sensitivity. The
algorithm has also been adapted, in a separate implementation, to produce
rigorous sequence alignments. Currently, using the DEC KL-10 system, we
can compare all sequences in the entire Protein Data Bank of the National
Biomedical Research Foundation with a 350-residue query sequence in less
than 3 min and carry out a similar analysis with a 500-base query sequence
against all eukaryotic sequences in the Los Alamos Nucleic Acid Data Base
in less than 2 min.
Cancer Gene Meets Its Match
NY Times July 3, 1983
“…a serendipitous computer search…”
Waterfield MD et al., Nature 1983 Jul 7;304(5921):35-39
Doolittle RF et al., Science 1983 Jul 15;221(4607):275-277
v-sis: 6 QGDPIPEELYKMLSGHSIRSFDDLQRLLQGDSGKEDGAELDLNMTRSHSGGELESLARGK 65
QGDPIPEELY+MLS HSIRSFDDLQRLL GD G+EDGAELDLNMTRSHSGGELESLARG+
PDGF : 10 QGDPIPEELYEMLSDHSIRSFDDLQRLLHGDPGEEDGAELDLNMTRSHSGGELESLARGR 69
v-sis: 66 RSLGSLSVAEPAMIAECKTRTEVFEISRRLIDRTNANFLVWPPCVEVQRCSGCCNNRNVQ 125
RSLGSL++AEPAMIAECKTRTEVFEISRRLIDRTNANFLVWPPCVEVQRCSGCCNNRNVQ
PDGF : 70 RSLGSLTIAEPAMIAECKTRTEVFEISRRLIDRTNANFLVWPPCVEVQRCSGCCNNRNVQ 129
v-sis: 126 CRPTQVQLRPVQVRKIEIVRKKPIFKKATVTLEDHLACKCEIVAAARAVTRSPGTSQEQR 185
CRPTQVQLRPVQVRKIEIVRKKPIFKKATVTLEDHLACKCE VAAAR VTRSPG SQEQR
PDGF : 130 CRPTQVQLRPVQVRKIEIVRKKPIFKKATVTLEDHLACKCETVAAARPVTRSPGGSQEQR 189
v-sis: 186 AKTTQSRVTIRTVRVRRPPKGKHRKCKHTHDKTALKETLGA 226
AKT Q+RVTIRTVRVRRPPKGKHRK KHTHDKTALKETLGA
PDGF : 190 AKTPQTRVTIRTVRVRRPPKGKHRKFKHTHDKTALKETLGA 230
V-sis and Platelet-Derived Growth Factor (PDGF)
(for Slide Animation please Click the area of slide or Slide Show button)
An earlier, more subtle discovery…
Viral src gene products are related to the catalytic chain of
mammalian cAMP-dependent protein kinase Barker WC,
Dayhoff MO. PNAS 1982 May;79(9):2836-2839
Query: 113 YAAQIVLTFEYLHSLDLIYRDLKPENLLIDQQGYIQVTDFGFAKR---VKGRTWT---LC 166
Y+ +V
+LHS +++ DLKP N+LI +Q
+++DFG +++
++GR +
+
Sbjct: 125 YSLDVVNGLLFLHSQSILHLDLKPANILISEQDVCKISDFGCSQKLQDLRGRQASPPHIG 184
Query: 167 GTPEYLAPEIILSKGYNKAVDWWALGVLIYEMAAGYPPFFADQPIQIYEKIVSGKVR 223
GT + APEI+ +
D ++ G+ +++M
P ++ +P +
+V+ +R
Sbjct: 185 GTYTHQAPEILKGEIATPKADIYSFGITLWQMTTREVP-YSGEPQYVQYAVVAYNLR 240
Biology not Algorithms
- compare proteins, not DNA
- must detect similar amino acids not just identities
(for Slide Animation please Click the area of slide or Slide Show button)
How often would one find matches?
How many protein families would there be?
In 1983, there were only a small
percentage of genes from the genomes of a
number of evolutionarily distant organisms
( e.g. human, fly, yeast, e.coli ).
Unexpected similarities should be extremely rare.
(for Slide Animation please Click the area of slide or Slide Show button)
Estimating number of protein families
Earliest Estimates of Number of
Protein Families - ~1000
• Zuckerkandl,E. (1974) Accomplissement et perspectives de la
paleogenetique chimique. In: Ecole de Roscoff –1974, p. 69. Paris:CNRS.
“The appearance of new structures and functions in proteins during evolution”, J.
Mol. Evol. 7, 1-57 (1975).
• Dayhoff, M.O. (1974) Federation Proceedings 33, 2314.
“The origin and evolution of protein superfamilies”, Fed.Proc. 35, 2132-2138
(1976).
Margaret Dayhoff
Atlas of Protein Sequence and Structure,
Vol. 5, Supplement 3 (1978) pg. 10:
“It has been estimated that in humans there
are approximately 50,000 proteins of
functional or medical importance. … A
landmark of molecular biology will occur
when one member of each superfamily has
been elucidated. At the present rate of 25 per
year, this will take less than 15 years.”
(for Slide Animation please Click the area of slide or Slide Show button)
Hubris, the Genome Project, and
Protein Families
Chothia, C. (1992). One thousand families for the
molecular biologist. Nature, 357, 543-544.
Green P, Lipman D, Hillier L, Waterson R, States,D, and
Claverie JM (1993). Ancient Conserved Regions in
New Gene Sequences and the Protein Databases.
Science, 259, 1711-1716.
ACR = similarity detected between sequences from
distantly related organisms
(for Slide Animation please Click the area of slide or Slide Show button)
1992: What new families do we get from
the genome projects?
Set
Coding
Sequences
N
Seq. with
ACRs
ACRs
human ESTs 2644 600-1200
197 (16-33%)
103
worm ESTs
1472 1370
570 (42%)
240
worm genes
234
234
74 (32%)
59
yeast ORFs
182
182
43 (24%)
35
Sets compared
Matching
Sequences
ACRs
ACRs in
database
worm ESTs, human ESTs
77, 66
34
31 (91%)
worm ESTs, yeast ORFs
23, 13
9
8 (89%)
worm genes, human ESTs
17, 17
12
12 (100%)
worm genes, yeast ORFs
6, 4
4
3 (75%)
human ESTs, yeast ORFs
14, 13
10
10 (100%)
(for Slide Animation please Click the area of slide or Slide Show button)
Cumulative growth in number of proteins & number of
conserved domains (from Geer, L., Bryant, S., & Ostell, J.)
Green et al.
85% of ACRs
1.2*10 6
100
1.0*10 6
80
60
Conserved Domain
Families
6.0*10 5
40
4.0*10 5
Dayhoff 10%
of superfamilies
Protein
Sequences
2.0*10 5
20
0
0.0
1960
1965
1970
1975
1980
1985
1990
1995
2000
% Families Hit
Number of Proteins
8.0*10 5
(for Slide Animation please Click the area of slide or Slide Show button)
Why so few families and why do
they evolve slowly?
Structural View
Thermodynamics: Finkelstein, AV,
“Why are the same protein folds
used to perform different
functions?” FEBS 325, pp. 23-28
(1993)
(for Slide Animation please Click the area of slide or Slide Show button)
Constraints Due To Biological
Function May Be More Important
One gene
Compare pairs of sequences from
related classes of proteins
Functional
divergence
Gene duplication
– All sequences should at least
share structural similarity
Last universal
common ancestor
– Divergence times for all
sequences should be
approximately the same
– Sequences within a class share
function but sequences between
classes have differing function
Degree within-class similarity > betweenclass similarity indicates importance of
constraints due to biological function.
eukaryotes
prokaryotes
(for Slide Animation please Click the area of slide or Slide Show button)
Example from the Aminoacyl-tRNA
synthetases (aaRS) (from E. Koonin & Y. Wolf)
essential enzymes responsible for incorporation of amino acids into proteins
•Two unrelated classes of aaRS, each includes
10 aaRS related to each other
•The last universal common ancestor (LUCA) of
modern life forms already had at least 17 aaRS
•The duplication leading to aaRS of different
specificities must have occurred during a relatively
short period of early evolution
•The post-LUCA evolution of aaRS took much
longer than the early phase when the specificities
were established. However, the changes that
occurred after the aaRS were locked in their
specificities are small compared to the changes
traced to the early phase
Orthologs … (from S. Bryant)
Paralogs … (from S. Bryant)
Example from the Aminoacyl-tRNA
Synthetases (aaRS) (from E. Koonin & Y. Wolf)
ArgRS
HisRS
1.0
1.0
0.8
0.8
0.6
o
0.6
o
0.4
n
0.4
n
0.2
0.2
0.0
0.0
0.00
0.10
0.20
0.30
0.40
0.00
0.10
ValRS
0.30
0.40
TrpRS
1.0
1.0
0.8
0.8
0.6
o
0.6
0.4
n
0.4
0.2
Exceptions glutamine/glutamate,asparagine/
aspartate & tryptophan/tyrosine
0.2
0.0
0.00
0.20
0.0
0.10
0.20
0.30
0.40
0.00
0.10
0.20
0.30
0.40
(for Slide Animation please Click the area of slide or Slide Show button)
How many human genes?
80,000
Antequera F & Bird A, “Number of CpG islands and genes in
human and mouse”, PNAS 90, 11995-11999 (1993).
120,000
Liang F et al., “Gene Index analysis of the human genome
estimates approximately 120,000 genes”, Nat. Gen., 25, 239-240 (2000)
35,000
Ewing B & Green P, “Analysis of expressed sequence tags
indicates 35,000 human genes”, Nat. Gen. 25, 232-234 (2000)
28,000-34,000 Roest Crollius, H. et al., “Estimate of human gene number
Provided by genome-wide analysis using Tetraodon nigroviridis DNA
Sequence”, Nat. Gen. 25, 235-238 (2000).
41,000-45,000 Das M et al., “Assessment of the Total Number of Human
Transcription Units”, Genomics 77, 71-78 (2001)
(for Slide Animation please Click the area of slide or Slide Show button)
How many human genes with ACRs?
(from S. Resenchuk, T.Tatusov, L. Wagner, A. Souverov)
12,245 characterized mRNAs from RefSeq
78% have ACR, i.e., hit outside vertebrates at E <10e-6 ( 9,496/12,245)
90% of these have corresponding GenomeScan predictions which
also have ACR (8501/9496)
20,245 GS models for entire human genome have ACR
15,573 GS models after correction for splitting (20,245/1.3)
17,300 estimated human genes with ACRs ( ~15,573/.9)
(for Slide Animation please Click the area of slide or Slide Show button)
How many human genes?
17,303 estimated human genes with ACRs
Now use comparative genomics…
ACRs/
genes
S.cerev.
S. Pombe
A.thal.
4022/6306
63%
4846/6593
73%
14443/24605
58%
C. Elegans
11598/20850
55%
D. mela.
10469/14335
73%
17,303/.55 = ~31,500 Total Human Genes
More complicated than that!
(for Slide Animation please Click the area of slide or Slide Show button)
Conservation, expression level, protein length,
& exon number
EST #
0
0-20
0-200
>200
All
RefSeq #
396
2716
9454
2791
12,245
RS + ACR
240 (61%)
1718 (63%) 7049 (75%) 2447 (88%) 9496 (78%)
GS + ACR
158 (66%)
1424 (83%) 6256 (89%) 2245 (92%) 8501 (90%)
Prot. Len.
319
419
486
517
493
Avg. exon# 3.82
6.25
8.78
10.38
9.15
23,600 revised est. human genes with ACRs (~15,573/.66)
43,000 upper bound on est. total human genes (23,600/.55)
35,000 is more reasonable bound with this approach
The relationship of protein
conservation and sequence length
• Lipman DJ, Souvorov A, Koonin EV, Panchenko
AR, Tatusova TA
• BMC Evol Biol. 2002 2:20
E-coli
140
conserved
120
4279 proteins
nonconserved
Number
100
Structural
domains
80
60
Salmonella
Set
40
20
0
0
200
400
600
Length
800
1000
Archaeoglobus fulgidus
100
conserved
80
nonconserved
60
Structural
domains
Number
2420 proteins
40
20
0
0
200
400
600
Length
800
1000
Yeast
400
350
conserved
300
6305 proteins
nonconserved
Number
250
Structural
domains
200
150
100
50
0
0
200
400
600
Length
800
1000
Drosophila
50
conserved
40
2390 proteins
nonconserved
Number
30
Structural
domains
20
10
0
0
200
400
600
Length
800
1000
Human
300
conserved
250
nonconserved
200
Structural
domains
Number
14538 proteins
150
100
50
0
0
200
400
600
Length
800
1000
4279 proteins
E-value 1.e-3
A
E-coli
200
Number
150
conserved
100
nonconserved
50
0
0
200
400
600
800
1000
Length
4279 proteins
E-value 1.e-9
B
E-coli
140
120
Number
100
80
60
40
20
0
0
200
400
600
Length
800
1000
1.2
10
1
8
Archaeoglobus
fulgidus
Escherichia coli
Contact density
0.8
Fraction
0.6
4
0.4
2
0.2
0
0
200
400
600
Length
800
0
1000
Contact density
6
Acknowledgements
Steve Bryant
Greg Schuler
Lewis Geer
Alex Souverov
Alex Kondrashov
Tatiana Tatusov
Eugene Koonin
Lukas Wagner
Jim Ostell
Yuri Wolf
Sergei Resenchuk
Phil Murphy (NIAID)
& all my colleagues at NCBI and NIH