Transcript E-coli
Why Is
Sequence Comparison
Useful?
Lipman, David
(NIH/NLM/NCBI)
Almost 100 Trillion BLAST
comparisons per quarter (10/01)
1.E+14
9.E+13
8.E+13
7.E+13
6.E+13
5.E+13
4.E+13
3.E+13
2.E+13
. 1 E+13
0.E+00
1998
1999
Quarter
2000
2001
Rapid similarity searches of nucleic acid and
protein data banks.
With the development of large data banks of protein and
nucleic acid sequences, the need for efficient methods of
searching such banks for sequences similar to a given
sequence has become evident. We present an algorithm
for the global comparison of sequences based on
matching k-tuples of sequence elements for a fixed k.
The method results in substantial reduction in the time
required to search a data bank when compared with prior
techniques of similarity analysis, with minimal loss in
sensitivity. The algorithm has also been adapted, in a
separate implementation, to produce rigorous sequence
alignments. Currently, using the DEC KL-10 system, we
can compare all sequences in the entire Protein Data
Bank of the National Biomedical Research Foundation
with a 350-residue query sequence in less than 3 min
and carry out a similar analysis with a 500-base query
sequence against all eukaryotic sequences in the Los
Alamos Nucleic Acid Data Base in less than 2 min.
Cancer Gene Meets Its Match
NY Times July 3, 1983
“…a serendipitous computer search…”
v-sis:
6 QGDPIPEELYKMLSGHSIRSFDDLQRLLQGDSGKEDGAELDLNMTRSHSGGELESLARGK 65
QGDPIPEELY+MLS HSIRSFDDLQRLL GD G+EDGAELDLNMTRSHSGGELESLARG+
PDGF : 10 QGDPIPEELYEMLSDHSIRSFDDLQRLLHGDPGEEDGAELDLNMTRSHSGGELESLARGR 69
v-sis: 66 RSLGSLSVAEPAMIAECKTRTEVFEISRRLIDRTNANFLVWPPCVEVQRCSGCCNNRNVQ 125
RSLGSL++AEPAMIAECKTRTEVFEISRRLIDRTNANFLVWPPCVEVQRCSGCCNNRNVQ
PDGF : 70 RSLGSLTIAEPAMIAECKTRTEVFEISRRLIDRTNANFLVWPPCVEVQRCSGCCNNRNVQ 129
v-sis: 126 CRPTQVQLRPVQVRKIEIVRKKPIFKKATVTLEDHLACKCEIVAAARAVTRSPGTSQEQR 185
CRPTQVQLRPVQVRKIEIVRKKPIFKKATVTLEDHLACKCE VAAAR VTRSPG SQEQR
PDGF : 130 CRPTQVQLRPVQVRKIEIVRKKPIFKKATVTLEDHLACKCETVAAARPVTRSPGGSQEQR 189
v-sis: 186 AKTTQSRVTIRTVRVRRPPKGKHRKCKHTHDKTALKETLGA 226
AKT Q+RVTIRTVRVRRPPKGKHRK KHTHDKTALKETLGA
PDGF : 190 AKTPQTRVTIRTVRVRRPPKGKHRKFKHTHDKTALKETLGA 230
V-sis and Platelet-Derived Growth Factor (PDGF)
An earlier, more subtle
discovery…
Viral src gene products are related to the catalytic chain
of mammalian cAMP-dependent protein kinase Barker
WC, Dayhoff MO. PNAS 1982 May;79(9):2836-2839
Query: 113 YAAQIVLTFEYLHSLDLIYRDLKPENLLIDQQGYIQVTDFGFAKR---VKGRTWT---LC 166
Y+ +V +LHS +++ DLKP N+LI +Q +++DFG +++ ++GR + +
Sbjct: 125 YSLDVVNGLLFLHSQSILHLDLKPANILISEQDVCKISDFGCSQKLQDLRGRQASPPHIG 184
Query: 167 GTPEYLAPEIILSKGYNKAVDWWALGVLIYEMAAGYPPFFADQPIQIYEKIVSGKVR 223
GT + APEI+ +
D ++ G+ +++M P ++ +P + +V+ +R
Sbjct: 185 GTYTHQAPEILKGEIATPKADIYSFGITLWQMTTREVP-YSGEPQYVQYAVVAYNLR 240
Biology not Algorithms
- compare proteins, not DNA
- must detect similar amino acids not just identities
How often would one find matches?
How many protein families would there be?
In 1983, there were only a small
percentage of genes from the
genomes of a number of
evolutionarily distant
organisms (e.g. human, fly,
yeast, e.coli).
Unexpected similarities should be
extremely rare.
Estimating number of protein families
Earliest Estimates of Number
of Protein Families - ~1000
• Zuckerkandl,E. (1974) Accomplissement et
perspectives de la paleogenetique chimique. In: Ecole
de Roscoff –1974, p. 69. Paris:CNRS.
“The appearance of new structures and functions in
proteins during evolution”, J. Mol. Evol. 7, 1-57 (1975).
• Dayhoff, M.O. (1974) Federation Proceedings 33,
2314.
“The origin and evolution of protein superfamilies”,
Fed.Proc. 35, 2132-2138 (1976).
Margaret Dayhoff
Atlas of Protein Sequence and Structure,
Vol. 5, Supplement 3 (1978) pg. 10:
“It has been estimated that in humans
there are approximately 50,000 proteins
of functional or medical importance. …
A landmark of molecular biology will
occur when one member of each
superfamily has been elucidated. At
the present rate of 25 per year, this will
take less than 15 years.”
Hubris, the Genome Project,
and Protein Families
Chothia, C. (1992). One thousand families for
the molecular biologist. Nature, 357, 543-544.
Green P, Lipman D, Hillier L, Waterson R,
States,D, and Claverie JM (1993). Ancient
Conserved Regions in New Gene Sequences
and the Protein Databases. Science, 259, 17111716.
ACR = similarity detected between sequences
from distantly related organisms
1992: What new families do we get
from the genome projects?
Set
N
Coding
Sequences
Seq. with
ACRs
ACRs
human ESTs 2644 600-1200
197 (16-33%)
103
worm ESTs
1472 1370
570 (42%)
240
worm genes
234
234
74 (32%)
59
yeast ORFs
182
182
43 (24%)
35
Sets compared
Matching
Sequences
ACRs
ACRs in
database
worm ESTs, human ESTs
77, 66
34
31 (91%)
worm ESTs, yeast ORFs
23, 13
9
8 (89%)
worm genes, human ESTs
17, 17
12
12 (100%)
worm genes, yeast ORFs
6, 4
4
3 (75%)
human ESTs, yeast ORFs
14, 13
10
10 (100%)
Cumulative growth in number of proteins &
number of conserved domains
6
Green et al.
85% of ACRs
1.2*10
100
6
Number of Proteins
8.0*10
6.0*10
4.0*10
2.0*10
80
5
Conserved Domain Families
60
5
5
5
Dayhoff 10%
of superfamilies
40
Protein
20
Sequences
0
0.0
1960 1965 1970 1975 1980 1985 1990 1995 2000
% Families Hit
1.0*10
Why so few families and why do
they evolve slowly?
Structural View
Thermodynamics:
Finkelstein, AV, “Why are
the same protein folds
used to perform different
functions?” FEBS 325,
pp. 23-28 (1993)
Constraints Due To Biological
Function May Be More Important
Compare pairs of sequences
One gene
from related classes of proteins
Gene
– All sequences should at least
duplication
share structural similarity
– Divergence times for all
sequences should be
approximately the same
– Sequences
within a class share
function but sequences between
classes have differing function
eukaryotes
Degree within-class similarity >
between-class similarity indicates
importance of constraints due to
biological function.
Functional
divergence
Last universal
common
ancestor
prokaryotes
Example from the Aminoacyl-tRNA
synthetases (aaRS) (from E. Koonin & Y. Wolf)
•Two unrelated classes of aaRS, each
includes 10 aaRS related to each other
•The last universal common ancestor (LUCA) of
modern life forms already had at least 17 aaRS
•The duplication leading to aaRS of different
specificities must have occurred during a
relatively short period of early evolution
•The post-LUCA evolution of aaRS took
much longer than the early phase when
the specificities were established.
However, the changes that occurred after
the aaRS were locked in their specificities
are small compared to the changes traced
to the early phase
Orthologs … (from S. Bryant)
Paralogs … (from S. Bryant)
Example from the Aminoacyl-tRNA
Synthetases (aaRS) (from E. Koonin & Y. Wolf)
ArgRS
HisRS
1.0
1.0
0.8
0.8
0.6
o
0.6
o
0.4
n
0.4
n
0.2
0.2
0.0
0.0
0.00
0.10
0.20
0.30
0.40
0.00
0.10
ValRS
0.30
0.40
TrpRS
1.0
1.0
0.8
0.8
0.6
o
0.6
0.4
n
0.4
0.2
Exceptions glutamine/glutamate,asparagine/
aspartate & tryptophan/tyrosine
0.2
0.0
0.00
0.20
0.0
0.10
0.20
0.30
0.40
0.00
0.10
0.20
0.30
0.40
How many human genes?
80,000
Antequera F & Bird A, “Number of CpG islands and
genes in human and mouse”, PNAS 90, 11995-11999 (1993).
120,000
Liang F et al., “Gene Index analysis of the human
genome estimates approximately 120,000 genes”, Nat. Gen.,
25, 239-240 (2000)
35,000
Ewing B & Green P, “Analysis of expressed
sequence tags indicates 35,000 human genes”, Nat. Gen. 25,
232-234 (2000)
28,000-34,000 Roest Crollius, H. et al., “Estimate of
human gene number Provided by genome-wide analysis
using Tetraodon nigroviridis DNA Sequence”, Nat. Gen. 25,
235-238 (2000).
41,000-45,000 Das M et al., “Assessment of the Total
Number of Human Transcription Units”, Genomics 77, 71-78
(2001)
How many human genes with ACRs?
(from S. Resenchuk, T.Tatusov, L. Wagner, A. Souverov)
12,245 characterized mRNAs from RefSeq
78% have ACR, i.e., hit outside vertebrates at
E <10e-6 ( 9,496/12,245)
90% of these have corresponding GenomeScan
predictions which also have ACR (8501/9496)
20,245 GS models for entire human genome
have ACR
15,573 GS models after correction for
splitting (20,245/1.3)
17,300 estimated human genes with ACRs
( ~15,573/.9)
How many human genes?
17,303 estimated human genes with ACRs
Now use comparative genomics…
S.cerev.
S. Pombe
A.thal.
C. Elegans
D. mela.
ACRs/ 4022/6306 4846/6593 14443/24605 11598/20850 10469/14335
genes 63%
73%
58%
55%
73%
17,303/.55 = ~31,500 Total Human Genes
More complicated than that!
Conservation, expression level, protein
length, & exon number
EST #
0
0-20
0-200
>200
All
RefSeq #
396
2716
9454
2791
12,245
RS + ACR
240 (61%)
1718 (63%) 7049 (75%) 2447 (88%) 9496 (78%)
GS + ACR
158 (66%)
1424 (83%) 6256 (89%) 2245 (92%) 8501 (90%)
Prot. Len.
319
419
486
517
493
Avg. exon# 3.82
6.25
8.78
10.38
9.15
23,600 revised est. human genes with ACRs (~15,573/.66)
43,000 upper bound on est. total human
genes (23,600/.55)
35,000 is more reasonable bound with this approach
The relationship of protein
conservation and sequence length
• Lipman DJ, Souvorov A, Koonin EV,
Panchenko AR, Tatusova TA
• BMC Evol Biol. 2002 2:20
140
E-coli
4279 120
proteins
Number
100
80
60
40
20
0
0
200
400 Length 600
800
1000
Archaeoglobus fulgidus
100
80
2420
proteins
Number
60
40
20
0
0
200
400
Length 600
800
1000
Yeast
400
6305 350
proteins
Number
300
250
200
150
100
50
0
0
200
400 Length 600
800
1000
5
0
Drosophila
Number
2390 40
proteins
30
20
10
0
0
200
400 Length 600
800
1000
300
Human
250
14538
proteins
Number
200
150
100
50
0
0
200
400
Length
600
800
1000
200
E-value 1.e-3
E-coli
4279 proteins
Number
150
A
100
50
0
0
4279 proteins
140
200
400
600
800
1000
Length
E-value 1.e-9
E-coli
120
Number
100
B
80
60
40
20
00
200
400
600
800
1000
1.2
10
8
0.8
6
0.6
4
0.4
2
0.2
0
0
200
400
Length
600
800
0
1000
Contact density
Fraction
1
Acknowledgements
Steve Bryant
Lewis Geer
Alex Kondrashov
Eugene Koonin
Jim Ostell
Sergei Resenchuk
Greg Schuler
Alex Souverov
Tatiana Tatusov
Lukas Wagner
Yuri Wolf
Phil Murphy (NIAID)
& all my colleagues at NCBI and NIH