No Slide Title

Download Report

Transcript No Slide Title

SCOP (the Structural Classification of Proteins) Database
This contains all proteins, and protein domains, classified in
terms of their structural and evolutionary relationships: i.e.
their fold, superfamily and family.
The current version places proteins and domains in one of
1232 superfamilies.
SUPERFAMILY Database
The current version contains
(a) 7,924 hidden Markov models that represent the 1232 SCOP
superfamilies.
(b) the matches made to the sequences of 157 genomes.
Effectiveness of Sequence Comparison
Procedures
Structural Assignments to Genome Sequences
Genome
Human
Drosophila
Number of Sequences Amino acids Number of
Sequences assigned
assigned Superfamilies
(%)
(%)
32,035
64
46
904
18,484
60
42
828
6,356
53
40
653
Arabidopsis
28,786
62
45
877
A. fulgidus
2,420
64
57
464
E.coli
4,279
64
55
649
Yeast
Distribution of Known Superfamilies
Distribution
Number of Superfamilies
Prokaryotes
121
Eukaryotes
263
Eukaryotes and Prokaryotes
749
Not in current genomes
Total
99
1232
The Proportions of Families Shared by Pairs of Genomes
Genomes
Number of
Superfamilies
Proportion of families (%) common to
genomes in different kingdoms
hs
dm
sc
at
af
ec
Human (hs)
838
-
86
69
80
43
55
Drosophila (dm)
745
95
-
74
85
46
58
Yeast (sc)
607
95
91
-
95
55
66
Arabidopsis (at)
762
88
83
75
-
49
64
A. fulgidus (af)
417
86
83
79
89
-
85
E. coli. (ec)
612
75
71
65
79
58
-
The Power Law Distribution of Family Sizes
Genomes
Number of
large
superfamilies Number of
Total number that form a superfamilies
Total
of domains in third of all
with one
superfamilies superfamilies
domains
member
Human
904
43,471
6
124
Drosophila
828
17,293
15
165
Yeast
653
4,794
18
188
Arabidopsis
877
25,732
14
143
A. fulgidus
464
2,203
16
242
E. coli.
649
4,049
21
195
Divergence of Structure and Sequence in Homologous Proteins
Domain Combinations of Proteins
In Genomes
Bacteria:
two-thirds of proteins have two or more domains.
Eukaryotes:
three-quarters of proteins have two or more domains.
Combinations of Domains of Known Structure*
in 71 Genomes and PDB
Type of
combinations
Total
Supradomains
Most used
Combinations in
Combinations with known
genomes
atomic structure
Proportion
Proportion
Number of of genome Number of of genome
combinations sequences combinations sequences
(%)
(%)
3474
100
463
67
1734
91
374
65
174
62
138
54
*In genomes and in PDB these domains come, respectively, from 778 and 455 superfamilies
PSI Questions
1 and 2. How many structures are needed to cover prokaryotic and
eukaryotic families?
Prokaryotes: 870 families form ~55% of the residues in proteomes
Eukaryotes: 1000 families form ~45% of the residues in proteomes
PSI Questions
1 and 2. How many structures are needed to cover prokaryotic and
eukaryotic families?
Prokaryotes: 870 families form ~55% of the residues in proteomes
Eukaryotes: 1000 families form ~45% of the residues in proteomes
3. Would a strategy based on Pfam be effective?
~7200 Pfam families cover ~53% of Swiss-Prot/TrEMBL residues
PSI Questions
1 and 2. How many structures are needed to cover prokaryotic and
eukaryotic families?
Prokaryotes: 870 families form ~55% of the residues in proteomes
Eukaryotes: 1000 families form ~45% of the residues in proteomes
3. Would a strategy based on Pfam be effective?
~7200 Pfam families cover ~53% of Swiss-Prot/TrEMBL residues
5. Should we target protein complexes?
Also target novel combinations of domains of known structure
PSI Questions
1 and 2. How many structures are needed to cover prokaryotic and
eukaryotic families?
Prokaryotes: 870 families form ~55% of the residues in proteomes
Eukaryotes: 1000 families form ~45% of the residues in proteomes
3. Would a strategy based on Pfam be effective?
~7200 Pfam families cover ~53% of Swiss-Prot/TrEMBL residues
5. Should we target protein complexes?
Also target novel combinations of domains of known structure
7. Is the sampling of multiple family members useful?
Essential: members of divergent families can have very different structures
PSI Questions
1 and 2. How many structures are needed to cover prokaryotic and
eukaryotic families?
Prokaryotes: 870 families form ~55% of the residues in proteomes
Eukaryotes: 1000 families form ~45% of the residues in proteomes
3. Would a strategy based on Pfam be effective?
~7200 Pfam families cover ~53% of Swiss-Prot/TrEMBL residues
5. Should we target protein complexes?
Also target novel combinations of domains of known structure
7. Is the sampling of multiple family members useful?
Essential: members of divergent families can have very different structures
8. What fraction of eukaryotic proteins have prokaryote homologues?
~3/4; but domain combinations can be very different.
Goga Apic
Matthew Bashton
Julian Gough
Sarah Kummerfield
Martin Madera
Sarah Teichmann
Christine Vogel