Three-state per-residue accuracy

Transcript Three-state per-residue accuracy

Classification: understanding the
diversity and principles of
protein
structure
and
function
MCSG 2001 structures
Protein structure classification




Main reference: Robert B. Russell (2002)
Classification of Protein Folds. Molecular
Biotechnology 20:17-28.
Importance: central to studies of protein
structure, function, and evolution
Philosophy: phyletic vs. phenetic
Method: structure comparison + human
knowledge
Philosophy of classification
Phyletic: based on phylogenetic
relationship
 Phenetic: based on study of
phenomena (phenomelogical)

Classification Unit: Domain, a
LEGO piece
Ranganathan
From domain to assembly





Domains are shuffled, duplicated and fused to
make proteins
On average, a domain is of 173 a.a. in size,
compared to 466 a.a. for a yeast protein
Most of the natural domain sequences assume
one of a few thousand folds, of which ~1000 are
already known
no satisfactory estimate yet for the number of
macromolecular complexes
On average, a yeast complex may consist of 7.5
proteins
Sali et al. 2003
Distribution of Protein size
Swiss-prot
Structural vs. functional
domain
Russian doll: a conceptual
problem
Singh
Approaches
Hierarchical
 Based on the types and arrangements of
secondary structures
 Unit (level): domain
 Domain assignment
- structural vs. functional (fold or function
in isolation)
- automated assignment methods
(structure vs. sequence)

A. P. Singh
Assignment of Class
All a or All b (could be subjective)
 a / b (bab unit) or a + b
 Other classes

Class assignment could be
subjective
All-alpha structures
All-beta structures
Superoxide dimutase
Alpha/beta structures
Closed barrel
Open twisted sheet
B-a-b motif
(barrel)
(sheet)
a/b vs. a+b
Assignment of Fold
Defined by the number, type, and
arrangement of SSEs
 Connectivity (e.g. circular
permutation, scrambled proteins)

Assignment of Superfamily
Homologous even in the absence of
significant sequence similarity
- certain level of structural similarity
- unusual structural features
- low but significant sequence similarity
from structural alignment
- key active site residues
- sequence similarity bridges
 Divergence vs. convergence

Divergent vs. convergent
evolution
Divergent evolution: decent from a
common ancestor; become variant
due to mutation
 Convergent evolution: no common
ancestor; become similar due to
functional or physical constraint

Anti-freeze protein:
convergent evolution
crystal.biochem.queensu.ca
Homologous fold
Ranganathan
Analogous fold
Ranganathan
Analogous or homologous?
C’
N’
N
C
Scallop Myosin
Regulatory Domain
C chain
N
C
N’
C’
Aldehyde
Oxidoreductase A
chain
Assignment of Family

significant sequence similarity
Classification databases
SCOP
- careful assignment of evolutionary
relationships; homologous vs. analogous
 CATH
- A:architecture
 FSSP
- a list of structural neighbors

CATH
Class: SSE composition
& packing
Architecture: overall
shape of domain, ignore
SSE connectivity
Topology (Fold):
consider connectivity
Homologous superfamily:
a common ancestor
Singh
Classification databases
CATH
SCOP
FSSP
Class, Architecture, Topolgy, and
Homologous superfamily, a hierarchical
classification of protein domain
structures
http://www.biochem.ucl.ac.uk/bsm/cath
_new/
Structural Classification Of Proteins:
augmented manual classification
http://scop.mrc-lmb.cam.ac.uk/scop/
Fold classification based on StructureStructure alignment of Proteins
http://www2.ebi.ac.uk/dali/fssp/
Genome-scale structure analysis
Curr. Opin. Str. Biol., 2003
genome-scale structure
annotation
Some statistics






80% of sequence families belong to 400 folds
(top 10 folds account for 40% of sequence
families)
>60% of genes encode multi-domain proteins
(80% for eukaryotes)
~50,000 protein families and ~150,000
singletons
structural superfamilies ~1800 (+/-50) and
~10,000 unifolds
50-60% of distant homologs (<25% seq. id.)
can be recognized by profile-based sequence
comparison methods (e.g. psi-blast, HMM, etc)
50-60% of the enzymes in yeast and E coli are
common, and >80% of pathways are shared
superfolds, superfamilies, supersites
TIM barrel, Rossmann-like,
ferredoxin-like, b-propellers, 4-helix
bundle, Ig-like, b-jelly rolls,
Oligonucleotide/oligosaccharride
binding (OB) fold, SH3-like.
 Structure -> function (only 50%
correct)

Structure implicates function?
Assessing the Progress of
Structural Genomics Projects
1 Nov. 2002, Science
Target Tracking by PDB
(Sep 2002)
PDB content growth (May
2005)
Some statistics





Contributed 316 non-redundant PDB
entries comprising 459 CATH and 393
SCOP domains by 11 SG consortia.
14% of the targets have a homolog
(>30% sequence identity) solved by
another consortium
67% of SG domains in CATH are unique vs.
21% of non-SG domains.
19% and 11% contributed new
superfamilies and new folds, respectively.
Allow new and reliable homology models
for 9287 non-redundant gene sequences
in 208 completely sequenced genomes.
PSI Structure Statistics
2002-2003
Unique structures (30% seq. ID)
PSI
70%
PDB
10%
 New folds
PSI
12%
PDB
3%

NIGMS Protein Structure Initiative
Average total cost per structure
PSI Pilot phase
01
02
03
04
05
$650
$400
$240
?
$100
K
K
K
(7 centers)
(9 centers)
K (goal)
PSI-2 Production phase
06-10
Comparison
$50 K (goal)
~$250-300 K
NIGMS Protein Structure Initiative
PSI Pilot Phase -- Lessons Learned
1.
2.
3.
4.
5.
6.
Structural genomics pipelines can be
constructed and scaled-up
High throughput operation works for many
proteins
Genomic approach works for structures
Bottlenecks remain for some proteins
A coordinated, 5-year target selection
policy must be developed
Homology modeling methods need
improvement
NIGMS Protein Structure Initiative
PSI-2 Production Phase (2005)


Interacting network for high throughput
protein structure determination with three
components
 Large-scale centers for protein structure
production of selected targets
 Specialized centers for technology
development leading to high throughput
structure determination of difficult proteins
 Specialized centers for protein structures
relevant to disease (other NIH Institutes
and Centers)
Included in NIH Structural Biology Roadmap
plans
NIGMS Protein Structure Initiative
Computational structural
genomics
Summary table
Fold occurrence matrix
Common
Folds
Unique
Folds
Main findings







Folds can be assigned to ~25% ORF and ~20% amino
acids for the 20 genomes
>80% scop folds identified in one of the 20 organisms
Worm and E. coli have most distinct folds
Level of gene duplication (2.4 folds in MG, 32 in worm)
higher than observed based on sequence only
Top three most common folds: P-loop NTP hydrolase, the
ferrodoxin fold, TIM-barrel
Unique folds tend to be those involved in cell defense (e.g.
toxins)
Common folds tend to be more “symmetrical”
Fold evolution
Insertion, deletion, substitution
a-helix & b-sheet substitution in
Rossmann-fold like proteins
A path from all-b to all-a proteins
Circular Permutation (CP)
N
B
A
A
C
C
B
C
N
D
C
D
..A..B..C..D..
..C..D..A..B..
Circular permutation example
1nls (Concanavalin)
1led (Lectin)
C
N
N C
Strand invasion/withdraw
Strand invasion/withdraw
Strand invasion/withdraw
Hairpin flips/swaps
Hairpin flips/swaps
Sickel-cell hemoglobin confers
resistance to malaria
Hemoglobin &
sickle cell anemia
Lethal legos as killer clumps
The inherited form of Lou
Gehrig's disease--familial
amyotrophic lateral sclerosis
(FALS)--causes a decay of the
motor neurons in the spinal
cord and brain, a devastating
loss of bodily control, and
death within 2 to 5 years.
Elam et al. Nat. Str. Biol., 2003

Three-state per-residue accuracy

Transcript Three-state per-residue accuracy

Directory