Transcript Gene3D

Protein Family Resources and Protocols for Structural
and Functional Annotation of Genome Sequences
Domain structures
Domain structure predictions
Classification of
protein and domain
families
Sequence to
function
Structure to
function
C AT H
40,000
domain
entries
Gene3D
Fold Group
(1100)
Homologous
Superfamily
(2100)
Sequence Family
~100,000 domains of known structure in CATH
~2 million sequences from genomes assigned to CATH superfamilies in
Gene3D and functionally annotated
Gene3D:Domain structure annotations in genome
sequences
~5 million protein
sequences
from 560 completed
genomes and UniProt
scan against
library of HMM models
and sequences for
CATH
Pfam
NewFam
superfamilies
~ 2 million domain
sequences assigned to
CATH superfamilies
Gene3D
(1) Cluster ~5 million sequences into protein superfamilies
>200,000 protein superfamilies
(2) Map domains onto the sequences using HMM technology
(CATH & Pfam domains)
~10,000 domain superfamilies
(2100 of known structure)
Proportion of genome sequences which can be assigned to
domain families of known structure in CATH or SCOP
Gene3D
Genthreader
Genes with structural annotation
100
90
80
70
60
50
40
30
20
10
0
Arabidopsis
C.elegans
Drosophila
Human
Mouse
Yeast
Organism
HMM prediction
threading prediction
Annotation levels for an average genome
100%
many belonging to small species specific families
many predicted to be transmembrane
50%
0
predicted to belong to
structural superfamilies using HMM
or threading techniques
Target selection strategy for PSI-2
Percentage of domain sequences
Adam Godzik JCSG, Andras Fiser – NYSGC, Burkhard Rost - NESG
100
80
unknown
structure
(BIG -Pfam)
60
40
known
structure
(CATH MEGA)
20
0
0
1000
2000
3000
4000
Families ordered by size
5000
6000
Correlation of sequence and structural variability of CATH
families with the number of different functional groups
Superfamily Variation: Structure/Sequence
120
3.40.50.300
110
Structural Diversity
100
2.60.40.10
90
3.40.50.720
80
1.10.10.10
0-25 GO Terms
70
26-50 GO Terms
60
51-100 GO Terms
2.40.50.140
101-200 GO Terms
50
201+ GO Terms
3.40.50.150
40
30
20
10
0
0
25
50
75
100
125
150
Sequence
PopulationFamilies
in genomes (x 1000)
Structural diversity in the CATH Domain Superfamily
P-loop hydrolases
Cutinase
Cocaine esterase
Acetylcholinesterase
Protein Family Resources and Protocols for Structural
and Functional Annotation of Genome Sequences
Domain structures
Domain structure predictions
Sequence to
function
Sequence identity thresholds for 90% conservation of
enzyme function (to 3 EC Levels)
Number of CATH enzyme superfamilies
4.5E+05
160
4.0E+05
140
3.5E+05
120
highly variable
families
3.0E+05
2.5E+05
100
80
2.0E+05
60
1.5E+05
1.0E+05
40
5.0E+04
20
0.0E+00
0
11-20%
21-30%
31-40%
41-50%
51-60%
61-70%
71-80%
81-90%
Sequecne Identity (%)
Sequence identity threshold for 90% conservation
91-100%
Number of CATH enzyme
of families
Number
Superfamilies
Number of Domain Relatives
Number of sequences
Number of domain relatives
N-Fold Increase in Functional Annotation for
Sequences in Gene3D
Domain - Family specific cut-off
8
N-fold increase in coverage
N-fold increase in coverage
Domain - 50/80 and 40/80 cut-offs if identical MDA
6
4
2
0
Gene3D (6.8%)
H.sapiens (5%)
general thresholds
A.thaliana (2.7%)
C.elegans (1.1%)
B.anthracis (3.7%)
family specific thresholds
Gene3D
Get an XML version of this page
Link to UniProtLinks to different levels in the Gene3D protein family
Link to InterPro
Links to GO
Links to KEGG
Links to CATH/Pfam
“S” - indicates you can search the term against Gene3D
Functional information from GO, COGS, KEGG, EC, FunCat,
MINT, IntAct, ComplexDB
Functional annotation of structures using EC, GO, KEGG,
FunCat resources
Non-PSI PDBs
0 terms
1 term
PSI PDBs
2 terms
3 terms
4 terms
Phylogenetic trees derived from multiple sequence alignments can
be used to infer functionally related proteins
Tree Determinants - Valencia
Evolutionary Trace - Lichtarge
Funshift – Sonnhammer
SCI-PHY – Sjolander
Methods exploiting information on sequence
conserved residue positions
multiple sequence
alignment of relatives
from functional group
Structural model
Putative functional site
1 = highly
conserved
0 = unconserved
Scorecons –Thornton
Protein Keys – Sander
Score
conservation
for each position
in the alignment
using an entropy
measure
GEMMA: Compares sequence profiles (HMMs) between
subfamilies
sequence subfamily
80% seq. id)
Superfamily
of known
structure
(CATH)
putative
structure-function
group
clusters sequence relatives predicted to have similar structures/functions
even at low levels of sequence identity
GeMMA v SCI-PHY using gold annotated sequences in Babbitt benchmark
100
Purity
(high is
best)
90
80
70
60
SCI-PHY
50
40
GeMMA
30
20
10
0
Amidohydrolase
Crotonase
Enolase
Haloacid dehalogenase
Vicinal oxygen chelate
25
Edit
distance
(low)
20
15
SCI-PHY
GeMMA
10
5
0
Amidohydrolase
Crotonase
Enolase
Haloacid dehalogenase
Vicinal oxygen chelate
1.6
VI
distance
(low is
best)
1.4
1.2
1
SCI-PHY
GeMMA
0.8
0.6
0.4
0.2
0
Amidohydrolase
Crotonase
Enolase
Haloacid dehalogenase
Vicinal oxygen chelate
6
Deviation
from no.
5
4
3
2
SCI-PHY
GeMMA
Functional annotation coverage using different strategies
Annotation (EC number) coverage of MEGA family 3.90.1200.10
70
60
Coverage of family (%)
Coverage of superfamily (%)
80
50
40
30
20
10
0
Database annotations
experimental
annotations
Annotations inherited w ithin S60 clusters
inherit
Source functions
of annotation
at 60% seq. id.
Annotations inherited w ithin GeMMA
functional subfamilies
inherit functions
by GEMMA
Protein interactions and gene networks
Gene3D Biominer Methods
•Phylotuner: Correlation of domain occurrence profiles
•GOSS:Semantic Similarity calculation between protein pairs.
•CODA: Domain fusion analysis.
•HiPPI: homology inheritance of protein-protein physical
interaction data.
•GECO: Correlation of gene expression data
Protein Family Resources and Protocols for Structural
and Functional Annotation of Genome Sequences
Domain structures
Domain structure predictions
Structure to
function
Methods for Assessing Structural Novelty
CATHEDRAL – structure comparison
Redfern et al. PLOS comp. biol. 2007
CATHEDRAL
CE
LSQMAN
DALI
STRUCTAL
1
Proportion Correct Fold
0.98
0.96
0.94
0.92
0.9
0.88
0.86
1
2
3
4
5
6
7
8
9
10
11
Rank
12
13
14
15
16
17
18
19
20
structure similarity score
Structural clusters in the Aminoacyl tRNA
synthetases – like family
Aminoacyl tRNA synthetases
Gln-hydrolyzing synthases
DNA-binding, stress-related
Nucleotidyl-transferases
Argininosuccinate lyases
Galectin binding
superfamily
2.60.120.200
1bkzA00
1dypA00
Identifying functional groups in domain superfamilies
Aminoacyl tRNA
synthetases – like
Deoxyribodipyrimidine
photo-lyases
1dnpA00
Nucleotidylyltransferases
1ej2A00
AA tRNA
synthetase,
Class I
Electron
transfer
flavoprotein
1n3lA01
1o97D01
Exploiting 3D Templates to Represent Functional
Relatives
JESS – Thornton
GASP - Babbitt
SPASM – Kleywegt
PINTS – Russell
DRESPAT - Sarawagi
pvSOAR – Joachimiak
SITESEER: Match 3-residue templates and assess relevance of hits
by looking at residues within the local environment
Laskowski and Thornton
green and purple – identical residues; orange and white – similar residues
FLORA:3D templates for functional groups
From multiple structure alignments of functional subgroups in
the superfamily, identify vectors between amino acids
that are highly conserved and distinctive for the
functional subgroup.
FLORA:3D templates for functional groups
localFLORA
globalFLORA
single site
multiple sites
FLORA:Performance in recognising functionally related homologues
Local FLORA
Global FLORA
Coverage (%)
1
0.9
0.8
0.7
0.6
0
1
2
3
4
5
6
7
8
9
Rank
Benchmark of 36 diverse enzyme groups (from 12 families)
10
Performance of FLORA
Benchmarked on 36
large enzyme families
FLORA: 3D Templates for Structure-Function
Groups in Domain Families
1q77A00
Unknown
function
MCSG
1o97D01
1dnpA01
Electron
transfer
flavoprotein
Deoxyribodipyrimidine
photo-lyases
1ej2A00
Nucleotidylyltransferases
1n3lA01
AA tRNA
synthetases
http://www.ebi.ac.uk/thornton-srv/databases/ProFunc/
Sequence scans
Sequence search
vs PDB
Fold and
structural motifs
n-residue templates
SSM fold
search
Enzyme active sites
Sequence search
vs Uniprot
Surface clefts
Ligand binding sites
Sequence motifs
(PROSITE, BLOCKS,
SMART, Pfam, etc)
Residue
conservation
DNA binding sites
Superfamily HMM
library
DNA-binding
HTH motifs
Reverse templates
Gene neighbours
Nest analysis
Function Prediction for Proteins of ‘Putative’ or Unknown
Function
Class
Sequence
Evidence
Structure
Evidence
Sequence +
Structure
Neither
Successful
Putative
(57)
53
44
41
1
Unknown
(132)
95*
69*
57*
25
* Numbers refer to results where the top hit is classed as ‘Strong’
or ‘Moderate’
structural data provides relatively more information for proteins
about which there is less knowledge
these predictions need to be experimentally validated