PSI - European Bioinformatics Institute

Download Report

Transcript PSI - European Bioinformatics Institute

InterPro
Sandra Orchard
EBI is an Outstation of the European Molecular Biology Laboratory.
Why do we need predictive annotation tools?
14,000,000
12,000,000
UniProtKB
Number of sequences
10,000,000
UniProtKB/Swiss-Prot
8,000,000
6,000,000
4,000,000
2,000,000
0
5-Jan-04
5-Jan-06
5-Jan-08
Date
5-Jan-10
• Given a set of uncharacterised sequences, we usually want to know:
– what are these proteins; to what family do they
belong?
– what is their function; how can we explain this in
structural terms?
1. Pairwise alignment approaches (e.g. BLAST)
• Good at recognising similarity between closely related
sequences
• Perform less well at detecting divergent homologues
2. The protein signature approach
• Alternatively, we can model the conservation of amino acids
at specific positions within a multiple sequence alignment,
seeking ‘patterns’ across closely related proteins
• We can then use these models to infer relationships with
previously characterised sequences
• This is the approach taken by protein signature databases
What are protein signatures?
Protein family/domain
Multiple sequence alignment
Build model
Search
UniProt
Protein analysis
Significant
match
ITWKGPVCGLDGKTYRNECALL
AVPRSPVCGSDDVTYANECELK
Mature
model
Diagnostic approaches (sequence-based)
Single
motif
methods
Regex patterns
(PROSITE)
Full domain
alignment
methods
Profiles
(Profile Library)
HMMs
(Pfam)
Multiple
motif
methods
Identity matrices
(PRINTS)
Patterns
Sequence
alignment
Define
pattern
Extract pattern
sequences
Motif
xxxxxx
xxxxxx
xxxxxx
xxxxxx
Build
C-C-{P}-x(2)-C-[STDNEKPI]-x(3)-[LIVMFS]-x(3)-C
regular
expression
Pattern
signature
PS00000
Patterns
Advantages
• Some aa can be forbidden at some specific positions
which can help to distinguish closely related subfamilies
• Short motifs handling - a pattern with very few
variability and forbidden positions, can produce
significant matches e.g. conotoxins: very short toxins with few conserved
cysteines C-{C}(6)-C-{C}(5)-C-C-x(1,3)-C-C-x(2,4)-C-x(3,10)- C
Drawbacks
• High False Positive/False Negative rate
Patterns are mostly directed against functional
residues:
active sites, PTM, disulfide bridges, binding sites
Fingerprints
Sequence
alignment
Define
motifs
Extract motif
sequences
Fingerprint
signature
PR00000
Motif 1
xxxxxx
xxxxxx
xxxxxx
xxxxxx
Motif 2 Motif 3
Weight
matrices
xxxxxx
xxxxxx
xxxxxx
xxxxxx
xxxxxx
xxxxxx
xxxxxx
xxxxxx
Correct order
1
2
3
Correct spacing
The significance of motif context
• Identify small conserved regions in proteins
• Several motifs  characterise family
• Offer improved diagnostic reliability over single motifs by virtue of the
biological context provided by motif neighbours
order
interval
Profiles
&
HMMs
Whole protein
Sequence
alignment
Define
coverage
Use entire
alignment for
domain or protein
Build model
Profile or
HMM
signature
Entire domain
xxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Models
insertions and
deletions
HMM databases
Sequence-based
• PIR SUPERFAMILY: families/subfamilies reflect the evolutionary relationship
• PANTHER: families/subfamilies model the divergence of specific functions
• TIGRFAM: microbial functional family classification
• PFAM : families & domains based on conserved sequence
• SMART: functional domain annotation
Structure-based
•SUPERFAMILY : models correspond to SCOP domains
• GENE3D: models correspond to CATH domains
Why we created InterPro
By uniting the member databases, InterPro capitalises
on their individual strengths, producing a powerful
diagnostic tool & integrated database
– to simplify & rationalise protein analysis
– to facilitate automatic functional annotation of
uncharacterised proteins
– to provide concise information about the signatures and the
proteins they match, including consistent names, abstracts
(with links to original publications), GO terms and crossreferences to other databases
InterPro Entry
Groups similar signatures together
AddsAdds
extensive
extensive
annotation
annotation
LinksLinks
to other
to other
databases
databases
Structural information and viewers
 Hierarchical
classification
InterPro hierarchies: Families
FAMILIES can have parent/child relationships with
other Families
Parent/Child relationships are based on:
• Comparison of protein hits

child should be a subset of parent

siblings should not have matches in common
• Existing hierarchies in member databases
• Biological knowledge of curators
InterPro hierarchies: Domains
DOMAINS can have
parent/child
relationships with
other domains
Domains and Families may be linked through
Domain Organisation
Hierarc
hy
InterPro Entry
Groups similar signatures together
AddsAdds
extensive
extensive
annotation
annotation
LinksLinks
to other
to other
databases
databases
Structural information and viewers
InterPro Entry
Groups similar signatures together
AddsAdds
extensive
extensive
annotation
annotation
LinksLinks
to other
to other
databases
databases
Structural information and viewers
The Gene Ontology project provides a
controlled vocabulary of terms for
describing gene product characteristics
InterPro Entry
Groups similar signatures together
AddsAdds
extensive
extensive
annotation
annotation
LinksLinks
to other
to other
databases
databases
Structural information and viewers
UniProt
KEGG ... Reactome ... IntAct ...
UniProt taxonomy
PANDIT ... MEROPS ... Pfam clans ...
Pubmed
InterPro Entry
Groups similar signatures together
AddsAdds
extensive
extensive
annotation
annotation
LinksLinks
to other
to other
databases
databases
Structural information and viewers
PDB 3-D Structures
SCOP Structural
domains
CATH Structural
domain classification
Searching InterPro
Searching InterPro
Protein family membership
Domain organisation
Domains, repeats
& sites
GO terms
Searching InterPro
InterProScan access
Interactive:
http://www.ebi.ac.uk/Tools/pfa/iprscan/
Webservice (SOAP and REST):
http://www.ebi.ac.uk/Tools/webservices/services/pfa/iprscan_rest
http://www.ebi.ac.uk/Tools/webservices/services/pfa/iprscan_soap
Download:
ftp://ftp.ebi.ac.uk/pub/software/unix/iprscan/
?
?
?
?
?
?
?
?
? ?
?
?
?
?
?
?
?
?
?
?
Master headline
?