Presentation - European Bioinformatics Institute

Download Report

Transcript Presentation - European Bioinformatics Institute

Protein function and classification
www.ebi.ac.uk/interpro
Hsin-Yu Chang
www.ebi.ac.uk
Protein classification could help scientists
to gain information about protein functions.
Greider and Blackburn discovered telomerase in 1984 and were awarded
Nobel prize in 2009. Which model organism they used for this study ?
2. Saccharomyces cerevisiae
3. Mouse
1. Tetrahymena
4. Human
1989
Telomere hypothesis of
cell senescence
Szostak
1984
Discovery of telomerase
Greider and Blackburn
1995 Clone hTR
1995/1997 Clone hTERT
1997 Telomerase knockout
mouse
1999/2000…
Telomerase/telomere
dysfunctions and
cancer
1998 Ectopic expression of
telomerase in normal human
epithelial cells cause the
extension of their lifespan
A single Tetrahymena
cell has 40,000
telomeres, whereas a
human cell only has
92.
Gilson and Ségal-Bendirdjian, Biochimie, 2010.
Therefore, classify proteins into families
and identify protein homologues can help
scientists to gather more information about
their favourite proteins.
However, in the lab, what do we usually do
to analyse protein sequences and find out
their functions?
How can we annotate ProteinA ?
>ProteinA
MNRGVPFRHLLLVLQLALLPAATQGKKVVLGKKGDTVEL
TCTASQKKSIQFHWKNSNQIKILGNQGSFLTKGPSKLND
RADSRRSLWDQGNFPLIIKNLKIEDSDTYICEVEDQKEEV
QLLVFGLTANSDTHLLQGQSLTLTLESPPGSSPSVQCRS
PRGKNIQGGKTLSVSQLELQDSGTWTCTVLQNQKKVEF
KIDIVVLAFQKASSIVYKKEGEQVEFSFPLAFTVEKLTGS
GELWWQAERASSSKSWITFDLKNKEVSVKRVTQDPKLQ
MGKKLPLHLTLPQALPQYAGSGNLTLALEAKTGKLHQEV
NLVVMRATQLQKNLTCEVWGPTSPKLMLSLKLENKEAK
VSKREKAVWVLNPEAGMWQCLLSDSGQVLLESNIKVLP
TWSTPVQPMALIVLGGVAGLLLFIGLGIFFCVRCRHRRR
QAERMSQIKRLLSEKKTCQCPHRFQKTCSPI
What I used to do:
• Protein BLAST
• Publications - text books or papers
• UniProt
• PDB
• Specialized protein databases such as SGD, the human
protein atlas, etc.
BLAST
(Basic Local Alignment Tool)
: compares protein sequences to sequence
databases and calculates the statistical significance
of matches.
BLAST
Advantages:
• Relatively fast
Drawbacks:
• User friendly
• sometimes struggle
with multi-domain
proteins
• Very good at
recognising similarity
between closely
related sequences
• less useful for weaklysimilar sequences
(e.g., divergent
homologues)
Using BLAST to find clues of protein functions
-when it goes well
Pairwise alignment of two proteins:
CD4 from two closely-related species
Using BLAST to find clues of protein functions
-when it does not give you much information
Using BLAST to find clues of protein functions
-when it does not give you much information
Because BLAST performs local pairwise alignment, it:
•Cannot encode the information found in a multiple
sequence alignment that show you conserved sites.
Using pairwise alignment could miss out on conserved residues
60S acidic ribosomal protein P0: multiple sequence alignment
An alternative approach:
protein signature search
An alternative approach: protein signature search
• Construction of a multiple sequence alignment (MSA)
from characterised protein sequences.
• Modelling the pattern of conserved amino acids at
specific positions within a MSA.
• Use these models to infer relationships with the
characterised sequences
• This is the approach taken by protein signature
databases
Three different protein signature approaches
Patterns
Sequence
alignment
Single motif
methods
Profiles &
Hidden Markov
Models
(HMMs)
Full alignment
methods
Fingerprints
Multiple motif
methods
Protein databases that use signature approaches
Hidden Markov Models
Finger
prints
Profiles
Patterns
HAMAP
Structural
domains
Functional annotation of
families/domains
Protein
features
(sites)
Patterns
Patterns
Patterns are usually directed against functional sequence features such
as: active sites, binding sites, etc.
Sequence alignment
Motif
ALVKLISG
AIVHESAT
CHVRDLSC
CPVESTIS
Pattern sequences
[AC] – x -V- x(4) - {ED}
Regular expression
Pattern signature
PS00000
Patterns
Advantages:
• Strict - a pattern with very
little variability and can
produce highly accurate
matches
Drawbacks:
• Simple but less
flexible
Fingerprints
Fingerprints:
a multiple motif approach
Sequence alignment
Motif 1
Motif 2
Motif 3
Define motifs
xxxxxx
xxxxxx
xxxxxx
xxxxxx
Motif sequences
Fingerprint
signature
PR00000
Weight
matrices
xxxxxx
xxxxxx
xxxxxx
xxxxxx
xxxxxx
xxxxxx
xxxxxx
xxxxxx
The significance of motif context
•
Identify small conserved regions in proteins
•
Several motifs  characterise family
order
1
2
3
interval
Fingerprints
• Good at modeling the often small differences between closely
related proteins
• Distinguish individual subfamilies within protein families,
allowing functional characterisation of sequences at a high level
of specificity
Profiles & HMMs
Profiles & HMMs
Whole protein
Sequence alignment
Define coverage
Use entire alignment of
domain or protein family
Build model (Profile
or HMMs)
Profile or HMM
signature
Entire domain
xxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Profiles
Start with a multiple
sequence alignment
Amino acids at each position
in the alignment are scored
according to the frequency
with which they occur
Scores are weighted
according to
evolutionary distance
using a BLOSUM matrix
• Good at identifying homologues
HMMs
Start with a multiple
sequence alignment
Amino acid frequency at each
position in the alignment and
their transition probabilities
are encoded
Insertions and deletions
are also modelled
• Can model very divergent regions of alignment
• Very good at identifying evolutionarily distant homologues
Three different protein signature approaches
Patterns
Single motif
methods
Profiles &
HMMs
hidden Markov
models
Full alignment
methods
Fingerprints
Multiple motif
methods
www.ebi.ac.uk/interpro
Hidden Markov Models
Finger
prints
Profiles
Patterns
HAMAP
Structural
domains
Functional annotation of
families/domains
Protein
features
(sites)
The aim of InterPro
Protein
sequences
Family entry: description,
proteins matched and more
information.
Domain entry: description,
proteins matched and more
information.
Site entry: description, proteins
matched and more information.
What is InterPro?
• InterPro is an integrated sequence analysis resource
• It combines predictive models (known as signatures)
from different databases
• It provides functional analysis of protein sequences by
classifying them into families and predicting domains and
important sites
Facts about InterPro
• First release in 1999
• 11 partner databases
• Add annotation to UniProtKB/TrEMBL
• Provides matches to over 80% of UniProtKB
• Source of >85 million Gene Ontology (GO) mappings to >24 million
distinct UniProtKB sequences
• 50,000 unique visitors to the web site per month> 2 million sequences
searched online per month. Plus offline searches with downloadable
version of software
InterPro signature integration process
• Signatures are provided by member databases
• They are scanned against the UniProt database to see which
sequences they match
• Curators manually inspect the matches before integrating the
signatures into InterPro
InterPro
curators
InterPro signature integration process
• Signatures representing the same entity are integrated together
• Relationships between entries are traced, where possible
• Curators add literature referenced abstracts, cross-refs to other
databases, and GO terms
http://www.ebi.ac.uk/interpro/
How can we annotate ProteinA by using InterPro?
>ProteinA
MNRGVPFRHLLLVLQLALLPAATQGKKVVLGKKGDTVEL
TCTASQKKSIQFHWKNSNQIKILGNQGSFLTKGPSKLND
RADSRRSLWDQGNFPLIIKNLKIEDSDTYICEVEDQKEEV
QLLVFGLTANSDTHLLQGQSLTLTLESPPGSSPSVQCRS
PRGKNIQGGKTLSVSQLELQDSGTWTCTVLQNQKKVEF
KIDIVVLAFQKASSIVYKKEGEQVEFSFPLAFTVEKLTGS
GELWWQAERASSSKSWITFDLKNKEVSVKRVTQDPKLQ
MGKKLPLHLTLPQALPQYAGSGNLTLALEAKTGKLHQEV
NLVVMRATQLQKNLTCEVWGPTSPKLMLSLKLENKEAK
VSKREKAVWVLNPEAGMWQCLLSDSGQVLLESNIKVLP
TWSTPVQPMALIVLGGVAGLLLFIGLGIFFCVRCRHRRR
QAERMSQIKRLLSEKKTCQCPHRFQKTCSPI
Search
using protein
sequences
Family
Type
InterPro entry types
Family
Proteins share a common evolutionary origin, as reflected in their
related functions, sequences or structure. Ex. Telomerase family.
Domain
Distinct functional, structural or sequence units that may exist in a
variety of biological contexts. Ex. DNA binding domain.
Repeats
Short sequences typically repeated within a protein. Ex. Tubulin
binding repeats in microtubule associated protein Tau.
Sites
PTM
Ex. Phosphorylation sites, ion binding sites, tubulin conserved site.
Active
Site
Binding
Site
Conserved
Site
Type
Name
Identifier
Contributing
signatures
Description
References
GO terms
Type
Name
Contributing
signatures
Identifier
Relationships
Description
References
InterPro family and domain relationships
Family relationships in InterPro:
Interleukin-15/Interleukin-21
family (IPR003443)
Interleukin-15
(IPR020439)
Interleukin-15
Avian
(IPR020451)
Interleukin-15
Fish
(IPR020410)
Interleukin-21
(IPR028151)
Interleukin-15
Mammal
(IPR020466)
Relationships
InterPro relationships: domains
Protein kinase-like
domain
Protein kinase
catalytic domain
Serine/threonine
kinase catalytic
domain
Tyrosine
kinase catalytic
domain
A brief diversion into the Gene Ontology...
Inconsistency in naming of biological concepts
English is not a very precise language
• Same name for different concepts
• Different names for the same concept
An example …
Taction
Tactition
Tactile sense
Sensory perception of touch
? ; GO:0050975
Gene Ontology
• Unify the representation of gene and gene product attributes across
species
• Allow cross-species and/or cross-database comparisons
The Gene Ontology
Less specific concepts
• A way to capture
biological knowledge
in a written and
computable form
• A set of concepts
and their relationships
to each other arranged
as a hierarchy
More specific concepts
www.ebi.ac.uk/QuickGO
The Concepts in GO
•
•
1. Molecular Function
protein kinase activity
insulin receptor
activity
2. Biological Process
•
•
3. Cellular Component
Cell cycle
Microtubule cytoskeleton organisation
GO:0006955 Immune response
GO:0016020 membrane
Search
using
keywords
Summary
•
Protein classification could help scientists to gain information
about protein functions.
•
Blast is fast and easy to use but has its drawbacks.
•
Alternative approach: protein signature databases build
models (protein signatures) by using different methods
(patterns, fingerprints, profile and HMMs).
•
InterPro integrates these signatures from 11 member
databases. It serves as a sequence analysis resource that
classifies sequences into protein families and predicts
important domains and sites.
Why use InterPro?
• Large amounts of manually curated data
•
35,634 signatures integrated into 25,214 entries
•
Cites 38,877 PubMed publications
• Large coverage of protein sequence space
• Regularly updated
•
~ 8 week release schedule
•
New signatures added
•
Scanned against latest version of UniProtKB
Caution
• InterPro is a predictive protein signature database - results are
predictions, and should be treated as such
• InterPro entries are based on signatures supplied to us by our
member databases
....this means no signature, no entry!
And one more thing…..
We need your feedback!
missing/additional references
reporting problems
requests
EBI support page.
The InterPro Team:
Alex
Mitchell
Craig
McAnulla
Siew-Yit
Yong
Amaia
Sangrador
Hsin-Yu
Chang
Sarah
Hunter
Sebastien
Pesseat
Gift
Matthew
Maxim
Fraser Scheremetjew Nuka
Louise
Daugherty
Database
Basis
Institution
Built from
Focus
URL
Pfam
HMM
Sanger Institute
Sequence
alignment
Family & Domain
based on conserved
sequence
http://pfam.sanger.ac.uk/
Gene3D
HMM
UCL
Structure
alignment
Structural Domain
http://gene3d.biochem.ucl.a
c.uk/Gene3D/
Evolutionary
domain
relationships
http://supfam.cs.bris.ac.uk/
SUPERFAMILY/
Superfamily
HMM
Uni. of Bristol
Structure
alignment
SMART
HMM
EMBL Heidelberg
Sequence
alignment
Functional domain
annotation
http://smart.emblheidelberg.de/
Microbial Functional
Family Classification
http://www.jcvi.org/cms/rese
arch/projects/tigrfams/overv
iew/
TIGRFAM
HMM
J. Craig Venter Inst.
Sequence
alignment
Panther
HMM
Uni. S. California
Sequence
alignment
Family functional
classification
http://www.pantherdb.org/
PIRSF
HMM
PIR, Georgetown,
Washington D.C.
Sequence
alignment
Functional
classification
http://pir.georgetown.edu/pir
www/dbinfo/pirsf.shtml
PRINTS
Fingerprints
Uni. of Manchester
Sequence
alignment
Family functional
classification
http://www.bioinf.mancheste
r.ac.uk/dbbrowser/PRINTS/i
ndex.php
PROSITE
Patterns &
Profiles
SIB
Sequence
alignment
Functional
annotation
http://expasy.org/prosite/
HAMAP
Profiles
SIB
Sequence
alignment
Microbial protein
family classification
http://expasy.org/sprot/ham
ap/
ProDom
Sequence
clustering
PRABI : Rhône-Alpes
Sequence
alignment
Conserved domain
prediction
http://prodom.prabi.fr/prodo
m/current/html/home.php
Bioinformatics Center
Thank you!
www.ebi.ac.uk
Twitter: @emblebi
Facebook: EMBLEBI
YouTube: EMBLMedia
The BLOSUM (BLOcks SUbstitution Matrix) matrix is a
substitution matrix used for sequence alignment of
proteins. BLOSUM matrices are used to score alignments
between evolutionarily divergent protein sequences.