Transcript Document

Protein signatures,
classification
and
functional analysis
1
Menu
• Introduction: some definitions
• How to model domains ?
– Pattern
– Profile
– HMM
• Domain/family databases (InterPro…)
Protein domain/family: some definitions
• Most proteins have « modular » conserved
structures
•
Estimation: ~ 3 domains / protein
-> Prediction of domain content of a unkown protein
sequence may help to find a ‘function’
…Estimation: ~ 80% of protein have at least a ‘known’ domain
Number of domains per protein
~100 protein sequences
with 50 domains
http://prodom.prabi.fr/prodom/current/archives/2006.1/stat.html
CSA_PPIASE
Cys 181: active site residue
Binding cleft (motif)
Example of conserved regions (PPID family)
- 1 CSA_PPIASE (cyclophilin-type peptydil-prolyl cis-trans isomerase) (domain)
- 3 TPR repeats (tetratrico peptide repeat).
- 1 active site
- Binding cleft (motif)
InterPro scan results
?
General definitions of conserved sequence signatures
Conserved regions in biological sequences can be classified into 5 different groups:
•
Domains: specific combination of secondary structures organized into a
characteristic three dimensional structure or fold.
•
Families: groups of proteins that have the same domain arrangement or that are
conserved along the whole sequence.
•
Repeats: structural units always found in two or more copies that assemble in a
specific fold. Assemblies of repeats might also be thought of as domains.
•
Motifs: region of domains containing conserved active or binding residues, or
short conserved regions present outside domains that may adopt folded
conformation only in association with their binding ligands.
•
Sites: functional residues (active sites, disulfide bridges, post-translation
modified residues).
CSA_PPIASE
Cys 181: active site residue
Binding cleft (motif)
Example of conserved regions (PPID family)
- 1 CSA_PPIASE (cyclophilin-type peptydil-prolyl cis-trans isomerase) (domain)
- 3 TPR repeats (tetratrico peptide repeat).
- 1 active site
- Binding cleft (motif)
What makes Bee special?
Measures of Conservation
• Identity: Proportion of pairs of identical residues between two aligned
sequences. Generally expressed as a percentage. This value depends on
how the two sequences are aligned.
• Similarity: Proportion of pairs of similar residues between two aligned
sequences. If two residues are similar can determined by a substitution
matrix (e.g. BLOSUM62). This value depends strongly on the scoring
system used.
• !!! But not Homology: Two sequences are homologous if and only if they
have a common ancestor. This is not a measure of conservation and there
is no percentage of homology! (It's either yes or no). Homologous
sequences do not necessarily serve the same function, nor are they always
highly similar: structure may be conserved while sequence is not.
How to measure ‘conservation’ ?
Pairwise vs multiple sequence alignments
Blast vs modelled MSA
Detect conservation using pairwise alignments
A popular way to identify similarities between proteins is to perform a
pairwise alignment (Blast, Fasta).
When the identity is higher than 40% this method gives good results.
However, the weakness of the pairwise alignment is that no distinction is made
between an amino acid at a crucial position (like an active site) and an amino acid
with no critical role (not enough information).
Domain Family databases
Murcia 2011
13
Pairwise alignment
Detect conservation using MSA
•
A multiple sequence alignment (MSA) gives a more general view of a
conserved region by providing a better picture of the most
conserved residues, which are usually essential for the protein
function.
• MSA contains higher information content
than pairwise alignments
How to use MSA to look for
conservation ?
-> 1- Model MSA using various methods
-> 2- ‘Align’ the model with your sequence (InterPro scan…)
Methods to Build Models of MSA
• Consensus:
– Consensus, Patterns
• Profile:
–
–
–
–
Position Speficic Scoring Matrices (PSSMs),
Generalized Profiles,
Hidden Markov Models (HMMs),
PSI-BLAST.
…pattern or PSSM/profile specific is called descriptor,
descriptor motif, discriminator or predictor
Domain Family databases
Murcia 2011
19
Why do we need models of MSA?
Why do we need classifiers ?
•
to resume in a single “descriptor" the differences and similarities observed in each
column of the MSA;
•
to use the model/descriptor to search for similar sequences;
•
to classify similar sequences;
•
to align correctly important residues and detect variations in active sites and other
important regions of one protein (i.e. SNP);
•
to build databases of models/descriptors which can be used to annotate new
proteomes…
•
MSA models are more sensitive than Blast (pairwise alignment)
•
…
Consensus - pattern
Consensus Sequences
• Useful to detect protein belonging to a specific family or a
protein domain; much less useful at the DNA level due to the
small alphabet (4 letters) and the low sequence conservation
of DNA sequence elements (except for the detection of
enzyme restriction sites).
• Patterns do not attempt to describe a complete domain or
protein family, but simply try to identify the most important
residue combinations, such as the catalytic site of an
enzyme.
• They focus on the most highly conserved residues in a
protein family (motifs, sites).
Domain Family databases
Murcia 2011
22
Use of pattern
•
Patterns are used to describe small functional regions:
–
–
–
–
–
Enzyme catalytic sites;
Prosthetic group attachment sites (heme, PLP, biotin, etc.);
Amino acids involved in binding a metal ion;
Cysteines involved in disulfide bonds;
Regions involved in binding a molecule (ATP, calcium, DNA etc.)
or a protein.
– N-glycosylation sites
Domain Family databases
Murcia 2011
23
How to Build a PROSITE Pattern
•
Start with a multiple sequence alignment (MSA)
Domain Family databases
Murcia 2011
24
Consensus Sequences:
PROSITE Patterns syntax
The PROSITE patterns are described using the following conventions:
ex: <M-R-[DE]-x(2,4)-[ALT]-{AM}
1. The standard IUPAC one-letter codes for the amino acids are used.
2. The symbol `x' is used for a position where any amino acid is accepted.
3. Ambiguities are indicated by listing the acceptable amino acids for a given position, between
square parentheses `[ ]'. For example: [ALT] stands for Ala or Leu or Thr.
4. Ambiguities are also indicated by listing between a pair of curly brackets `{ }' the amino acids
that are not accepted at a given position. For example:{AM} stands for any amino acid except
Ala and Met.
5. Each element in a pattern is separated from its neighbor by a ‘-’.
6. Repetition of an element of the pattern can be indicated by following that element with a
numerical value or, if it is a gap ('x'), by a numerical range between parentheses.
Examples:
x(3) corresponds to x-x-x
x(2,4) corresponds to x-x or x-x-x or x-x-x-x
A(3) corresponds to A-A-A
Note: You can only use a range with 'x', i.e. A(2,4) is not a valid pattern element.
7. When a pattern is restricted to either the N- or C-terminal of a sequence, that pattern either
starts
with a `<' symbol or respectively ends with a `>' symbol.
Domain Family databases
Murcia 2011
25
You can also automatically build a pattern (from
MSA) by using Pratt or Splash softwares:
http://www.expasy.org/tools/pratt/
http://www.research.ibm.com/splash/
Automatic discovered patterns are usually different
from those designed by a human expert with
knowledge of the biochemical literature
• http://www.expasy.org/tools/scanpro
site/
http://www.expasy.org/tools/scanprosite/
Advantage and Limitation of
PROSITE Patterns
• Advantages:
– efficient for the identification of sites or short motifs.
– Intelligible to any user, you don’t need to be an expert in
bioinformatic to read or build a consensus sequence.
• Limitation:
– The regular expression syntax is too rigid to represent
highly divergent domains.
(one mismatch is enough to eliminate a match).
Domain Family databases
Murcia 2011
29
PSSM
Profile specific
scoring matrix
Position Specific Scoring Matrix (PSSM)
• A PSSM or a profile is based on the frequencies of each residue at a
specific position in a MSA.
• The MSA is converted into a matrix where a score is given to each
amino acid at each position of the MSA according to the observed
frequency (positive scores for expected amino acids and negative
scores for unexpected ones).
Domain Family databases
Murcia 2011
32
Construction of a PSSM
1: weight sequences of the MSA (i.e. algorithms based on
phylogenetic tree)
2: count the number of occurrence of the different amino
acids (or bases) at each position of the alignment
3: derivation of the preliminary matrix (calculate the
frequency)
4: correction of the sample bias (use substitution matrix (PAM,
Blosum etc.) In proteins some mismatches are more
acceptable than others.
Domain Family databases
Murcia 2011
33
Profiles
Sequence
alignment
Sequence 1:
Sequence 2:
Sequence 3:
Sequence 4:
Sequence 5:
Sequence 6:
Sequence 7:
Profile
(or weight matrix)
(residue frequency at
each position in
alignment)
Profiles
Sequence
alignment
Sequence 1:
Sequence 2:
Sequence 3:
Sequence 4:
Sequence 5:
Sequence 6:
Sequence 7:
F most frequent
Profile
(or weight matrix)
(residue frequency at
each position in
alignment)
Phenylalanine has
highest score
Profiles
Sequence
alignment
Sequence 1:
Sequence 2:
Sequence 3:
Sequence 4:
Sequence 5:
Sequence 6:
Sequence 7:
L and Y equal frequency
Profile
(or weight matrix)
(residue frequency at
each position in
alignment)
Different scores
Profiles
Sequence
alignment
Sequence 1:
Sequence 2:
Sequence 3:
Sequence 4:
Sequence 5:
Sequence 6:
Sequence 7:
L and Y equal frequency
Profile
(or weight matrix)
(residue frequency at
each position in
alignment)
Leucine is aliphatic
(dissimilar from F)
Tyrosine and phenylalanine
both aromatic (similar)
Profiles
Sequence 1:
Sequence 2:
Sequence 3:
Sequence 4:
Sequence 5:
Sequence 6:
Sequence 7:
Profiles score frequency
• Highest frequency aa  highest score
• Lower frequency aa  lower score
• Similar aa not in alignment  even lower score
• Dissimilar aa not in alignment  very low score
** In a pattern would be [FLY]  equal frequency
Search a Database With a PSSM
• The sequence (MCFVNRFYSFCMP) is ‘aligned’ to the PSSM:
M
C
F
V
N
R
F
Y
S
F
C
M
P
A
1
2
3
4
5
6
7
C
D
E
F
G
H
I
K
L
M
N
P
Q
R
S
T
V
W
Y
12,-41,-20, 5,-25,-42,-18,-18, 33,-12,-12,-19,-41, 42, 9, 2, 9, 16,-61,-11;
-23,-54, -5,-24,-37,-19,-45, -3, 7,-35,-38, 59,-41,-12,-42, 10, 65,-17,-68,-15;
-13,-62,-14, 4,-53, 78,-36,-65,-15,-64,-49,-14,-48, 9, 5,-10,-11,-63,-61,-42;
-36,-68,-63,-36, 60,-63,-38,-14,-47, 3,-21,-52,-53,-34,-58,-39,-45,-26,138, 36;
-22,-60,-54,-24, 6,-43, 0, 30, 13, 0,-22,-27,-59, 55, -9,-38,-11, 37,-57, 12;
-35,-46,-18, 14, -9,-51,-12,-19, 34,-39,-28, 36,-45, 44, -9, -3, 41,-27,-24, 17;
-33,-58, 37, -6,-16,-39,-21, 61,-23, -1,-28, -6,-58,-17,-54,-20, -9, 14,-12, 11;
Searching algorithm: sliding windows. At each position of the sliding window the
score is obtained by summing the score of all columns
Best score: 16+59+5+60+12-3-16=133
Domain Family databases
Murcia 2011
39
Avantages and limitations of PSSMs
• Advantages:
– The score produced permits to estimate the quality of
the match produced.
– The method is relatively fast and simple to implement
• Limitations:
– Indels are forbidden: long region can not be implement.
PSSM: Fingerprints
• To overcome the gap limitation of PSSMs, two or more
PSSMs can be used to describe long regions. The
combination of various PSSMs is called ‘fingerprints’
• PRINTS database is a collection o annotated fingerprints
(usefull to define sub-families)
Generalized profiles
• A generalized profile is an extension
of the PSSM, in which we introduce
position specific deletion and
insertion penalties.
Generalized Profiles
The following information is stored in any generalized
profile:
• Each position is called a match state. A score for every
residue is defined at every match states (M), just as in the
PSSM.
• Each match state can be ommitted in the alignment, by what
is called a deletion state (D) and receives a positiondependent penalty.
• Insertion of variable lenght are possible between any two
adjacent match (or deletion) states. These insertion states
(I) are given a position-dependent penalty that might also
depend upon the inserted residues.
• A couple of additional parameters allow to adapt the
behaviour of the profile on its extremities which can force
to match the whole domain or produce partial matches.
Domain Family databases
Murcia 2011
44
Example of a Generalized Profile
ID
AC
DT
DE
MA
MA
.
.
.
MA
MA
MA
MA
MA
MA
MA
MA
MA
MA
MA
.
.
.
//
ZF_RING_2; MATRIX.
PS50089;
DEC-2001 (CREATED); DEC-2001 (DATA UPDATE); DEC-2004 (INFO UPDATE).
Zinc finger RING-type profile.
/GENERAL_SPEC: ALPHABET='ABCDEFGHIKLMNPQRSTVWYZ'; LENGTH=43;
/DISJOINT: DEFINITION=PROTECT; N1=5; N2=39;
/DEFAULT: D=-20; I=-20; B1=0; E1=-10; MI=-105; MD=-105; IM=-105; DM=-105; M0=-5;
/I: B1=0; BI=-105; BD=-105;
/M: SY='C'; M=-10,-20,119,10,0,-20,-30,10,-30,-30,-20,-20,-20,-40,-30,-30,-10,-10,-10,-50,-30,-30;
/M: SY='P'; M=-1,-9,-21,-10,-4,-17,-14,-10,-11,-8,-14,-8,-6,4,-5,-10,0,-1,-10,-27,-14,-6;
/M: SY='I'; M=-7,-27,-24,-32,-25,-1,-32,-25,32,-22,16,15,-21,-23,-19,-21,-17,-7,25,-21,-3,-24; D=-3;
/I: I=-3; DM=-16;
/M: SY='C'; M=-10,-20,119,-30,-30,-20,-30,-30,-30,-30,-20,-20,-19,-40,-30,-30,-10,-10,-10,-50,-30,30;
/M: SY='L'; M=-10,-12,-17,-14,-9,-1,-19,-7,-7,-9,2,2,-11,-21,-8,-7,-12,-8,-7,-17,1,-9;
/M: SY='E'; M=-8,9,-22,12,17,-24,-13,-3,-23,2,-20,-15,5,-11,6,-2,3,-2,-19,-29,-15,11;
/M: SY='E'; M=-7,-4,-23,-4,1,-16,-17,-8,-12,-2,-12,-8,-2,-5,-3,-3,-3,-2,-11,-25,-10,-2;
/M: SY='F'; M=-10,-19,-24,-21,-13,7,-24,-11,4,-15,6,7,-16,-13,-12,-13,-15,-9,-2,-12,6,-13;
Domain Family databases
Murcia 2011
46
Align the generalized profile with a sequence….
(Dynamic programming, ~Smith Waterman algorithm)
a sequence
Algorithm and Software to buid and use
Generalized Profiles
• Pftools is a package to perform the different steps of the
construction of a profile and to search a database of protein
(or DNA) with a profile.
– http://www.isrec.isb-sib.ch/ftp-server/pftools
• Searching algorithm: dynamic programming (similar to SmithWaterman algorithm).
-> guaranteed to find the optimal local alignment with respect
to the scoring system being used (which includes the
substitution matrix and the gap-scoring scheme)
Domain Family databases
Murcia 2011
49
• http://www.expasy.org/tools/scanpro
site/
http://www.expasy.org/tools/scanprosite/
Statistical Significance of Sequence
Similarities
•
•
•
Each method (except patterns) gives a score of similarity between the
query sequence and the subject sequence or the method.
Ones need to estimate if this raw score can occure by chance. This is done
by the E-value or expected value
The E-value is the number of matches with a score equal to or greater
than the observed score that are expected to occur by chance.
An E-value of 1 is considered not to be significant.
An E-value of 0.1 possibly to be significant.
An E-value of 0.01 most likely to be significant.
•
Pitfall: The E-value depends on the size of the searched database, as the
number of false positives expected above a given score threshold usually
increases proportionally with the size of the database.
Domain Family databases
Murcia 2011
51
Advantage and Limitation of Generalized
Profiles
• Strenghs:
– Very sensitive to detect similarities (close to the twilight
zone).
– Good scoring system.
• Weaknesses:
– Require some expertise to use efficiently.
– Very CPU expensive.
Domain Family databases
Murcia 2011
52
HMM
Generalized Profiles can be represented
in a probabilistic framework named
Hidden Markov Models (HMMs).
HMM profiles
•
Each position in an HMM consists of a Match, Insert and Deletion state
•
Parameters describing a HMM profile:
– Emission probability: the probability of emitting an amino acid ‘x’ being in
state q (Amino acid emission probabilities are evaluated from observed
frequencies as for PSSM).
– Transition probability:
3 states: Match (M), Deletion (D), Insertion (I).
Transitions: M->I, M->D, I->M, I->D …
Transition probabilities are evaluated from observed transition
frequencies.
Domain Family databases
Murcia 2011
56
Hidden Markov Models (HMM)
I2
I1
M1
I3
I4
I5
I6
I7
I8
I9
M2
M3
M4
M5
M6
M7
M8
M9
D2
D3
D4
D5
D6
D7
D8
D9
M10
M = match state
I = insert state
D = delete state
Each position in an HMM consists of a
Match, Insert and Deletion state
HMMER HMM Profile
NAME ig
ACC
PF00047.15
LENG 65
GA
25.1 13.4
TC
25.1 13.4
NC
25.0 25.0
XT
-8455
-4 -1000 -1000 -8455
-4 -8455
-4
NULT -4 -8455
NULE 595 -1558
85
338
-294
453 -1158
197
249
902 -1085
EVD
-28.914425
0.238245
HMM
A
C
D
E
F
G
H
I
K
L
M
m->m
m->i
m->d
i->m
i->i
d->m
d->d
b->m
m->e
-16
* -6461
M 1 -2647 -5115
-567
223 -5436
3047
164 -5186 -1236 -2912 -4204
I -149
-500
233
43
-381
399
106
-626
210
-466
-720
-1 -11609 -12651
-894 -1115
-701 -1378
-16
*
2
-972
-498
831
1649 -5434
884
766 -2367
62 -5129
-1
-149
-500
233
43
-381
399
106
-626
210
-466
-720
-1 -11609 -12651
-894 -1115
-701 -1378
*
*
3 -1011 -5113
411
-343 -1695 -2365
989 -5184
60
-699
50
-149
-500
233
43
-381
399
106
-626
210
-466
-720
-1 -11609 -12651
-894 -1115
-701 -1378
*
*
Domain Family databases
-142
-21
-313
N
P
Q
524
275
-2643
394
765
275
-278
275
45
531
201
384
R
S
T
V
W
Y
-554
45
-178
96
-319
359
-622
117
-4737
-369
-5298
-294
-4615
-249
1
-1114
394
1363
45
-2178
96
904
359
-1046
117
-4735
-369
-5296
-294
-1449
-249
2
850
394
-148
45
400
96
1625
359
1230
117
-856
-369
-5296
-294
-4613
-249
3
Murcia 2011
-1998
58
-644
HMM Profile softwares
• HMMER is a package to build and use HMMs
(http://hmmer.janelia.org/)
Used by Pfam, SMART and TIGRfam databases.
• SAM is a similar package
(http://www.cse.ucsc.edu/research/compbio/sam.html).
Used by SCOP superfamily and gene3D.
Domain Family databases
Murcia 2011
59
Advantage and Limitation of HMM Profiles
• Advantage:
Solid theoretical basis: more efficient than generalized
profile to estimate insertion and deletion penalties.
Other advantages and limitations just like generalized
profiles.
Domain Family databases
Murcia 2011
60
Generalized Profiles and HMM Profiles
• The format of generalized profiles is equivalent to the one
of HMM profiles.
• It is easy to convert a generalized profile in a HMM profile
without loosing information:
– htop program: convert a HMM profile (HMMER) in
generalized profile.
– ptoh program: convert a generalized profile in HMM
profile (HMMER).
Domain Family databases
Murcia 2011
61
Domain/Family
databases
MSA models are stored in databases
(Prosite, PRINTS, Pfam
…and
InterPro…)
Signatures Methods
• Pattern
• Fingerprint
• Sequence clustering
• Profile
• HMM
InterPro scan results
?
Part of the protein
sequence wich has been
‘recognized’ by different
modelled MSA
protein
folding
InterPro hits
InterPro domain architecture
PROSITE
• PROSITE is a database containing patterns and
generalized profiles.
• http://www.expasy.org/prosite
• Contains ~1300 patterns and ~1000 generalized profiles.
• Good documentation.
• PROSITE is also use to annotate UniProtKB/Swiss-Prot.
Domain Family databases
Murcia 2011
67
PROSITE Documentation Page
Domain Family databases
Murcia 2011
68
PROSITE Pattern Page
ID
AC
DT
DE
PA
NR
NR
NR
CC
CC
DR
DR
DR
DR
DR
...
DR
DR
DR
3D
DO
//
ZF_RING_1; PATTERN.
PS00518;
DEC-1991 (CREATED); JUN-1994 (DATA UPDATE); DEC-2005 (INFO UPDATE).
Zinc finger RING-type signature.
C-x-H-x-[LIVMFY]-C-x(2)-C-[LIVMYA].
/RELEASE=48.7,204086;
/TOTAL=354(354); /POSITIVE=352(352); /UNKNOWN=0(0); /FALSE_POS=2(2);
/FALSE_NEG=375; /PARTIAL=2;
/TAXO-RANGE=??E?V; /MAX-REPEAT=1;
/VERSION=1;
Q02084, A33_PLEWA , T; Q09654, ARD1_CAEEL , T; P36406, ARD1_HUMAN ,
Q8BGX0, ARD1_MOUSE , T; P36407, ARD1_RAT
, T; O76924, ARI2_DROME ,
O95376, ARI2_HUMAN , T; Q9Z1K6, ARI2_MOUSE , T; Q99728, BARD1_HUMAN,
O70445, BARD1_MOUSE, T; Q9QZH2, BARD1_RAT , T; Q9NZS9, BFAR_HUMAN ,
Q8R079, BFAR_MOUSE , T; Q5PQN2, BFAR_RAT
, T; Q96CA5, BIRC7_HUMAN,
P18541, ZNFP_LYCVA , N;
Q88470, ZNFP_TACV , N;
Q6UY11, EGFL9_HUMAN, F;
1BOR; 1CHC; 1FBV; 1G25;
PDOC00449;
T;
T;
T;
T;
T;
P19326, ZNFP_LYCVP , N; P19325, ZNFP_LYCVT , N;
Q8NEG5, ZSWM2_HUMAN, N; Q9D9X6, ZSWM2_MOUSE, N;
P30735, VE6_MNPV
, F;
1JM7; 1RMD;
Domain Family databases
Murcia 2011
69
PROSITE profile Page
Domain Family databases
Murcia 2011
70
Scanprosite Web Page
Domain Family databases
Murcia 2011
71
Scan Prosite Output
The PROSITE database is now complemented by a series of
rules that can give more precise information about specific
residues.
Domain Family databases
Murcia 2011
72
ProRule
Domain Family databases
Murcia 2011
73
Pfam
• The largest collection of curated domains and families
(~10000).
• Very good descriptors (Few false positives and false
negatives).
• But ~3000 motifs have less than 10 matches on UniProtKB.
• Uses HMM profiles (HMMER3).
• http://pfam.sanger.ac.uk/
Domain Family databases
Murcia 2011
74
Pfam entry page
Domain Family databases
Murcia 2011
75
SMART
• ~ 800 descriptors.
• Concentrates on large domain families and the identification
of new domains.
• Uses HMM profiles (HMMER2).
• Weak annotation.
• Good tools for genomic analysis.
• http://smart.embl.de/smart/set_mode.cgi?NORMAL=1
Domain Family databases
Murcia 2011
76
SMART homepage
Domain Family databases
Murcia 2011
77
ProDom
• ProDom is a database of protein domain families generated
automatically from the global comparison of all available
protein sequences (last release in 2008 !!).
• Descriptors are built with PSI-BLAST
• No annotation
• http://prodom.prabi.fr/prodom/current/html/home.php
• Used to defined new pfam families
Domain Family databases
Murcia 2011
78
Family databases: PRINTS
•
Fingerprints are combination of ungapped PSSM. As gaps are not
allowed they are usually directed against well conserved short
motifs.
•
The PRINTS database is specialised in subfamily classification.
(The GPCR family was divided in more than 100 sub-families)
•
•
Contains 12’000 motifs.
http://www.bioinf.manchester.ac.uk/dbbrowser/PRINTS/index.php
Domain Family databases
Murcia 2011
79
PRINTS homepage
Domain Family databases
Murcia 2011
80
Other family databases
• PANTHER: was developed to annotate the human genome.
Contains a lot of models for mammalian proteins, but very
few for plant, fungi or bacteria. Family/subfamily
classification, more than 5000 families and 25 000
subfamilies. Automatically generated.
http://www.pantherdb.org
• PIRSF: good annotation for functional residues. ~30000
automatically generated HMM profiles.
http://pir.georgetown.edu/pirsf/
• TIGRFAM only for prokaryotic proteins. 3500 HMM
profiles
http://www.tigr.org/TIGRFAMs/
Domain Family databases
Murcia 2011
81
Scop superfamily and CATH
•
Scop Superfamily and CATH are structural domain database using HMM profiles.
Hierarchical classification of domains.
•
Use HMM profiles (SAM).
•
Domain boundaries are semi-automatically extracted.
•
Very sensitive methods (often more matches for a given domain than Pfam or
PROSITE).
•
Usefull for structure prediction but dangerous for functional prediction. Tends to
group structurally related domains but with no functional relationship.
(ex: tpr repeat: only alpha helices. SCOP or CATH tpr repeat profiles picked-up a lot
of conserved regions rich in alpha helices but not evolutively link to tpr)
•
•
http://supfam.mrc-lmb.cam.ac.uk/SUPERFAMILY/
http://www.cathdb.info/
Domain Family databases
Murcia 2011
82
InterPro
integrates MSA models from various
databases
and
organize them and their annotation so
relationships emerge.
InterPro
• Interpro is an attempt to group a number of protein
databases:Pfam, PROSITE, PRINTS, ProDom, SMART
TIGRFAM, SCOP superfamily, Gene3D.
• http://www.ebi.ac.uk/interpro
• InterPro tries to have and maintain a high quality annotation.
• The database and a stand-alone package are available to
locally run a complete InterPro analysis.
• ftp://ftp.ebi.ac.uk/pub/databases/interpro/
Domain Family databases
Murcia 2011
84
InterProScan
Domain Family databases
Murcia 2011
85
InterProScan Output
Domain Family databases
Murcia 2011
86
InterPro protein coverage
96.0% of UniProtKB/SwissProt
78.6% of UniProtKB/TrEMBL
Protein Sequence Databases
Murcia,
Protein Sequence Databases
Murcia,
Protein Sequence Databases
Murcia,
Never forget that:
• The computational sequence analysis tools are naïve about real biology and the
complex relationships between molecular elements and proteins.
•
Therefore we should be critical about what we can achieve with such
computational sequence analysis tools.
•
So, again, be critical… and understand the biology.
Many thanks to
• Lorenzo Cerruti
• Nicolas Hulo
• Jennifer McDonald
• And you !
Further Reading
• Durbin, Eddy, Mitchison, Krog. Biological Sequence
Analysis: Probabilistic Models of Proteins and Nucleic acids.
Cambridge University Press, 1998.
• Attwood TK, Parry-Smith DJ. Introduction to
bioinformatics. Addison Wesley Longman Limited, 1999
• Krogh A, Brown M, Mian IS, Sjolander K, Haussler D.
Hidden Markov models in computational biology. Applications
to protein modeling. J Mol Biol. 1994 Feb 4;235(5):1501-31.
• Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang
Z, Miller W, Lipman DJ. Gapped BLAST and PSI-BLAST: a
new generation of protein database search programs.
Nucleic Acids Res. 1997 Sep 1;25(17):3389-402.
Domain Family databases
Murcia 2011
94