Unification of sequence and structure analysis of proteins

Download Report

Transcript Unification of sequence and structure analysis of proteins

Protein Classification and Metaorganization.
Methods for Global Organization of
the Protein Universe
Golan Yona
Department of Computer Science
Cornell University
Golan Yona, Cornell University
Ismb02
Outline
•Introduction
•Sequence-based classifications of proteins
•Methods
•Sequence comparison algorithms
•Mathematical models of protein families
•Databases
•Domain-based protein sequence classification
•Protein-wise protein sequence classification
•Alternative representations of proteins
•Structure-based classifications of proteins
•Structure comparison algorithms
•Databases
•Other classifications
Golan Yona, Cornell University
Ismb02
What are proteins ?
• Structural framework (keratin, collagen)
• Transport and storage of small molecules (hemoglobin)
• Transmit information (hormones, receptors)
• Antibodies
• Blood clotting factors
• Enzymes
The protein is created in the cell as a unique sequence
of amino acids
AC M V L
L E
C
V
Golan Yona, Cornell University
Ismb02
Sequence
ACMVLLCEVEKYP…
folding
Structure
Function
Golan Yona, Cornell University
?????
Ismb02
Ismb02
Background and Problem definition
About
protein sequences are known
today (non-redundant database).
This number keeps rapidly growing
(large scale sequencing projects).
!
The function of 40-50% of the new proteins is unknown.
Understanding biological function is important for:
• Study of fundamental biological processes
• Drug design
• Genetic engineering
Golan Yona, Cornell University
Ismb02
Basic approach
•Calculate pairwise distances/similarities with sequences of
known proteins
Pairwise similarities are computed using the rigorous
dynamic programming algorithm (Smith-Waterman), or
heuristic algorithms such as Fasta and Blast.
V
MCTKKLLVYD
VRSM
Golan Yona, Cornell University
KLLLGF E
Ismb02
Database search
•A sequence is compared with all sequences in the database
MFGPVRACAATFD...
MCTKKLVTD....
RVSMLILLGFH.....
KAACGCGLMDRFYVVLVV....
.....
Its properties are extrapolated from neighboring sequences
(nearest neighbor approach to learning and generalization)
Golan Yona, Cornell University
Ismb02
BLASTP 2.2.2 [Jan-08-2002]
Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer,
Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997),
"Gapped BLAST and PSI-BLAST: a new generation of protein database search
programs",
Nucleic Acids Res. 25:3389-3402.
Query= nr|001640000880 trembl: (Q935W2) DNA mismatch repair protein
(Fragment).
(164 letters)
Database: /local/databases/nr/nr
933,075 sequences; 292,319,403 total letters
Searching..................................................done
Score
E
(bits)
Value
nr|008510000010 swissprot: (Q99XL8) DNA mismatch repair protein ...
316
5e-86
nr|001640000880 trembl: (Q935W2) DNA mismatch repair protein (Fr...
298
1e-80
nr|001640000882 trembl: (Q936C5) DNA mismatch repair protein (Fr...
290
3e-78
nr|001640000877 trembl: (Q933P8) DNA mismatch repair protein (Fr...
283
4e-76
nr|001640000883 trembl: (Q936C9) DNA mismatch repair protein (Fr...
280
5e-75
nr|001350001337 trembl: (Q9ETZ4) DNA mismatch repair protein (Fr...
248
2e-65
nr|001350001345 trembl: (Q9EVZ2) DNA mismatch repair protein (Fr...
247
3e-65
nr|001350000805 trembl: (Q933M4) Putative DNA mismatch repair en...
246
7e-65
Sequences producing significant alignments:
Golan Yona, Cornell University
Ismb02
BLASTP 2.2.2 [Jan-08-2002]
Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer,
Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997),
"Gapped BLAST and PSI-BLAST: a new generation of protein database search
programs",
Nucleic Acids Res. 25:3389-3402.
Query= nr|001640000690 trembl: (Q22102) Hypothetical 19.0 kDa protein.
(164 letters)
Database: /local/databases/nr/nr
933,075 sequences; 292,319,403 total letters
Searching..................................................done
Score
E
(bits)
Value
nr|001640000690 trembl: (Q22102) Hypothetical 19.0 kDa protein.
310
5e-84
nr|009250000047 trembl: (Q9GUG1) Hypothetical 104.9 kDa protein.
230
4e-60
nr|009660000030 trembl: (Q20175) Hypothetical 110.0 kDa protein.
225
1e-58
nr|008350000033 trembl: (Q17506) Hypothetical 95.8 kDa protein.
152
8e-37
nr|011390000015 trembl: (O76601) H02F09.4 protein.
152
1e-36
nr|002650001251 trembl: (Q9XUR0) T05F1.12 protein.
148
2e-35
nr|002140000359 trembl: (O16440) Hypothetical 24.8 kDa protein.
139
2e-32
nr|002210001099 trembl: (Q9N5M4) Hypothetical 25.9 kDa protein.
135
1e-31
Sequences producing significant alignments:
Golan Yona, Cornell University
Ismb02
BLASTP 2.2.2 [Jan-08-2002]
Query= nr|001640000685 trembl: (Q19708) F22B5.4 protein.
(164 letters)
Database: /local/databases/nr/nr
933,075 sequences; 292,319,403 total letters
Searching..................................................done
Sequences producing significant alignments:
Score
E
(bits)
Value
153
8e-37
nr|001670000430 trembl: (P90860) F36A2.7 protein.
>nr|001670000430 trembl: (P90860) F36A2.7 protein.
Length = 167
Score =
153 bits (386), Expect = 8e-37
Identities = 83/157 (52%), Positives = 108/157 (67%), Gaps = 2/157 (1%)
Query: 8
RKTMFRFVLSRNASTSNVPSPARIQLKKPAEAGHFQYSRNWSRDPRFVKVAIQKGDTPYQ 67
+ T+ R VL RNAS S +
+K
+ G F+Y R+ SRD R+
A + GDT
+
Sbjct: 13
KNTIARIVLVRNAS-SGLTLKHEQTIKINDQQGFFKYQRDVSRDTRYSNPA-KPGDTAAR 70
Query: 68
FLVRRLGHAYEVYPLFVLTAAWFVLFCSASYWSFGKAEIWLDRSNSKAPWDWERLRDTYW 127
F+ R+LGHAYE+YPLF L A W VLF
Sbjct: 71
++SF KAEIWLDRS++ APWDWER+R+ YW
FMFRKLGHAYEIYPLFGLLAIWCVLFGYTVWYSFEKAEIWLDRSHTVAPWDWERIRNNYW 130
Golan Yona, Cornell University
Ismb02
Smith Waterman
(Adv. App. Math 1981)
Based on the dynamic programming algorithm. Finds the
maximal local similarity given the parameters for point
mutations s(ai,bj) and the gap penalties g
The optimal alignment of the a1a2a3….ai with b1b2b3….bj must end in
one of three possible ways
ai OR ai OR bj
-
bj
If S(i,j) is the best score over all possible alignments of a1a2a3….ai
with b1b2b3….bj then
S(i,j) = MAX{S(i-1,j-1)+s(ai,bj), S(i-1,j)-g, S(i,j-1)-g}
Golan Yona, Cornell University
Ismb02
Smith Waterman
(Adv. App. Math 1981)
Implementation:
0...
j
..m
0
.
S(i-1,j-1)+s(ai,bj)
S(i-1,j)-g
i
.
n
Golan Yona, Cornell University
S(i,j-1)-g
Ismb02
FASTA
(Pearson & Lipman, PNAS 1988)
Create a hash table with all k-tuples that appear in the query
sequence, index vectors and an offset vector. Run a bounded
dynamic programming around the best diagonal/region
Library sequence 0 . . .
j
..m
0
.
i
.
n
Golan Yona, Cornell University
Ismb02
BLAST and Gapped-BLAST
(Altschul et al., JMB 1990, Altschul et al. NAR 1997)
For each word length k in the query generate all k-tuples with
score > T. Use these k-tuples to search the library sequence.
Extend the seeds to generate HSPs (high scoring segment
pairs without gaps). Combine HSPs
0...
j
..m
0
.
i
.
n
Golan Yona, Cornell University
Ismb02
PSI-BLAST
(Altschul et al. NAR 1997)
• Iterative procedure that uses a position-specific scoring matrix.
• Generate a profile P = p1p2…pn from hits detected at iteration I,
to search the database in iteration I+1, until convergence.
• The profile stores the probabilities to observe the 20 amino acids
at each position along the query sequence (derived from hits with
related sequences, after weighting and addition of pseudo-counts)
1
2
.
.
20
Golan Yona, Cornell University
1 2
n
p1 p2 …
pn
Ismb02
Limitations of sequence comparison
In many cases sequences have diverged to the extent that their
sequence similarity is undetectable with the above algorithms.
More than 50% of the 1,000,000 proteins that are known
today have an unidentified function
Iterative procedures such as PSI-BLAST are more powerful, but
are more susceptible to parameter tuning and have higher false
positive rate
Use reliable mathematical models of protein families to
classify new genes
Golan Yona, Cornell University
Ismb02
Protein classification
Sub-families (rab, ras, ran, rac, rho, ARF1, transducin,..)
More
specific
Common
Families (G-proteins, Motor proteins, DNA helicases,..)
evolutionary
origin
Super-families (P-loop containing nucleotide triphosphate
hydrolases)
Fold families (P-loop containing hydrolases,
beta/alpha tim barrel)
Classes (alpha and beta, all-alpha, ...)
Golan Yona, Cornell University
Ismb02
Why is protein classification important?
Protein classification can help to elucidate the function of new genes.
Comparing a sequence with a database of protein families is more
effective than a standard database search.
• Each family is represented as a model
 regular expression
MCTKKLVTD....
 profile
 HMM
• Those models can capture subtle similarities.
Family1
Family2
Family3
Protein classification is useful in structure and function prediction,
and especially important in large-scale annotation efforts.
Golan Yona, Cornell University
Ismb02
Databases of protein families
Domain-based clusterings:
• Prosite
(Bairoch & Bucher ISMB 1994, Laurent et al. NAR 2002).
• Pfam
(Sonnhammer et al. Proteins 1997, Bateman et al. NAR 2002).
• ProDom
(Gouzy et al. Computers and Chemistry 1999, Corpet et al. NAR 2000).
• Prints
(Attwood et al. NAR 2002).
• Domo
(Gracy & Argos Bioinformatics 1998).
• Blocks
(Henikoff et al. NAR 1999).
Protein-based clusterings:
• ProtoMap
(Yona et at. Proteins 1999, Yona et al. NAR 2000).
• Cogs
(Tatusov et al. Science 1997, Tatusov et al. NAR 2001).
• Systers
(Krause & Vingron Bioinformatics 1998, Krause et al. NAR 2002).
• PIR
(Barker et al. NAR 2000, Wu et al. NAR 2002).
Structural classifications:
• SCOP
(Murzin et al. JMB 1995).
• CATH
(Orengo et al. Structure 1997).
• FSSP
(Holm & Sander NAR 1999).
Significant overlap, but no two databases are identical.
Golan Yona, Cornell University
Ismb02
Mathematical representations of protein
families
• Regular expression
• Position specific scoring matrices (profiles)
• Hidden Markov Models
• Probabilistic suffix trees
• Sparse Markov transducers
Golan Yona, Cornell University
Ismb02
Regular expression
G-x(2,3)-[MLIV]-x-P-{K,H}-x(2)-C
GAY_LDPAKSC
GMF_VEPNTSC
GKRFVRPFHNC
GKSTVDPFHNC
G - a match with amino acid G
x – a match with any amino acid
x(i,j) – a match with any amino acid at least i times and no more
than j times
[MLIV] – a match with any amino acid of the four M, L, I or V
{K,H} – a match with any amino acid other than K and H
Golan Yona, Cornell University
Ismb02
Regular expression - examples
ACTININ_1.
[EQ]-x(2)-[ATV]-[FY]-x(2)-W-x-N.
HOMEOBOX_1.
[LIVMFYG]-[ASLVR]-x(2)-[LIVMSTACN]-x[LIVM]-x(4)-[LIV]-[RKNQESTAIY][LIVFSTNKH]-W-[FYVC]-x-[NDQTAH]-x(5)[RKNAIMW].
HELIX_LOOP_HELIX.
[DENSTAP]-[KR]-[LIVMAGSNT]{FYWCPHKR}-[LIVMT]-[LIVM]-x(2)[STAV]-[LIVMSTACKR]-x-[VMFYH][LIVMTA]-{P}-{P}-[LIVMRKHQ].
Golan Yona, Cornell University
Ismb02
Profile
(Gribskov et al. PNAS 1987)
GAA_TGAAGTC
GGT_TTAGGCC
GGCAATAGGGC
GTTAATAACAC
Position specific scoring matrix
A
C
G
T
0
0
1
0
0.25
0
0.5
0.25
0.25
0.25
0
0.5
1
0
0
0
0.5
0
0
0.5
0
0
0.25
0.75
1
0
0
0
0.5
0
0.5
0
0
0.25
0.75
0
0.25
0.25
0.25
0.25
0
1
0
0
Add pseudo counts
Na(position s)= Ntotal(s)*
Final score:
score(a/position s) =
Golan Yona, Cornell University
Σamino
Σamino
acid i
acid i
p(a/i) p(i/s)
p(i/s) score(i,a)
Ismb02
Hidden Markov Models
v1 v
a22
2
w2
a11
a12
b11 w1
v1
v2
b14
v3 v4
v3
v4
a23
w3
v1
v2 v
3 v4
•Hidden states w and observations v
w(t) – the state at time t
•Transition probabilities
P(wj(t+1)/wi(t)) = aij
•Emission probabilities
P(vk(t)/wj(t)) = bjk
•Transition and emission probabilities are learned from training data
Golan Yona, Cornell University
Ismb02
Hidden Markov Models (cont.)
a11
A
b11 w1
C
b14
G T
a22
a12
w2
a21 A
C G
T
A simple intron/exon model for DNA:
•Hidden states are w1=exon and w2=intron.
•Observations (visible symbols) are A, C, G and T
Golan Yona, Cornell University
Ismb02
Hidden Markov Models (cont.)
•HMMs used to model protein families are profile HMMs with three
different types of hidden states: Match (M), delete (D) and insertion (I)
states (Krogh et al. JMB 1996)
•The observations (visible symbols) are the amino acids
Golan Yona, Cornell University
Ismb02
Hidden Markov Models (cont.)
Given a sequence of length T, what is the probability that the sequence
was generated by the model?
Answer: Sum over all possible paths that can generate this sequence
Define the forward variable αj(t) to be the probability to observe the
sequence Vt = v(1)v(2)…v(t) and to reach hidden state j at time t
Note: P(VT) = Σj αj(T)
The HMM forward algorithm:
Initialize: αj(1) = P0(j) bjk
where vk=v(1)
Iteration: for t=1 till T-1
αj(t+1) = Σi αi(t) aij bjk
Golan Yona, Cornell University
where vk=v(t+1)
Ismb02
Hidden Markov Models (cont.)
What is the most probable sequence of hidden states that generated it?
The Viterbi algorithm:
Define the variable δj(t) to be the probability to observe the sequence
Vt = v(1)v(2)…v(t) and to reach hidden state j at time t maximized over all
sequences of hidden states of length t-1.
Define the pointer variable pathj(t) that contains the pointer to the most
probable hidden state at time t-1 from which we can reach state j at time t
Initialize: δj(1) = P0(j) bjk where vk=v(1)
pathj(1) = 0
Iteration: for t=1 till T-1
δj(t+1) = maxi {δi(t) aij } bjk
where vk=v(t)
pathj(t+1) = arg maxi {δi(t) aij}
Output:
probability P = maxi {δi(T)}
final state wT = arg maxi {δi(T)} and trace back
Golan Yona, Cornell University
Ismb02
Probabilistic Suffix Trees
(Ron et al. Machine Learning 1996)
•Identify short significant patterns (contiguous segments)
•Does not require multiple alignment
•Induces a probability distribution on the next symbol to appear right after
the segment (short term memory)
•Variable memory length
•More efficient than order L Markov chains
•Longer memory length compared to first-order HMMs, and easier to learn
Golan Yona, Cornell University
Ismb02
Probabilistic Suffix Trees (cont)
The learning algorithm
Initialize: let tree T consist of a single root node (with empty label)
let S = {σ/ σ in Σ and P(σ) > Pmin }
Building the PST: while S is not empty, pick s in S
•
Remove s from S
•
If there exists a symbol σ in Σ such that
P(σ/s) > Tmin
and
P(σ/s)
>r
or < 1/r
P(σ/suf(s))
then add to T the node s and all the nodes on the path to s from the
deepest node of T that is a suffix of s
{
•
If |s| < L then add the strings {σs / σ in Σ and P(σs) > Pmin } to S
Finalize: smooth tree probabilities
Golan Yona, Cornell University
Ismb02
Databases of protein domains
•
•
•
•
•
•
•
•
•
Prosite
Pfam
Blocks
ProDom
Prints
Domo
InterPro
Smart
eMotif
Golan Yona, Cornell University
http://www.expasy.ch/prosite/
http://www.sanger.ac.uk/Software/Pfam/
http://www.blocks.fhcrc.org/
http://prodes.toulouse.inra.fr/prodom/doc/prodom.html
http://www.bioinf.man.ac.uk/dbbrowser/PRINTS/
http://www.infobiogen.fr/services/domo/
http://www.ebi.ac.uk/interpro/
http://smart.embl-heidelberg.de/
http://dna.stanford.edu/identify
Ismb02
Motifs and domains
• Motif: a simple combination of a few consecutive secondary structure
elements with a specific geometric arrangement (e.g., helix-loop-helix).
Some, but not all motifs are associated with a specific biological
function.
• Domain: the fundamental unit of structure folding and evolution. It
combines several secondary elements and motifs, not necessarily
contiguous, which are packed in a compact globular structure. A
domain can fold independently into a stable 3D structure, and it has a
specific function.
• Domain family: proteins that share a
domain (possibly in combination with
other domains)
• Protein family: proteins that have the same combination of domains
Golan Yona, Cornell University
Ismb02
General approaches
• Motif based databases
• Prosite, Prints, Blocks, eMotif, InterPro
• Domain-based databases
• Pfam, ProDom, Domo, Smart
• Manual/Semi-manual
• Prosite
• Semi-automatically
• Pfam, Smart
• Fully automatic
• ProDom, Blocks, Domo, eMotif
• Use different models (regular expressions, profiles, HMMs)
• Based on each other
Golan Yona, Cornell University
Ismb02
Prosite
• A dictionary of functional and structural motifs and domains
• Valuable biological information on each family
• Each motif/domain/family is represented as a regular expression, a
rule or a profile
G-x(2,3)-[MLIV]-x-P-{K,H}-x(2)-C
OR
A
C
G
T
1
0
0
1
0
2
0.25
0
0.5
0.25
3
0.25
0.25
0
0.5
4
1
0
0
0
5
0.5
0
0
0.5
6
0
0
0.25
0.75
7
1
0
0
0
8
0.5
0
0.5
0
9
0
0.25
0.75
0
10
0.25
0.25
0.25
0.25
11
0
1
0
0
• Models are generated from (usually published) multiple alignments,
manually calibrated to ensure selectivity and sensitivity
• Patterns do not always cover complete domains whereas profiles
usually span the whole domain
• Some families have more than one signature pattern and /or profiles
• Pattern length can vary from a few amino acids to several hundreds
• As of June 2002 contains 1564 patterns and profiles describing
1144 families or domains
Golan Yona, Cornell University
Ismb02
Prosite - example
Profile:
ID
AC
MA
MA
MA
MA
MA
CC
MA
MA
MA
MA
MA
MA
MA
MA
MA
MA
MA
MA
MA
MA
MA
MA
MA
MA
MA
MA
MA
MA
MA
MA
MA
MA
MA
MA
MA
TEST; MATRIX.
PS50999;
/DISJOINT: DEFINITION=PROTECT; N1=4; N2=27;
/NORMALIZATION: MODE=1; FUNCTION=LINEAR; R1=1.1126; R2=0.02183468; TEXT='NScore';
/CUT_OFF: LEVEL=0; SCORE=110; N_SCORE=8.5; MODE=1;
/DEFAULT: M0=-8; D=-20; I=-20; B1=-60; E1=-60; MI=-105; MD=-105; IM=-105; DM=-105;
/I: B1=0; BI=-105; BD=-105;
A, B, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y, Z;
/M: SY='R'; M=-12, -9,-30, -9, -6,-24, 10, -8,-33, 16,-24,-13, 0,-19, 0, 36, -7,-13,-23,-20,-17, -6;
/M: SY='R'; M= -2, -4,-24,-10, -4,-19,-16, -6,-14, 8,-14, -7, 4,-17, 4, 23, -2, -3,-12,-23,-12, -3;
/M: SY='G'; M= -5, -7,-27, -8, -7,-18, 18,-15,-30, 7,-26,-14, -1,-16, -8, 0, -2,-11,-21,-19,-15, -7;
/M: SY='G'; M= -9, 4,-27, -2, -9,-24, 24, -6,-32, 0,-27,-17, 16,-20, -6, 12, 0,-12,-27,-25,-21, -9;
/M: SY='K'; M=-11, -1,-30, -1, 9,-29,-20, -9,-30, 48,-29,-10, 0,-11, 10, 34,-10,-10,-20,-20,-10, 9;
/M: SY='K'; M= -7, -6,-24, -7, -1,-19, -4,-12,-20, 9,-19,-10, -5,-15, -4, 5, -3, -1,-11,-21,-10, -3;
/M: SY='V'; M= -1,-30,-12,-31,-30, 0,-31,-30, 32,-21, 11, 11,-29,-29,-29,-21,-11, -1, 48,-29, -9,-30;
/M: SY='T'; M= -1, -3, 7,-13,-13,-11,-21,-21,-13,-13,-11,-11, -3,-14,-13,-13, 16, 42, -1,-33,-13,-13;
/M: SY='V'; M= -6,-18,-19,-21,-16, -5,-26,-20, 9,-11, 8, 4,-15,-21,-13, -6, -7, 6, 12,-25, -7,-16;
/M: SY='V'; M= -4,-30,-17,-33,-29, 1,-33,-29, 35,-24, 17, 14,-27,-27,-26,-23,-15, -4, 40,-26, -6,-29;
/M: SY='S'; M= -3, 4,-19, 1, 7,-22,-12, -5,-20, 3,-22,-14, 9,-11, 10, 5, 16, 13,-16,-31,-15, 8;
/M: SY='G'; M= -2, 0,-28, -4,-16,-28, 56,-14,-36,-16,-30,-20, 12,-20,-16,-16, 2,-16,-30,-24,-28,-16;
/M: SY='L'; M=-10,-29,-21,-33,-23, 14,-31,-21, 23,-28, 36, 20,-27,-28,-21,-21,-25, -9, 15,-17, 3,-23;
/M: SY='E'; M= -6, 20,-26, 26, 30,-31,-12, -1,-28, 3,-23,-20, 11, -7, 12, -4, 4, -6,-26,-32,-19, 21;
/M: SY='L'; M= 5,-15,-18,-19, -8, -8,-19,-15, 3,-10, 10, 10,-16,-18, -9,-12, -9, -2, 5,-23, -9, -8;
/M: SY='F'; M= -6, -7,-21,-11,-12, 21,-18,-10,-12,-10, -9, -9, -4,-21,-17,-11, -5, -5, -9, -9, 11,-13;
/I:
I= -6; MD=-32;
/M: SY='G'; M= -2, 3,-25, 7, 2,-27, 18,-10,-27, -9,-18,-14, 0,-14, -2,-12, -1,-12,-22,-23,-20, 0; D=-6;
/I:
I= -6; MI=-32; IM=-32; DM=-32;
/M: SY='I'; M= 0,-21,-20,-25,-20, 4,-25,-20, 17,-22, 16, 12,-19,-22,-18,-21,-15, -7, 14,-19, -3,-20;
/M: SY='D'; M= -7, 27,-27, 36, 21,-32, -3, -3,-32, 0,-25,-23, 14, -9, 2, -9, 2, -9,-27,-34,-21, 12;
/M: SY='L'; M= -9,-23,-27,-21,-11, -8,-27,-18, 6,-20, 14, 5,-22, 8, -9,-18,-18, -6, -4,-23,-10,-12;
/M: SY='K'; M=-11, 2,-27, 3, 10,-22,-20, -4,-24, 23,-21,-11, 0,-12, 10, 16, -4, -2,-19,-19, -3, 9;
/M: SY='K'; M= 2, -2,-20, -3, 5,-22,-13,-12,-21, 16,-22,-12, 0,-10, 3, 6, 8, 5,-13,-27,-14, 4;
/M: SY='L'; M= -5,-28,-19,-31,-23, 10,-28,-21, 21,-26, 32, 18,-26,-27,-21,-20,-22, -8, 17,-19, -1,-22;
/M: SY='A'; M= 31,-14, 0,-22,-15,-17, -2,-22, -8,-15, -6, -8,-13,-17,-15,-21, 2, -3, 1,-24,-19,-15;
/M: SY='A'; M= 16, -5,-17, -9, -2,-22, -7, -3,-20, 5,-21,-11, -1,-12, 0, 1, 10, 0,-11,-26,-13, -2;
/M: SY='E'; M= 3, 4,-23, 5, 13,-23,-16, -9,-13, -3,-12, -7, -2, -9, 5,-10, 1, -2,-12,-27,-15, 9;
/I: E1=0; IE=-105; DE=-105;
Golan Yona,
Courtesy
of Nicolas
CornellHulo
University
Pattern:
K-x-V-[TC]-x-[IVL]-x(2)-[LMF]
Ismb02
Prints
• A database of gene- and domain-family fingerprints in the form
of collections of short motifs aligned without gaps
• Fingerprints comprise several motifs from different parts of a
multiple alignment
• Discrimination power increases with the number of motifs in a
fingerprint
• Start from a seed alignment of a few sequences. Search the
database and refine the fingerprint based on the set of sequences
that match the fingerprint completely (add sequences and/or
add/remove motifs), until convergence
• Partial match – possibly subfamily members
• As of July 2002, contains ~10,500 motifs grouped into 1750
families
Golan Yona, Cornell University
Ismb02
Prints - a fingerprinting overview
annotation
PRINTS
N
Golan Yona,
Courtesy
of Terri
Cornell
Attwood
University
C
Ismb02
Prints - example
SUMMARY INFORMATION
37 codes involving
0 codes involving
0 codes involving
0 codes involving
0 codes involving
1 codes involving
0 codes involving
8
7
6
5
4
3
2
elements
elements
elements
elements
elements
elements
elements
COMPOSITE FINGERPRINT INDEX
8| 37
37
37
37
37
37
37
37
7|
0
0
0
0
0
0
0
0
6|
0
0
0
0
0
0
0
0
5|
0
0
0
0
0
0
0
0
4|
0
0
0
0
0
0
0
0
3|
1
0
0
0
1
1
0
0
2|
0
0
0
0
0
0
0
0
-+----------------------------------------|
1
2
3
4
5
6
7
8
True positives..
PRIO_COLGU
PRIO_MACFA
PRIO_CEREL
PRIO_GORGO
PRIO_PANTR
PRIO_HUMAN
PRIO_SHEEP
PRIO_CALJA
PRIO_BOVIN
PRIO_ATEPA
PRIO_SAISC
PRIO_PREFR
O75942
PRIO_CAPHI
PRIO_CEBAP
PRIO_FELCA
PRP1_TRAST
PRIO_RABIT
PRIO_PIG
PRIO_CANFA
PRIO_CRIGR
Q15216
PRIO_RAT
PRIO_CERAE
PRIO_MUSVI
PRIO_MESAU
PRIO_MOUSE
PRIO_TRIVU
Subfamily: Codes involving 3 elements
Subfamily True positives..
PRIO_CHICK
Golan Yona,
Courtesy
of Terri
Cornell
Attwood
University
PRIO_ODOHE
O46648
PRP2_BOVIN
PRIO_PONPY
PRIO_CAMDR
PRP2_TRAST
PRIO_CRIMI
PRIO_MUSPF
O46593
Ismb02
Prints - website
Golan Yona,
Courtesy
of Terri
Cornell
Attwood
University
Ismb02
ProDom
• A database of automatically generated protein domains
• Old procedure:
– Identify high-scoring segment pairs (HSPs) using BLAST for all-versusall comparison of SwissProt
– Construct homologous segment sets by transitive closure of overlapping
HSPs
– Break into domains
• New procedure:
– Construct homologous segment sets by PSI-BLAST recursive search
• Generate a multiple alignment and a consensus sequence for each
family
• May break real domains into smaller domains (more than
305,000 domain families as of January 2002)
Golan Yona, Cornell University
Ismb02
ith iteration
DB
query
internal repeat detection
yes
no
query
PSI-BLAST
no match
matches
repeat matches
DB changes
remove newly found domains
split modified sequences
sort by size
(i+1)th iteration
DB
query
Golan Yona,
Courtesy
of Daniel
CornellKahn
University
Ismb02
ProDom - website
Domain analysis
• Which proteins contain a given domain?
• Which proteins share a homologous domain with a given protein?
Domain
motif
Links to other
representations of the family
Summary alignment
Parameters to define
sub-family levels
Golan Yona,
Courtesy
of Daniel
CornellKahn
University
Ismb02
ProDom - website
• Synthetic views of alignments and trees
• Homology search with graphical output
Golan Yona,
Courtesy
of Daniel
CornellKahn
University
Ismb02
•Links to SWISS-PROT, TREMBL, PROSITE & PDB
•Links to 3-D modelling with SWISS-MODEL
Golan Yona,
Courtesy
of Daniel
CornellKahn
University
Ismb02
Blocks
• A collection of blocks (short ungapped alignments)
• Each represent a conserved region of a protein family (not
necessarily associated with a known function)
• Blocks are generated from multiple alignments of groups of
related proteins.
• Groups are derived from protein families in Prosite, Smart, Pfam,
ProDom
• Used only the SWISSPROT entries to define the blocks (to avoid
false positive sequences from TrEMBL)
• Complementary to PRINTS
• As of August 2001 contains 8656 blocks representing 2101
groups
Golan Yona, Cornell University
Ismb02
Blocks (cont)
• Detection:
– Identify short motifs using the algorithm by Smith 1990
GAYKLDPAKSC
GMFMVEPNTSC
GKRFVRPFHNC
GKSTVDPFHNC
– Short patterns are then extended in both directions until the similarity
score drops. Maximal length of a block is 60
DKGAYKLDPAKSCNL
HSGMFMVEPNTSCHF
GLGKRFVRPFHNCWM
MMGKSTVDPFHNCHD
• Assembly
– Create an ordered set of non overlapping blocks (“path”) that occur in a
critical number of sequences
DKGAYKLDPAKSCNL
HSGMFMVEPNTSCHF
GLGKRFVRPFHNCWM
MMGKSTVDPFHNCHD
KHAYKLDAKSNLL
VHMFMVDNTSHFL
LHKRFVDFHSWMM
KKVLDEAKSCN
KLILEENTSCH
SLFLREFHNCW
AAVLDEFHNCH
SKLLEEFSECH
• Scoring
– Score individual blocks based on length, similarity, #sequences, etc
– Score a path based on scores of the constituent blocks and #sequences
Golan Yona, Cornell University
Ismb02
Pfam
• A database of Hidden Markov Models for protein domains/families generated
semi-automatically
• Start from a seed multiple alignment (published or from other databases, such as
Prosite and ProDom).
• The alignment usually represent complete domains (confirmed with structural data
when available).
• After manual inspection a profile HMM is built from the alignment
• The HMM is used to search the database. If a true member is missed, then it is
added to the seed alignment and the process is repeated.
• A full alignment is created for all members of the family. If it is not correct the
process is repeated with another seed alignment or another alignment method
GAY_LDPAKSC
GMF_VEPNTSC
GKRFVRPFHNC
GKSTVDPFHNC
build
search
Database
refine
• As of May 2002, contains 3849 families, domains, repeats and motifs (in addition
Pfam B derived from ProDom)
Golan Yona, Cornell University
Ismb02
Domo
• Combines compositional and local similarity search followed by multiple
sequence alignments
• Start by detecting global similarities from the comparison of amino acid and
dipeptide composition of each protein.
• Group similar proteins into clusters
• Pick one representative from each cluster and compile into a suffix tree
• Compare the suffix tree with itself
local similarities
• Cluster local similarities and align (without gaps)
• Analyze alignments to detect domain boundaries and split
• Create multiple alignments
• Iterative family extension using automatically generated regular expressions
• Contains 8877 entries
Golan Yona, Cornell University
Ismb02
InterPro
• Combined resource of proteins domain and
motifs from ProSite, ProDom, Prints, Smart and
Pfam
Golan Yona, Cornell University
Ismb02
Databases of protein clusters
•
•
•
•
Cogs
Systers
PIR
ProtoMap
http://www.ncbi.nlm.nih.gov/COG/
http://systers.molgen.mpg.de/
http://wwwnbrf.georgetown.edu/pirwww/pirhome.shtml
http://protomap.cornell.edu/
What are the differences?
Golan Yona, Cornell University
Ismb02
General approach
Guide lines:
•Two proteins are homologous if they evolved from the same
ancestor protein.
•We can deduce homology based on statistically significant
sequence similarities
•Homology is a transitive relation
Golan Yona, Cornell University
Ismb02
Transitive closure leads to false connections
•Similarity is not transitive and it does not necessarily entail
homology
Main traps:
1) Overestimates of the statistical significance of similarity scores.
Chance similarities
2) Multi-domain proteins
protein 1
protein 2
protein 3
Golan Yona, Cornell University
Ismb02
Systers
•
•
•
•
•
Run all-against-all comparison (Smith-Waterman) of database sequences
Apply the single linkage clustering algorithm to construct hierarchy of all sequneces
Derive superfamilies from the structure of the hierarchy
Construct distance graph for each superfamily
Cut weak connections in each superfamily distance graph to derive family clusters
290,811 non-redundant sequences (without inclusions and duplicates) from Swiss-Prot,
TrEMBL, PIR, wormbase, FlyBase, etc. (as of August 2000)
cluster into 64,282 superfamilies,
split into 82,449 family clusters (including 55,182 single sequence clusters of mostly
fragmental sequences),
Seed
annotated with Pfam domains
1e-80
1e-72
1e-57
1e-42
1e-38
1e-31
1e-14
1e-08
1e-01
Golan Yona, Cornell University
Seed
1e-85
1e-76
1e-40
1e-36
1e-22
1e-02
Ismb02
Systers
Single Linkage
Hierarchy
Superfamilies
Superfamily
Distance Graph
Family
Cluster
Superfamilies as well as family clusters are derived
from the structure generated by the data itself
 no need for a user defined static cutoff
Golan Yona,
Courtesy
of Antje
Cornell
Krause
University
Ismb02
COGs
• Based on comparison of 43 complete genomes (as of May 2002)
• Apply single linkage clustering algorithm to get
• COGs - clusters of orthologous proteins (genes in different species
that evolved from the same ancestral protein) or orthologous sets of
paralogs (genes from the same genome which are related by
duplication).
• Each COG consists of at least 3 species.
• Start by forming minimal COGs (triangles)
• Merge COGs that share a side edge
• Refine – split COGs that were incorrectly merged due to multi-domain
proteins or other reasons
• COGNITOR: a tool to add new proteins to COGs on the basis of
genome-specific best hits; used to incorporate new genomes.
• Total of 3307 COGs as of May 2002 (including 74,059 of 104,101
proteins encoded in the analyzed genomes)
Golan Yona, Cornell University
Ismb02
ProtoMap
A graph-based approach: sequence space is represented as a directed
graph where vertices = protein sequences and edges represent relationships
between proteins. The width (weight) of an edge reflects the significance.
1 Starting from a very high resolution classification
Only connections of very high statistical significance (10-100) are considered.
The graph splits into connected components
Myoglobin
Hemoglobin alpha
Leghemoglobin
Hemoglobin beta
e-value > 10-100
e-value < 10-100
2 Lower the threshold (consider weaker similarities). Merge together clusters that are
strongly connected
obtain the classification at the next level of confidence
This process is repeated at thresholds 10-95 ,10-90,10-85 ,10-80 ..... 10-5 ,10-0 (=1)
Golan Yona, Cornell University
Ismb02
The clustering algorithm (cont.)
rejected
Phase I: identify pairs of clusters candidate
rejected
for merging (local test)
phase II: identify groups of clusters which
are strongly connected (global test)
This process is repeated at thresholds 10-95 ,10-90,10-85 ,10-80 ..... 10-5 ,10-0 (=1)
365,174 proteins are classified into 18,140 clusters (+29,000 singletons)
Golan Yona, Cornell University
Ismb02
ProtoMap - website
Results of the search on keyword “electron transport”.
Golan Yona, Cornell University
Ismb02
Results of the search on sequence ID “acha_human”
Golan Yona, Cornell University
Ismb02
Browsing a cluster (part1 – list)
Golan Yona, Cornell University
Ismb02
Browsing a cluster (part2 –higher level constituents)
Golan Yona, Cornell University
Ismb02
Higher level constituents - cont.
Golan Yona, Cornell University
Ismb02
Alternative representations
• Is there a unique universal representation of
proteins?
Golan Yona, Cornell University
Ismb02
Alternative representations
• Studies based on dipeptide composition - a 400 dimensional
vector
• van Heel 1991:
– Compared sequences by means of the correlation of their
representations (advantages..)
– Linear projection and clustering (k-means) into fixed number of
clusters
• Ferran et al. 1994
– Used a two-layer neural network to classify sequences based on
their dipeptide composition into a fixed number of clusters (trained
using the Kohonen learning algorithm)
– Meaning of weight vectors..
Golan Yona, Cornell University
Ismb02
Alternative representations
• Combination of compositional properties and other
physical/chemical properties (Hobohm & Sander 1995)
– Composition of the 20 amino acids
– Composition of different groups of amino acids (polar, charged,
aromatic, bulky, etc)
– Isoelectric point
– Molecular weight
– Average hydrophobicity
– Length
– Dipeptide composition (used reduced representation)
• The combination is weighted so as to achieve maximal
separation over the train set
• Can induce alternative similarity measures between proteins
and used to search for close relatives
Golan Yona, Cornell University
Ismb02
Alternative representations
Casari, et al. (1995) A method to predict functional residues in proteins
Input: a multiple-sequence alignment
- each sequence is converted to a vector of size (20 * L) where L is length of the
alignment
Generation of of N x (20*L) matrix
- one sequence produces a vector of dimensions 20*L
-
N sequences to produce N vectors of dimension 20*L
Use Principal Component Analysis
-
calculate the covariance matrix- indicates how different dimensions are correlated to one
another
largest eigenvalues and corresponding eigenvectors are the principal components
take the three largest (the largest of which represents consensus sequence)
project their 20*L dimensional data onto those 3 dimensions (minimize covariance by
projection onto a new basis of the eigenvectors of the covariance matrix)
- this can be used to predict a protein subfamily for a given protein and functional residues
Golan Yona, Cornell University
Ismb02
Structural classifications
• SCOP
• CATH
• FSSP
http://scop.mrc-lmb.cam.ac.uk/scop/
http://www.biochem.ucl.ac.uk/bsm/cath_new/index.html
http://www.ebi.ac.uk/dali/fssp/fssp.html
Structure comparison algorithms
•Dali
•CE
•Structal
Golan Yona, Cornell University
Ismb02
Dali
(Holm & Sander JMB 1993)
• Comparison of protein structures using 2D
distance matrices to represent 3D structures
ROP
CYC b56
1
CYC b56
2
3
ROP
Golan Yona, Cornell University
Ismb02
Dali (cont.)
• Assignment of equivalent residue pairs
Golan Yona, Cornell University
Ismb02
Dali (cont.)
L
L
S = i=1
S S
f(i,j)
j=1
• The similarity score of two structures
• i and j are labeled pairs of equivalent (matched) residues
(i.e. i = iA,iB).
A
B
 f = similarity measure based on Ca-Ca distances d ij and d ij
• Largest S corresponds to optimal set of equivalencies.
• Rigid similarity scores:
fR(i,j) = q R – | dAij – dBij |
q R = zero level of similarity
• Elastic similarity score:
d*ij = the average of dAij and dBij
q E = tolerance of 20% deviation
w(r) = envelope function = exp(-r2/a2)
Golan Yona, Cornell University
fE(i,j) =
A – dB |
|
d
* )
ij
ij
E
w(d
ij
q *
d ij
qE
Ismb02
CE
(Shindyalov & Bourne, Protein Eng. 1998)
Protein Structure Alignment by Incremental Combinatorial
Extension (CE) of the Optimal Path
Define alignment fragment pair (AFP) as a continuous segment of protein A
aligned against a continuous segment of protein B (without gaps).
•An alignment is a path of AFPs s.t. for every two consecutive AFPs there may
be gaps inserted into either A or B, but not into both. That is, for every two
consecutive AFPs i and i+1
piA1  piA  m and
piB1  piB  m
or
piA1  piA  m
and
piB1  piB  m
or
piA1  piA  m
and
piB1  piB  m
where piA is the starting position of AFP i in protein A
Golan Yona, Cornell University
Ismb02
CE
What is a “good”AFP?
Define the distance between two different AFPs i and j as:
1 m
A
A
B
B
Dij   d A ( pi  k  1, p j  m  k )  d B ( pi  k  1, p j  m  k )
m k 1
dA(p,q) represents the distance between the alpha carbon atoms at positions p
and q in protein A.
Protein B
Dij
i
j
i
j
Protein A
If you already have n-1 AFPs and consider adding the n-th AFN, do so only if
(1) Dnn  D0
1 n 1
( 2)
Din  D1

n  1 i 0
Golan Yona, Cornell University
1
(3) 2
n
n
n
 D
i 0 j 0
ij
 D1
Ismb02
CE (cont.)
1.
2.
3.
Select an initial AFP.
Build an alignment path by incrementally adding “good” AFPs that
satisfy the conditions of paths
Repeat step (2) until the proteins are completely matched, or until no
good AFPs remain.
Protein B
Protein A
4.
To assess the significance of the alignment, compare it to the
alignment of a random pairs of structures, and compute the Z-score
based on the RMSD and number of gaps in the final alignment.
Golan Yona, Cornell University
Ismb02
Structal
(Levitt & Gerstein, PNAS 1998)
An initial equivalence is chosen, based on matching the ends of the two
structures.
Repeat until convergence:
•
Superimpose the two structures so as to minimize the RMS, given the
equivalence
•
Given the superposition, calculate the distances dij between any atom i
in the first protein and any atom j in the second protein
•
Transform distances into similarities sij = M/[1+ (dij/d0)2] where
M=20 and d0 = 2.24A
•
Apply dynamic programming to define a new set of equivalences
Golan Yona, Cornell University
Ismb02
Structal (cont)
2) Superimpose to
minimize RMS
1) Alignment fixed
4) Use dynamic prog.
to find the best set
of equivalences
3) Calculate distances
between all atoms
5) Superimpose given
the new alignment
6) Recalculate distances
between all atoms
Golan Yona, Cornell University
Ismb02
SCOP
Structural classification of proteins with 5 level hierarchy:
Domains: the individual entries
Family: homologous proteins with significant sequence similarity
Superfamily: protein families that share weak sequence similarity but
with conserved functional residues (e.g. in active sites) – believed to
be evolutionary related
Fold: protein superfamilies that share he same fold (not necessarily due to
common evolutionary ancestry)
Class: all-alpha, all-beta, alpha/beta, alpha+beta, membrane proteins,
small proteins
The classification is based on manual analysis by experts (Dr. Alexy
Murzin)
As of May 2002, 7 main classes, 686 folds, 1073 superfamilies, 1827
families
Golan Yona, Cornell University
Ismb02
CATH
Structural classification of proteins with 5 level hierarchy:
Protein chains: the individual entries
Homologous superfamily: proteins with highly similar structures and
functions.
Topology: clusters according to the topological connections and
numbers of secondary structures.
Architecture: describes the gross orientation of secondary
structures, independent of connectivities (assigned manually).
Class: derived from secondary structure content, is assigned for
more than 90% of protein structures automatically.
The assignments of structures to topology families and homologous
superfamilies are made by sequence and structure
comparisons.
As of Jan 2002, 8 main classes, 46 architectures, 1453 topologies,
more than 2000 superfamilies.
Golan Yona, Cornell University
Ismb02
FSSP
Structural classification of proteins into a tree hierarchy:
Protein domains: the individual entries (defined using the algorithm of
Holm and Sander 1994)
Start with all-vs-all structure comparison of protein domains
Domains are clustered automatically into clusters using the single linkage
algorithm based on the z-scores of the structure similarity scores
3242 families of more than 30,000 structures as of June 2002
Golan Yona, Cornell University
Ismb02
Combined sequence and structure studies
•
•
•
•
•
•
Han & Baker 1996
–
Multiple alignments of protein families
–
Create profiles
–
Cluster profile columns to clusters
–
Each cluster (signature pattern) is mapped to a structural motif
Rigoutsos et al. 1999
–
Analysis of 17 complete genomes
–
Unsupervised pattern discovery (represented as extended regular
expressions) – 1D dictionary
–
Locate motifs in the PDB database and align the corresponding
substructures
–
If the RMSD is below 2.5A enter to 3D dictionary
Elofsson & Shonnhammer 1999
–
Mapping of Pfam domains to SCOP domains and vice versa.
Kasuya and Thoronton 1999
Salamov et al. 1999
Jonassen et al. 2000
Golan Yona, Cornell University
Ismb02
BioSpace
(Yona & Levitt, ISMB 2000)
MCAVSST...
RASWIILTC...
DYEEPEERT...
Type1 clusters (structure-based)
Golan Yona, Cornell University
Type2 clusters
Ismb02
Summary
Many different classification systems
•
Different source databases, different objects
•
Different representations
•
Different methodologies
Significant overlap, but differences are too fundamental to
accommodate in a single classification system.
Is there a universal definition and characterization of domain
families and protein families?
Golan Yona, Cornell University
Ismb02
Acknowledgments
Terri Attwood (prints)
Jerome Gracy (domo)
Steven Henikoff (blocks)
Nicolas Hulo (prosite)
Daniel Kahn (prodom)
Eugene Koonin (cogs)
Antje Krause (systers)
Erik Sonnhammer (Pfam)
Chris Chau, Nick DeNunzio, Leonid Meyerguz
Golan Yona, Cornell University
Ismb02