PPT - CABM Structural Bioinformatics Laboratory

Download Report

Transcript PPT - CABM Structural Bioinformatics Laboratory

Some Specific
“Informatics” tools
of Bioinformatics
• Databases














NCBI GenBank- Protein and DNA sequence
NCBI Human Map - Human Genome Viewer
NCBI Ensembl - Genome browsers for
human, mouse, zebra fish, mosquito
TIGR - The Institute for Genome Research
SwissProt - Protein Sequence and Function
ProDom - Protein Domains
Pfam - Protein domain families
ProSite - Protein Sequence Motifs
Protein Data Base (PDB) - Coordinates for
Protein 3D structures
SCOP Database- Domain structures
organized into evolutionary families
HSSP - Domain database using Dali
FlyBase
WormBase
PubMed / MedLine
• Sequence Alignment Tools






Clustal
FASTA
Simple Blast
Gapped Blast
PSI-Blast
Hidden Markov Models
• 3D Structure Alignments /
Classifications





Dali
VAST
PRISM
CATH
SCOP
Community Assembly Through Adaptive
Radiation in Hawaian Spiders.
R. Gillespie. Science (2004): 303: 356
Phylogeny of spiny leg spider
clade based on combined
mitochondrial cytochrome oxidase
I, 12S ribosomal DNA, and 16S
ribosomal DNA sequences
Sequence-based Dendrograms
It is widely used in:
- Phylogenetic analysis
- Prediction of protein
secondary/tertiary structure
- Finding diagnostic patterns to
characterize protein families
- Detecting new homologies
between new genes
and established sequence
families
Multiple Sequence
Alignments
- Practically useful methods only since
1987
- Before 1987 they were constructed by
hand
- The basic problem: no dynamic
programming approach can be used
- First useful approach by D. Sankoff
(1987) based on phylogenetics
(LEFT, adapted from Sonhammer et al. (1997).
“Pfam,” Proteins 28:405-20. ABOVE, G Barton
AMAS web page)
4 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
- One of the most essential tools in
molecular biology
- Most multiple alignments based on this approach
- Initial guess for a phylogenetic tree based on pairwise alignments
- Built progressively starting with most closely related sequences
- Follows branching order in phylogenetic tree
- Sufficiently fast
- Sensitive
- Algorithmically heuristic, no mathematical property associated with the
alignment
- Biologically sound, it is common to derive alignments which are impossible to
improve by eye
(adapted from Sonhammer et al. (1997). “Pfam,” Proteins 28:405-20)
5 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
Progressive Multiple Alignments
6 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
Popular
Multiple
Alignment
Programs
7 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
Clustering approaches for multiple
sequence alignment:
All vs All Transitive Relationships
SGMPLVSANHGVTG-------MPVSAFTVILS--KAYPA---VGCPHPIYEILYNRQQHY
----------ALTG-------MPVSAFTVILS--KAYPG---ATVPIKFDKILYNRQQHY
----------GGPA-------YEMPAFTAELT--APFPP---VGGPVKFNKLLYNGRQNY
HAYAGKKGKHGGPA-------YEMPAFTAELT--VPFPP---VGAPVKFDKLLYNGRQNY
----------ELSA-------HATPAFTAVLT--SPLPA---SGMPVKFDRTLYNGHSGY
----GTPGRKGEPGE---AAYMYRSAFSVGLETRVTVP-----NVPIRFTKIFYNQQNHY
------RGPKGPPGE---SVEQIRSAFSVGLFPSRSFPP---PSLPVKFDKVFYNGEGHW
-------GPPGPPGMTVNCHSKGTSAFAVKAN--ELPPA---PSQPVIFKEALHDAQGHF
----------NIRD-------QPRPAFSAIRQ---NPMT---LGNVVIFDKVLTNQESPY
--------------D---YRATQKVAFSALRTINSPLR----PNQVIRFEKVITNANENY
--------------D---YKATQKIAFSATRTINVPLR----RDQTIRFDHVITNMNNNY
--------------V---RSGSAKVAFSAIRSTNHEPSEMSNRTMIIYFDQVLVNIGNNF
---ENALAPDFSKGS---YRYAPMVAFFASHTYGMTIP------GPILFNNLDVNYGASY
.* .
:
:
MMCOL10A1_1.483
Ca1x_Chick
S15435
CA18_MOUSE.597
Ca28_Human
MM37222_1.98
COLE_LEPMA.264
HP27_TAMAS.72
S19018
C1qb_Mouse
C1qb_Human
Cerb_Human
2.HS27109_1
DPRSGIFTCKIPGIYYFSYHVHVKGT--HVWVGLYKNGTP-TMYTY---DEYSKGYLDTA
DPRTGIFTCRIPGLYYFSYHVHAKGT--NVWVALYKNGSP-VMYTY---DEYQKGYLDQA
NPQTGIFTCEVPGVYYFAYHVHCKGG--NVWVALFKNNEP-VMYTY---DEYKKGFLDQA
NPQTGIFTCEVPGVYYFAYHVHCKGG--NVWVALFKNNEP-MMYTY---DEYKKGFLDQA
NPATGIFTCPVGGVYYFAYHVHVKGT--NVWVALYKNNVP-ATYTY---DEYKKGYLDQA
DGSTGKFYCNIPGLYYFSYHITVYMK--DVKVSLFKKDKA-VLFTY---DQYQEKNVDQA
DPTLNKFNVTYPGVYLFSYHITVRNR--PVRAALVVNGVR-KLRTR---DSLYGQDIDQA
DLATGVFTCPVPGLYQFGFHIEAVQR--AVKVSLMRNGTQ-VMERE---AEAQDG-YEHI
QNHTGRFICAVPGFYYFNFQVISKWD--LCLFIKSSSGGQ-PRDSLSFSNTNNKGLFQVL
EPRNGKFTCKVPGLYYFTYHASSRGN---LCVNLVRGRDRDSMQKVVTFCDYAQNTFQVT
EPRSGKFTCKVPGLYYFTYHASSRGN---LCVNLMRGRER--AQKVVTFCDYAYNTFQVT
DSERSTFIAPRKGIYSFNFHVVKVYNRQTIQVSLMLNGWP----VISAFAGDQDVTREAA
TPRTGKFRIPYLGVYVFKYTIESFSA--HISGFLVVDGIDKLAFESEN-INSEIHCDRVL
. *
* * * :
MMCOL10A1_1.483
Ca1x_Chick
S15435
CA18_MOUSE.597
Ca28_Human
MM37222_1.98
COLE_LEPMA.264
HP27_TAMAS.72
S19018
C1qb_Mouse
C1qb_Human
Cerb_Human
2.HS27109_1
SGSAIMELTENDQVWLQLPNA-ESNGLYSSEYVHSSFSGFLVAPM------SGSAVIDLMENDQVWLQLPNS-ESNGLYSSEYVHSSFSGFLFAQI------SGSAVLLLRPGDRVFLQMPSE-QAAGLYAGQYVHSSFSGYLLYPM------SGSAVLLLRPGDQVFLQNPFE-QAAGLYAGQYVHSSFSGYLLYPM------SGGAVLQLRPNDQVWVQIPSD-QANGLYSTEYIHSSFSGFLLCPT------SGSVLLHLEVGDQVWLQVYGDGDHNGLYADNVNDSTFTGFLLYHDTN----SNLALLHLTDGDQVWLETLR--DWNGXYSSSEDDSTFSGFLLYPDTKKPTAM
SGTAILQLGMEDRVWLENKL--SQTDLERG-TVQAVFSGFLIHEN------AGGTVLQLRRGDEVWIEKDP--AKGRIYQGTEADSIFSGFLIFPS------TGGVVLKLEQEEVVHLQATD---KNSLLGIEGANSIFTGFLLFPD------TGGMVLKLEQGENVFLQATD---KNSLLGMEGANSIFSGFLLFPD------SNGVLIQMEKGDRAYLKLER---GN-LMGG-WKYSTFSGFLVFPL------TGDALLELNYGQEVWLRLAK----GTIPAKFPPVTTFSGYLLYRT------. :: :
: . :
* *:*.
Clustal
Alignment
8 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
MMCOL10A1_1.483
Ca1x_Chick
S15435
CA18_MOUSE.597
Ca28_Human
MM37222_1.98
COLE_LEPMA.264
HP27_TAMAS.72
S19018
C1qb_Mouse
C1qb_Human
Cerb_Human
2.HS27109_1
- Motif: a short signature pattern identified in
the conserved region of the multiple alignment
- Profile: frequency of each amino acid at each position is
estimated
Profiles
Motifs
HMMs
- HMM: Hidden Markov Model, a generalized profile in
rigorous mathematical terms
Can get more
sensitive
searches with
these multiple
alignment
representations
(Run the profile
against the DB.)
Structure Sequence
2hhb
1ecd
1mbd
Core
Core
HAHU
- D - - - M P N A L S A L S D L H A H K L - F - - R V D P V N K L L S H C L L V T L A A H
HADG
- D - - - L P G A L S A L S D L H A Y K L - F - - R V D P V N K L L S H C L L V T L A C H
HATS
- D - - - L P T A L S A L S D L H A H K L - F - - R V D P A N K L L S H C I L V T L A C H
HABOKA
- D - - - L P G A L S D L S D L H A H K L - F - - R V D P V N K L L S H S L L V T L A S H
HTOR
- D - - - L P H A L S A L S H L H A C Q L - F - - R V D P A S Q L L G H C L L V T L A R H
HBA_CAIMO
- D - - -
HBAT_HO
- E - - - L P R A L S A L R H R H V R E L - L - - R V D P A S Q L L G H C L L V T P A R H
G G ICE3
P - - - N I E A D V N T F V A S H K P RG - L - N - - T H DQ N N F R A G F V S Y M K A H
CTTEE
P - - - N I G K H V D A L V A T H K P R G - F - N - - T H A Q N N F R A A F I A Y L K G H
I A G A L S K L S D L H A Q K L - F - - R V D P V N K F L G H C F L V V V A I H
GGICE1
P - - - T I L A K A K D F G K S H K S R A - L - T - - S P A Q D N F R K S L V V Y L K G A
MY WHP
- K - G H H E A E L K P L A Q S H A T K H - L - H K I P I K Y E F I S E A I I H V L H S R
MYG_CASFI
- K - G H H E A E I K P L A Q S H A T K H - L - H K I P I K Y E F I S E A I I H V L Q S K
MYHU
- K - G H H E A E I K P L A Q S H A T K H - L - H K I P V K Y E F I S E C I I Q V L Q S K
MYBAO
- K - G H H E A E I K P L A Q S H A T K H - L - H K I P V K Y E L I S E S I I Q V L Q S K
Consensus Profile
<
- c - - d L P A E h p A h p h ? H A ? K h - h - d c h p h c Y p h h S ? C h L V v L h p p
<
<
<
9 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
Fuse multiple alignment into:
MMCOL10A1_1.483
Ca1x_Chick
S15435
CA18_MOUSE.597
Ca28_Human
MM37222_1.98
COLE_LEPMA.264
HP27_TAMAS.72
S19018
C1qb_Mouse
C1qb_Human
Cerb_Human
2.HS27109_1
SGSAIMELTENDQVWLQLPNA-ESNGLYSSEYVHSSFSGFLVAPM------SGSAVIDLMENDQVWLQLPNS-ESNGLYSSEYVHSSFSGFLFAQI------SGSAVLLLRPGDRVFLQMPSE-QAAGLYAGQYVHSSFSGYLLYPM------SGSAVLLLRPGDQVFLQNPFE-QAAGLYAGQYVHSSFSGYLLYPM------SGGAVLQLRPNDQVWVQIPSD-QANGLYSTEYIHSSFSGFLLCPT------SGSVLLHLEVGDQVWLQVYGDGDHNGLYADNVNDSTFTGFLLYHDTN----SNLALLHLTDGDQVWLETLR--DWNGXYSSSEDDSTFSGFLLYPDTKKPTAM
SGTAILQLGMEDRVWLENKL--SQTDLERG-TVQAVFSGFLIHEN------AGGTVLQLRRGDEVWIEKDP--AKGRIYQGTEADSIFSGFLIFPS------TGGVVLKLEQEEVVHLQATD---KNSLLGIEGANSIFTGFLLFPD------TGGMVLKLEQGENVFLQATD---KNSLLGMEGANSIFSGFLLFPD------SNGVLIQMEKGDRAYLKLER---GN-LMGG-WKYSTFSGFLVFPL------TGDALLELNYGQEVWLRLAK----GTIPAKFPPVTTFSGYLLYRT------:: :
:
:
* *:*
10 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
Motifs
- several proteins are grouped together by similarity
searches
- they share a conserved motif
- motif is stringent enough to retrieve the family members
from the complete protein database
- PROSITE: a collection of motifs (1135 different motifs)
PKC_PHOSPHO_SIT
E
Protein kinase C
phosphorylation
site
[ST]-x-[RK]
Post-translational
modifications
RGD
Cell attachment
sequence
R-G-D
Domains
SOD_CU_ZN_1
Copper/Zinc
superoxide
dismutase
[GA]-[IMFAT]-H-[LIVF]-Hx(2)-[GP]-[SDG]-x-[STAGDE]
Enzymes_Oxidoreduc
tases
THIOL_PROTEASE_
ASN
Eukaryotic thiol
(cysteine)
proteases active
site
[FYCH]-[WI]-[LIVT]-x-[KRQAG]N-[ST]-W-x(3)-[FYW]-G-x(2)-G[LFYW]-[LIVMFYG]-x-[LIVMF]
Enzymes_Hydrolases
TNFR_NGFR_1
TNFR/CD27/30/4
0/95 cysteine-rich
region
C-x(4,6)-[FYH]-x(5,10)-C-x(0,2)C-x(2,3)-C-x(7,11)-C-x(4,6)[DNEQSKP]-x(2)-C
Receptors
11 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
Motifs
· Each element in a pattern is separated from its neighbor by a “-”.
· The symbol “x” is used for a position where any amino acid is accepted.
· Ambiguities are indicated by listing the acceptable amino acids for a given
position, between brackets “[]”.
· Ambiguities are also indicated by listing between a pair of braces “{}” the
amino acids that are not accepted at a given position.
· Repetition of an element of the pattern is indicated by with a numerical
value or a numerical range between parentheses following that element.
Cons.
Cys
Cons A
V
-1
D
0
V
0
D
0
P
3
C
5
A
2
s
2
n
-1
p
0
c
5
L
-5
N
-4
g
1
G
6
T
3
C
5
I
-6
d
-4
i
0
g
1
L
-5
E
0
S
3
Y -14
T
0
C
5
R
0
C
5
P
0
P
1
G
4
y -22
S
1
G
5
E
2
R
-5
C
5
E
0
T
-4
D
0
I
0
D
-4
C
-2
-14
-13
-20
-18
115
-7
-12
-15
-18
115
-14
-16
-16
-17
-10
115
-13
-19
-6
-13
-11
-20
-13
-9
-10
115
-13
115
-14
-18
-19
-6
-9
-20
-20
-17
115
-26
-11
-18
-10
-15
D
-9
-1
-9
18
1
-32
-2
3
4
-7
-32
-17
12
7
0
0
-32
-19
8
-8
0
-20
14
4
-25
-6
-32
0
-32
-8
-3
3
-35
-3
1
10
0
-32
20
-13
5
-2
-1
E
-5
-1
-7
11
3
-30
-2
2
4
-6
-30
-9
5
1
-7
2
-30
-11
6
-6
0
-14
10
3
-22
-1
-30
2
-30
-4
0
-4
-31
-1
-8
12
1
-30
25
-8
4
-1
-2
F
-13
-16
-15
-34
-26
-8
-21
-25
-19
-17
-8
0
-20
-35
-49
-21
-8
0
-15
-4
-20
0
-33
-28
31
-11
-8
-19
-8
-15
-24
-48
55
-14
-52
-31
-16
-8
-34
-1
-24
-17
-13
G
-18
-10
-10
0
-9
-20
-5
0
-7
-11
-20
-25
0
29
59
-12
-20
-28
-13
-11
-3
-23
5
3
-34
-16
-20
-11
-20
-17
-13
53
-43
-7
66
-7
-13
-20
-5
-21
-11
-14
-16
H
-2
0
-6
4
-5
-13
-4
0
3
0
-13
-5
24
0
-13
-3
-13
-5
5
-5
-3
-9
0
0
10
-2
-13
1
-13
0
-3
-11
11
0
-14
0
8
-13
6
2
-1
-3
-3
I
-5
-12
-5
-26
-14
-11
-12
-18
-16
-17
-11
4
-24
-31
-41
-5
-11
8
-17
3
-12
9
-25
-18
-5
-7
-11
-12
-11
-7
-12
-40
-1
-10
-45
-19
-16
-11
-25
0
-11
-4
-8
K
-2
0
-5
7
-1
-28
-2
0
2
-5
-28
-5
5
-1
-10
1
-28
-4
0
-5
-3
-11
2
2
-17
-1
-28
4
-28
-1
1
-7
-25
-2
-11
6
9
-28
10
-4
2
-1
-5
L
-7
-13
-7
-27
-14
-15
-13
-18
-16
-15
-15
8
-25
-31
-41
-11
-15
6
-16
1
-13
8
-26
-20
0
-9
-15
-13
-15
-7
-13
-40
6
-12
-44
-20
-16
-15
-25
-1
-14
-9
-6
M
-4
-8
-5
-20
-12
-9
-9
-13
-10
-14
-9
8
-18
-23
-32
-5
-9
8
-12
2
-8
7
-19
-13
-1
-5
-9
-8
-9
-5
-10
-31
4
-7
-35
-15
-11
-9
-17
0
-9
-4
-4
N
-3
1
-6
15
-1
-18
0
4
7
-5
-18
-12
25
12
3
1
-18
-12
5
-5
0
-14
11
6
-14
-3
-18
3
-18
-4
-2
5
-21
0
4
5
5
-18
9
-6
1
0
-1
P
-5
-3
-4
0
12
-31
-1
3
-6
28
-31
-14
-10
-10
-14
-4
-31
-17
-9
-8
-7
-17
-9
-6
-13
-9
-31
-8
-31
6
15
-13
-34
-7
-16
4
-11
-31
-4
-14
-6
-11
-7
Q
-1
0
-4
7
1
-24
0
1
3
-2
-24
-1
6
0
-9
1
-24
-4
2
-4
0
-9
4
3
-13
0
-24
4
-24
0
2
-7
-20
0
-10
7
7
-24
16
-3
2
0
-2
R
-3
-2
-6
4
-4
-22
-3
-1
0
-5
-22
-5
2
-1
-9
-1
-22
-5
-2
-6
-5
-14
0
1
-15
-1
-22
5
-22
-2
0
-7
-21
-4
-10
2
15
-22
5
-5
0
-4
-7
S
0
0
-1
6
2
1
2
7
2
0
1
-7
4
4
5
6
1
-9
-1
-2
2
-8
3
6
-14
1
1
1
1
0
2
4
-22
4
4
4
-1
1
3
-4
0
0
-3
T
0
0
0
2
0
-5
1
4
0
-1
-5
-5
1
-3
-9
11
-5
-4
-1
0
0
-4
0
3
-13
3
-5
1
-5
1
1
-7
-20
4
-11
2
-1
-5
0
0
0
2
-2
V
-1
-8
-1
-19
-9
0
-7
-12
-11
-13
0
2
-19
-23
-29
0
0
6
-13
4
-7
7
-19
-12
-7
-4
0
-8
0
-3
-8
-29
-7
-5
-33
-13
-13
0
-18
0
-6
-1
-6
W
-24
-26
-27
-38
-37
-10
-30
-30
-23
-26
-10
-15
-26
-32
-39
-33
-10
-12
-24
-14
-29
-17
-34
-32
17
-16
-10
-23
-10
-26
-33
-39
43
-24
-40
-38
-18
-10
-38
-15
-34
-29
-27
Y
-10
-9
-14
-21
-22
-5
-17
-16
-10
-9
-5
-5
-2
-23
-38
-18
-5
-1
-5
-6
-16
-5
-22
-20
44
-8
-5
-13
-5
-10
-19
-36
63
-9
-40
-22
-6
-5
-23
0
-18
-14
-12
Gap
100
100
100
100
100
100
100
25
25
25
25
100
100
50
100
100
100
100
31
31
31
100
100
100
100
100
100
100
100
100
100
100
50
100
100
100
100
100
100
100
100
100
100
12 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
EGF Profile Generated for SEARCHWISE
Prosite Pattern -- EGF like pattern
-
Bone morphogenic protein 1 (BMP-1), a protein which induces cartilage and bone formation.
Caenorhabditis elegans developmental proteins lin-12 (13 copies) and glp-1 (10 copies).
Calcium-dependent serine proteinase (CASP) which degrades the extracellular matrix proteins type ….
Cell surface antigen 114/A10 (3 copies).
Cell surface glycoprotein complex transmembrane subunit .
Coagulation associated proteins C, Z (2 copies) and S (4 copies).
Coagulation factors VII, IX, X and XII (2 copies).
Complement C1r/C1s components (1 copy).
Complement-activating component of Ra-reactive factor (RARF) (1 copy).
Complement components C6, C7, C8 alpha and beta chains, and C9 (1 copy).
Epidermal growth factor precursor (7-9 copies).
13 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
A sequence of about thirty to forty amino-acid residues long found in the sequence
of epidermal growth factor (EGF) has been shown [1 to 6] to be present, in a more or less conserved form, in a large
number of other, mostly animal proteins. The proteins currently known to contain one or more copies of an EGF-like pattern
are listed below.
+-------------------+
+-------------------------+
|
|
|
|
x(4)-C-x(0,48)-C-x(3,12)-C-x(1,70)-C-x(1,6)-C-x(2)-G-a-x(0,21)-G-x(2)-C-x
|
|
************************************
+-------------------+
'C': conserved cysteine involved in a disulfide bond.
'G': often conserved glycine
'a': often conserved aromatic amino acid
'*': position of both patterns.
'x': any residue
-Consensus pattern: C-x-C-x(5)-G-x(2)-C
[The 3 C's are involved in disulfide bonds]
http://www.expasy.ch/sprot/prosite.html
Coverage v Error Rate (ROC Graph)
Coverage (roughly,
fraction of sequences
that one confidently
“says something”
about)
sensitivity=tp/p=tp/(tp+fn)
Thresh=10
Thresh=20
Thresh=30
Different score thresholds
Two “methods” (red is more
effective)
100%
Error rate (fraction of the
“statements” that are false positives)
[Specificity = tn/n =tn/(tn+fp)]
error rate = 1-specificity = fp/n
14 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
100%
• The Significance of Similarity Scores Decreases with
Database Growth




The score between any pair of sequence pair is constant
The number of database entries grows exponentially
The number of nonhomologous entries >> homologous entries
Greater sensitivity is required to detect homologies
 Score of 100 might rank as best in database of 1000 but only in top100 of database of 1000000
DB-1
DB-2
15 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
Significance Depends
on Database Size
• Low Complexity Regions
 Different Statistics for matching
AAATTTAAATTTAAATTTAAATTTAAATTT
than
ACSQRPLRVSHRSENCVASNKPQLVKLMTHVKDFCV
 Automatic Programs Screen These Out (SEG)
 Identify through computation of sequence entropy in a window of a
given size
H =  f(a) log2 f(a)
• Also, Compositional Bias
 Matching A-rich query to A-rich DB vs. A-poor DB
LLLLLLLLLLLLL
16 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
Low-Complexity Regions
Computational Complexity
• Basic NW Algorithm is
O(n2) (in speed)
• O(n2) in memory too!
• Improvements can
(effectively) reduce
sequence comparison to
O(n) in both
A
A
B
C
N
R
Q
C
L
C
R
P
M
1
Y
1
C
1
1
Y
1
1
N
M
Y
1
R
5
4
3
3
2
2
0
0
C
3
3
4
3
3
3
3
4
3
3
1
0
0
K
3
3
3
3
3
3
3
3
3
2
1
0
0
C
2
2
3
2
2
2
2
3
2
3
1
0
0
R
2
1
1
1
1
2
1
1
1
1
2
0
0
B
1
2
1
1
1
1
1
1
1
1
1
0
0
P
0
0
0
0
0
0
0
0
0
0
0
1
0
N’
17 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
 M x N squares to fill
 At each square need to
look back (M’+N’) “black”
squares to find max in block
 M x N x (M’+N’) -> O(n3)
 However, max values in
block can be cached, so
algorithm is really only
O(n2)
N
M’
• Hash table of short words in the query sequence
• Go through DB and look for matches in the query
hash (linear in size of DB)
• K-tuple determines word size (k-tup 1 is single aa)
• by Bill Pearson
_
VLICT =
VLICTAVLMVLICTAAAVLICTMSDFFD
18 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
FASTA
(Adapted from D Brutlag)
19 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
Join together query
lookups into
diagonals and then
a full alignment
• BLAST employs substitution matrix which specifies a
score s(i,j) for aligning each pair of amino acids.
• Indexes of query “words” (also tried indexing DB)
• Starts with all overlapping words (3 residues) from
query
• Scans DB for “hits” that score at least T when aligned
with some word in the query sequence
• Extends High Scoring Pairs (HSPs) left and right to
maximal length, until the running alignment score
drops to Smax - X
• Basic Blast does not permit gaps in alignments
• Extension time accounts for > 90% of execution time;
desirable to reduce the number of extensions
performed
Basic
Blast
20 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
Altschul, S., et al Lipman, D. J. (1990). Basic
local alignment search tool. J. Mol. Biol.
215, 403-410
• Extend hash hits into High
Scoring Segment Pairs (HSPs)
with match values S > T.
• Stop extension when total
score drops below Smax - X
• Parameters T and X define
coverage and specificity.
• Extension is O(N). This takes
most of the time in Blast
21 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
Blast: Extension
of Hash Hits
Blasting
against
the DB
Number of
hash hits is
proportional
to O(N*M*D),
where N is
the query
size, M is the
average DB
seq. size, and
D is the size
of the DB
22 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
• In simplest Blast algorithm, find best scoring segment in each
DB sequence
• Statistics of these scores determine significance
23 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
Blast2:
Gapped
Blast
Gapped Blast
• An HSP of interest is much longer than a single word pair, and may
therefore entail multiple hits on the same diagonal and within a relatively
short distance of one another. “Islands of Certainty”
• Choose window of length A, and invoke extension only when two nonoverlapping hits are found within a distance A of one another on the
same diagonal (see Fig 2 of Altshul et al, 1997)
• If the HSP generated has a normalized score of at least Sg, then a
gapped extension is triggered
• Runs ~ 3X faster than Original Blast, with higher sensitivity and
coverage
• Gapped Extension on
Diagonals with two Hash
Hits
25 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
Blast2:
Gapped Blast
PSI Blast: Iterated Application of
BLAST to Position-Specific Matrices
• Database searches using position-specific score matrices (profiles or
motifs) are often much better able to detect weak relationships that are
database searches that use a simple sequence as a query
• Compile a set of N sequences hit by Gapped-BLAST with E-val <
Threshold (default 0.01)
• The Query is used as a Master to Construct a Multiple Sequence
Alignment (MSA).
• An N x 21 (21st “residue” is a gap) Position-Specific Score Matrix is
then computed from the MSA
• Use the new Position-Specific Score Matrix to rerun Gapped-BLAST,
generate MSA, compile new Position Specific Score Matrix
• Iterate
• Automatically builds
profile and then
searches with this
27 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
Y-Blast
Parameters: overall
threshold, inclusion
threshold, inerations
Iteration
Scheme
Speed
Sensitivity
Blast
FASTA
SmithWaterman
PSI-Blast
Profiles
HMMs
28 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
PSI-Blast
Problems with Profiles
• Complicated Models with many Free Parameters
• What are the best ways to set the position-specific
scores
• How to score gaps and insertions
• Many to many problems of transitivity
• How to combine structural information and MSA
information
1D Hidden Markov Models: General statistical
modeling technique for “linear” problems like time
series and sequences;
commonly used in speech recognition, submarine
warfare
Hidden Markov Models (HMMs)
Mathematical Model that generates a series of
sequences
Mathematical Model <--> Series of Sequences
Training: Process of parameterizing the HMM, from a
series of sequences
Unlike Profiles, HMMs can be trained from a set of
unaligned example sequences, producing a multiple
sequence alignment (MSA) in the process
Hidden Markov Models (HMMs)
HMM is composed of several “States” (columns in a
multiple alignment)
Each State “emits” a symbol (residue) according
“symbol emission probabilities” (0.1 E, 0.15 R, …)
After emitting a symbol, state i will change to state j with
some “state transition probability
Markov Chain: Choice of next state to occupy depends
on the current state. However, this state is not
observed, as we only observe the symbol it emits. The
underlying “sequence of states” is “Hidden”
HMMs
Starting from an initial state, a sequence of symbols is
generated by moving from state to state until an end state is
reached.
(Figures from Eddy, Curr. Opin. Struct. Biol.)
32 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
Hidden Markov Model:
- a composition of finite number of states,
- each corresponding to a column in a multiple alignment
- each state emits symbols, according to symbol-emission
probabilities
D
0.5
A
0.2
0.3
C
E
0.4
0.5
B
MM
0.1
0.8
Path:
A
D
B
C
Probability = Init(A)*0.5*0.3*0.4
33 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
Markov Models
Hidden Markov models
Probability = ?
Two coin toss
0.1
H: 0.5
H: 0.3
T: 0.5
T: 0.7
Fair
0.9
Biased
HMM
34 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
The path is unknown (hidden): H H T H T T H T T T
0.93
R: 0.1
R: 0.4
G: 0.2
G: 0.4
B: 0.7
B: 0.2
0.01
0.2
0.5
R: 0.1
G: 0.1
B: 0.8
e.g. observed:
RRR
Probability of a given sequence =
Sum probability over ALL paths giving that sequence
35 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
More HMMs
Extra
36 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
Result: HMM sequence profile
Extra
37 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
Different topologies:
Hidden Markov Models (HMMs)
Mathematical Model that generates a series of
sequences
Mathematical Model <--> Series of Sequences
Training: Process of parameterizing the HMM, from a
series of sequences
Unlike Profiles, HMMs can be trained from a set of
unaligned example sequences, producing a multiple
sequence alignment (MSA) in the process
Hidden Markov Models (HMMs)
Delete State - Emits nothing, allowing a column to be skipped
Insert State - Inserts residue
Delete / Insert States have their own state transition probabilities
which are trained from the sequences
HMM provides a statistically justifiable treatment of insertions and
deletions
In standard profiles, insertions/deletions are difficulty to be certain
of; e.g. alternate paths in Dynamic Programming algorithm
Since the placement of insertions and deletions is a major problem
in recognizing highly divergent sequences, the recasting of
profiles as HMMs promises a significant increase in the power of
MSA analysis to recognize distantly related homologues
HMMs, Profiles,
Motifs, and Multiple
Alignments used to
define modules
(Figures from Branden & Tooze)
•Several motifs (b-sheet, beta-alpha-beta, helix-loop-helix) combine to form a compact globular
structure termed a domain or tertiary structure
•A domain is defined as a polypeptide chain or part of a chain that can independently fold into a stable
tertiary structure
•Domains are also units of function (DNA binding domain, antigen binding domain, ATPase domain,
etc.)
40 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
Modules
•Another example of the helix-loop-helix
motif is seen within several DNA binding
domains including the homeobox proteins
which are the master regulators of
development