cbb752-mg-spr11

Download Report

Transcript cbb752-mg-spr11

Mark Gerstein, Yale University
gersteinlab.org/courses/452
(last edit in spring '11, not including in-class edits)
1 (c) Mark Gerstein, 2006, Yale, bioinfo.mbb.yale.edu
BIOINFORMATICS
Multiple Sequences
Multiple Alignment Topics
• Multiple Alignment
• Motifs
- Fast identification methods
• Profile Patterns
- Refinement via EM
- Gibbs Sampling
• HMMs
• Applications
• Issues: site independence
- BoCaTFBS
Do not reproduce without permission
2-
Lectures.GersteinLab.org
(c) '09
- Module DBs
- Regression vs expression
It is widely used in:
- Phylogenetic analysis
- Prediction of protein
secondary/tertiary structure
- Finding diagnostic patterns to
characterize protein families
- Detecting new homologies
between new genes
and established sequence
families
Multiple Sequence
Alignments
- Practically useful methods only since
1987
- Before 1987 they were constructed by
hand
- The basic problem: no dynamic
programming approach can be used
- First useful approach by D. Sankoff
(1987) based on phylogenetics
(LEFT, adapted from Sonhammer et al. (1997).
“Pfam,” Proteins 28:405-20. ABOVE, G Barton
AMAS web page)
3 (c) Mark Gerstein, 2006, Yale, bioinfo.mbb.yale.edu
- One of the most essential tools in
molecular biology
- Most multiple alignments based on this approach
- Initial guess for a phylogenetic tree based on pairwise alignments
- Built progressively starting with most closely related sequences
- Follows branching order in phylogenetic tree
- Sufficiently fast
- Sensitive
- Algorithmically heuristic, no mathematical property associated with the
alignment
- Biologically sound, it is common to derive alignments which are impossible to
improve by eye
(adapted from Sonhammer et al. (1997). “Pfam,” Proteins 28:40520)
4 (c) Mark Gerstein, 2006, Yale, bioinfo.mbb.yale.edu
Progressive Multiple Alignments
5 (c) Mark Gerstein, 2006, Yale, bioinfo.mbb.yale.edu
Clustering approaches for multiple
sequence alignment
C1Q Example
6 (c) Mark Gerstein, 2006, Yale, bioinfo.mbb.yale.edu
Ca28_Human
ELSAHATPAFTAVLTSPLPASGMPVKFDRTLYNGHSGYNPATGIFTCPVGGVYYFAYHVH
VKGTNVWVALYKNNVPATYTYDEYKKGYLDQASGGAVLQLRPNDQVWVQIPSDQANGLYS
TEYIHSSFSGFLLCPT
C1qb_Human
DYKATQKIAFSATRTINVPLRRDQTIRFDHVITNMNNNYEPRSGKFTCKVPGLYYFTYHA
SSRGNLCVNLMRGRERAQKVVTFCDYAYNTFQVTTGGMVLKLEQGENVFLQATDKNSLLG
MEGANSIFSGFLLFPD
Cerb_Human
VRSGSAKVAFSAIRSTNHEPSEMSNRTMIIYFDQVLVNIGNNFDSERSTFIAPRKGIYSF
NFHVVKVYNRQTIQVSLMLNGWPVISAFAGDQDVTREAASNGVLIQMEKGDRAYLKLERG
NLMGGWKYSTFSGFLVFPL
COLE_LEPMA.264
RGPKGPPGESVEQIRSAFSVGLFPSRSFPPPSLPVKFDKVFYNGEGHWDPTLNKFNVTYP
GVYLFSYHITVRNRPVRAALVVNGVRKLRTRDSLYGQDIDQASNLALLHLTDGDQVWLET
LRDWNGXYSSSEDDSTFSGFLLYPDTKKPTAM
HP27_TAMAS.72
GPPGPPGMTVNCHSKGTSAFAVKANELPPAPSQPVIFKEALHDAQGHFDLATGVFTCPVP
GLYQFGFHIEAVQRAVKVSLMRNGTQVMEREAEAQDGYEHISGTAILQLGMEDRVWLENK
LSQTDLERGTVQAVFSGFLIHEN
HSUPST2_1.95
GIQGRKGEPGEGAYVYRSAFSVGLETYVTIPNMPIRFTKIFYNQQNHYDGSTGKFHCNIP
GLYYFAYHITVYMKDVKVSLFKKDKAMLFTYDQYQENNVDQASGSVLLHLEVGDQVWLQV
YGEGERNGLYADNDNDSTFTGFLLYHDTN
2.HS27109_1
ENALAPDFSKGSYRYAPMVAFFASHTYGMTIPGPILFNNLDVNYGASYTPRTGKFRIPYL
GVYVFKYTIESFSAHISGFLVVDGIDKLAFESENINSEIHCDRVLTGDALLELNYGQEVW
LRLAKGTIPAKFPPVTTFSGYLLYRT
4.YQCC_BACSU
VVHGWTPWQKISGFAHANIGTTGVQYLKKIDHTKIAFNRVIKDSHNAFDTKNNRFIAPND
GMYLIGASIYTLNYTSYINFHLKVYLNGKAYKTLHHVRGDFQEKDNGMNLGLNGNATVPM
NKGDYVEIWCYCNYGGDETLKRAVDDKNGVFNFFD
5.BSPBSXSE_25
ADSGWTAWQKISGFAHANIGTTGRQALIKGENNKIKYNRIIKDSHKLFDTKNNRFVASHA
GMHLVSASLYIENTERYSNFELYVYVNGTKYKLMNQFRMPTPSNNSDNEFNATVTGSVTV
PLDAGDYVEIYVYVGYSGDVTRYVTDSNGALNYFD
SGMPLVSANHGVTG-------MPVSAFTVILS--KAYPA---VGCPHPIYEILYNRQQHY
----------ALTG-------MPVSAFTVILS--KAYPG---ATVPIKFDKILYNRQQHY
----------GGPA-------YEMPAFTAELT--APFPP---VGGPVKFNKLLYNGRQNY
HAYAGKKGKHGGPA-------YEMPAFTAELT--VPFPP---VGAPVKFDKLLYNGRQNY
----------ELSA-------HATPAFTAVLT--SPLPA---SGMPVKFDRTLYNGHSGY
----GTPGRKGEPGE---AAYMYRSAFSVGLETRVTVP-----NVPIRFTKIFYNQQNHY
------RGPKGPPGE---SVEQIRSAFSVGLFPSRSFPP---PSLPVKFDKVFYNGEGHW
-------GPPGPPGMTVNCHSKGTSAFAVKAN--ELPPA---PSQPVIFKEALHDAQGHF
----------NIRD-------QPRPAFSAIRQ---NPMT---LGNVVIFDKVLTNQESPY
--------------D---YRATQKVAFSALRTINSPLR----PNQVIRFEKVITNANENY
--------------D---YKATQKIAFSATRTINVPLR----RDQTIRFDHVITNMNNNY
--------------V---RSGSAKVAFSAIRSTNHEPSEMSNRTMIIYFDQVLVNIGNNF
---ENALAPDFSKGS---YRYAPMVAFFASHTYGMTIP------GPILFNNLDVNYGASY
.* .
:
:
MMCOL10A1_1.483
Ca1x_Chick
S15435
CA18_MOUSE.597
Ca28_Human
MM37222_1.98
COLE_LEPMA.264
HP27_TAMAS.72
S19018
C1qb_Mouse
C1qb_Human
Cerb_Human
2.HS27109_1
DPRSGIFTCKIPGIYYFSYHVHVKGT--HVWVGLYKNGTP-TMYTY---DEYSKGYLDTA
DPRTGIFTCRIPGLYYFSYHVHAKGT--NVWVALYKNGSP-VMYTY---DEYQKGYLDQA
NPQTGIFTCEVPGVYYFAYHVHCKGG--NVWVALFKNNEP-VMYTY---DEYKKGFLDQA
NPQTGIFTCEVPGVYYFAYHVHCKGG--NVWVALFKNNEP-MMYTY---DEYKKGFLDQA
NPATGIFTCPVGGVYYFAYHVHVKGT--NVWVALYKNNVP-ATYTY---DEYKKGYLDQA
DGSTGKFYCNIPGLYYFSYHITVYMK--DVKVSLFKKDKA-VLFTY---DQYQEKNVDQA
DPTLNKFNVTYPGVYLFSYHITVRNR--PVRAALVVNGVR-KLRTR---DSLYGQDIDQA
DLATGVFTCPVPGLYQFGFHIEAVQR--AVKVSLMRNGTQ-VMERE---AEAQDG-YEHI
QNHTGRFICAVPGFYYFNFQVISKWD--LCLFIKSSSGGQ-PRDSLSFSNTNNKGLFQVL
EPRNGKFTCKVPGLYYFTYHASSRGN---LCVNLVRGRDRDSMQKVVTFCDYAQNTFQVT
EPRSGKFTCKVPGLYYFTYHASSRGN---LCVNLMRGRER--AQKVVTFCDYAYNTFQVT
DSERSTFIAPRKGIYSFNFHVVKVYNRQTIQVSLMLNGWP----VISAFAGDQDVTREAA
TPRTGKFRIPYLGVYVFKYTIESFSA--HISGFLVVDGIDKLAFESEN-INSEIHCDRVL
. *
* * * :
MMCOL10A1_1.483
Ca1x_Chick
S15435
CA18_MOUSE.597
Ca28_Human
MM37222_1.98
COLE_LEPMA.264
HP27_TAMAS.72
S19018
C1qb_Mouse
C1qb_Human
Cerb_Human
2.HS27109_1
SGSAIMELTENDQVWLQLPNA-ESNGLYSSEYVHSSFSGFLVAPM------SGSAVIDLMENDQVWLQLPNS-ESNGLYSSEYVHSSFSGFLFAQI------SGSAVLLLRPGDRVFLQMPSE-QAAGLYAGQYVHSSFSGYLLYPM------SGSAVLLLRPGDQVFLQNPFE-QAAGLYAGQYVHSSFSGYLLYPM------SGGAVLQLRPNDQVWVQIPSD-QANGLYSTEYIHSSFSGFLLCPT------SGSVLLHLEVGDQVWLQVYGDGDHNGLYADNVNDSTFTGFLLYHDTN----SNLALLHLTDGDQVWLETLR--DWNGXYSSSEDDSTFSGFLLYPDTKKPTAM
SGTAILQLGMEDRVWLENKL--SQTDLERG-TVQAVFSGFLIHEN------AGGTVLQLRRGDEVWIEKDP--AKGRIYQGTEADSIFSGFLIFPS------TGGVVLKLEQEEVVHLQATD---KNSLLGIEGANSIFTGFLLFPD------TGGMVLKLEQGENVFLQATD---KNSLLGMEGANSIFSGFLLFPD------SNGVLIQMEKGDRAYLKLER---GN-LMGG-WKYSTFSGFLVFPL------TGDALLELNYGQEVWLRLAK----GTIPAKFPPVTTFSGYLLYRT------. :: :
: . :
* *:*.
Clustal
Alignment
7 (c) Mark Gerstein, 2006, Yale, bioinfo.mbb.yale.edu
MMCOL10A1_1.483
Ca1x_Chick
S15435
CA18_MOUSE.597
Ca28_Human
MM37222_1.98
COLE_LEPMA.264
HP27_TAMAS.72
S19018
C1qb_Mouse
C1qb_Human
Cerb_Human
2.HS27109_1
- Local Minimum Problem
- Parameter Choice Problem
1. Local Minimum Problem
- It stems from greedy nature of alignment
(mistakes made early in alignment cannot be
corrected later)
- A better tree gives a better alignment
(UPGMA neighbour-joining tree method)
2. Parameter Choice Problem
• - It stems from using just one set of parameters
(and hoping that they will do for all)
8 (c) Mark Gerstein, 2006, Yale, bioinfo.mbb.yale.edu
Problems with Progressive
Alignments
9 (c) Mark Gerstein, 2006, Yale, bioinfo.mbb.yale.edu
Domain Problem in Mult. Alignment
- Motif: a short signature pattern identified in
the conserved region of the multiple alignment
- Profile: frequency of each amino acid at each position is
estimated
Profiles
Motifs
HMMs
- HMM: Hidden Markov Model, a generalized profile in
rigorous mathematical terms
Can get more
sensitive
searches with
these multiple
alignment
representations
(Run the profile
against the DB.)
Structure Sequence
Core
Cor
e
Core
2hhb
<
1ecd
<
1mbd
<
Consensus Profile
<
10 (c) Mark Gerstein, 2006, Yale, bioinfo.mbb.yale.edu
Fuse multiple alignment into:
Multiple Alignment
Do not reproduce without permission
11 -
Lectures.GersteinLab.org
(c) '09
motifs
• ChIP-on-chip or ChIP-seq: Immunoprecipitate DNATF complexes, then either hybridize them to a
microarray chip or sequence them.
• List promoter regions of co-regulated genes.
• SELEX: Systematic Evolution of Ligands by
Exponential Enrichment (or in vitro selection). A library
of random oligonucleotides are bound to a purified
protein, then the bound ones are identified.
[Adapted from C Bruce, CBB752 '09]
12 (c) Mark Gerstein, 2006, Yale, bioinfo.mbb.yale.edu
Examples of when you would want to find motifs -Finding TF-binding sequences
• Given a collection of binding sites, develop a
representation of those sites that can be used to
search new sites and reliably predict where additional
binding sites occur.
• Given a set of sequences known to contain binding
sites for a common factor, but not knowing where the
sites are, discover the location of the sites in each
sequence and a representation of the protein.
[Adapted from C Bruce, CBB752 '09]
13 (c) Mark Gerstein, 2006, Yale, bioinfo.mbb.yale.edu
Two problems in motif analysis
Two classes of motif discovery
algorithms
• Multiple alignment methods.
– Return PWM; use local search techniques such as
Gibbs sampling or EM
• Deterministic combinatorial algorithms based
on word frequency counts.
– Search for various sized sequences exhaustively
and evaluate significance.
[Adapted from C Bruce, CBB752 '09]
M
M
C
O
L
1
0
A
1
_
1
.
4
8
3S
G
S
A
I
M
E
L
T
E
N
D
Q
V
W
L
Q
L
P
N
A
E
S
N
G
L
Y
S
S
E
Y
V
H
S
S
F
S
G
F
L
V
A
P
M
C
a
1
x
_
C
h
i
c
k
S
G
S
A
V
I
D
L
M
E
N
D
Q
V
W
L
Q
L
P
N
S
E
S
N
G
L
Y
S
S
E
Y
V
H
S
S
F
S
G
F
L
F
A
Q
I
S
1
5
4
3
5
S
G
S
A
V
L
L
L
R
P
G
D
R
V
F
L
Q
M
P
S
E
Q
A
A
G
L
Y
A
G
Q
Y
V
H
S
S
F
S
G
Y
L
L
Y
P
M
C
A
1
8
_
M
O
U
S
E
.
5
9
7S
G
S
A
V
L
L
L
R
P
G
D
Q
V
F
L
Q
N
P
F
E
Q
A
A
G
L
Y
A
G
Q
Y
V
H
S
S
F
S
G
Y
L
L
Y
P
M
C
a
2
8
_
H
u
m
a
n
S
G
G
A
V
L
Q
L
R
P
N
D
Q
V
W
V
Q
I
P
S
D
Q
A
N
G
L
Y
S
T
E
Y
I
H
S
S
F
S
G
F
L
L
C
P
T
M
M
3
7
2
2
2
_
1
.
9
8 S
G
S
V
L
L
H
L
E
V
G
D
Q
V
W
L
Q
V
Y
G
D
G
D
H
N
G
L
Y
A
D
N
V
N
D
S
T
F
T
G
F
L
L
Y
H
D
T
N
C
O
L
E
_
L
E
P
M
A
.
2
6
4S
N
L
A
L
L
H
L
T
D
G
D
Q
V
W
L
E
T
L
R
D
W
N
G
X
Y
S
S
S
E
D
D
S
T
F
S
G
F
L
L
Y
P
D
T
K
K
P
T
A
M
H
P
2
7
_
T
A
M
A
S
.
7
2 S
G
T
A
I
L
Q
L
G
M
E
D
R
V
W
L
E
N
K
L
S
Q
T
D
L
E
R
G
T
V
Q
A
V
F
S
G
F
L
I
H
E
N
S
1
9
0
1
8
A
G
G
T
V
L
Q
L
R
R
G
D
E
V
W
I
E
K
D
P
A
K
G
R
I
Y
Q
G
T
E
A
D
S
I
F
S
G
F
L
I
F
P
S
C
1
q
b
_
M
o
u
s
e
T
G
G
V
V
L
K
L
E
Q
E
E
V
V
H
L
Q
A
T
D
K
N
S
L
L
G
I
E
G
A
N
S
I
F
T
G
F
L
L
F
P
D
C
1
q
b
_
H
u
m
a
n
T
G
G
M
V
L
K
L
E
Q
G
E
N
V
F
L
Q
A
T
D
K
N
S
L
L
G
M
E
G
A
N
S
I
F
S
G
F
L
L
F
P
D
C
e
r
b
_
H
u
m
a
n
S
N
G
V
L
I
Q
M
E
K
G
D
R
A
Y
L
K
L
E
R
G
N
L
M
G
G
W
K
Y
S
T
F
S
G
F
L
V
F
P
L
2
.
H
S
2
7
1
0
9
_
1
T
G
D
A
L
L
E
L
N
Y
G
Q
E
V
W
L
R
L
A
K
G
T
I
P
A
K
F
P
P
V
T
T
F
S
G
Y
L
L
Y
R
T
:
:: : :
**
:
*
15 (c) Mark Gerstein, 2006, Yale, bioinfo.mbb.yale.edu
Motifs
- several proteins are grouped together by similarity
searches
- they share a conserved motif
- motif is stringent enough to retrieve the family members
from the complete protein database
- PROSITE: a collection of motifs (1135 different motifs)
Prosite Pattern -- EGF like pattern
-
Bone morphogenic protein 1 (BMP-1), a protein which induces cartilage and bone formation.
Caenorhabditis elegans developmental proteins lin-12 (13 copies) and glp-1 (10 copies).
Calcium-dependent serine proteinase (CASP) which degrades the extracellular matrix proteins type ….
Cell surface antigen 114/A10 (3 copies).
Cell surface glycoprotein complex transmembrane subunit .
Coagulation associated proteins C, Z (2 copies) and S (4 copies).
Coagulation factors VII, IX, X and XII (2 copies).
Complement C1r/C1s components (1 copy).
Complement-activating component of Ra-reactive factor (RARF) (1 copy).
Complement components C6, C7, C8 alpha and beta chains, and C9 (1 copy).
Epidermal growth factor precursor (7-9 copies).
16 (c) Mark Gerstein, 2006, Yale, bioinfo.mbb.yale.edu
A sequence of about thirty to forty amino-acid residues long found in the sequence
of epidermal growth factor (EGF) has been shown [1 to 6] to be present, in a more or less conserved form, in a large
number of other, mostly animal proteins. The proteins currently known to contain one or more copies of an EGF-like pattern
are listed below.
+-------------------+
+-------------------------+
|
|
|
|
x(4)-C-x(0,48)-C-x(3,12)-C-x(1,70)-C-x(1,6)-C-x(2)-G-a-x(0,21)-G-x(2)-C-x
|
|
************************************
+-------------------+
'C': conserved cysteine involved in a disulfide bond.
'G': often conserved glycine
'a': often conserved aromatic amino acid
'*': position of both patterns.
'x': any residue
-Consensus pattern: C-x-C-x(5)-G-x(2)-C
[The 3 C's are involved in disulfide bonds]
http://www.expasy.ch/sprot/prosite.html
PKC_PHOSPHO_SIT
E
Protein kinase C
phosphorylation
site
[ST]-x-[RK]
Post-translational
modifications
RGD
Cell attachment
sequence
R-G-D
Domains
SOD_CU_ZN_1
Copper/Zinc
superoxide
dismutase
[GA]-[IMFAT]-H-[LIVF]-Hx(2)-[GP]-[SDG]-x-[STAGDE]
Enzymes_Oxidoreduc
tases
THIOL_PROTEASE_
ASN
Eukaryotic thiol
(cysteine)
proteases active
site
[FYCH]-[WI]-[LIVT]-x-[KRQAG]N-[ST]-W-x(3)-[FYW]-G-x(2)-G[LFYW]-[LIVMFYG]-x-[LIVMF]
Enzymes_Hydrolases
TNFR_NGFR_1
TNFR/CD27/30/4
0/95 cysteine-rich
region
C-x(4,6)-[FYH]-x(5,10)-C-x(0,2)C-x(2,3)-C-x(7,11)-C-x(4,6)[DNEQSKP]-x(2)-C
Receptors
17 (c) Mark Gerstein, 2006, Yale, bioinfo.mbb.yale.edu
Motifs
· Each element in a pattern is separated from its neighbor by a “-”.
· The symbol “x” is used for a position where any amino acid is accepted.
· Ambiguities are indicated by listing the acceptable amino acids for a given
position, between brackets “[]”.
· Ambiguities are also indicated by listing between a pair of braces “{}” the
amino acids that are not accepted at a given position.
· Repetition of an element of the pattern is indicated by with a numerical
value or a numerical range between parentheses following that element.
Enumerative techniques
• dictionary-based methods count the number of
occurrences of all n-mers in the target sequences, and
calculate which ones are most overrepresented.
• a number of similar overrepresented words may be
combined into a more flexible motif description.
• Alternatively, one can search the space of all
degenerate consensus sequences up to a given length,
for example, using IUPAC codes for 2-nucleotide or 3nucleotide degenerate positions in the motif
• WEEDER describes a motif as a consensus sequence
and an allowed number of mismatches, and uses an
efficient suffix tree representation to find all such
motifs in the target sequences
[Adapted from C Bruce, CBB752 '09]
IUPAC
Code
Meaning
G
G
A
A
T
T
C
C
R
G or A
Y
T or C
M
A or C
K
G or T
S
G or C
W
A or T
H
A or C or T
B
G or T or C
V
G or C or A
D
G or A or T
N
G or A or T or C
Consensus-based methods
• Enumerate all the oligos of (or up to) a given length, in order to determine
which ones appear, with possible substitutions, in a significant fraction of
the input sequences, and finally to rank them according to statistical
measure of significance.
• Drawbacks:
– For motif length of m, there are 4m candidates to enumerate. O(4m) execution time.
– Too slow.
• Motif search can be accelerated by pre-processing the data in an indexing
structure, such as a suffix tree.
[Adapted from C Bruce, CBB752 '09]
Multiple Alignment
Do not reproduce without permission
20 -
Lectures.GersteinLab.org
(c) '09
Profiles
Profile : a position-specific scoring matrix composed of 21
columns and N rows (N=length of sequences in multiple
alignment)
What happens with gaps?
5
21 (c) Mark Gerstein, 2006, Yale, bioinfo.mbb.yale.edu
Profiles
Cons.
Cys
Cons A
V
-1
D
0
V
0
D
0
P
3
C
5
A
2
s
2
n
-1
p
0
c
5
L
-5
N
-4
g
1
G
6
T
3
C
5
I
-6
d
-4
i
0
g
1
L
-5
E
0
S
3
Y -14
T
0
C
5
R
0
C
5
P
0
P
1
G
4
y -22
S
1
G
5
E
2
R
-5
C
5
E
0
T
-4
D
0
I
0
D
-4
C
-2
-14
-13
-20
-18
115
-7
-12
-15
-18
115
-14
-16
-16
-17
-10
115
-13
-19
-6
-13
-11
-20
-13
-9
-10
115
-13
115
-14
-18
-19
-6
-9
-20
-20
-17
115
-26
-11
-18
-10
-15
D
-9
-1
-9
18
1
-32
-2
3
4
-7
-32
-17
12
7
0
0
-32
-19
8
-8
0
-20
14
4
-25
-6
-32
0
-32
-8
-3
3
-35
-3
1
10
0
-32
20
-13
5
-2
-1
E
-5
-1
-7
11
3
-30
-2
2
4
-6
-30
-9
5
1
-7
2
-30
-11
6
-6
0
-14
10
3
-22
-1
-30
2
-30
-4
0
-4
-31
-1
-8
12
1
-30
25
-8
4
-1
-2
F
-13
-16
-15
-34
-26
-8
-21
-25
-19
-17
-8
0
-20
-35
-49
-21
-8
0
-15
-4
-20
0
-33
-28
31
-11
-8
-19
-8
-15
-24
-48
55
-14
-52
-31
-16
-8
-34
-1
-24
-17
-13
G
-18
-10
-10
0
-9
-20
-5
0
-7
-11
-20
-25
0
29
59
-12
-20
-28
-13
-11
-3
-23
5
3
-34
-16
-20
-11
-20
-17
-13
53
-43
-7
66
-7
-13
-20
-5
-21
-11
-14
-16
H
-2
0
-6
4
-5
-13
-4
0
3
0
-13
-5
24
0
-13
-3
-13
-5
5
-5
-3
-9
0
0
10
-2
-13
1
-13
0
-3
-11
11
0
-14
0
8
-13
6
2
-1
-3
-3
I
-5
-12
-5
-26
-14
-11
-12
-18
-16
-17
-11
4
-24
-31
-41
-5
-11
8
-17
3
-12
9
-25
-18
-5
-7
-11
-12
-11
-7
-12
-40
-1
-10
-45
-19
-16
-11
-25
0
-11
-4
-8
K
-2
0
-5
7
-1
-28
-2
0
2
-5
-28
-5
5
-1
-10
1
-28
-4
0
-5
-3
-11
2
2
-17
-1
-28
4
-28
-1
1
-7
-25
-2
-11
6
9
-28
10
-4
2
-1
-5
L
-7
-13
-7
-27
-14
-15
-13
-18
-16
-15
-15
8
-25
-31
-41
-11
-15
6
-16
1
-13
8
-26
-20
0
-9
-15
-13
-15
-7
-13
-40
6
-12
-44
-20
-16
-15
-25
-1
-14
-9
-6
M
-4
-8
-5
-20
-12
-9
-9
-13
-10
-14
-9
8
-18
-23
-32
-5
-9
8
-12
2
-8
7
-19
-13
-1
-5
-9
-8
-9
-5
-10
-31
4
-7
-35
-15
-11
-9
-17
0
-9
-4
-4
N
-3
1
-6
15
-1
-18
0
4
7
-5
-18
-12
25
12
3
1
-18
-12
5
-5
0
-14
11
6
-14
-3
-18
3
-18
-4
-2
5
-21
0
4
5
5
-18
9
-6
1
0
-1
P
-5
-3
-4
0
12
-31
-1
3
-6
28
-31
-14
-10
-10
-14
-4
-31
-17
-9
-8
-7
-17
-9
-6
-13
-9
-31
-8
-31
6
15
-13
-34
-7
-16
4
-11
-31
-4
-14
-6
-11
-7
Q
-1
0
-4
7
1
-24
0
1
3
-2
-24
-1
6
0
-9
1
-24
-4
2
-4
0
-9
4
3
-13
0
-24
4
-24
0
2
-7
-20
0
-10
7
7
-24
16
-3
2
0
-2
R
-3
-2
-6
4
-4
-22
-3
-1
0
-5
-22
-5
2
-1
-9
-1
-22
-5
-2
-6
-5
-14
0
1
-15
-1
-22
5
-22
-2
0
-7
-21
-4
-10
2
15
-22
5
-5
0
-4
-7
S
0
0
-1
6
2
1
2
7
2
0
1
-7
4
4
5
6
1
-9
-1
-2
2
-8
3
6
-14
1
1
1
1
0
2
4
-22
4
4
4
-1
1
3
-4
0
0
-3
T
0
0
0
2
0
-5
1
4
0
-1
-5
-5
1
-3
-9
11
-5
-4
-1
0
0
-4
0
3
-13
3
-5
1
-5
1
1
-7
-20
4
-11
2
-1
-5
0
0
0
2
-2
V
-1
-8
-1
-19
-9
0
-7
-12
-11
-13
0
2
-19
-23
-29
0
0
6
-13
4
-7
7
-19
-12
-7
-4
0
-8
0
-3
-8
-29
-7
-5
-33
-13
-13
0
-18
0
-6
-1
-6
W
-24
-26
-27
-38
-37
-10
-30
-30
-23
-26
-10
-15
-26
-32
-39
-33
-10
-12
-24
-14
-29
-17
-34
-32
17
-16
-10
-23
-10
-26
-33
-39
43
-24
-40
-38
-18
-10
-38
-15
-34
-29
-27
Y
-10
-9
-14
-21
-22
-5
-17
-16
-10
-9
-5
-5
-2
-23
-38
-18
-5
-1
-5
-6
-16
-5
-22
-20
44
-8
-5
-13
-5
-10
-19
-36
63
-9
-40
-22
-6
-5
-23
0
-18
-14
-12
Gap
100
100
100
100
100
100
100
25
25
25
25
100
100
50
100
100
100
100
31
31
31
100
100
100
100
100
100
100
100
100
100
100
50
100
100
100
100
100
100
100
100
100
100
22 (c) Mark Gerstein, 2006, Yale, bioinfo.mbb.yale.edu
EGF Profile Generated for SEARCHWISE
M(p,a) = chance of finding amino acid a at position p
Msimp(p,a) = number of times a occurs at p divided by number of sequences
However, what if don’t have many sequences in alignment? Msimp(p,a)
might be baised. Zeros for rare amino acids. Thus:
Mcplx(p,a)= b=1 to 20 Msimp(p,b) x Y(b,a)
Y(b,a): Dayhoff matrix for a and b amino acids
S(p,a) ~ a=1 to 20 Msimp(p,a) ln Msimp(p,a)
23 (c) Mark Gerstein, 2006, Yale, bioinfo.mbb.yale.edu
Profiles
formula
for
position
M(p,a)
H(p,a) = - a=1 to 20 f(p,a) log2 f(p,a),
where f(p,a) = frequency of amino acid a occurs at position p ( Msimp(p,a) )
Say column only has one aa (AAAAA):
H(p,a) = 1 log2 1 + 0 log2 0 + 0 log2 0 + … = 0 + 0 + 0 + … = 0
Say column is random with all aa equiprobable (ACD..ACD..ACD..):
Hrand(p,a) = .05 log2 .05 + .05 log2 .05 + … = -.22 + -.22 + … = -4.3
Say column is random with aa occurring according to probability found in
the sequence databases (ACAAAADAADDDDAAAA….):
Hdb(a) = - a=1 to 20 F(a) log2 F(a),
where F(a) is freq. of occurrence of a in DB
Hcorrected(p,a) = H(p,a) – Hdb(a)
24 (c) Mark Gerstein, 2006, Yale, bioinfo.mbb.yale.edu
Profiles
formula for
entropy
H(p,a)
[Adapted from C Bruce, CBB752 '09]
MacIsaac & Fraenkel, 2006
25 (c) Mark Gerstein, 2006, Yale, bioinfo.mbb.yale.edu
Scanning for Motifs with PWMs
Automatically builds
profile and then
searches with this
Also PHI-blast
26 (c) Mark Gerstein, 2006, Yale, bioinfo.mbb.yale.edu
Y-Blast
Parameters: overall •
threshold, inclusion
threshold, interations
•
Cor
e
Speed
Iteration
Scheme
Sensitivity
Semi-supervised learning
Blast
FASTA
SmithWaterman
PSI-Blast
Profiles
HMMs
Convergence vs explosion (polluted profiles)
27 (c) Mark Gerstein, 2006, Yale, bioinfo.mbb.yale.edu
PSI-Blast
Multiple Alignment
Do not reproduce without permission
28 -
Lectures.GersteinLab.org
(c) '09
EM
Probabilistic Approaches
• Expectation Maximization: Search the PWM
space randomly
• Gibbs sampling: Search sequence space
randomly.
[Adapted from C Bruce, CBB752 '09]
Expectation-Maximization (EM) algorithm
•
•
Used in statistics for finding maximum likelihood estimates of parameters in
probabilistic models, where the model depends on unobserved latent variables.
EM alternates between performing
–
–
•
an expectation (E) step, which computes an expectation of the likelihood by including the latent
variables as if they were observed, and
a maximization (M) step, which computes the maximum likelihood estimates of the parameters by
maximizing the expected likelihood found on the E step.
The parameters found on the M step are then used to begin another E step, and
the process is repeated.
[Adapted from C Bruce, CBB752 '09]
Alternating approach
1. Guess an initial weight matrix
2. Use weight matrix to predict instances in
the input sequences ]
3. Use instances to predict a weight matrix
4. Repeat 2 & 3 until satisfied.
Examples: Gibbs sampler (Lawrence et al.)
MEME (expectation max. / Bailey, Elkan)
ANN-Spec (neural net / Workman, Stormo)
[Adapted from B Noble, GS 541 at UW, http://noble.gs.washington.edu/~wnoble/genome541/]
Expectation-maximization
foreach subsequence of width W
convert subsequence to a matrix
do {
re-estimate motif occurrences from matrix
EM
re-estimate matrix model from motif occurrences
} until (matrix model stops changing)
end
select matrix with highest score
[Adapted from B Noble, GS 541 at UW, http://noble.gs.washington.edu/~wnoble/genome541]
Sample DNA sequences
>ce1cg
TAATGTTTGTGCTGGTTTTTGTGGCATCGGGCGAGAATA
GCGCGTGGTGTGAAAGACTGTTTTTTTGATCGTTTTCAC
AAAAATGGAAGTCCACAGTCTTGACAG
>ara
GACAAAAACGCGTAACAAAAGTGTCTATAATCACGGCAG
AAAAGTCCACATTGATTATTTGCACGGCGTCACACTTTG
CTATGCCATAGCATTTTTATCCATAAG
>bglr1
ACAAATCCCAATAACTTAATTATTGGGATTTGTTATATA
TAACTTTATAAATTCCTAAAATTACACAAAGTTAATAAC
TGTGAGCATGGTCATATTTTTATCAAT
>crp
CACAAAGCGAAAGCTATGCTAAAACAGTCAGGATGCTAC
AGTAATACATTGATGTACTGCATGTATGCAAAGGACGTC
ACATTACCGTGCAGTACAGTTGATAGC
[Adapted from B Noble, GS 541 at UW, http://noble.gs.washington.edu/~wnoble/genome541/]
Motif occurrences
>ce1cg
taatgtttgtgctggtttttgtggcatcgggcgagaata
gcgcgtggtgtgaaagactgttttTTTGATCGTTTTCAC
aaaaatggaagtccacagtcttgacag
>ara
gacaaaaacgcgtaacaaaagtgtctataatcacggcag
aaaagtccacattgattaTTTGCACGGCGTCACactttg
ctatgccatagcatttttatccataag
>bglr1
acaaatcccaataacttaattattgggatttgttatata
taactttataaattcctaaaattacacaaagttaataac
TGTGAGCATGGTCATatttttatcaat
>crp
cacaaagcgaaagctatgctaaaacagtcaggatgctac
agtaatacattgatgtactgcatgtaTGCAAAGGACGTC
ACattaccgtgcagtacagttgatagc
[Adapted from B Noble, GS 541 at UW, http://noble.gs.washington.edu/~wnoble/genome541/]
How does EM algorithms work?
Starting from a
single site,
expectation
maximization
algorithms such
as MEME4
alternate between
assigning sites to
a motif (left) and
updating the motif
model (right).
Note that only the
best hit per
sequence is shown
here, although
lesser hits in the
same sequence can
have an effect as
well.
Specifically,
in E step, estimate
location of motif
match. In M step,
find most likely
parameters of motif
model given the
locations.
MEME - a practical program
using EM
• Subsequences which occur in the input DNA sequence are
used as the starting points from which EM converges
iteratively to locally optimal motifs. This increases the
likelihood of finding globally optimal motifs.
• Multiple occurrences of a motif are allowed. Algorithm is
allowed to ignore sequences with no appearance of a
shared motif. So, more resistance to noisy data.
• Motifs are probabilistically erased after they are found, so
more than one motif can be found.
[Adapted from C Bruce, CBB752 '09]
Multiple Alignment
Do not reproduce without permission
37 -
Lectures.GersteinLab.org
(c) '09
Gibbs Sampling
Initialization
• Randomly guess an instance si from each of
t input sequences {S1, ..., St}.
sequence 1
sequence 2
sequence 3
sequence 4
ACAGTGT
TTAGACC
GTGACCA
ACCCAGG
CAGGTTT
sequence 5
[Adapted from B Noble, GS 541 at UW, http://noble.gs.washington.edu/~wnoble/genome541/]
Gibbs sampler
• Initially: randomly guess an instance si from each
of t input sequences {S1, ..., St}.
• Steps 2 & 3 (search):
– Throw away an instance si: remaining (t - 1) instances
define weight matrix.
– Weight matrix defines instance probability at each
position of input string Si
– Pick new si according to probability distribution
• Return highest-scoring motif seen
[Adapted from B Noble, GS 541 at UW, http://noble.gs.washington.edu/~wnoble/genome541/]
Sampler step illustration:
ACAGTGT
TAGGCGT
ACACCGT
???????
CAGGTTT
ACAGTGT
TAGGCGT
ACACCGT
ACGCCGT
CAGGTTT
[Adapted from B Noble, GS 541 at UW,
http://noble.gs.washington.edu/~wnoble/genome541/]
A
C
G
T
.45 .45 .45 .05 .05 .05 .05
.25 .45 .05 .25 .45 .05 .05
.05 .05 .45 .65 .05 .65 .05
.25 .05 .05 .05 .45 .25 .85
sequence 4
ACGCCGT:20%
11%
ACGGCGT:52%
Comparison
• Both EM and Gibbs sampling involve iterating
over two steps
• Convergence:
– EM converges when the PSSM stops changing.
– Gibbs sampling runs until you ask it to stop.
• Solution:
– EM may not find the motif with the highest score.
– Gibbs sampling will provably find the motif with the
highest score, if you let it run long enough.
[Adapted from B Noble, GS 541 at UW, http://noble.gs.washington.edu/~wnoble/genome541/]
Multiple Alignment
Do not reproduce without permission
42 -
Lectures.GersteinLab.org
(c) '09
HMMs
Starting from an initial state, a sequence of symbols is
generated by moving from state to state until an end state is
reached.
HMMs
Cor
e
(Figures from Eddy, Curr. Opin. Struct. Biol.)
43 (c) Mark Gerstein, 2006, Yale, bioinfo.mbb.yale.edu
Hidden Markov Model:
- a composition of finite number of states,
- each corresponding to a column in a multiple alignment
- each state emits symbols, according to symbol-emission
probabilities
44 (c) Mark Gerstein, 2006, Yale, bioinfo.mbb.yale.edu
Relating Different Hidden Match
States to the Observed Sequence
We see
We don't see
45 (c) Mark Gerstein, 2006, Yale, bioinfo.mbb.yale.edu
The Hidden Part of HMMs
46 (c) Mark Gerstein, 2006, Yale, bioinfo.mbb.yale.edu
Comparison of HMMs to Profiles
• Insertions:
47 (c) Mark Gerstein, 2006, Yale, bioinfo.mbb.yale.edu
Sequence profile elements
Algorithms
Viterbi maximizes for seq
Forward sums of all possible paths
Forward Algorithm – finds probability P that a model l
emits a given sequence O by summing over all paths that
emit the sequence the probability of that path
Viterbi Algorithm – finds the most probable path through
the model for a given sequence
(both usually just boil down to simple applications of
dynamic programming)
48 (c) Mark Gerstein, 2006, Yale, bioinfo.mbb.yale.edu
Probability of a path through the model
49 (c) Mark Gerstein, 2006, Yale, bioinfo.mbb.yale.edu
HMM algorithms similar to those in
sequence alignment
50 (c) Mark Gerstein, 2006, Yale, bioinfo.mbb.yale.edu
Seq. Alignment, Struc. Alignment, Threading
EM - expectation
maximization
"roll your own"
model -- dialing in
probabilities
51 (c) Mark Gerstein, 2006, Yale, bioinfo.mbb.yale.edu
Building the Model
Matching prot fams (pfam)
Predicting sec. struc. (TM, alpha)
Modelling binding sites for TF
(speech recognition)
52 (c) Mark Gerstein, 2006, Yale, bioinfo.mbb.yale.edu
Applications of HMMs
(Gene Finding)
HMMs, Profiles,
Motifs, and Multiple
Alignments used to
define modules
(Figures from Branden & Tooze)
•Several motifs (b-sheet, beta-alpha-beta, helix-loop-helix) combine to form a compact globular
structure termed a domain or tertiary structure
•A domain is defined as a polypeptide chain or part of a chain that can independently fold into a stable
tertiary structure
•Domains are also units of function (DNA binding domain, antigen binding domain, ATPase domain,
etc.)
53 (c) Mark Gerstein, 2006, Yale, bioinfo.mbb.yale.edu
Modules
•Another example of the helix-loop-helix
motif is seen within several DNA binding
domains including the homeobox proteins
which are the master regulators of
development
Multiple Alignment
Do not reproduce without permission
54 -
Lectures.GersteinLab.org
(c) '09
Positions Independent
• Limitation of position weight matrix is the assumption
that the positions in the site contribute additively to the
total binding activity.
• Statistical methods (e.g. neural networks) used to
identify which pairs of sites are dependent on each
other.
[Adapted from C Bruce, CBB752 '09]
55 (c) Mark Gerstein, 2006, Yale, bioinfo.mbb.yale.edu
Independence of bases within motif
[Adapted from C Bruce, CBB752 '09]
Zhou and Liu, 2004
56 (c) Mark Gerstein, 2006, Yale, bioinfo.mbb.yale.edu
Correlated bases
Using Binding Site
Regions Found by
ChIP-chip to refine
motifs: BoCaTFBS
57
[Wang
et al.,
GenomeBiology,
'06] ('06)]
[Wang
et al.,
GenomeBiology
57 (c) Mark Gerstein, 2006, Yale, bioinfo.mbb.yale.edu
• Traditional motif
learners (e.g.
consensus sequences,
profile methods, and
HMMs) only use
positive information
• ChIP-chip & Chip-seq
give vast amount of
negative information
(regions not bound)
• Explicitly use this in
constructing classifier
that refines known
positive motif seeds
• Use sequence of
Alternating Decision
Trees (ADTboost),
which allow explicit
inter-positional
correlations between
nucleotide positions
[Wang
et al., GenomeBiology,
[Wang et
al., GenomeBiology
('06)] '06]
58
58 (c) Mark Gerstein, 2006, Yale, bioinfo.mbb.yale.edu
Good performance compared to traditional motiffinders but large negative set requires training and
detection cascade for efficiency and balance
• Multiple Alignment
• Motifs
 Fast identification methods
• Profile Patterns
 Refinement via EM
 Gibbs Sampling
• HMMs
• Applications
 Module DBs
 Regression vs expression
• Issues: site independence
 BoCaTFBS
59 (c) Mark Gerstein, 2006, Yale, bioinfo.mbb.yale.edu
Multiple Alignment Topics