Motifs and motif prediction methods I - BIDD

Download Report

Transcript Motifs and motif prediction methods I - BIDD

CZ5226: Advanced Bioinformatics
Lecture 4: Motifs and methods for generating motifs
Prof. Chen Yu Zong
Tel: 6874-6877
Email: [email protected]
http://xin.cz3.nus.edu.sg
Room 07-24, level 7, SOC1,
National University of Singapore
What is a motif?
• A motif is a sequence pattern that occurs
repeatedly in a group of related DNA or
RNA or protein or peptide sequences.
2
Types of motifs and
what they mean
• Motifs in protein sequences
– Structure, function, evolution
• Motifs in DNA and RNA sequences
– Promoters, transcription factor binding sites,
splicing signals
• Motifs in MHC-binding peptides
– Anchor residue positions, TCR recognition
residues
3
Motifs in Protein Sequences
• The leucine zipper may
explain how some
eukaryotic gene
regulatory proteins work.
• L-x(6)-L-x(6)-L-x(6)-L
• The leucine side chains
extending from one
alpha-helix interact with
those from a similar
alpha helix of a second
polypeptide, facilitating
dimerization
4
Motifs in
DNA
Sequences
5
Motifs in DNA Sequences
• Promoter regions, e.g. TATA box
• Transcription factor binding sites, e.g.
Eve in Drosophila:
G-G-T-C-C-T-G-G
• Cis-Regulatory regions
6
Motifs in RNA sequences
Human RNAsplice
junctions
sequence
matrix
http://www-lmmb.ncifcrf.gov/~toms/sequencelogo.html
7
Motifs in Protein Structures
• Protein structure
patterns can encode
information about
protein function.
• Structure motifs can
be used to improve
multiple alignments of
protein sequences.
8
Active site recognition
EXAMPLE: CATHEPSIN A
PEPTIDASE FAMILY S10
EC # 3.4.16.5
3-D representation
3D profile
(PROCAT)
9
1ac5
438LTFVSVYNASHMVPFDKS455
1ivy
10
419IAFLTIKGAGHMVPTDKP436
Motifs in MHCBinding Peptide
11
Motifs in MHC Binding
Peptides
12
Motifs in MHC Binding Peptides
13
What is the goal and method of
motif detection?
• Perform local multiple sequence alignment
to find consensus sequences and common
sequence patterns (motifs)
14
Macromolecular motif recognition
1-D representation: Primary amino acid sequence
MIRAAPPPLFLLLLLLLLLVSWASRGEAAPDQDEIQRLPGLAKQPSFRQYSGYLKSSGSKHLHYWFVESQKDPE
NSPVVLWLNGGPGCSSLDGLLTEHGPFLVQPDGVTLEYNPYSWNLIANVLYLESPAGVGFSYSDDKFYATNDTE
VAQSNFEALQDFFRLFPEYKNNKL...
Computational
sequence analysis
Query secondary
databases over the
Internet
http://www.ebi.ac.uk/interpro/
15
The Three Elements of Pattern
Discovery
Pattern discovery requires:
• A pattern language
– This defines what kind of patterns you can find.
• An objective function
– This defines what makes a pattern “interesting”.
• An algorithm
– This defines how to search among the possible
patterns to find the “interesting” ones.
16
The Three Elements of Pattern
Discovery
Pattern discovery requires:
• A pattern language
– This defines what kind of patterns you can find.
• An objective function
– This defines what makes a pattern “interesting”.
• An algorithm
– This defines how to search among the possible
patterns to find the “interesting” ones.
17
Pattern Description Languages
• Regular expressions
• Profiles
• Hidden Markov Models (HMMs)
– Motif HMMs
– Motif-based HMMs
18
Macromolecular motif recognition
single motif
exact regular expression
(PROSITE)
full domain alignment
profile (PROSITE)
residue frequency
multiple motifs
Hidden Markov Model
(Pfam, PROSITE)
matrices (PRINTS)
19
Regular Expressions
• Regular expressions can be used to
describe sequence motifs.
• They use a simple syntax to describe
patterns.
• An example protein pattern:
[DENG]-x-[DEN]-x(0,2)-[DENQK]-[LIVFY]
20
Regular expressions contd.
Basic rules for regular expressions
• Each position is separated by a hyphen “-”
• A symbol X is a regular expression matching itself
• x means ‘any residue’
• [ ] surround ambiguities - a string [XYZ] matches any of the enclosed symbols
• A string [R]* matches any number of strings that match
• { } surround forbidden residues
• ( ) surround repeat counts
Model formation
•Restricted to key conserved features in order to reduce the “noise” level
•Built by hand in a stepwise fashion from multiple alignments
21
Motif modelling methods
Prosite: Regular expressions
CARBOXYPEPT_SER_HIS
[LIVF]-x(2)-[LIVSTA]-x-[IVPST]-x-[GSDNQL]-[SAGV]-[SG]-H-x[IVAQ]-P-x(3)-[PSA]
Regular expressions represent features by logical
combinations of characters. A regular expression defines
a sequence pattern to be matched.
22
Regular expressions contd.
Regular expressions, such as PROSITE patterns, are matched to
primary amino acid sequences using finite state automata.
“all-or-none”
23
Profiles
• Profiles give weights for each letter.
• Example from TRANSFAC: NF-kappab1
G
G
G
G
A
T
Y
C
C
C
A
0
0
0
2
16
0
0
0
0
0
C
0
0
0
0
1
0
7
16
18
17
G
18
18
18
16
0
3
1
0
0
1
T
0
0
0
0
1
15
10
2
0
0
24
Profiles
• Profiles are usually
created by aligning
multiple instances
of the motif.
• Example: nuclear
hormone receptor
transcription factor
binding site.
25
Motif modelling methods
Prints: Residue frequency matrices
Motif 1
NPESWTNFANMLW
NPYSWVNLTNVLW
REYSWHQNHHMIY
NEGSWISKGDLLF
NPYSWTNLTNVVY
NEYSWNKMASVVY
NDFGWDQESNLIY
NENSWNNYANMIY
NEYGWDQVSNLLY
NPYAWSKVSTMIY
NPYSWNGNASIIY
NEYAWNKFANVLF
NPYSWNRVSNILY
NPYSWNLIANVLY
NEYRWNKVANVLF
LDQPFGTGYSQ
VDNPVGAGFSY
VDQPVGTGFSL
VDQPGGTGFSS
IDNPVGTGFSF
IDQPTGTGFSV
VDQPLGTGYSY
IDQPAGTGFSP
LESPIGVGFSY
LDQPVGSGFSY
LDQPVGSGFSY
LDQPINTGFSN
LDQPIGAGFSY
LDAPAGVGFSY
LDQPVGAGFSY
Motif
Motif
2 3
FFQHFPEYQTNDFHIAGESYAGHYIP
Motif 4
FFNKFPEYQNRPFYITGESYGGIYVP
WVERFPEYKGRDFYIVGESYAGNGLM
FLSKFPEYKGRDFWITGESYAGVYIP
WFQLYPEFLSNPFYIAGESYAGVYVP
FFEAFPHLRSNDFHIAGESYAGHYIP
FFRLFPEYKDNKLFLTGESYAGIYIP
FLTRFPQFIGRETYLAGESYGGVYVP
FFNEFPQYKGNDFYVTGESYGGIYVP
WMSRFPQYQYRDFYIVGESYAGHYVP
FFRLFPEYKNNKLFLTGESYAGIYIP
FFRLFPEYKNNKLFLTGESYAGIYIP
WLERFPEYKGREFYITGESYAGHYVP
WMSRFPQYRYRDFYIVGESYAGHYVP
WFEKFPEHKGNEFYIAGESYAGIYVP
LAFTLSNSVGHMAP
LQFWWILRAGHMVA
LMWAETFQSGHMQP
LTYVRVYNSSHMVP
LQEVLIRNAGHMVP
LTFVSVYNASHMVP
LTFARIVEASHMVP
LTFSSVYLSGHEIP
IDVVTVKGSGHFVP
MTFATIKGSGHTAE
MTFATIKGGGHTAE
FGYLRLYEAGHMVP
MTFATVKGSGHTAE
ITLISIKGGGHFPA
MTFATVKGSGHTAE
•a collection of protein “fingerprints” that exploit groups of motifs to build
characteristic family signatures
•motifs are encoded in ungapped ”raw” sequence format
•different scoring methods may be superimposed onto the data, e. .g. BLAST
•improved diagnostic reliability
•mutual context provided by motif neighbours
26
Motif modelling methods
Prosite: Profiles
Feature is represented as a matrix with a score for every
possible character.
Matrix is derived from a sequence alignment, e.g.:
F
F
Y
F
F
L
K
K
P
P
K
E
L
A
I
V
V
F
L
F
V
V
L
I
S
G
G
K
A
S
H
Q
Q
E
A
E
C
T
E
A
V
C
L
M
L
I
I
I
L
F
L
L
A
I
V
Q
G
K
D
Q
27
Profiles contd.
Derived matrix:
A
C
D
E
F
G
H
I
K
L
M
N
P
Q
R
S
T
V
W
Y
-18
-22
-35
-27
60
-30
-13
3
-26
14
3
-22
-30
-32
-18
-22
-10
0
9
34
-10
-33
0
15
-30
-20
-12
-27
25
-28
-15
-6
24
5
9
-8
-10
-25
-25
-18
-1
-18
-32
-25
12
-28
-25
21
-25
19
10
-24
-26
-25
-22
-16
-6
22
-18
-1
-8
-18
-33
-26
14
-32
-25
25
-27
27
14
-27
-28
-26
-22
-21
-7
25
-19
1
8
-22
-7
-9
-26
28
-16
-29
-6
-27
-17
1
-14
-9
-10
11
-5
-19
-25
-23
-3
-26
6
23
-29
-14
14
-23
4
-20
-10
8
-10
24
0
2
-8
-26
-27
-12
Alignment
positions
3
22
-17
-9
-15
-23
-22
-8
-15
-9
-9
-15
-22
-16
-18
-1
2
6
-34
-19
-10
-24
-34
-24
4
-33
-22
33
-27
33
25
-24
-24
-17
-23
-24
-10
19
-20
0
-2
-19
-31
-23
12
-27
-23
19
-26
26
12
-24
-26
-23
-22
-19
-7
16
-17
0
-8
-7
0
-1
-29
-5
-10
-23
0
-21
-11
-4
-18
7
-4
-4
-11
-16
-28
-18
28
Profiles contd.
•Inclusion of all possible information to maximise overall
signal of protein/domain
i. e., a full representation of features in the aligned
sequences
•Able to detect distant relationships with only few well
conserved residues
•
•Position-dependent weights/penalties for all 20 amino acids,
gaps, insertions
•Dynamic programming algorithms for scoring hits
29
Hidden Markov Models (HMM)
• HMMs generalize
the idea of a profile.
• They can model
insertions and
deletions in the
sequence as well as
the letters at
conserved positions.
• Profiles can be seen
as simple HMMs.
30
Macromolecular motif recognition
Pfam and Prosite: Hidden Markov Models
(HMMs)
•Feature is represented by a probabilistic model of
interconnecting match, delete or insert states
•contains statistical information on observed and expected
positional variation - “platonic ideal of protein family”
Di
Ii
B
Mi
E
31
HMM example
A possible HMM for the sequence “ACCY” which is represented as
a sequence of probabilities. The probability of ACCY is shown as a
highlighted path through the model.
P that an amino acid occurs
in a particular state
P of transition
state
32
Motif-Based HMMs
Motif-based HMMs are sequence models made by
combining one or more motif models.
Motif HMM: Motifs are modeled as profile HMMs
without delete or insert states.
Motif HMM
M1
M2
M3
M4
M5
33
A Simple Motif-Based HMM
• Adding emitting states with self-loops, plus
start and end states, turns a motif HMM
into a sequence model.
• The HMM below models sequences with
one occurrence of the motif.
Sequence HMM
Start
Left
Flank
M1
M2
M3
M4
M5
Right
Flank
End
34
Motif-Based HMM for Modeling
Cis-regulatory Regions
With two or more motif
models we can make
more complicated
motif-based HMMs.
This sequence model
captures motifs on the
+ and – strand of DNA.
It does not capture the
order of the motifs.
35
The Three Elements of Pattern
Discovery
Pattern discovery requires:
• A pattern language
– This defines what kind of patterns you can find.
• An objective function
– This defines what makes a pattern “interesting”.
• An algorithm
– This defines how to search among the possible
patterns to find the “interesting” ones.
36
Objective functions for Regular
Expression Patterns
• Possible objective functions are:
– Perfect matches only (no mismatches)
– Allow a given number of mismatches
– Allow a given density of mismatches (or
wildcards).
• To be interesting, the motif must occur a
certain minimum number of times in the
data.
37
Objective functions for profiles
and HMMs
• Profile- and HMM-based motifs are usually
ranked by statistical or informationtheoretical measures:
– Likelihood ratio (eg, forward-backward)
– Information content (relative entropy)
– Maximum a posteriori probability
38
Example for profiles: the likelihood ratio
• Use the profile to compute the likelihood of the
data: Pr(data | profile)
• Use the background model to compute the
likelihood of the data under the background
model: Pr(data | bkgrnd)
• The likelihood is:
Pr(data | profile) / Pr(data | bkgrnd)
39
Objective functions for protein
structure patterns
• Structure motifs are usually evaluated
based on the RMS distance
– between the pattern and each instance, or,
– among all the instances of the pattern.
40
The Three Elements of Pattern
Discovery
Pattern discovery requires:
• A pattern language
– This defines what kind of patterns you can find.
• An objective function
– This defines what makes a pattern “interesting”.
• An algorithm
– This defines how to search among the possible
patterns to find the “interesting” ones.
41
Algorithms for discovering
sequence motifs
• Regular expression searches enumerate
or use seeds.
• Profile/HMM algorithms use Gibbs
sampling or Expectation Maximization
(EM). Forward-Backward is a form of EM.
42
Regular Expression Discovery:
a simple algorithm
• Look for DNA 16-mers where (up to) one wild
card is allowed in the pattern:
– E.g., “T-A-C-X-G-T-A-G-G-C-C-T-A-G-T-T”
16
11
5

1
.
5

10
• There are
possible patterns—a
big number.
• Idea: Instead of enumerating the possible
patterns and counting, just update the counts
of appropriate patterns for each 16-mer that
actually occurs in the data.
43
Regular Expression Discovery:
a simple algorithm (cont’d)
• Run a window of width 16 along the data
and, for each 16-mer in the data, e.g.
“AGGGTAAAAGCCCCCT”, update the
counts of the exact match pattern and
each pattern with one wildcard:
A-G-G-G-T-A-A-A-A-G-C-C-C-C-C-T,
X-G-G-G-T-A-A-A-A-G-C-C-C-C-C-T,
A-X-G-G-T-A-A-A-A-G-C-C-C-C-C-T, etc.
44
Profile discovery algorithms
• Profile discovery algorithms for finding
sequence motifs mostly use either EM
(Expectation Maximization) or Gibbs
sampling.
45
What is Gibbs sampling?
• Stochastic optimization method
• Works well with local multiple alignment without
gaps (motif searching)
• Searches for the statistically most probable motifs
by sampling random positions instead of going
through entire search space
46
What is the program going to do?
1. Ask user for :
a) file containing multiple DNA or protein sequences
b) motif width
c) how many motifs wanted
2. Calculate the background frequencies of
A,C,G,T from all the sequences.
[0.34951456310679613, 0.17799352750809061,
0.21035598705501618, 0.23300970873786409]
47
What is the program going to do?
3. Generate random start positions for the motif
in each sequence.
Example:
10 sequences, 30 bp in length, motif width of 7
start = [2, 6, 9, 14, 5, 7, 20, 20, 6, 22]
>> random.uniform(0,ceiling)
where ceiling=len(sequence)-width
48
What is the program going to do?
4. Construct position specific score matrix
from all sequences except one.
Motif Position
A
C
G
T
0
1
2
3
4
5
6
0.6
0
0.7
0
0.5
0.1
0.1
0
0.9
0.2
0.2
0
0.3
0
0.3
0
0
0.7
0.1
0
0.6
0
0
0
0
0.3
0.5
0.2
49
What is the program going to do?
5. Score the left-out sequence according to
the position specific score matrix:
score 
width

k 0
i{A,C,G,T }
Pki
Pi
where Pki  probability of seeing base i at position k
and Pi  probability of seeing base i as background
50
What is the program going to do?
Example:
Use the position specific matrix and background from before:
[A: 0.34951456310679613, C: 0.17799352750809061,
G: 0.21035598705501618, T: 0.23300970873786409]
Motif Position
A
C
G
T
0
1
2
3
4
5
6
0.6
0
0.7
0.2
0.5
0.1
0.1
0
0.9
0.2
0.1
0.3
0.3
0
0.3
0
0
0.7
0.1
0
0.6
0
0
0
0
0.1
0.5
0.2
GATTACA:
0.3
0
0
0
0.5
0.3
0.1






 4.81
0.21 0.35
0.23
0.23
0.35
0.18
0.35
51
What is the program going to do?
6. Randomly generate another start position of the
motif for that left-out sequence.
7. Score that sequence with its new start position.
8. Compare this new score with its original score.
9. If newscore >= oldscore, then jump to that new
start position, else jump to that new start position
with probability = newscore
oldscore
52
What is the program going to do?
10. Start all over again with this
updated start position with
another sequence left out
Do this many many times!
~ 1000 iterations
Gibbs will converge to a stationary
distribution of the start positions
=> a probable alignment of the
multiple sequences
53
Gibbs Sampling Algorithm
54
Gibbs Sampling Algorithm (cont’d)
55
Gibbs Sampling Algorithm (cont’d)
56
Gibbs Sampling Example
• The following slides illustrate Gibbs
sampling to discover a motif in yeast
DNA sequences.
• This example uses a sequence model
that allows multiple sites per
sequence.
• Columns are sampled as well as sites.
57
The Input Data Set
5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT …HIS7
5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG …ARO4
5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT …ILV6
5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC …THR4
5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA …ARO1
5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA …HOM2
5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA …PRO3
300-600 bp of upstream sequence
per gene are searched in
Saccharomyces cerevisiae.
58
The Target Motif
5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT …HIS7
5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG …ARO4
5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT …ILV6
5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC …THR4
5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA …ARO1
5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA …HOM2
5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA …PRO3
AAAAGAGTCA
AAATGACTCA
AAGTGAGTCA
AAAAGAGTCA
GGATGAGTCA
AAATGAGTCA
GAATGAGTCA
AAAAGAGTCA
**********
MAP score = 20.37 (maximum)
59
Initial Seeding
5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT …HIS7
5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG …ARO4
5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT …ILV6
5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC …THR4
5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA …ARO1
5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA …HOM2
5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA …PRO3
TGAAAAATTC
GACATCGAAA
GCACTTCGGC
GAGTCATTAC
GTAAATTGTC
CCACAGTCCG
TGTGAAGCAC
**********
MAP score = -10.0
60
Sampling
Add?
5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT …HIS7
5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG …ARO4
5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT …ILV6
5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC …THR4
5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA …ARO1
5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA …HOM2
5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA …PRO3
TGAAAAATTC
GACATCGAAA
GCACTTCGGC
GAGTCATTAC
GTAAATTGTC
CCACAGTCCG
TGTGAAGCAC
**********
How much better is the
alignment with this site
as opposed to without?
TCTCTCTCCA
TGAAAAATTC
GACATCGAAA
GCACTTCGGC
GAGTCATTAC
GTAAATTGTC
CCACAGTCCG
TGTGAAGCAC
**********
61
Continued Sampling
Add?
5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT …HIS7
5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG …ARO4
5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT …ILV6
5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC …THR4
5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA …ARO1
5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA …HOM2
5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA …PRO3
GACATCGAAA
GCACTTCGGC
GAGTCATTAC
GTAAATTGTC
CCACAGTCCG
TGTGAAGCAC
**********
How much better is the
alignment with this site
as opposed to without?
TGAAAAATTC
GACATCGAAA
GCACTTCGGC
GAGTCATTAC
GTAAATTGTC
CCACAGTCCG
TGTGAAGCAC
**********
Source: G.M. Church
62
Column Sampling
5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT …HIS7
5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG …ARO4
5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT …ILV6
5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC …THR4
5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA …ARO1
5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA …HOM2
5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA …PRO3
GACATCGAAA
GCACTTCGGC
GAGTCATTAC
GTAAATTGTC
CCACAGTCCG
TGTGAAGCAC
**********
How much better is the
alignment with this new
column structure?
GACATCGAAAC
GCACTTCGGCG
GAGTCATTACA
GTAAATTGTCA
CCACAGTCCGC
TGTGAAGCACA
********* *
Source: G.M. Church
63
The Best Motif
5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT …HIS7
5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG …ARO4
5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT …ILV6
5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC …THR4
5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA …ARO1
5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA …HOM2
5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA …PRO3
AAAAGAGTCA
AAATGACTCA
AAGTGAGTCA
AAAAGAGTCA
GGATGAGTCA
AAATGAGTCA
GAATGAGTCA
AAAAGAGTCA
**********
MAP score = 20.37
64
What is the program missing?
• Doesn’t do re-initializations in the middle to get out of local
maxima
• Doesn’t optimize the width (you have to specify width
explicitly)
• Doesn’t do the Bayesian approach – just frequentist (easier
for me and for you to understand!)
• Doesn’t read in fasta files
• Doesn’t do error checking!
• And other things that don’t know they are missing yet!
65
Alternative program?
• Gibbs Motif Sampler
– http://bayesweb.wadsworth.org/gibbs/gib
bs.html
• AlignAce
– http://atlas.med.harvard.edu/cgibin/alignace.pl
66
Protein structure pattern
discovery algorithms
• To discover structure patterns, some
algorithms search the space of structural
motifs.
• We illustrate another approach (called
SPRATT) that searches the space of
sequence motifs to find structure motifs.
67
Structure patterns: SPRATT
• Describe local spatial
neighborhoods as
strings.
• Use sequence pattern
discovery to find
patterns.
• Check if string
similarity reflects
structural similarity.
68
SPRATT Example
69
SPRATT Example (cont’d)
70
SPRATT Example: conclusion
• Now, use the PRATT sequence pattern
discovery algorithm to find any patterns in the Cstrings and N-strings, separately.
• For each pattern found, compute the maximum
RMSD of the overlaid structures of the pattern
instances.
• The final objective function is the ratio of the
information content of the sequence pattern and
the maximum RMSD.
71
Some real-life algorithms
• Sequence pattern discovery
–
–
–
–
–
–
RSA tools (DNA)
MEME (protein, DNA, RNA)
Alignace (DNA, RNA)
Weeder
PRATT (protein, DNA, RNA, other alphabets)
Clover
• Cis-regulatory regions
– MCAST
– Comet, Cister, Cluster-Buster
– Meta-MEME
• Protein structure pattern discovery
– SPRATT
– CE
72