Transcript ppt

Biological Sequence Analysis
Lecture 26, Statistics 246
April 27, 2004
1
Synopsis
Some biological background
A progression of models
Acknowledgements
References
2
The objects of our study
DNA, RNA and proteins: macromolecules which are
unbranched polymers built up from smaller units.
DNA: units are the nucleotide residues A, C, G and T
RNA: units are the nucleotide residues A, C, G and U
Proteins: units are the amino acid residues A, C, D, E,
F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W and Y.
To a considerable extent, the chemical properties of
DNA, RNA and protein molecules are encoded in the
linear sequence of these basic units: their primary
3
structure.
The central dogma
DNA
CCTGAGCCAACTATTGATGAA
transcription
mRNA
CCUGAGCCAACUAUUGAUGAA
translation
Protein
PEPTIDE
4
A protein-coding gene
5
Motifs - Sites - Signals - Domains
For this lecture, I’ll use these terms interchangeably
to describe recurring elements of interest to us.
In PROTEINS we have: transmembrane domains,
coiled-coil domains, EGF-like domains, signal
peptides, phosphorylation sites, antigenic
determinants, ...
In DNA / RNA we have: enhancers, promoters,
terminators, splicing signals, translation initiation
sites, centromeres, ...
6
Motifs and models
Motifs typically represent regions of structural
significance with specific biological function.
Are generalisations from known examples.
The models can be highly specific.
Multiple models can be used to give higher
sensitivity & specificity in their detection.
Can sometimes be generated automatically from
examples or multiple alignments.
7
The use of stochastic models for motifs
Can be descriptive, predictive or everything
else in between…..almost business as usual.
However, stochastic mechanisms should never
be taken literally, but nevertheless they can be
amazingly useful.
Care is always needed: a model or method can
break down at any time without notice.
Biological confirmation of predictions is almost
always necessary.
8
Transcription initiation in E. coli
RNA polymerase-promotor interactions
In E. coli transcription is initiated at the promotor , whose
9
sequence is recognised by the Sigma factor of RNA polymerase.
Transcription initiation in E. coli, cont.
..YKFSTYATWWIRQAITR..
10
Determinism 1: consensus sequences
 Factor
70
28
Promotor consensus sequence
-35
-10
TTGACA
TATAAT
CTAAA
CCGATAT
Similarly for 32 , 38 and 54.
Consensus sequences have the obvious limitation:
there is usually some deviation from them.
11
The human transcription factor Sp1
has 3 Cys-Cys-His-His zinc finger DNA binding domains
12
Determinism 2: regular expressions
The characteristic motif of a Cys-Cys-His-His zinc finger
DNA binding domain has regular expression
C-X(2,4)-C-X(3)-[LIVMFYWC]-X(8)-H-X(3,5)-H
Here, as in algebra, X is unknown. The 29 a.a. sequence of our
example domain 1SP1 is as follows, clearly fitting the model.
1SP1: KKFACPECPKRFMRSDHLSKHIKTHQNKK
13
Prosite patterns
An early effort at collecting descriptors for
functionally important protein motifs. They
do not attempt to describe a complete
domain or protein, but simply try to identify
the most important residue combinations,
such as the catalytic site of an enzyme.
They use regular expression syntax, and
focus on the most highly conserved residues
in a protein family.
http://au.expasy.org
14
More on Prosite patterns
This pattern, which must be in the N-terminal of the sequence (‘<'), means:
<A - x [ST]
(2)
x(0,1)
-V {LI}
Ala-any- [Ser or Thr]-[Ser or Thr]
- (any or none)-Val-(any but Leu, Ile)
15
Searching with regular expressions
http://www.isrec.isb-sib.ch/software/PATFND_form.html
c.{2,4}c...[livmfywc]........h.{3,5}h
PatternFind output
[ISREC-Server] Date: Wed Aug 22 13:00:41 MET 2001
...
gp|AF234161|7188808|01AEB01ABAC4F945 nuclear
protein NP94b [Homo sapiens] Occurences: 2
Position : 514 CYICKASCSSQQEFQDHMSEPQH
Position : 606 CTVCNRYFKTPRKFVEHVKSQGH
........
16
Regular expressions can be limiting
The regular expression syntax is still too rigid to
represent many highly divergent protein motifs.
Also, short patterns are sometimes insufficient with
today’s large databases. Even requiring perfect
matches you might find many false positives. On
the other site some real sites might not be perfect
matches.
We need to go beyond apparently equally likely
alternatives, and ranges for gaps. We deal with the
former first, having a distribution at each position.
17
Cys-Cys-His-His profile: sequence logo form
A sequence logo is a scaled position-specific a.a.distribution.
18
Scaling is by a measure of a position’s information content.
Weight matrix model (WMM) =
Stochastic consensus sequence
A
9
214
C
13
G
22
7
T
5
18
2
193
19
63
142
26
31
29
124
118
52
38
31
Counts from 242 known
8
29
43
70
A
0.04
0.88
0.26
0.59
0.49
0.03
C
0.09
0.03
0.11
0.13
0.21
0.05
G
0.07
0.01
0.12
0.16
0.12
0.02
T
0.80
0.08
0.51
0.13
0.18
0.89
216
sites
Relative frequencies:
2
A -38
19
1
12
10
-48
C -15
-38
-8
-10
-3
-32
G -13
-48
-6
-7
-10
-40
T
-32
8
-9
-6
19
17
1
0
1
10 log2fbl/pb
Informativeness:2+Sbpbllog2pbl
2
3
4
5
6
19
fbl
Interpretation of weight matrix entries
candidate sequence
CTATAATC....
aligned position
123456
Hypotheses:
S=site (and independence)
R=random (equiprobable, independence)
 pr(CTATAA | S) 
log2 
pr(CTATAA | R) 
 .09x.08x.26x.13x.51x.01 


= log2
 .25x.25x.25x.25x.25x.25

= (2+log2.09)+...+(2+log2.01)
1
-15- 32
= + 1 - 9 + 10 - 48
10
Generally,
score sbl = log fbl/pb
l=position, b=base
pb=background frequency
20
Use of the matrix to find sites
C
Move the matrix
along the sequence
and score each
window.
Peaks should
occur at the
true sites.
Of course in general
any threshold will
have some false
positive and false
negative rate.
A
C-38
T
A
T
19
1
12
10 -48
-3 -32
C
-15 -38
-8 -10
G
-13 -48
-6
T
17 -32
8
A
-38
C
A
A
T
sum
-7 -10 -40
-9
19
19
12
10 -48
-15 -38
-8 -10
-3 -32
G
-13 -48
-6
T
17 -32
8
A
-38
C
1
-6
-93
19
+85
-7 -10 -40
-9
1
-6
19
12
10 -48
-15 -38
-8 -10
-3 -32
G
-13 -48
-6
T
17 -32
8
-7 -10 -40
-9
-6
19
-95
21
Profiles
Are a variation of the position specific scoring matrix
approach just described. Profiles are calculated slightly
differently to reflect amino acid substitutions, and the
possibility of gaps, but are used in the same way.
In general a profile entry Mla for location l and amino acid a is
calculated by
Mla = ∑bwlbSab
where b ranges over amino acids, wlb is a weight (e.g. the
observed frequency of a.a. b in position l) and Sab is the (a,b)entry of a substitution matrix (e.g. PAM or BLOSUM)
calculated as a likelihood ratio.
Position specific gap penalties can also be included.
22
Derivation of a profile for Ig domains
PileUp of: @/home/ucsb00/George/.WAG/pileup-8391.8424
Symbol comparison table: GenRunData:pileuppep.cmp
1254
CompCheck:
GapWeight: 3.000
GapLengthWeight: 0.100
pileup.msf
Name:
Name:
Name:
Name:
g1
m3
k2
l3
MSF: 49
g1
m3
k2
l3
1
ELVKAGSSVK
GLVEPGGSLR
LPVTPGEPAS
VSVALGQTVR
Type: P
Len:
Len:
Len:
Len:
MSCKATGYTF
LSCSASGFTF
ISCRSSQSLL
ITCQ.GDSLR
May 27, 1999 13:18
49
49
49
49
SSYE....LY
SAND....MN
DSGDGNTYLN
GYDAA.....
Check: 9167 ..
Check: 2179
Check:
0
Check: 6739
Check: 249
Weight:
Weight:
Weight:
Weight:
WVRQAPGQGL
WVRQAPGKGL
WYLQKAGQSP
WYQQKPGQAP
49
EDLGYISSS
EWLSFIGGS
QLLIYTLSY
LLVIYGRNN
1.00
1.00
1.00
1.00
23
Cons A
B
C
D
E
F
G
H
I
K
L
M
N
P
Q
R
S
T
V
W
Y
Z
Gap Len ..
V
15
8
-14
14
21
1
24
-4
19
-4
21
19
3
3
8
-14
4
10
32
-33
-14
14
100
100
L
9 -10
-14
-12
-5
24
-3
-6
21
-5
38
34
-9
17
0
-7
14
5
25
9
-7
-4
100
100
V
20 -20
20
-20
-20
20
20 -30
110
-20
80
60
-30
10 -20
-30
-10
20 150
-80
-10
-20
100
100
T
31
21
-10
25
32 -31
21
4
-3
28
-11
0
18
14
17
6
15
32
0
-33
-24
25
100
100
P
35
-1
-4
0
3 -13
12
2
5
-1
10
12
-3
50
11
0
13
14
17
-30
-25
6
100
100
G
70
60
20
70
50 -60
150 -20
-30
-10
-50
-30
40
30
20
-30
60
40
20 -100
-70
30
100
100
E
22
29
-4
36
40 -33
39
10
-12
11
-18
-11
22
15
32
3
31
11
-4
-32
-31
35
100
100
S
25
14
26
11
11 -23
29
-5
-3
11
-18
-12
12
38
0
6
57
34
1
-10
-28
4
100
100
V
26 -11
-1
-9
-6
16
9 -14
46
-11
45
37
-12
6
-5
-19
-3
11
61
-30
-3
-6
100
100
R
-4
13
-8
7
7 -30
-3
14
-14
49
-22
5
13
16
17
60
27
4
-14
50
-33
12
100
100
! 11
I
-1 -17
-13
-19
-13
46
-21 -16
67
-8
64
58
-19
-13 -11
-12
-13
5
54
-13
6
-11
100
100
S
28
20
43
14
14 -21
40 -13
-3
14
-24
-17
20
27
-7
4
90
38
-3
9
-27
1
100
100
C
30 -40 150
-50
-60 -10
20 -10
20
-60
-80
-60
-30
10 -60
-30
70
20
20 -120 100
-60
100
100
K
4
18
-11
17
17 -32
6
15
-12
40
-17
1
17
15
31
39
24
4
-11
18
-31
24
100
100
A
53
11
19
12
12 -20
31
-6
-1
3
-9
-4
11
21
5
-8
33
17
5
-21
-15
6
30
30
S
28
21
28
19
16 -22
45 -11
-5
8
-21
-14
18
21
-2
-2
60
36
2
-13
-27
6
100
100
G
29
41
-9
53
39 -44
60
9
-16
7
-24
-15
28
15
37
-4
20
14
1
-54
-37
37
100
100
W
2
-4
35
-14
-10
31
1
-4
8
-12
8
-4
1
-8 -23
-12
38
1
-2
43
28
-18
100
100
L
10 -10
-19
-10
-3
29
-3 -10
32
-3
44
41
-6
0
-6
-16
-3
44
32
-3
0
-3
100
100
W 21 -28
-18
-39
-26
57
-30
1
29
-15
53
37
-20
-22 -21
-1
-14
-12
13
68
40
-22
100
100
! 21
S
27
33
18
37
27 -32
50
-4
-10
9
-27
-19
25
18
9
-1
59
18
-3
-20
-29
17
100
100
S
29
8
40
4
4
3
19
-4
-2
-2
-10
-11
11
9
-9
-9
48
11
-2
14
4
-6
100
100
B
12
35
6
33
21 -10
26
14
-10
0
-15
-15
35
-6
10
-11
10
7
-6
-18
3
14
100
100
D
34
47
-20
66
57 -48
39
17
-9
14
-21
-15
32
11
35
-4
15
15
-6
-61
-27
47
100
100
A
31
11
7
14
11 -15
31
-4
-4
-1
-8
-4
8
11
6
-8
14
11
6
-25
-14
7
21
21
N
3
15
-4
10
7
-7
6
7
-4
6
-6
-4
21
0
6
1
4
3
-4
-4
-1
6
21
21
T
6
3
3
3
3
-4
6
-1
3
3
-1
0
3
4
-1
-1
4
21
3
-8
-4
1
21
21
Y
-4
-4
14
-7
-7
19
-10
4
1
-8
4
-1
-1
-11
-8
-8
-6
-4
-1
15
21
-8
21
21
L
-3 -20
-34
-21
-12
45
-20 -11
34
-7
66
62
-17
-12
-3
-10
-17
-3
34
12
8
-8
21
21
N
2
31
4
15
9
4
3
20
-8
4
-9
-11
46
-11
4
-5
4
2
-11
6
18
4
21
21
! 31
W 80 -70 -120 -110 -110 130 -100 -10
-50
10
50
-30
-30
-80 -50
140
30
-60
-80 150 110
-80
100
100
F
-3 -16
38
-22
-22
51
-16
0
38
-25
35
16
-13
-22 -25
-29
-16
-3
44
10
44
-25
100
100
R
-8
3
-29
3
6 -10
-14
23
-3
27
7
24
3
10
32
48
-4
-6
-1
44
-23
19
100
100
Q
20
50
-60
70
70 -80
20
70
-30
40
-10
0
40
30 150
40
-10
-10
-20
-50
-60 110
100
100
A
48
19
-10
19
19 -38
19
0
-6
48
-13
6
19
19
19
16
19
19
0
-22
-29
19
100
100
P
49
8
10
10
10 -47
27
10
-11
6
-18
-11
3
92
20
13
28
23
8
-57
-50
14
100
100
G
70
60
20
70
50 -60
150 -20
-30
-10
-50
-30
40
30
20
-30
60
40
20 -100
-70
30
100
100
Q
11
34
-42
44
44 -55
10
41
-20
44
-10
3
28
18
91
34
-3
-3
-14
-27
-42
68
100
100
G
49
26
20
29
23 -30
66 -11
-11
0
-23
-14
20
22
8
-12
45
22
8
-39
-32
12
100
100
L
13 -13
-22
-13
-6
16
-6
0
19
-6
38
35
-13
38
6
-3
0
6
29
-10
-16
0
100
100
! 41
E
11
22
-38
35
53 -17
12
20
1
11
10
12
16
3
42
0
-1
4
2
-35
-20
47
100
100
L 10 -10
-49
-10
-11
42
-20
-2
16
-4
48
32
-7
-19
0
7
-6
-9
12
21
18
-5
100
100
L
-3 -31
-43
-31
-20
71
-26 -16
61
-20
96
82
-27
-16
-8
-27
-24
-3
66
17
16
-14
100
100
I
15
6
19
6
3
10
20 -15
42
-5
13
11
0
3
-8
-12
26
16
36
-26
-12
-2
100
100
Y 24 -27
56
-42
-38 101
-48
16
15
-44
34
1
-13
-55 -45
-41
-27
-21
-3
81 105
-44
100
100
I
15
5
12
6
3
10
17 -14
46
-5
17
15
-1
2
-8
-15
9
33
40
-38
-11
-1
100
100
S
10
7
-3
6
6
-3
18
-1
1
8
3
12
6
10
6
12
25
7
8
17
-19
4
100
100
S
25
33
21
26
20 -25
45
-2
-11
11
-25
-18
36
17
5
0
60
18
-5
-9
-24
10
100
100
S
11
21
32
9
6
3
15
5
-6
4
-14
-15
29
2
-6
-4
46
8
-9
21
7
-3
100
100
*
12
0
4
6
6
4
22
0
6
6
21
2
6
9
12
7
25
8
10
5
11
0
24
g1 ELVKAGSSVK MSCKATGYTF SSYE....LY WV
m3 GLVEPGGSLR LSCSASGFTF SAND....MN WV
k2 LPVTPGEPAS ISCRSSQSLL DSGDGNTYLN WY
l3
VSVALGQTVR ITCQ.GDSLR GYDAA..... WY
Cons A
13C
14K
15A
16S
30
4
53
28
B
-40
18
11
21
C
D
150
-11
19
28
-50
17
12
19
E
-60
17
12
16
... Gap
Len
100 100
100 100
30 30
100 100
We search with a profile in the same way as we did before.
These days profiles of this kind have been largely replaced by
Hidden Markov Models for sequence searching and alignment.
25
Modelling motifs: the next steps
Missing from the weight matrix models of motifs
and profiles are good ways of dealing with:
• Length distributions for insertions/deletions
• Local and non-local association of amino acids
Hidden Markov Models (HMM) help with the first.
Dealing with the second remains a hard unsolved
problem.
26
Hidden Markov Models
Processes {(St, Ot), t=1,…}, where St is the hidden
state and Ot the observation at time t, such that
pr(St | St-1, Ot-1,St-2 , Ot-2 …)
= pr(St | St-1)
pr(Ot | St ,St-1, Ot-1,St-2 , Ot-2 …) = pr(Ot | St, St-1)
The basics of HMM were laid bare in a series of
beautiful papers by L E Baum and colleagues around
1970, and their formulation has been used almost
unchanged to this day.
27
Hidden Markov Models:extensions
Many variants are now used. For example, the distribution of O
may depend only on previous S, or also on previous O values,
pr(Ot | St , St-1 , Ot-1 ,.. ) = pr(Ot | St ), or
pr(Ot | St , St-1 , Ot-1 ,.. ) = pr(Ot | St , St-1 ,Ot-1) .
Most importantly for us, the times of S and O may be
decoupled, permitting the Observation corresponding to State
time t to be a string whose length and composition depends on
St (and possibly St-1 and part or all of the previous
Observations). This is called a hidden semi-Markov or
generalized hidden Markov model.
28
Some early applications of HMM
finance, but we never saw them
speech recognition
modelling ion channels
In the mid-late 1980s HMMs entered genetics and
molecular biology, and they are now firmly entrenched.
Some current applications of HMM to biology
mapping chromosomes
aligning biological sequences
predicting sequence structure
inferring evolutionary relationships
finding genes in DNA sequence
The algorithms
As the name suggests, with an HMM the series
O = (O1,O2 ,O3 ,……., OT) is observed, while
the states S = (S1 ,S2 ,S3 ,……., ST) are not.
There are elegant algorithms for calculating
pr(O|), arg max pr(O|) in certain special
cases, and arg maxS pr(S|O,).
Here  are the parameters of the model, e.g.
transition and observation probabilities.
30
Profile HMM = stochastic regular expressions
M = Match state, I = Insert state, D = Delete state.
To operate, go from left to right. I and M states output
31
amino acids; B, D and E states are “silent”.
How profile HMM are used
Instances of the motif are identified by calculating
log{pr(sequence | M)/pr(sequence | B)},
where M and B are the motif and background HMM.
Alignments of instances of the motif to the HMM are
found by calculating
arg maxstates pr(states | instance, M).
Estimation of HMM parameters is by calculating
arg maxparameterspr(sequences | M, parameters).
In all cases, we use the efficient HMM algorithms.
32
Pfam domain-HMM
Pfam is a library of models of recurrent protein
domains. They are constructed semi-automatically
using profile hidden Markov models.
Pfam families have permanent accession numbers
and contain functional annotation and crossreferences to other databases, while Pfam-B families
are re-generated at each release and are
unannotated.
See http://pfam.wustl.edu/
33
On the Modelling of Finding
Short DNA Motifs
34
Outline
•
•
•
•
Some Biological Background
Models and Modelling
Results and Discussions
Future Work
35
Cartoon of gene structure
(5’ Splice site)
(3’ Splice site)
36
12 examples of 5’splice (donor) sites
exon TCGGTGAGT intron
TGGGTGTGT
CCGGTCCGT
ATG GTAAGA
TCT GTAAGT
CAGGTAGGA
CAGGTAGGG
AAGGTAAGG
AGGGTATGG
TGGGTAAGG
GAGGTTAGT
CATGTGAGT
37
Probability models for short DNA motifs
Short: ~6-20 base pairs
DNA motifs: splice sites, transcription factor binding sites,
translation initiation sites, enhancers, promoters,...
Why probability models?
• to characterize the motifs
• to help identify them
• for incorporation into larger models
Aim: given a number of instances of a DNA sequence
motif, we want a model for the probability of that
motif.
38
Weight matrix models, Staden (1984)
Base
A
C
G
T
-3
33
37
18
12
-2
61
13
12
14
-1 0 +1
10 0 0
3 0 0
80 100 0
7 0 100
+2
53
3
42
2
+3
71
8
12
9
+4
7
6
81
6
+5
16
16
22
46
A weight matrix for donor sites.
Essentially a mutual independence model.
An improvement over the consensus CAGGTAAGT.
But we have to go beyond independence…
39
Beyond independence
Weight array matrices, Zhang & Marr (1993) consider
dependencies between adjacent positions, i.e. (nonstationary) first-order Markov models. The number of
parameters increases exponentially if we restrict to full
higher-order Markovian models.
Variable length Markov models, Rissanen (1986),
Buhlmann & Wyner (1999), help us get over this
problem. In the last few years, many variants have
appeared: all make use of trees.
[The interpolated Markov models of Salzberg et al
40
(1998) address the same problem.]
Some notation
L: length of the DNA sequence motif.
Xi: discrete random variable at position i,
taking values from the set  = {A, C, G, T}.
x = (x1, …, xL): a DNA sequence motif of
length L.
x1j (j>1): the sequence (x1 , x2, …., xj ).
P(x): probability of x.
41
Variable Length Markov Models
(Rissanen (1986), Buhlmann & Wyner (1999))
Factorize P(x) in the usual telescopic way:
L
P(X1  x1 )   P(X l  x l | X
l1
1
x )
l1
1
l 2
l 1 i
l

1
Simplify this using context functions cl : 


i0
l = 2,..L, to
L
P(X1  x1 )   P(X l  x l | c l (X )  c l (x ))
l1
1
l1
1
l 2
42
VLMM cont.
A VLMM for a DNA sequence motif of length L is
specified by
• a distribution for X1, and, for l = 2,…L,
• a constrained conditional distribution for Xl given Xl1,…,X1.
That is, we need L-1 context functions.
43
VLMM: an illustrative example
Full
X2
X3
X2
X2 X1
Pruned
X2
X
X3 3
X2
X2 X1
Pruned context: P(X3|X2=A, X1) = P(X3|X2=A), etc.
44
Sequence dependencies
(interactions) are not always local
3-dimensional folding; DNA, RNA & protein interactions
The methods outlined so far all fail to incorporate long-range
( ≥4 bp) interactions. New model types are needed!
45
Modelling “long-range” dependency
The principal work in this area is Burge & Karlin’s (1997)
maximal dependence decomposition (MDD) model.
Cai et al (2000) and Barash et al (2003) used Bayesian
networks (BN).
Yeo & Burge (2003) applied maximum entropy models.
Ellrott et al (2002) ordered the sequence and applied
Markov models to the motif.
We have adapted this last idea, to give permuted variable
46
length Markov models (PVLMM).
Permuted variable length MMs
(PVLMM)
Let   ( 1 , 2 ,..., L )
be a permutation vector.
By using VLMM, simplify the context terms in the equation:
P ( x)  P ( X 1 
L
x1 )  P( X  l
l 2
 x l |
 l 1
X 1

 l 1
x1 )
The simplified equation has the form:
P ( x)  P ( X 1 
L
x1 )  P( X  l
l 2

 l 1
 l 1
x l | cl ( X 1 )  cl ( x1 ))
47
Part of a context (decision) tree for
position -2 of a splice donor PVLMM
Node #s: counts
Edge #s: split variables
Sequence order: +2
+5
-1
+4
-2
+3
-3
48
Maximal Dependence Decomposition
MDD starts with a mutual independence model as with
WMMs. The data are then iteratively subdivided, at
each stage splitting on the most dependent position,
suitably defined. At the tips of the tree so defined, a
mutual independence model across all remaining
positions is used.
The details can vary according to the splitting criterion
(Burge & Karlin used 2), the actual splits (binary, etc),
and the stopping rule.
However, the result is always a single tree.
49
Parts of MDD trees for splice donors
50
In each case, splits are into the most frequent nt vs the others.
Issues in statistical modelling of
short DNA motifs
In any study of this kind, essential items are:
• the model class (e.g. WMM, PVLMM, or MDD)
• the way we search through the model class (e.g. by
forward selection & simulated annealing)
• the way we compare models when searching (e.g.
by AIC, BIC, or MDL), and finally,
• the way we assess the final model in relation to our
aims (e.g. by cross-validation).
We always need interesting, high-quality datasets.
51
Model assessment:
Stand-alone recognition
M: motif model
B: background model
Given a sequence x = (x1, …, xL), we
predict x to be a motif if
log {P(x | M) / P(x | B)} > c,
for a suitably chosen threshold value c.
52
Model assessment: terms
TP: true positives, FP: false positives
TN: true negatives, FN: false negatives
Sensitivity (sn) and specificity (sp) are:
sn = TP / [TP + FN]
sp = TP / [TP + FP].
53
Transcription factor binding sites
These are of great interest, their signals are very
weak, and we typically have only a few instances.
TRANSFAC (Wingender et al 2000).
We have studied 61 TFs with effective length ≤ 9 and
≥ 20 instances, 2,238 sites in all.
In 25/61 cases we are able to improve upon WMM.
In the remaining, PVLMM performs similarly to WMM.
54
Assessing our models
for TFBS recognition
We randomly inserted each of the 2,238 sites into a background
sequence of length 1,000 simulated from a stationary 3rd order
MM, with parameters estimated from a large collection of human
sequence upstream of genes.
10-fold cross-validation (CV) was used for each TF. At each round
of CV, 90% of the true sites were used to fit the models (PVLMM,
MDD, WMM). We then scanned the sequences containing the
remaining 10% of the sites, and selected the N top-scoring
sequences as putative binding sites, where N was the size of the
remaining 10%. Thus we have sn = sp.
55
TFBS: results for 25/61
56
Three transcription factors
Sensitivity
wmm: 0.37
pvlmm: 0.54
mdd: 0.37
wmm: 0.38
pvlmm: 0.55
mdd: 0.45
wmm: 0.54
pvlmm: 0.72
mdd: 0.64
57
Splice donor sites
Mammalian donor sequences from SpliceDB,
Burset et al (2001).
15,155 canonical donor sites of length 9, with
GT conserved at position 0 and 1.
Mammalian donor site
58
Beginning of the splicing process
Splice donor U1snRNP(RNA+proteins)
Exon
59
Interpretation of PVLMM model selected
Mammalian donor site
Base -3
A
33
C
37
G
18
T
12
-2
61
13
12
14
U1 snRNA G
U
-1 0 +1 +2 +3 +4 +5
10 0 0 53 71 7 16
3 0 0
3 8 6 16
80 100 0 42 12 81 22
7 0 100 2 9 6 46
C
C
A
U
U
C
A60
“Long-range” dependence in the model:
5’/3’ compensation

U1 sn RNA: G
U
C
Optimal permutation: +2
C
+3
A
+4
U
-1
U
C
A
-2 +5 -3 61
Integrating into a gene finder
SLAM (Pachter et al, 2002): a eukaryotic crossspecies gene finder (generalized pair hidden
Markov model (GPHMM)).
Results at the exon level
VLMM
MDD
PVLMM
Sensitivity
360/465
365/465
371/465
Specificity
360/470
365/465
371/464
62
Conclusions
• Systematic framework for probabilistic modeling
of short DNA motifs
• Statistical modeling issues
• Better prediction performance by modeling
local & non-local dependence
• Good biological interpretations
63
Future work
•
•
•
•
Model the dependence in protein motifs
Extend to find de novo motifs
Find a way to deal with indels
Incorporate local sequence information into
PVLMM-based predictions
• Jointly model multiple motifs in one species and
sites across species
64
References
Biological Sequence Analysis
R Durbin, S Eddy, A Krogh and G Mitchison
Cambridge University Press, 1998.
Bioinformatics The machine learning approach
P Baldi and S Brunak
The MIT Press, 1998
Post-Genome Informatics
M Kanehisa
Oxford University Press, 2000
Find short DNA motifs using permuted Markov models
X Zhao, H Huang and T Speed, RECOMB2004.
65
Acknowledgements
Terry Speed, UCB & WEHI
Haiyan Huang, UCB
The SLAM team:
Simon Cawley, Affymetrix
Lior Pachter, UCB
Marina Alexandersson, FCC
Sourav Chatterji, UCB
Mauro Delorenzi (ISREC)
WEHI bioinformatics lab
66