Transcript Document

Discovery of Conserved
Sequence Patterns Using a
Stochastic Dictionary Model
Authors Mayetri Gupta
& Jun S. Liu
Presented by Ellen Bishop
12/09/2003
1
“ofallthewordsinthisunsegmentedphraseth
erearesomehidden”
The challenge is to develop an algorithm
for DNA sequences that can partition
the sequence into meaningful “words”
2
Presentation Outline





Introduction
MobyDick
Stochastic Dictionary-based Data
Augmentation (SDDA)
Algorithm Extensions
Results
3
Introduction

Some new challenges now that there
are publicly available databases of
genome sequences:


How do genes regulate the requirements of
specific cells or for cells to respond to
changes?
How can gene regulatory networks be
analyzed more efficiently?
4
Gene Regulation

Transcription Factors (TF) play a critical
role in gene expression

Enhance it or Inhibit it


Short DNA motifs 17-30 nucleotides long often
correspond to TF binding sites
Build model for TF binding sites given a set
DNA sequences thought to be regulated
together
5
MobyDick





Dictionary building algorithm developed
in 2000 by Bussemaker, Li and Siggia
Decomposes sequences into the most
probable set of words
Start with dictionary of single letters
Test for the concatenation of each pair
of words and its frequency
Update dictionary
6
MobyDick results

Tested on the first 10 chapters of Moby
Dick


4214 unique words, @1600 of them
repeats
Result had 3600 unique words

Found virtually all 1600 repeated words
7
SDDA

Stochastic Dictionary-based Data
Augmentation


Stochastic words represented by probabilistic word
matrix (PWM)
Some definitions

D=dictionary size

=sequence data generated by concatenation of words



D ={M1 .,… MD} the concatenated of words, including
single letters
P =p(M1)…p(MD) probability vector
Ai ={Aik … Ank} denotes the site indicators for
motifs Mk (Aik=1 or 0)
8
SDDA

Some definitions







q=4, also A,G,C,T are the first 4 words in dictionary
={P1 .,… Pk} sequence partition so each part Pi
corresponds to a dictionary word
N() = total number of partitions
NMj()=number of occurrences of word type Mj in the
partition
wj .(j=1…D) denotes word lengths
The D-q motif matrices are denoted by {q+1 … D}=
(D)
If the k th word is width w then its probability matrix is
k = {1k … wk}
9
Probabilistic Word Matrix
A
C
G
T
.85
.05
.1
0
.07
.78
.05
.1
.8
.07
.12
.01
.02
.01
.96
.01
.12
.01
.85
.02
ACAGG=.85*.78*.8*.96*.85=.4328
GCAGA=.1*.78*.8*.96*.12=.0072
10
General idea of Algorithm

So we start with D(1)={A,G,C,T} and estimate
the likelihood of those 4 words in the dataset.
Then we look at any pair of letters, say AT. If
it is over-represented and in comparison to
D(1) then it is added to the dictionary D(2)
and this is repeated for all the pairs.
Consider all the concatenations of all the
pairs of words in D(n) and form a new
dictionary D(n+1) by including those new words
that are over-represented or more abundant
than by chance.
11
SDDA Algorithm
1) Partitioning: sample for words given the current
value of the stochastic word matrix and word usage
probabilities


Do a recursive summation of probabilities to evaluate the
partial likelihood up to every point in the sequence
Li()=P([i-wk+1:j]|)Li-wk()
Words are sampled sequentially backward, starting at the end
of the sequence. Sample for a word starting at position i,
according to the conditional probability
P(Aik=1|Ai+wk,)=P([i:i+wk-1]| k,p)Li-1()/Li+wk-1()
If none of the words are selected then the appropriate single
letter word is assumed & k is decremented by 1.
12
SDDA Algorithm
2) Parameter Update:
Given the partition A,



update the word stochastic matrix  D
update the word probabilities vector P
by sampling their posterior distribution
3) Repeat steps 1 and 2 until convergence, when MAP
(maximum a posteriori) score stops increasing. This
is a method of “scoring” optimal alignment and is
calculated with each iteration.
4) Increase dictionary size D=D+1. Repeat again from
step 1 but now D-1 is a known word matrix
13
Algorithm Extensions




Phase Shift via Metropolis steps
Patterns with variable insertions and
deletions (gaps)
Patterns of unknown widths
Motif detection in the presence of “low
complexity” regions
14
Phase Shift



If 7,19,8,23 are strongest pattern but algorithm
chooses a1=9, a2=21 early on then it is likely
to also choose a3=10,a4=25
Metropolis steps solution
a ={a1 … am} are starting positions for an
occurrence of a motif
 Choose   1 with probability .5 each
 Update the motif position a+ with
probability min{1, p(a+| )/p(a|)
15
Patterns with:
gaps/unknown widths

Gaps - Additional recursive sum in the partitioning
step(1) using





io Insertion-opening probability
ie Insertion-extension probability
Do Deletion-opening probability
De Deletion-extension probability
Unknown Widths - The authors also enhanced their
algorithm to determine the likely pattern width if it is
unspecified.
16
Motif Detection with “low
complexity” regions



AAAAAAA…
CGCGCGCG…
The stochastic dictionary model is
expected to control this by treating
these repeats as a series of adjacent
words
17
Results

Two case studies are provided


Simulated dataset with background
polynucleotide repeats
CRP binding sites
18
Relative performance of the SDDA
compared to BioProspector & AlignAce
SDDA
SDDA
BP
BP
AA
AA
EVAL2
Success
Falsepositive
Success
Falsepositive
Success
Falsepositive
a) .24
1
.07
1
.02
.3
.43
.48
.6
.06
.7
0
0
-
.72
.5
.12
.1
0
0
-
.96
.7
.02
0
-
0
-
b) .24
1
.03
1
.09
.1
.52
.48
.9
.12
.7
.01
.1
.62
.72
.9
.05
.6
0
.1
.36
.96
1
.03
0
-
0
19
Credits

Slide 6,7

Slide 9

Slide 14 Lawrence, C.E., Altschul, S.F.,Boguski, M.S.,
Bussemaker,H.J., Li, H and Siggia, E.D. (2000),
“Building a Dictionary for Genomes:Identification of Presumptive
Regulatory Sites by Statistical Analysis”,Proceedings of the
National Academy of Science USA, 97, 10096-10100.
Liu,J.S., Gupta,M., Liu, X., Mayerhofere, L. and
Lawrence, C.E.,”Statistical Models for Biological Sequence Motif
Discovery”,1-19
Liu,J.S.,Neuwald, A.F., and Wootton,J.C. (1993), “Detecting
Subtle Sequence Signals: A Gibbs Sampling Strategy for Multiple
Alignment”, Science, 262,208-214.
20
Bibliography



Bussemaker,H.J., Li, H and Siggia, E.D. (2000), “Building a
Dictionary for Genomes:Identification of Presumptive Regulatory
Sites by Statistical Analysis”,Proceedings of the National
Academy of Science USA, 97, 10096-10100.
Lawrence, C.E., Altschul, S.F.,Boguski, M.S., Liu,J.S.,Neuwald,
A.F., and Wootton,J.C. (1993), “Detecting Subtle Sequence
Signals: A Gibbs Sampling Strategy for Multiple Alignment”,
Science, 262,208-214.
Liu,J.S., Gupta,M., Liu, X., Mayerhofere, L. and Lawrence,
C.E.,”Statistical Models for Biological Sequence Motif
Discovery”,1-19
21