Transcript Tal Frank
Building a dictionary
for genomes
By Harmen J. Bussemaker, Hao Li, and Eric
D. Siggia
Tal Frank
Topics that will be discussed
Biological background
Present the biological problem
Show an algorithm that treats this problem
statistical mechanics methods
Try our algorithm on two well known problems
What we did so far
Human Genome Project(2001)
This article published(2000) :Sequence is not
everything - Lets do some theory
Control over gene expression - when, how much
Control element = Regulator = Sequence motif
Genes are working together = Co-regulated genes
The goals of this work
Identify the Control element
Where are they located ?
Identify co-regulated genes
Multiple control elements
Example : where are the control elements located?
Concepts: directionality , upstream ,in the junk
TACGAXTTCGA
Example: co-regulated genes
naïve approach :TACGAXTTTAAYATGGCA
experimentally :TACGAXTTCGAYATGGCA
To activate set of genes: multiple sequences needed
New terminology
DNA = string of letters
Control element = word
Multiple control element = sentences
Genes and junk = background noise
Example : S: …GAGCXTGGYGCTT……
words = {GA,TG}
sentence = GA.TG
background = genes and junk.
MobyDick algorithm
decipher a ‘‘text’’ consisting of a long
string of letters written in an unknown
language.
Find the words in the text
Find the right spacing
example : D={A,T,AT} S=ATT
P1=A.T.T
P2=AT.T
How would you do it ?
1.Look for repeated substring in the string :
Tal went to Weizmann this morning. When he arrived he
didn’t go to his office, he went to drink a cup of coffee ….
{went, to, he} D (dictionary)
2.Space the text – ooopps Spacing is not that
simple.
e.g.– D={A,T,AT} S=ATT
P1=A.T.T
p1
P2=AT.T
p2
MobyDick Blueprints
1 letter word
S=TAGATAT
Find pw
D={T,A,G}
pw ={pA,pT..}
2 letter word
D={A,TA,…}
S=TAGATAT
Find pw
pw ={pA,pTA.}
No more optional words stop!
Find spacing
S=TA.G.A.TA.T
statistical mechanics in
order to ?
1.How does MobyDick decide {pw}?
2.When does MobyDick add a new
word?
3.Space (parse) the text.
The likelihood function
z pw
k
k:
Nw ( k )
w
a possible spacing
Nw: number of times the word w appears
Example : D=(T,AT,A) S=TATA
k1=T.A.T.A
Z p A2 pT2 p A pT p AT
k2=T.AT.A
Likelihood function - intuition
Z(D,{pw})- partition function: <E>,<N>,<T>,….
Z(D,{pw})- the probability to obtain a
sequence S.
Example : D =(T,AT,A) {pT,pA,pAT}
Question : what is the probability to S=TATA?
1st possibility : T.A.T.A pA*pA*pT*pT
2nd possibility: T.AT.A pT*pAT*pA
p p A p p A pT p AT Z
2
2
T
Finding {pw}
Given : D,S
Maximize Z({pw},D) with respect to {pw}
This {pw} gives the highest probability
to get the given S
Lets find the {pw} !
Definition : N w - average number of the word w
over the different spacings .
Can prove: N w pw ( ) ln Z
pw
Nw
maximize Z- solve: pw
w' N w'
solving : is done by iteration:
pw’
<Nw’>
pw
Enough is enough !!!
When is pw good enough ?
when the new {pw} don’t give higher Z
We say : this method converges !
Other methods don’t converge.
Why finding {pw} using this
way ?
Monte-Carlo methods don’t converge.
Slow method can transform to fast method
Order of complexity O(LDl)
L-the length of the string
D-the size of the dictionary
l-the length of the longest word in D
Add new words ?
Look at dictionary
D={T,A,C,G} S=TATTGA
Compose new word ww’
D={T,A,C,G} S=TATTGA ww’=TA
Check occurrence
D={T,A,C,G} S=TATTGA ww’=TA
[ N _ ww '] [ N w pw' ]?
Yes- add to dictionary
D={T,A,C,G,TA} S=TATTGA
A problem and a bad
solution
The algorithm finds only the words which are
composed from words already in the dictionary.
Example : S=AATATAAA
1st step : S=AATATAAA
D= {A}
2nd step : S=AATATAAA
AT is not a composition of words
Solution: Look for repeated long strings
by consideration the problem
Spacing
Define :
w number of times the word w occurs in
a given spacing.
Quality factor :
Nw
Qw
w
The required condition :
Qw 1
checking the algorithm
Applying on the English novel Moby Dick
Applying on Control elements on the yeast
genome
Not always possible - Voynich manuscript
(1450)
Preparing the book
MobyDick
Call me Ishmael. Some years ago- never mind how long precisely- having little
or no money in my purse, and nothing particular tothought I would sail …
CallmeIshmaelSomeyearsagonevermindhowlongpreciselyhavingli
ttleornomoneyinmypurseandnothingparticulartothoughtIwouldsail..
CallabajameIshmaelbjklmbbSomeyearsagonevermindhowlon
Eciselyhavinglittlermsdrornomoneyinmypurseandnothingparticu
artothoughtIwouldsail …
Results- MobyDick
10 first chapters
D={a,b,c….}
Text : 4,214 unique words
2,630 occurred only once
Background – increases L by the factor of 3.
2,450 words found , 700 in English, 40
composite words.
Results- yeast
D={T,A,C,G}
Text : 443 experimentally determined sites
Background – genes and junk
500 words found
114 match the experimentally predictions
Not that good – it is a beginning!
The end