Genome evolution: a sequence

Download Report

Transcript Genome evolution: a sequence

Genome Evolution. Amos Tanay 2010
Genome evolution:
Lecture 11: Transcription factor
binding sites
Genome Evolution. Amos Tanay 2010
Sequence specific transcription factors
•
•
•
Sequence specific transcription factors (TFs) are a critical part of any gene activation
or gene repression machinary
TFs include a DNA binding domain that recognize specifically “regulatory elements” in
the genome.
The TF-DNA duplex is then used to target larger transcriptional structure to the
genomic locus.
Genome Evolution. Amos Tanay 2010
Sequence specificity is represented using consensus
sequences or weight matrices
•
•
•
•
The specificity of the TF binding is central to the understanding of the regulatory
relations it can form.
We are therefore interested in defining the DNA motifs that can be recognize by
each TF.
A simple representation of the binding motif is the consensus site, usually derived by
studying a set of confirmed TF targets and identifying a (partial) consensus.
Degeneracy can be introduced into the consensus by using N letters (matching any
nucleotide) or IUPAC characters (erpresenting pairs of nucleotides, for exampe
W=[A|T], S=[C|G]
A more flexible representation is using weight matrices (PWM/PSSM):
ACGCGT
ACGCGA
ACGCAT
TCGCGA
TAGCGT
•
1
2
3
4
5
6
A
60%
20%
0
0
20%
40%
C
0
80%
0
100%
0
0
G
0
0
100%
0
80%
0
T
40%
0
0
0
0
60%
PWMs are frequently plotted using motif logos, in which the height of the character
correspond to its probability, scaled by the position entropy
Genome Evolution. Amos Tanay 2010
TF binding energy is approximated by weight matrices
We can interpret weight matrices as
energy functions:
E ( s )   wi [ si ]
i
wi [ si ]  log( pi [ si ])
This linear approximation is reasonable
for most TFs.
Leu3 data (Liu and Clarke, JMB 2002)
Genome Evolution. Amos Tanay 2010
TF binding affinity is kinetically important, with possible functional
implications
Ume6
•
s
Stronger prediction
Average PWM energy
11.5
5.5
ChIP ranges
Stronger binding
Tanay. Genome Res 2006
Kalir et al. Science 2001
Genome Evolution. Amos Tanay 2010
TFs are present at only a fraction of their optimal sequence tragets.
Binding is regulated by co-factors, nucleosomes and histone
modifications
An
epigenetic
signature of
promoters
(Heinzman
et al., 2007)
Genome Evolution. Amos Tanay 2010
TFs are present at only a fraction of their optimal sequence tragets.
Binding is regulated by co-factors, nucleosomes and histone
modifications
An epigenetic
signature of
enhancers
(Heinzman et
al., 2007)
Genome Evolution. Amos Tanay 2010
TFs are present at only a fraction of their optimal sequence tragets.
Binding is combinatorially regulated by co-factors, nucleosomes and
histone modifications
Active
Inactive
Barski et al. Cell 2007
Genome Evolution. Amos Tanay 2010
Specific proteins are identifying enhancers
Here are studies of p300 binding in the developing mouse brain
(visel et al. 2009)
Genome Evolution. Amos Tanay 2010
TFBSs are clustered in promoters or in “sequence modules”
•
•
•
•
The distribution of binding sites in the genome is non uniform
In small genomes, most sites are in promoters, and there is a bias toward
nucleosome free region near the TSS
In larger genomes (fly) we observe CRM (cis-regulatory-modules) which are
frequently away from the TSS. These represent enhancers.
A single binding site, without the context of other co-sites, is unlikely to represent a
functional loci
Genome Evolution. Amos Tanay 2010
Constructing a weight matrix from aligned TFBSs is trivial
• This is done by counting (or “voting”)
• Several databases (e.g., TRANSFAC, JASPAR) contain matrices
that were constructed from a set of curated and validated binding
site
• Validated site: usually using “promoter bashing” – testing reported
constructs with and without the putative site
Transfac 7.0/11.3 have 400/830 different PWMs, based on more than
11,000 papers
However, there are no real different 830 matrices outthere – the real
binding repertoire in nature is still somewhat unclear
Genome Evolution. Amos Tanay 2010
Probabilistic interpretation of weight matrices and a
generative model
•
One can think of a weight matrix as a probabilistic model for binding sites:
k
P (m)   Pi (m[i ])
i 1
•
•
This is the site independent model, defining a probability space over k-mers
Given a set of aligned k-mers, we know that the ML motif model is derived by voting
(a set of independent multinomial variables – like the dice case)
•
Now assume we are given a set of sequences that are supposed to include binding
sites (one for each), but that we don’t know where the binding sites are.
In other words the position of the binding site is a hidden variable h.
We introduce a background model Pb that describes the sequence outside of the
binding site (usually a d-order Markov model)
Given complete data we can write down the likelihood of a sequence s as:
•
•
•
|S |
Pback ( s )   Pback ( s[i ] | s[i  d ..i  1]))
i 1
k
P( s, l |  )  Pback ( s ) ( Pi ( s[l  i ]) / Pback ( s[l  i ] | s[l  i  d ..l  i  1]))
i 1
Genome Evolution. Amos Tanay 2010
Using EM to discover PWMs de-novo
•
Inference of the binding site location posterior:
P(l | s, 1 )  P( s, l |  1 ) /  P( s, i |  1 )
i
•
Note that only k-factors should be computed for each location (Pb(s) is constant))
•
•
Inference of the binding site location posterior:
Note that only k factors should be computed for each location (Pb(s) is constant))
•
Starting with an initial motif model, we can apply a standard EM:
E:
M:
•
P(l | s, 1 )  P( s, l |  1 ) /  P( s, i |  1 )
Pi (c)  
 P(l | s
i
j
,  ) ( s j [l  i ], c)
1
j l  0..| S |
As always with the EM, initializing to reasonable PWM would be critical
Following Baily and Elkan, MEME 1995
Genome Evolution. Amos Tanay 2010
Allowing false positive sequences
•
If we assume some of the sequences may lack a binding site, this should be
incorporated into the model:
k
P( s, l |  )  P(hit ) * Pback ( s ) ( Pi ( s[l  i ]) / Pback ( s[l  i ] | s[l  i  d ..l  i  1]))
i 1
•
This is sometime called the ZOOPS model (Zero or one positions)
l
hit

s
•
In Bayesian terms:
–
–
•
•
Probability of sequence hit P(hit | S)
Probability of hit at position l = Pr(l|S)
We can consider the PWM parameters as variables in the model
Learning the parameters is then equivalent to inference
Genome Evolution. Amos Tanay 2010
Using Gibbs sampling to discover PWMs de-novo
•
We can use Bibbs sampling to sample the hidden
sites and estimate the PWM
l
hit

P(l j | l1 ,.., l i 1 , l i 1 ,.., l n , S )
s
•
•
This is done by estimating the PWM from all
locations except for the one we sample, and
computing the hit probabilities as shown before
l
hit
Note that we are working with the MAP (Maximum
a-posteriori)  to do the sampling:
s
P(l j |  MAP )

MAP
l
 arg max L(l ,.., l , l ,.., l | S , )
1
i 1
i 1
n
hit

•
But this can be shown to approximate:
s
P(l j | l1 ,.., l i 1 , l i 1 ,.., l n , )
Gibbs: Lawrence et al. Science 1993
Genome Evolution. Amos Tanay 2010
Generalizing PWMs to allow site dependencies: mixture of
PWMs and Trees
Tree motif
Mixture of PWMs
P ( s, l |  ) 
We only change the motif
component of the
likelihood model
Pback ( s )  P( s[l..l  l ] |  )
 k

  Pback ( s[l  i ] | s[l  i  d ..l  i  1]) 
 i 1

Learning the model can become more difficult
This is because computing the ML model parameter from complete
data may be challenging
Barash et al., RECOMB 2003
Genome Evolution. Amos Tanay 2010
Discriminative scores for motifs
•
•
•
So far we used a generative probabilistic model to learn PWMs
The model was designed to generate the data from parameters
We assumed that TFBSs are distributed differently than some fixed background
model
•
If our background model is wrong, we will get the wrong motifs..
•
A different scoring approach try to maximize the discriminative power of the motif
model.
We will not go here into the details of discriminative vs. generative models, but we
shall exemplify the discriminative approach for PWMs.
•
Lousy discriminator
High specificity discriminator
High sensitivity discriminator
Genome Evolution. Amos Tanay 2010
Hypergeometric scores and thresholding PWMs
Number of sequences
 | A |  n | A | 



k  | B | k 

P(| A  B | k ) 
 n 


|
B
|


Hyper geometric probability
(sum for j>=k is the hg p-value)
Positive
True positive
PWM score threshold
For a discriminative score, we need to decide on both the PWM model and the
threshold.
Genome Evolution. Amos Tanay 2010
Exhaustive k-mer search
• A very common strategy for motif finding is to do exhustive k-mer
search.
• Given a set of hits and a set of non hits, we will compute the number
of occurrences of each k-mer in the two sets and report all cases
that have a discriminative score higher than some threshold
• Since k-mers either match or do not match, there is no issue with
the threshold
• For DNA, we will typically scan k=5-8.
• This can be done efficiently using a map/hash:
– Iterate on short sequence windows (of the desired k length)
– For each window, mark the appearance of the k-mer in a table
– Avoid double counting using a second map
• It is easy to generalize such exhaustive approaches to include gaps
or other types of degeneracy.
Genome Evolution. Amos Tanay 2010
Refining k-mers to PWMs using heuristic “EM”
•
•
K-mer scan is an excellent intial step for finding refined weight matrices. For
example, we can use them to initialize an EM.
If we want to find a weight matrix, but want to stick to the discriminative
setting, we can heuristically use and “EM-like” algorithm:
–
–
–
–
Start with a k-mer seed
Add uniform prior to generate a PWM
Compute the optimal PWM threshold (maximal hyper-geometric score)
Restimate the PWM by voting from all PWM true positives
• Consider additional PWM positions
• Bound the position entropies to avoid over-fitting
– Repeat two last steps until fail to improve score
•
There are of course no guarantees for improving the scores, but empirically
this approach works very well.
Genome Evolution. Amos Tanay 2010
High density arrays quantify TF binding preferences and
identify binding sites in high throughput
•
•
Using microarrays (high resolution tiling arrays) we can now map binding sites in a
genome-wide fashion for any genome
The problem is shifting from identifying binding sites to understanding their function
and determining how sequences define them
Harbison et al., Nature 2004
Genome Evolution. Amos Tanay 2010
Genome Evolution. Amos Tanay 2010
Direct measurements of the in-vitro binding affinity of 8-mers and DNA binding domains
(here just a library of homeodomains, from Berger et al. 2008)
Genome Evolution. Amos Tanay 2010
Profiling binding affinity to the entire k-mer spectrum provide direct
quantification of in-vitro affinity (Badis et al., 2009)
104 TFs
8-mers
Heatmap of 2D hierarchical
agglomerative clustering
analysis of 4740 ungapped 8mers over 104 nonredundant
TFs, with both 8- mers and
proteins clustered using
averaged E-score from the
two different array designs.
If only biology was that simple…
Genome Evolution. Amos Tanay 2010
Discrete and
deterministic “binding
sites” in yeast as
identified by Young,
Fraenkel and colleuges
In fact, binding is rarely deterministic and discrete, and simple wiring is something you
should treat with extreme caution.
Genome Evolution. Amos Tanay 2010
Correlation between PWM predicted binding and ChIP experiments spans high,
medium and low affinity sites
ABF1
14
8
r = 0.42
r = 0.26
2
-2
2
ChIP log(binding ratio)
6
PWM sequence energy
r = 0.28
r = 0.8
PWM sequence energy
PWM sequence energy
r = 0.11
r = 0.74
GCN4
-11
-13
r = 0.42
r = 0.20
-15
-2
2
ChIP log(binding ratio)
6
r = 0.21
r = 0.72
MBP1
- 12 . 5
- 14 . 5
r = 0.42
r = 0.28
- 16 . 5
-2
2
6
ChIP log(binding ratio)
PWM regression exploits variable levels of binding affinity to robustly recover
binding preferences.
Motif regression optimizes the PWM given the overall correlation of
the predicted binding energies and the measured ChIP values vs
  arg max spearman( F (s |  ), vs )
Tanay, GR 2006
Genome Evolution. Amos Tanay 2010
TFBS evolution: purifying selection and conservation
TF1
TF1
Similar function
CACGCGTT
CACGCGTA
Neutral evolution
TF1
Disrupted function
CACGCGTT
CACGAGTT
Low rate
purifying selection
TF2
TF1
Altered function
CACGCGTT
CACACGTT
Low rate
purifying selection
Altered affinity
CACGCGTT
CACACGTT
Rate?
Selection?
Genome Evolution. Amos Tanay 2010
Binding sites conservation
Kellis et al., 2003
Genome Evolution. Amos Tanay 2010
Binding sites conservation: heuristic motif identification
Kellis et al., 2003
Genome Evolution. Amos Tanay 2010
Analyzing k-mer evolutionary dynamics
• Instead of trying to identify conserved motifs try to infer the
evolutionary rate of substitution between pairs of k-mers
• Start from a multiple alignment and reconstruct ancestral sequences
(assuming site independence, or even max parsimony)
• Now estimate the number of substitution between pairs of 8-mers,
compare this number to the number expected by the background
model
• Do it for a lot of sequence, so that statistics on the difference
between observed and expected substitutions can be derived
Genome Evolution. Amos Tanay 2010
Saccharomyces TFBS Selection Network
Inter-island organization in
the Reb1 cluster: selection hints
toward multi modality of Reb1
Nodes: octamers
node
conservation
conserved @ 2SD
conserved @ 3SD
otherwise
Arcs: 1nt substitution
arc Rate
Selection
Normal
neutral
Low
negative
not enough stat
Tanay et al., 2004
Genome Evolution. Amos Tanay 2010
Leu3 selection network
Substitution changing
high affinity to high
affinity motifs
0.3
TF1
0.2
Altered affinity
Motif 1
Motif 2
Rate?
Selection?
High Affinity
(Kd < 60)
Meidum Affinity
(400 > Kd > 60)
0.1
High rate subs.
0
-5 -4 -3 -2 -1 0 1 2 3
log delta affinity
Substitution changing
high affinity to low
affinity motifs
Genome Evolution. Amos Tanay 2010
A simple transcriptional code and its evolutionary
implications
TF5
AAATTT
AATTTT
AAAATT
TF3
GATGAG
GATGCG
GATGAT
TF4
ACGCGT
TCGCGT
ACGCGT
TF1
CACGTG
CACTTG
TF2
TGACTG
TGAGTG
TGACTT
Genome Evolution. Amos Tanay 2010
The Halpren-Bruno model for selection on affinity
The basic notion here is of the relations between sequence, binding and function/fitness
Sequence
Binding energy
Function
E (S )
F (E)
We argued that E(S) can be approximated by a PWM
F(E) is a completely different story, for example:
Is there any function at all to low affinity binding sites?
Is there a difference between very high affinity and plain strong binding sites?
Are all appearances of the site subject to the same fitness landscape?
Genome Evolution. Amos Tanay 2010
The Halpren-Bruno model for selection on affinity
We work on deriving the substitution rate at each position of the binding site, given its observed
stationary frequency. We are assuming that the fitness of the site is defined by multiplying the
fitnesses of each locus. This means fitness is generally linear in the binding energy!
According to Kimura’s theory, an allele with
fitness s and a homogeneous population would
fixate with probability:
Assuming slow mutation rate (which allow us to
assume a homogenous population) and motifs
a and b with relative fitness s the fixation
probabilities (chance of fixation given that
mutation occurred!) are:
If p represent the mutation probability, and p the
stationary distribution, and if we assume the
process as a whole is reversible then:
(Halpern and Bruno, MBE 1998)
p p 
ln  b ba 
p p
f ab   a ab 
p p
1  a ab
p b pba
1  e 2 s
1  e 2 Ns
fitness  1  s, s  1
1  e 2 s
2s
f ab 

 2 Ns
1 e
1  e  2 Ns
1  e2s
 2s
f ab 

1  e 2 Ns 1  e 2 Ns
2s
1  e 2 Ns e 2 Ns  1
2 Ns
f ab / f ba 



e
1  e 2 Ns
 2s
1  e 2 Ns
p b pba f ba
p p
f
 1  b ba  ab  e 2 Ns
p a pab f ab
p a pab f ba
p p 
ln  b ba 
p p
rab  c  pab   a ab 
p p
1  a ab
p b pba
Genome Evolution. Amos Tanay 2010
The Halpren-Bruno model for selection on affinity
The HB model is limited for the study of general sequences.
When restricting the analysis to relatively specific sites, HB is not completely off
Moses et al., 2003
Genome Evolution. Amos Tanay 2010
Testing the general binding energy – fitness correspondence
•
While E(S) is approximated by a PWM, F(E) is
unlikely to be linear
•
Assume that the background probability of a
motif a is P0(a). In detailed balance, and
assuming the fitness of a at functional sites is
F(a), the stationary distribution at sites can be
shown to be:
Expected and observed energy
distribution in E.Coli CRP sites
(left) and background (right)
Q(a)  Po (a)e 2 NF ( a )
•
If we collapse all sites with binding energy E
(and hence the same F(a)=F(E(a))
Q( E )  Po ( E )e 2 NF ( E )
•
The entire genome should behave like a
mixture of background sequance and functional
loci:
W ( E )  (1   ) Po ( E )  Q( E )
•
Inferred F(E), is shown in Orange
So we can try and recover Q(E) and therefore
F(E) from the maximum likelihood parameters
fitting an empirical W(E)
Comparison of CRP energies in
E.coli and S. typhimurium
(Hwa and Gerland, 2000-)
Mustonen and Lassig, PNAS 2005
Genome Evolution. Amos Tanay 2010
More tests for possible conservation of low binding energy
sites
Simulation
S. mikitae
S. cerevisiae
(Neutral, context aware)
High affinity
ΔE
ΔE
..
..
ΔE
ΔE
..
..
1
KS statistics
0.8
0.6
Low affinity
0.4
0.2
0
0
0.25
0.5
Genome Evolution. Amos Tanay 2010
More tests for possible conservation of low binding energy
sites
Binding site
conservation
Conservation
of total
energy
Reb1
S
Conservation score
S
S
60
50
40
30
20
10
0
0
Ume6
Conservation score
20
Cbf1
20
Gcn4
Mbp1
20
20
15
15
15
15
10
10
10
10
5
5
5
5
0
0
0
50
100
binding energy percentile
0
0
50
100
binding energy percentile
50
100
binding energy percentile
0
0
50
binding energy percentile
100
0
50
binding energy percentile
100
Tanay, GR 2006
Genome Evolution. Amos Tanay 2010
Algorithms for discovering conserved sites:
• Assuming PWM models
– Search for loci that behave as predicted by the model
(P(s|TFBS)/P(s|back) > Threshold)
– Search for genomic regions with surprisingly many conserved binding
sites
• Search for an ML PWM
– Search for motifs and conserved sites in aligned sequences
• Assume the phylogeny, alignment, look for a PWM that will optimize the
likelihood of the data
– Search for motifs and conserved sites in unaligned sequences
• Assume the phylogeny, look for an ML PWM
• Search for a general ML evolutionary model (many PWMs)
– Search for a set of PWMs/Motif model that will maximize the likelihood
of the data
Genome Evolution. Amos Tanay 2010
An evolutionary model
P(si>si’|si-1)
Phylo Tree
Neutral model
Motif (PWM?)
TFBS evo model
Substitution probability = background *(fixation)
Substitution probability = HB model
Genome Evolution. Amos Tanay 2010
The PhyME/PhyloGibbs model (Sinha, Blanchette, Tompa 2004,
Sidharthan,Siggia,Nimwegen 2005, based on evo model by Siggia and collegues)
our presentation is a bit generalized and adapted..
•
•
Earlier/Other similar approaches using EM
Evolutionary model:
–
–
–
•
A phylogeny
Neutral independence model (simple tree, probabilities on branches)
A PWM-induced fixation probabilities that are proportional to the matrix weights:
xi  paxi
  paxi  xi wk [ xi ]

Pr( xi | paxi , l (i )  k )  1  
xi  pax

paxi  xi wk [ xi ]

xi

Recall the single species generative model
k
P( s, l |  )  Pback ( s ) ( Pi ( s[l  i ]) / Pback ( s[l  i ] | s[l  i  d ..l  i  1]))
i 1
•
The generalization to alignments is
similar, but should consider a
phylogeny at each locus
Back
Motif
Back
Genome Evolution. Amos Tanay 2010
Phylogenetic generative model
TFBS
1
2
3
4
xi  paxi
  paxi  xi wk [ xi ]

Pr( xi | paxi , l (i )  k )  1  
xi  pax

paxi  xi wk [ xi ]

xi

Pr( xi | paxi , l (i )  no)   paxi  xi
Joint probability –
product over
independent trees
Inference – for loci modes (l(i))
and for ancestral sequences
(hierachicaly as in Ex 3)
Learning – find ML PWM matrix
Prob of emitting the alignment given the mode
log Pr( s, l | w,  )   Pr(l (i)) Pr( s i | l (i), w,  )
i
Prior of having a motif/background
Pr(l (i) | s, w), Pr( xi | l (i), s, w)
arg max L( w | S )  Pr( S | w)
w
Genome Evolution. Amos Tanay 2010
EM for the PWM parameters
Key point: the positions in the tree are independent once we decide on their “mode”
(background or a certain positions in a binding site)
Complete data: the mode of each position (l(i)). The ancestral sequences (h)
Joint probability:
Pr( s, l | w,  )   Pr(l (i)) Pr( s i | l (i), w,  )
i
The EM target:
Posterior of the missing data
Joint probability

 
 
i
i

Q( w | w' )   Pr(l , h | S , w' ) log   Pr(l (i)) Pr( x j | pax j , w, l (i)) 

l ,h
j
 i 
 
Complete data!

We can use the usual trick and decompose this into a sum of independent terms, one
for each PWM position. For position k we have:



i
i


Pr(
l
(
i
)

k
|
s
,
w
'
)
log(Pr(
l
(
i
)

k
)
Pr(
h
|
l
(
i
)

k
,
s
,
w
'
)
log
Pr(
x
|
pax
,
l
(
i
)

k
,
w
)

i 
h
j
j
k 
 j



 log(  paxi  xi wk [ xi ]) 



i
i
log(Pr( l (i )  k )  Pr(l (i )  k | S , w' )  Pr( x j , x paj | s, w' , l (i )  k )log( 1   
paxi  y w k [ y ]) 
i 
j x ij , x ipaj

i

y! x paj



Genome Evolution. Amos Tanay 2010
EM for the PWM parameters
Average number of times of we observed the variables

 log(  paxi  xi wk [ xi ]) 



i
i
log(Pr( l (i )  k )  Pr(l (i )  k | S , w' )  Pr( x j , x paj | s, w' , l (i )  k )log( 1   
paxi  y w k [ y ]) 
i 
j x ij , x ipaj


y! x ipaj


Log of the optimized parameter
In simple words:
•we are optimizing a weighted sum of log(x*w) or log(1-Sxw)
•The coefficient for the mutations paxj -> xj at PWM positions k is the average
number of times we observe it given the previous parameter set w’
The optimization problem is solved using Lagrange multipliers (to satisfy the constraint
on the wk). But because each parameter appears in terms of two forms (log(x*w) and
log(1-x*w), the solution is not as trivial as the dice case.
Genome Evolution. Amos Tanay 2010
Example for intra-site epistasis
AAACGT
ACGCGT
Assume the following TFBS alignment (where ACGCGT is the motif)
According to the simple TFBS evo model, we
should “pay” twice for the loss of the ACGCGT
site, since each of the two mutations would be
multiplied by a very low fixation probability
ACGCGT
“Double
Loss”
AAACGT
The more realistic scenario involve one mutation
that was under pressure, but then neutrality
ACGCGT
ACGCGT
“Loss”
Multiple possible trajectories can have different
loss/gain/neutrality dynamics
AAGCGT
ACGCGT
“Neutrality”
AAACGT
Assuming Affinity=PWM=Fitness gives the Halpren Burno
model, or the frequency selection approximation
When fitness(PWM) is non linear, we have epistasis which
means problems for the simple loci-independent model
Affinity
Affinity
PWM
PWM
Fitness
Fitness
Genome Evolution. Amos Tanay 2010
An evolutionary model (2)
P(si>si’|si-1)
Phylo Tree
Neutral model
TF Target Sets
Selection factors
(per TF)
To fully capture the effect of binding sites, we need to study a Markov
model over the entire sequence.
Instantaneous rates: background * selection on TFBS changes
Mutation rate = background*selection factor
*selection factor
Genome Evolution. Amos Tanay 2010
Controlling context effects: approximately independent blocks
•The sequence is decomposed into
intervals, each containing one or
more overlapping binding sites, or
sequences that are one mutation
away from a binding site. Such
intervals are called epistatic
intervals
•The joint probability is written as a
product over epistatic intervals
•For each epistatic interval we should
compute exp(Qt)(from, to), where the
rate matrix is large (4d) when d is the
interval size
•This is approximated by (exp(Qt/N))N
Genome Evolution. Amos Tanay 2010
Learning model parameters
•Finding the optimal selection factor is
solved by non-linear optimization of
the likelihood
•Looking for target set is done using a
greedy algorithm.
•The resulted target sets and their
selection factors are analogous to
motifs/PWMs with some additional
evolutionary parameter indicating the
strength of selection (or the
sufficiency of the motif to determine a
functional site)
•As shown to the left, similar motifs
are sometime separated by significant
selection factors, suggesting
functional partitioning.