Transcript Slides

Finding Motifs in Promoter
Regions
Libi Hertzberg
.
Or Zuk
Overview
 Introduction
and Definitions
 P-value Algorithm
 Experimental Results
 Generalizations and future work
2
Transcriptional Regulation
in the cell:
DNA
REGULATOR
Transcription
Factor (TF)
Regulatory
protein
RNA
(singular)
Regulatory polymerase
TF
binds
protein
Promoter motif
Regulatory Regulatory
protein
protein
RNA
polymerase
(complex)
Promoter motif
mRNA of
regulated gene
3
Motif Representation
We need to represent the motif – the TF binding site.
Known binding sites:
There are three known representations:
 Consensus – Most frequent letter in every position
 IUPAC code – Allow all letters with frequency above a
threshold in every position
 Position Specific Weight Matrix – Count number of
occurrences of every letter in every position - More
Informative !
4
An Example:
An alignment of 5 known binding sites of a TF
Position Specific Weight Matrix - F
A
C
T
G
0
1
1
3
1
0
4
0
0
0
5
0
0
3
2
0
0
4
0
1
3
1
1
0
5
Giving a Score to a Potential Binding Site
We
are given a site R=(r1,..., rL). We want to know how
likely it is to be bound by the TF. We compute how well it
fits to the weight matrix of the TF.
We
do this by calculating the Likelihood function of the
site – namely, the probability that it would have been
generated given that it is indeed a binding site of this TF.
6
F=

A
C
0
1
1 0 0 0 3
0 0 3 4 1
T
G
1
3
4 5 2 0 1
0 0 0 1 0
Likelihood(“GATTCC”) = (3/5)*(1/5)*(5/5)*(2/5)*(4/5)*(1/5) = 0.00768
7
The Score
 By
taking log on the likelihood of R we get the
score of R, which is the Loglikelihood of R.
Likelihood(“GATTCC”)
=
(3/5)*(1/5)*(5/5)*(2/5)*(4/5)*(1/5) = 0.00768

score(“GATTCC”) = Loglikelihood(“GATTCC”) =
log(3/5)+log(1/5)+log(5/5)+log(2/5)+log(4/5)+log(1/5) = -4.869
8
The PSSM

From the weight matrix F, we compute a Position Specific
Score Matrix (PSSM) M by :
A 0 1 0 0 0 3
F=
C 1 0 0 3 4 1
T
1 4 5 2 0 1
G 3 0 0 0 1 0

For example, MG,1 = log(3/5)
9
Finding The Motif
We are given a TF and a gene. We want to know if this gene is
regulated by the TF.
Our Input :
 The sequence of the promoter region of the gene
 The PSSM of the TF
A simple Algorithm : Scan the promoter region, and at each
position calculate the score according to the PSSM. Take the
best position (i.e. the one with the highest score) to be the
suspected binding site.
-2.3
-5.2
-4.5
-1.2
-0.5
AAGTTGCCGAGATCGTAGCTATCGATCGATCGACAGCTAAC
10
The Problem
Problem : In any (e.g. random) sequence we will find some
best position (and best score). How do we assign statistical
significance (p-value) to the position and score we have
found.
-2.3
-5.2
-4.5
-1.2
-0.5
AAGTTGCCGAGATCGTAGCTATCGATCGATCGACAGCTAAC
The Goal of
Our Work
Statistical
Evaluation
11
Overview
Introduction and Definitions
 P-value Algorithm
 Experimental Results
 Generalizations and future work

12
What Do We Want To Calculate?
 Let
N be the promoter length, and L the length of
the TF binding site. Suppose we scanned the
promoter and have found that the maximal score
had the value t.
 The p value is : the probability that the maximal
score in a random sequence of length N, will be
above the threshold t.
13
The Algorithm has two Steps :
1. Finding the set of all the sequences of length L,
with a score above the threshold t.
2. Calculate the probability of finding at least one of
those sequences in a random sequence of length
N.
14
Step One :
Finding the sequences
 Let
K = K(t) be the number of sequences of length
L (out of the 4L) with a score above t.
 We have a branch and bound algorithm for
enumerating them in time linear in K.
 Problem : In some cases K might be too large. For
example, Suppose L=20, and only one of a
thousand sequences of length L has a score higher
than t. It means K=420/1000 = 415 ~ 1billion.
15
Approximating K
 If
K is too large we cannot enumerate all the K
sequences, but only try to estimate their number
(i.e. K).
 There are various methods to do so (Gaussian
approximation, Statistical Mechanics, Large
Deviations techniques). We used Generating
Functions method, which proved to be the best.
The method can give both lower and upper
bounds on the correct number K.
16
17
Two Steps :
1. Finding the set of all the sequences of length L,
with score above the threshold t.
2. Calculate the probability of finding at least one of
those sequences in a random sequence of length
N.
18
Step Two : Calculating Probabilities
 We
are given a set of K sequences. We need to
calculate the probability of finding at least one of
them in a random sequence of length N.
 First, let’s consider a simpler problem, where we
have one target R (K = 1).
19
: H – Number of occurrences of R in a
promoter region of length N.
R = AACG
 Define
AAACGGTTGTTACAACGGTTCCTCCAACG
H=3
 Our
p value is : P(H > 0) = 1 – P(H = 0)
20
A Naive Approximation
 At
a specific position, the probability of R appearing
is 1/4L, and the probability of R not appearing is (1 –
1/4L )
 A naive approximation :
We have N-L+1 possible start positions, so :
P(H = 0) ~ (1 – 1/4L )N-L+1
 Problem
: We have neglected correlations !
21
Why Do Correlations Matter ?
ATAA
TAAA
AAAA
TAAC
CAAA
GTAA
TAAG
GAAA
TTAA
TAAT
TAAA
CTAA
TAA
AAAC
AAA
AAAG
AAAT
TAA appears in 8
sequences of length 4
AAA appears in 7
sequences of length 4
P(H > 0) = 8/44
P(H > 0) = 7/44
The Difference is in the self-overlapping pattern of them:
TAA TAA
AAA AAA
TAA
TAA
AAA
AAA
:No Self overlaps
:Maximal number of self overlaps
Less self overlaps
Higher P(H>0)
22
The effect of self overlaps
The Mean of H Is : E(H) = (N-L+1) * (1/4L)
Independent of the specific sequence R.
 Proof:

23
The effect of self overlaps

The correlation between ‘close’ Xi’s
depends on the specific sequence R.
R1
TAAAAA
=
R2
AAAAAA
=
Less self overlaps
Higher P(H>0)
24
Algorithm
 We
have developed a recursive algorithm which
takes into account the correlations. It calculates the
exact value of P(H > 0).
 In the more interesting case, where we have a set
of K target sequences, our method still applies.
 If we assume that the promoter’s DNA is not
random but there are different probabilities for
A,T,C,G, the same algorithm still works.
 Time Complexity : O(N * K log K)
25
Algorithm (Cont.)
 When
K is too large, we don’t know the exact
sequences, but only (an approximation of) their
number.
 What
we do : Take the worst-case scenario (i.e.
highest P(H > 0) possible)
 Highest
p value : fewest overlaps. We assume no
overlaps at all.
26
Algorithm (Cont.)
The case of no overlaps is
possible for K < 4L/L.
 We are usually interested in
much smaller values of K.
Thus, for our case of
interest, the bound we get
is quite tight.

Upper Bound on K
Lower Bound on Overlaps
Upper Bound
on P(H>0) the P value!
27
Gene's Promoter
Regions
Sketch Of
The System :
Scan to get
Max Scores
Transcription
Factors Weight
Matrices
Input
Estimate K
(For each
pair!!)
Large
Small
Enumerate
K sequences
FDR
Bound
pvalue
Calculate
pvalue
Statistical
Evaluation
Output
28
Overview
Introduction and Definitions
 P-value Algorithm
 Experimental Results
 Generalizations and future work

29
A Comparison with Matinspector
We used the Promoter Database of Saccharomyces
cerevisiae. It contains genes and for every gene
the TFs that are known to bind its promoter. We
took 24 Transcription Factors whose PSWM is
known, and 135 promoters of genes which are
known to be bound by at least one of them.
Our Results:
We calculated the p-value (using the algorithm) for
each of the TFs on each of the genes.
30
Our Results
Each gene has 25 p-values of all the TFs. We used the FDR
method to find the statistical significant TFs for every gene.
Here the
threshold for
the FDR is
0.1, but we
will check the
results for a
range of
threshold
values
31
Our Results
After we found statistical significant TFs for every
gene, we compared the results with the data from
the database. There are 2 parameters:
1. False positives rate – TFs that we found as
statistical significant, but are not known to be
bound to the gene.
2. False negatives rate – TFs that are known to be
bound, but we didn’t find.
Lower parameters values
better results

32
Our Results Graph
We calculated the average of these 2 parameters (False
positives rate, false negatives rate) on all the genes, For a
range of FDR threshold parameter values, Q = 0.01,…,0.45
Notice that the
false
positives
rate is very
close to the
FDR
threshold
value
33
Matinspector Results
 For
every gene Matinspector gives the number of
occurrences of every TF in its promoter, and an
estimation of the expected number of occurrences
(re value).
 To compare to our results, we decided to declare a
TF as significant if it was found more times than it is
expected to be found, or, in other words, if the ratio
(expected number)/(number of occurrences) is
lower than a certain threshold.
34
Results Graphs
Matinspector Results
Our Results
In our results the average of the 2 parameters (green) is always lower,
And the false positives rate (red) is always much lower
35
Comparison with Synthetic Data
Synthetic- Synthetic
Real data
40%All
ofpositives
positives
are are
not annotated
annotated
The left graph shows lower error rates then in our true data. The right
graph shows error rates similar to those in our data, thus suggesting an
estimate for the amount of missing real binding sites in the database.
36
Overview
Introduction and Definitions
 P-value Algorithm
 Experimental Results
 Generalizations and future work

37
Markov Models
In true DNA sequences, the nucleotides are not
independent but rather posses statistical dependencies at
close distances.
 To model this, we used a markov model, in which the
distribution of each letter depends on some m previous
ones. m is denoted as the size of the model.
 Example : Each letter depends on it’s previous :

1st\2n
d
A
C
T
G
A
0.37 0.17 0.18 0.28
C
0.32 0.20 0.17 0.31
T
0.30 0.23 0.19 0.27
G
0.25 0.20 0.17 0.37
38
Markov Models (Cont.)
 We
can use this for both the binding site model and
the background (‘random’) model. The more
realistic the model, we hope to get more realistic p
values.
 We obtain a tradeoff. As we increase m :
 Advantages :
• More reliable p values 
• Reduce false positive/negative errors 
 Drawbacks

:
Need more data to represent the model 

Computational complexity increases 
39
Other Possible Directions
 Account
for multiple occurrences: P(H = n).
 Account for combinations of motifs. Find pairs (or
larger groups) which are statistically significant.
 Use close genomes (e.g. human and mouse) in
order to reduce false positives rate.
 Combine with expression data (how ?)
40
The END
Thanks to
Ido Kanter
 Gaddy Getz
 Eytan Domany

41