Transcript ppt
Localising regulatory elements using statistical
analysis and shortest unique substrings of DNA
Nora Pierstorff1, Rodrigo Nunes de Fonseca2, Thomas Wiehe1
1 - Institute for Genetics, University of Cologne, Germany, Email: [email protected]
2 - Institute for Developmental Biology, University of Cologne
INTRODUCTION:
In order to localize regulatory regions three basic
computational approaches have been followed.
1.
Search for bindingsites of known transcription factors
using Position Weight Matrices. [1]
2.
Search for conserved motifs in upstream-regions of
homologous or coregulated genes. [2]
3.
Search for statistically overrepresented motifs [3]
Our program SHUREG follows the third approach which is
supported by two hypotheses:
1.
Degenerate binding site lead the transcription factor
to the bindingsite
2.
New bindingsites can be created easily from
degenerate bindingsites through few mutations to
adapt the organism to environmental changes.
Figure 1a: SHUREG prediction in the giant region
Hairy Drosophila melanogaster Shureg results
SHUREG - ALGORITHM:
1.
Calculation of shustrings (shortest unique
substrings) at every position relative to a
surrounding
window
on
forwardand
backwardstrand.
2.
Counting of neighbours (exact repeats in the
surrounding)
3.
Calculation of P-values for each shustring
4.
Smoothing of P-values
Hairy Drosophila melanogaster Ahab results
0,25
70
Translation start
site
10213
Stripe 2
0,2
Stripe 7 Stripe 6
Stripe 7
Stripe 6
Translation start site
10213
Stripe 2
60
Stripe 5
50
Stripe 1
0,15
0,1
0,05
40
Stripe 5
30
Stripe 1
20
10
0
0
0
0
2000SHUREG
4000
6000
8000 in 10000
12000
14000
Figure
2a:
prediction
the hairy
region
Nucleotide position
-10
2000
4000
6000
8000
10000
12000
14000
Figure 2b: Ahab prediction in the hairy region
Nucleotide position
We applied our program to different
well explored regions of the Drosophila
melanogaster genome. Our dataset
includes segmentation and dorsalventral genes. We compare our
predictions to the results of AHAB[1], a
program that uses PWM‘s
Figure 1 shows two predictions for the
giant region. 1a is computed using
Shureg. 1b is the result of the Ahabprogram applied to the same
sequence.
Figure 3a: SHUREG prediction in the
sog region
Sog Drosophila melanogater Ahab results
20
15Exon
Ahab value
WHY SHORTEST UNIQUE SUBSTRINGS?
Analyzing the human (mouse-) genome we found 255 (293)
global shustrings of length 11bp. [4]
29 (22) of the shustrings are positioned in 1000bp-upstreamregions.
The probability of this distribution is
3.3 x 10-24 (5.0 x 10-18 )
Figure 1b: AHAB prediction in the giant region
Ahab value
Several regulatory region prediction methods using
computation have been developed in the last few
years. Most of the available methods require
transcription factor binding site matrices to achieve
reasonable results. In order to avoid the need of
biological information, we developed a program
named SHUREG to predict regulatory regions
without any extrinsic information but the sequence
itself. Calculating shustrings (shortest unique
substrings) we find statistically overrepresented
motifs which are assumed to be indicators of
regulatory elements. [3]
RESULTS:
shureg value
ABSTRACT:
1
CRM
Figure 2a shows the Shureg prediction
for the regulatory regions of the hairy
gene. 2b shows the corresponding
Ahab-prediction.
10
Exon 2
5
0
-5
0
2000
4000
6000
8000
10000
Figure
3b:
AHAB
prediction
in the 12000
sog
Nucleotide
region
the dorsalposition
PWM
0 using
Figure 3c: AHAB prediction in the sog
region using all known PWM‘s
14000
Figure 3 is partitioned into 3
predictions. Figure 3a is the Shureg
prediction for the dorsal regulated
enhancer of the sog gene. Figure 3b
shows the Ahab prediction using only
the PWM of the Dorsal binding site.
Figure 3c shows the Ahab-prediction
using all known PWM‘s in an
hypothetical case that we do not know
the actual factors responsable for this
gene regulation.
DISCUSSION:
To localize regulatory regions without any extrinsic information is a hard topic. To use the amount of overrepresented patterns in a region as
indicator of regulatory regions is a reasonable measure and can lead to reasonable results. But it also leads to a lot false positive predictions,
because we find additional overrepresented patterns which cannot be set into correlation to binding sites. To improve the predictions of our
method we need to find more features to distinguish between true positive and false positive predictions, we are currently investigating the
conservation of overrepresented motifs between species.
References:
[1] N. Rajewsky, M. Vergassola, U. Gaul, and E. D. Siggia (2002): Computational detection of genomic cis-regulatory modules, applied to body patterning in the early Drosophila embryo.
BMC Bioinformatics, 3:30
[2] H. Bussemaker, H. Li, E Siggia (200): Building a dictionary for genomes: Identification of presumptive regulatory sites by statistical analysis. PNAS, Aug 2000; 97
[3] Nazina A., Papatsenko D. (2003). Statistical extraction of Drosophila cis-regulatory modules using exhaustive assessment of local word frequency. BMC Bioinformatics 4:1471-2105/4/65
[4] Haubold, B., Pierstorff, N., Moeller, F., Wiehe, T. (2005). Genome comparison without alignment using shortest unique substri ngs. BMC Bioinformatics, 6:123.