Transcript Poster
PreDetector : Prokaryotic Regulatory Element Detector
Samuel Hiard1, Sébastien Rigali2, Séverine Colson2, Raphaël Marée1 and Louis Wehenkel1
1
Department of Electrical Engineering and Computer Science & Centre for biomedical inegrative genoproteomics CBIG/GIGA – University of Liège, Sart-Tilman B28 Liège, Belgium
2 Centre for Protein Engineering – University of Liège, Sart-Tilman B6 Liège, Belgium
Abstract
PreDetector is a stand-alone software, written in java. Its final aim is to predict regulatory sites for prokaryotic species. It comprises two functionalities.
The first one is very similar to Target Explorer1. From a set of sequences identified as potential target sites, PreDetector creates a consensus sequence and computes its scoring
matrix. This sequence and matrix can be saved on a file and, then, be used to find along a selected genome the sequences that are close enough to the consensus sequence. To
this end, a score is attributed to each locus in the genome according to the similarity measure defined by the matrix. The output of this functionality is filtered with a cut-off score
and then directly used as input by the second one.
The second functionality starts by fetching the gene positions of the selected species from the NCBI server. The loci having above cut-off score are then classified into four
classes, allowing multiple classes for one element. This gives the biologists a better view of his discovered sequences.
Matrix Generation
When biologists search for a regulation motif, they find
several potential sequences. We then have to find a
way to obtain a consensus sequence that averages the
potential ones.
The first point would be to make a kind of alignment of
the potential sequences. Target Explorer1 allows
variable lengths for the sequences, but PreDetector
doesn’t. It just takes the sequences « as it » and starts
the generation of the matrix.
The matrix should reflect the fact that nucleotides with
higher frequencies at some position in the observed
set should have a greater impact on the score on that
position than nucleotides that are more equally
distributed.
In the other hand, nucleotides with high expected
frequencies along the genome should not have much
importance, as they are likely to be found, and
conversely.
So, the weight function for a specific nucleotide in the
matrix is the following one1 :
weighti , j
n
ln
i, j
pi / N 1
pi
Consensus search in
genome
When the matrix is computed, it can be used to find
similar loci in the genome.
The score for each locus is calculated as the sum of
the values that each base of the sequence has in the
weight matrix.
The four classes
1) Regulatory : The distal is located in the userspecified bounds, and at least
one nucleotide is not in a gene
2) Upstream : The distal is facing a start codon
and is not in a gene
3) Coding : The distal is in a gene
4) Terminator : At least one nucleotide facing a
stop codon, and no start codon
Exemple : Use the previous matrix to find similar loci
on nucleotides 100 – 200 on gene X of
Drosophila Melanogaster.
Id
1
2
3
4
5
Strand
for
for
rev
rev
for
Seq
CCGGC
CCGAT
AGCGC
TCCGG
TCGTT
Pos
38
20
29
37
58
Score
4.01
2.45
2.45
2.21
2.12
Screen shots
Matrix Generation
(Only the first 5 results are shown here).
Then, only sequences that have a score greater than a
user-defined cut-off score are kept. In this exemple, we
could set the cut-off score at 2.40 and keep only the
first three elements
Search Parameters
where :
- ni,j is the observed frequency of nucleotide i in
position j
- N is the number of sequences in the set
- pi is the expected frequency of nucleotide i in the
genome
PreDetector in two words
Sequence search
ACGT
…AACGTTTTTACGTCCCCACGT…
Exemple of matrix
Let’s assume that we have experimentally discovered
these motifs:
Classification
Results
Terminator
Coding
A
A
C
C
C
C
G
G
G
T C
G T
C T
on a specie known to have 40% CG, the consensus
matrix will then be :
A
0.65 -1.39 -1.39 -1.39 -1.39
C
0.41 1.39 -1.39 0.41 0.41
G
-1.39 -1.39 1.39 0.41 -1.39
T
-1.39 -1.39 -1.39 0.08 0.65
Score ≥ Threshold
Genes positions
Regulatory
NCBI
Server
Upstream
Classification
Conclusion
When several hits have been found, PreDetector then
classifies them into 4 different classes : Regulatory,
Upstream, Coding and Terminator, allowing multiple
classes per element.
PreDetector can play an important role in automatic
regulatory element detection and validation. It also can
be upgraded for eukaryotic species handling.
To achieve this goal, PreDetector connects to the NCBI
server, and downloads the specie’s genes positions.
References
The classes are described in detail on the next column.
1. Target Explorer: an automated tool for the identification of new target genes for
a specified set of transcription factors, Alona Sosinsky, Christopher P. Bonin,
Richard S. Mann and Barry Honig, Nucleic Acids Research, 2003, Vol. 31, No. 13
3589-3592