Prediction of Regulatory Elements for Non

Download Report

Transcript Prediction of Regulatory Elements for Non

Prediction of Regulatory Elements
for Non-Model Organisms
Rachita Sharma, Patricia Evans, Virendra Bhavsar
Faculty of Computer Science, University of New Brunswick, Fredericton, NB, Canada E3B 5A3
Analysis
Introduction
Determination of regulatory networks from available data is one of the major challenges in
bioinformatics research. A regulatory network of an organism is represented by a set of genes
and their regulatory relationships, which indicate how a gene or a group of genes affect
(inhibit or activate) production of other gene products as shown in Figure 1. Some organisms
such as yeast, Arabidopsis thaliana (thale cress, a plant) and fruit fly are being investigated
very thoroughly by biologists as model organisms.
We are developing a system to predict the regulatory relationships of a non-model organism
(target genome), about which less information is known, using information about the
regulatory relationships of a related model organism (source genome). If the organisms are
closely related then the regulatory relationships are likely to be similar. Differences in the
regulatory relationships between organisms can be determined by using data from both the
model and non-model organisms. This research started as a part of the bioinformatics
research component of the Canadian Potato Genome Project.
This methodology has been implemented for mapping regulatory elements and their regulatory
network. The first step of mapping regulatory elements has been tested on Yeast
(Saccharomyces cerevisiae) and Arabidopsis thaliana as the source and target genomes,
respectively, which diverged approximately 1.6 Giga-years ago. For any pair of genomes, only
some of the transcription factors from one genome can be mapped to another genome, since the
evolutionary distance between them leads to many false negatives. In addition, the number of
confirmed mappings between any two genomes is unknown as it depends on the definition of a
confirmed mapping used in the experiment. The predicted transcription factors are compared on
the basis of
 how likely a sequence predicted as a transcription factor is to be a transcription factor of the
target genome
 how likely the predicted transcription factor is to correspond to the correct type of transcription
factor from the source genome
Therefore, the predicted transcription factors are compared to a set of 1922 available
transcription factors of the Arabidopsis thaliana genome to determine the actual number of
transcription factors predicted.
|
Inhibition
 Determine associations between the genes that act as regulatory elements (transcription
factors and target genes) in model and non-model organisms
 Predict the regulatory relationships in a non-model organism
Methodology
 Find transcription factors of the target genome using the available regulatory element
information of the source organism based on
Similar sequences (TF-Seq)
Same protein domain family (TF-Fam)
Same protein domain sub-family (TF-SubFam)
 Map target genes from the source genome to the target genome based on finding
transcription factor binding site motifs (TFBS) in
Nucleotide data of the target genome (BS-Seq)
Promoter data of the target genome (BS-Prom)
Similar target gene sequences of source genome in the target genome (BSBlast)
Nucleotide data of the target genome discarding binding sites located in the
predicted regions of nucleosome occupancy (BS-Nuc)
Gene expression data will be used to further refine the regulatory network to understand how
the predicted regulatory relationships correspond to the expression levels of the genes in the
data
FN
1200
TP
1000
800
600
400
200
0
TF-Seq
TF-Fam
TF-SubFam
Method
Figure 2: Number of true positives, false positives, false
negatives and true negatives for transcription factors
identified using TF-Seq, TF-Fam, and TF-SubFam
Conf irmed
Similar
Other TF
Not TF
Types
Figure 3: Number of hit sequences divided into four
types (Confirmed, Similar, Other TF and Not TF) using
TF-Seq for BLAST e-value cut-off parameter of 0.1
Transcription factor mapping based on having the
TN FP FN TP
same protein domain family has better
40000
performance than the other two methods based
35000
on sequence similarity and having the same
30000
protein domain sub-family as shown in Figure 2.
Also, the transcription factors predicted are of
25000
the correct type as illustrated in Figure 3 and the
20000
sequences with similar annotation may be part
15000
of the false positives. Figure 4 shows that target
10000
gene mapping by finding TFBS motifs in
promoters has better performance than the other
5000
methods. The sequence similarity in BS-Blast is
0
not useful for mapping target genes, showing
BS-Seq BS-Prom BS-Blast BS-Nuc
that target genes with similar binding sites do not
Method
need to have high sequence similarity. Also,
Figure 4: Number of true positives, false positives, using BS-Nuc to refine the results of BS-Prom
false negatives and true negatives for target genes using the Nucleosomes Position Prediction tool
identified using BS-Seq, BS-Prom, BS-Blast and BSdoes not improve the performance of the results,
Nuc
showing the effects of the variable position of the
transcription-suppressing nucleosomes.
Number of sequences
Objectives
FP
36000
35500
35000
34500
34000
33500
33000
32500
32000
31500
Activation
Figure 1: Example of gene regulatory network
TN
Total number of hits
Number of sequences
Results
Conclusion
These results in this work show that TF-Fam and BS-Prom are promising methods for predicting
regulatory elements for a non-model organism based on a model organism. These regulatory
elements can be used further to predict the regulatory network of the non-model organism.