Orientation of the transcription factor binding

Download Report

Transcript Orientation of the transcription factor binding

Predicting Novel Transcription Factor Binding Sites in Human
Using a Machine Learning Approach
Sonya Liberman1,2, Nir Friedman1 & Hanah Margalit2
1 School of Computer Science and Engineering, The Hebrew University, Jerusalem, Israel
2 Department of Molecular Genetics and Biotechnology, Faculty of Medicine, The Hebrew University, Jerusalem, Israel
Transcription factors (TFs) regulate gene expression by
binding to specific sequences on the DNA. A major
challenge is to expand the known repertoire of TF-target
pairs by identifying novel Transcription Factor Binding
Sites (TFBS) based on sequence data. One main
difficulty in such computational predictions is the large
number of false positives they generate. Here we
examine the association of five features with TFBS and
show that they differ between true binding sites and
similar sequences that are predicted as binding sites.
Using machine learning approaches, we developed a
computational scheme for TFBSs prediction, in which
prediction of sites based on sequence data is subjected
to filtering and further classification according to these
features. This results in a significant reduction in the
number of false positive predictions and enables the
construction of a more accurate transcription regulation
network.
1
Training Sets
• Each site was represented as a 5-coordinate vector
Evolutionary Conservation
(X1,X2,X3,X4,X5)
Average Conservation Score
Known Transcription Factor Biding Sites
0.57
Sites predicted by a motif search tool
0.24
Conservation
Number of
neighboring
binding sites
Known TFBSs are on average more conserved than other predicted sites
Number of
neighboring
sites that fit the
motif
Distance
from TSS
Orientation of
transcription
factor binding
• A positive set was constructed out of 159 known sites that were
also discovered by TestMOTIF
Number of neighboring known binding sites of other
transcription factors
Known TFBS
Gene 1
Clustered
TFBS
Scattered
TFBS
Kernels
Predicted TFBS
Gene 2
Different shapes
indicate BSs for
different TFs
Sites for which distance is less than 200 bp are considered neighbors
1.04
The sites were classified using 4 different kernels: Gaussian,
Linear, Polynomial and Sigmoidal.
Cross-Validation
• A sevenfold cross-validation was performed to evaluate
performance using each one of the kernel functions
Average number of neighboring known TFBSs
Known Transcription Factor Biding Sites
• A negative set was constructed out of 159 randomly chosen sites
from the set of new sites predicted by TestMOTIF
Sites predicted by a motif search tool
0.44
• Linear kernel achieved best cross-validation results
Sevenfold cross-validation results for Linear kernel
Known TFBSs have on average more neighbors among known TFBSs than
other predicted sites do
• Transcription is governed by
cis-regulatory elements and
associated transcription factors
Number of neighboring sites with a similar sequence
Promoter
with a
knwon site
Promoter
without a
knwon site
• In order to predict new TFBSs we use motifs of known
TFBS represented by PSSMS
Gene 1
Gene 2
Average number of sites with a similar sequence
AATGATGC
GCATCATT
AATGATGC
TTACTACG
CGTAGTAA
TTACTACG
Promoter with a known TFBS
14.65
We use a motif search tool (TestMOTIF6) that predicts
new TFBS in promoter sequences according to known
motifs, and assigns a p-value to each prediction
True
Positives
Average
value
False
Negatives
(%)
False
Negatives
Average
value
True
Negatives
(%)
True
Negatives
Average
value
False
Positives
(%)
False
Positives
Average
value
83.68%
5.98
16.32%
-0.92
86.82%
-2.20
13.18%
0.64
Classification Results
Known TFBS
Predicted TFBS
GENE
True
Positives
(%)
Promoter without a known TFBS
10.52
Known TFBSs tend to be surrounded by other sites that match their motif
2
• A classifier was trained on the full set of 318 sites and
managed to separate correctly 88.68% of the training data
• All new sites predicted by TestMOTIF were tagged by the
classifier
• The threshold for defining true binding sites was set to a
positive score of 2
• 936/73607 (~1.3%) sites received a score above the
threshold (222 unique pairs of TF and target gene)
• Final set included new target genes for 51 known
transcription factors
4
6
The Model
• Known human TFBSs from TRANSFAC database were
mapped onto the human genome
Distance from the TSS (Transcription Start Site) of the
target gene
• 210 sites were chosen as a reliable set of known TFBSs
• Promoters of 150 genes were searched for putative
binding sites for 98 different TFs
Distribution of
the distance of
sites from their
target genes.
True
Positives
?
83%
Positives
Position relative to TSS
False
Negatives
Known
TFBSs
• Number of neighboring known binding sites of other TFS
Average log probability
of newly predicted sites
given a model built
according to newly
predicted sites (73607)
Average log probability
of newly predicted sites
given a model built
according to known
sites (159)
2.623
-88.862
17.631
2.296
• 75% are located within the 400 bp upstream to TSS
New sites predicted by TestMOTIF can be classified according to their
probability of being generated by the first or the second set of
parameters
7
Unfortunately only few transcription factors have enough known binding
sites to enable reliable statistics.
Gene 1
• Number of neighboring sites with a similar sequence
AACCCA
TTGGGT
• Distance from the TSS of the target gene
TGGGTT
ACCCAA
3
Average log probability
of known sites given a
model built according to
newly predicted sites
(73607)
Classification
Only several TFs have a specific binding orientation, i.e.:
• E2F has a defined orientation of upstream binding sites. (90% have
same orientation)
• EBOX has a defined orientation of downstream binding sites. (86%)
• Evolutionary conservation
Average log probability
of known sites given a
model built according to
known sites (159)
• 61% of sites are located within the 200 bp upstream to TSS.
Orientation of the transcription factor binding
The differentiation is made based on the following
five features:
• Orientation of the transcription factor binding
Learning two separate models
The frequencies for the Gaussian components of the mixture and the
parameters for each component were learned separately for the set of known
sites predicted by TestMOTIF and the set of the new sites2
• We predicted ~150,000 statistically significant new sites
including 174 out of 210 known TFBSs (~83%)
To differentiate
between true positive
predictions and false
positive predictions False
To model the complex distribution of the data we used the Gaussian Mixture
Model (GMM) with a countable infinite number of Gaussian components
Gene 1
AACCC
TTGGGT
A
Gene 3
AACCC
TTGGGT
A
X
1. Sinha, S., M. Blanchette, and M. Tompa, PhyME: a probabilistic algorithm for finding motifs in sets of orthologous
sequences. BMC Bioinformatics, 2004. 5: p. 170.
2. Neal, R.M., Regression and classification using Gaussian process priors. Oxford University Press, 1998: p. 475501.
3. Siepel, A., et al., Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res,
2005. 15(8): p. 1034-50.
4. Li, N. and M. Tompa, Analysis of computational approaches for motif discovery. Algorithms Mol Biol, 2006. 1: p. 8.
5. Shane T. Jensen, X. Shirley Liu., Qing Zhou and Jun S. Liu, Computational Discovery of Gene Regulatory Binding
Motifs: A Bayesian Perspective. Statistical Science, 2004. 19(1): p. 188-204.
6. Barash Y., Elidan G., Kaplan T., Friedman N. CIS: compound importance sampling method for protein-DNA binding
site p-value estimation. Bioinformatics, 2005. 1;21(5): p. 596-600.
Gene 3
5
• Dr. Yael Altuvia for her help with the feature definition
• Tommy Kaplan for his help with the TestMOTIF tool