A Novel Knowledge Based Method to Predicting Transcription

Download Report

Transcript A Novel Knowledge Based Method to Predicting Transcription

A Novel Knowledge Based Method to
Predicting Transcription Factor Targets
[email protected]
Background
• It is termed as the major aspect of transcription
regulation that transcription factor regulates target genes’
expression. Extensive efforts have been made in
discovering transcription factors’ target genes both in
wet and dry labs. Since transcription factors as well as
their targets may participate the same biological
pathways and share similar biological functions, we
can inference the regulatory relationship by analyzing
the Gene Ontology annotations and potential
transcription factor DNA binding sites.
• Hence, the computational method to predict potential
transcription factor target we developed could be
useful in transcription regulatory mechanism
researches.
Background
• TF-TFBS-TFT triplets
– Transcription factors(TF) regulate transcription factor
target(TFT) through binding to transcription factor
DNA binding sites(TFBS).
Background
• Our predicting strategy
GO encoding
0/1 encoding
TFBS
GO encoding
TFT
Hybridization Space
TF
False
Predictor
True
It is TRUE that Transcription
factors(TF) regulate
transcription factor
target(TFT) through binding
to transcription factor DNA
binding sites(TFBS).
Materials and Methods
•
Positive dataset
– For transcription factors as well as their targets, binding sites, the
original dataset came from TRANSFAC v7.0. Then the original dataset
was filtered as following steps: (1) Remove the TFs with no SwissProt
Accessions, as well as TFTs. (2) Remove TFBSs with length less than 5bp
or longer than 25bp. (3) Finally, positive dataset with 3430 TF-TFBSTFT triplet which covered 143TF, 571TFBS and 1416TFT was built
•
Negative dataset
– Negative dataset was randomly generated based on positive dataset
as following steps; (1) Random number i was generated from uniform
distribution on interval [1,143] , j from interval [1,571] and k from
[1,1416] . (2) The ith TF, jth TFBS and kth TFT was selected from the
positive dataset. Then a new triple was constructed through combining
those three elements. This new triple is ignored if it does appear in
the original positive dataset, otherwise would be pushed in the negative
dataset. (3) Repeat step 1,2 and 3 until the size of negative dataset reached
6860, which is two times that of positive dataset. (3) Finally, a
negative dataset with 6860 TF-TFBS-TFT triples which covered 140TF,
559TFBS and 1317TFT was obtained
Numeric representation system
• TF Gene Ontology representation system
– By using Uniprot2GO mapping provided by GOA Uniprot 34.0 on
November 21st 2005 [ http://www.ebi.ac.uk/GOA/ ], functional
annotations of TFs provided by GO were obtained.
– Each TF can be represented in a 9525D (Dimensional) vector through
using each of the 9525 GO items as the vector base, e.g. for a given TF
that hit a GO item which is the ith number of the 9525 GO items, then
the ith component of the 9525D vector will be set to 1, otherwise 0.
– Thus, the TF sample can be formulated as
 t1 


t
 2 


T 

t
i 




 t9525 
where,
1, hit found
ti  
0, otherwise
Numeric representation system
• TFTs are encoded by using the same approach as TFs
– Each TFT can be represented in a 9525D (Dimensional) vector
through using each of the 9525 GO items as the vector base
 g1 


g
 2 


G 

 gi 




 g9525 
where,
1, hit found
gi  
0, otherwise
Numeric representation system
•
Short nucleotide sequence TFBS are encoded using the 0/1 encoding system which
can be briefed as follows
–
–
–
Firstly, TFBSs with length less than 25bp are extended to exact 25bp through adding
‘N’ suffixes, e.g. ‘CCCCACGTAGCTAGACGTAG’ will be extended to
‘CCCCACGTAGCTAGACGTAGNNNNN’, meanwhile make no change for these TFBSs
with length exact 25bp.
Then, these TFBSs can be represented in a 100D (Dimensional) vector, e.g.
‘ACGTAGCTAGACGTAGCTAGNNNNN’ will be represented in a 100D binary vector as
0010'0010'0010'0010'0001'0010'0100'1000'0001'0100'0010'1000'0001'0100'0001'0010'0100'1
000'0001'0000'0000'0000'0000'0000 , meanwhile each nucleotide was encoded with a 4D
binary vector as
' A ' : 0001
' C ' : 0010

' G ' : 0100
' T ' : 1000

' N ' : 0000
Finally, each TFBS can be formulated as
 d1 


d
2




D

dj 




d 
 100 
where, d can be either 0 or 1.
The hybridization space
• To fascinate predicting the interactions between TF and TFT, a
numeric representation to cover TF-TFBS-TFT triplet is
developed. This can be done as follows. Suppose Tx , Dy
and Gz are the xth TF, yth TFBS, zth TFT, respectively. The x − y
− z TF-TFBS-TFT triplet TDG( x, y, z ) can be expressed as
The hybridization space
1
TDG u, v, z 
TF regulate TFT
through binding
TFBS
Predictor
0
NOT
The predictor
• The Nearest Neighbor Algorithm
– Once the numeric representation is built, the predictor
performed in this contribution can be briefly as follows.
Suppose there are N triplets ( R1 R2 , ..., Ri ,..., RN ) with
known classification label ( L1 L2 , ..., Li,..., LN) , where Li ∈
{true false} and true indicates it is indeed a true triplet that
TF act on TFT through TFBS, and false otherwise.
– Given a novel triplet R , is it true? To investigate this problem,
distance D(R, Ri) (1≤ i ≤ N ) is defined
R  Ri
D  R, Ri   1 
R  Ri
where, Ri · R is the inner-product of R and Ri , ||R||
and ||Ri|| are the modulus of R and Ri , respectively.
The predictor
• The Nearest Neighbor Algorithm
– Once the distance is calculated. The category of R can be
predicted to be same as that of its nearest neighbor.
D  R, Rk   min D  R, R1  , D  R, R2  ,..., D  R, Rk  ,..., D  R, RN 
– If there is a tie, which means there are more than one nearest
neighbor.
Results and Discussion
Jackknife cross-validation test
Correctly Predicted true TF-TFBS-TFT triplets

success rate for positive dataset =
true TF-TFBS-TFT triplets


success rate for negative dataset = Correctly Predicted artificial TF-TFBS-TFT triplets

artificial TF-TFBS-TFT triplets

Dataset
Positive
Negative
Overall
Success rate
2630/2693=97.6%
5100/5337=95.6%
7730/8030=96.3%
Excluding the transcription factors with no GO annotations as well as the transcription
factor targets, finally 19150D (9525+9525+100) vector were built for 2693 true triples and
5337 artificial triplets.
The result is obtained when k is set to 0.5.
Conclusion
• Identifying transcription factor’s targets is one of
the basic researches in transcription regulatory
area. In this contribution, a knowledge based
method was proposed to identify TF-TFT
relationships through integrating Gene Ontology
annotations and transcription factor DNA binding
preference. The predictor we built acquired a
fairly good performance as 96.5%, which
indicates the computational method we
developed could be a useful tool in transcription
regulatory mechanism researches.
Thank you