Exploring Cross-Function Domain Interaction Map

Download Report

Transcript Exploring Cross-Function Domain Interaction Map

An HV-SVM Classifier to Infer
TF-TF Interactions Using Protein
Domains and GO Annotations
Xiao-Li Li
Institute for Infocomm Research, Singapore
Nanyang Technological University, Singapore
Joint work with
Junxiang Lee (National University of Singapore)
Bharadwaj Veeravalli (National University of Singapore)
See-Kiong Ng (Institute for Infocomm Research)
Outlines
•
•
•
•
•
1. Introduction
2. Existing work
3. The proposed techniques
4. Experiments
5. Conclusions
1. Introduction
3 Key Bio-molecules in a cell
• DNA: The blueprint. Encodes all the information
to produce an organism
• RNA: The messenger. Transmits DNA
information to the protein synthesis machinery.
• Proteins: The workers. Targets for most drugs.
The Central Dogma
Transcription
DNA
PPI
Translation
RNA
Protein
Cell
Genes transcribed to RNA: transfer genetic information
from DNA into RNA.
RNA translated to protein: cells build proteins through
protein biosynthesis.
Protein is the machinery of life: Proteins function by
interacting with other proteins and biomolecules.
Genetic Information Flow in a cell
Transcription: DNA to RNA
Translation: RNA to Protein
Transcriptional Regulation
This slide is from Li Shen: CSB Tutorial, UCSD
Combinatorial Nature of TFs
• Recruiting of RNA Polymerase requires the
cooperation of several different transcription
factors.
–
–
–
–
TFIID is first positioned at TATA box of DNA
TFIIA, TFIIB, etc then need to attached to DNA
Next, RNA Polymerase bounds
Other TFs complete the mature of transcription
complex
2. Existing works: Computationally
Infer TF-TF Interactions
• Exploits the abundance of gene expression data,
inferring synergistic relationships between TFs
• A pair of TFs (A, B) is considered to be interacting if
the expression correlation scores of genes binding
both A and B is significantly greater than any set of
gene with binding of either TF A or B alone.
ECS(A, B, G)>>max(ECS(A,G), ECS(B,G))
Pilpel (2001), Nature Genet ; Bussemaker (2001), Nature Genet
Yu, (2003), Trends Genet ; Banerjee, (2003), NAR
Existing works (Cont.)
• Exploit the growing availability of whole genome
sequences. Sophisticated sequence analysis methods
are employed to discover the so-called interacting
motif pairs in the DNA sequences.
• Two motifs are deemed interacting if they cooccurrence in the input DNA promoters are overrepresented and the distance between the two motifs
are significantly different from random expectations.
• The TFs binding to these motifs are then predicted to
be interacting with each other.
Conlon (2003), PNAS; Das (2004), PNAS ; Yu, (2006), NAR;
Existing works (Cont.)
• Exploit the growing availability of large-scale
protein-protein interaction networks.
• The working hypothesis here is that proteins that are
close to each other in the PPI networks are more
likely to be co-regulated by the same set of TFs.
• TFs A and B are considered to be interacting when
the median distance between protein pairs controlled
by both A and B is significantly shorter than protein
pairs controlled by A only or B only.
Nagamine, (2005), NAR
Existing works (Cont.)
• Machine learning algorithms have been used to
exploit the current abundant availability of PPI data
to build models to classify novel protein interactions.
– Feature representation
– SVM, NB, etc
• Most of these works had focused on predicting
generic PPI’s. It would be interesting if we could
also employ a machine learning approach for
predicting the more biologically specific TF-TF
interactions.
Bock (2001), Bioformatics; Wu, (2005), NAR; Ben-Hur, (2005), Bioformatics.
3. The Proposed Technique
• 3.1 TF characterization
• 3.2 SVM classification
• 3.3 Kernel design
– Vertical Pairwise Kernel
– Horizontal Pairwise Kernel
– Combined kernel
• 3.4 Genetic algorithm (GA)
3.1 TF characterization
• We characterize TFs according to their
protein domains and biological attributes,
which include molecular functions,
biological processes, as well as cellular
components.
Protein Domains
• They are evolutionarily conserved modules
of amino acid sub-sequence, i.e. “building
blocks” for constructing different proteins.
• The existence of certain domains in the TFs
could indicate the propensity for the TFs to
interact (DDI).
GO Annotations
• GO provides a common vocabulary that describes
the biological processes, molecular functions and
cellular components for many bio-molecules.
• Physical interactions between TFs require that
they exist in close proximity in a cell.
• Biologically, TFs that have the same molecular
functions, involved in the same biological
processes, and located in the same cellular
components are more likely to interact.
3.2 Support Vector Machines
• SVM is related to statistical learning theory,
and it was first introduced in 1992.
Class 2
• Consider a binary
classification problem
• Many decision boundaries!
• Are all decision boundaries
equally good?
Class 1
V. Vapnik. The Nature of Statistical Learning Theory. 2nd edition, Springer, 1999.
Large-margin Decision Boundary
• The decision boundary should be as far away as
possible from the data of both classes
• We should maximize the margin,
Class 2
Class 1
m
Finding the Decision Boundary
• Let {x1, ..., xn} be our data set and let yi 
{1,-1} be the class label of xi
• The decision boundary should classify all
points correctly 
• The decision boundary can be found by
solving the following constrained optimization
problem
The Dual Problem
•Inner product: a similarity measure between the objects.
•Different kernels can be designed, e.g. using domain
knowledge to better address the problem!!!
• This is a quadratic programming (QP) problem
• Many approaches, such as Loqo, cplex have been proposed
and a global maximum of ai can always be found
– see http://www.numerical.rl.ac.uk/qp/qp.html
A Geometrical Interpretation
a8=0.6 a10=0
a7=0
a5=0
a4=0
a9=0
Class 1
Class 2
a2=0
a1=0.8
a6=1.4
a3=0
3.3 Kernel design:
SVM Default Linear Kernel Kd
• Need to define pairwise kernel (TF pairs, not single TF)
• A simplistic approach is to arrange the TFs in an
alphabetically ordered list. Hence given TF pairs (A, B)
and (C, D), where A<B and C<D, the pairwise kernel is
defined as:
K d (( A, B), (C, D))  ΦA,C  ΦB, D
ΦX ,Y  wd d X dY  wp p X pY  w f f X fY  wc c X cY
T
T
T
T
What is wrong with default kernel?
Vertical Pairwise Kernel Kv
Kv ((A, B)(C, D))  (ΦA,C  ΦB,D )  (ΦA,D  ΦB,C )
Proposed Horizontal Pairwise Kernel Kh
• A pairwise kernel is to represent a pair
explicitly as one object and measure the
similarity directly between both pairs.
A
TF pair AB
TF pair CD
C
B
D
Domain pairs in AB
Domain pairs in CD
Common processes in AB
Common processes in CD
Common functions in AB
Common functions in CD
Common components in AB
Common components in CD
Kh (( A, B), (C, D))  wd d ABdCD  wpp ABpCD  w f f ABfCD  wcc ABcCD
T
T
T
T
Combining to build HV kernel Khv
and learning feature weights
• Combining vertical and horizontal kernels
as following HV kernel
K HV (( A, B)(C, D))  wv Kv (( A, B), (C, D))  wh Kh (( A, B), (C, D))
where Kh and Kv refer to the horizontal and
vertical kernels while wh and wv are their
respective kernel weights.
• How to decide the kernel weights wh and wv
and feature weights
3.4 Genetic algorithm optimizes the
kernel and feature weights
1) Randomly generate an initial population of k chromosomes;
2) Evaluate the fitness, f, of each individual;
3) Form k/2 random pairs from the population and select the
fitter individual of each pair as parents to breed the next
generation. Repeat to obtain a total of k parents;
4) Breed a new generation of offspring through crossover of the
k parents;
5) Perform random mutation on the newly created offspring with
a mutation rate r;
6) Repeat Steps 2-5 for n generations;
7) Select fittest chromosome from the n generations as solution
to the search problem.
4. Experimental Results
• We collected TF-TF interaction data for Homo sapiens
(Human) and Mus Musculus (Mouse) from various
databases, including
–
–
–
–
–
IntAct (http://www.ebi.ac.uk/intact/index.html),
GRIP (http://biodata.mshri.on.ca/grid/servlet/Index/),
MINT (http://mint.bio.uniroma2.it/mint/),
BIND (http://bind.ca/),
DIP (http://dip.doe-mbi.ucla.edu/).
• In all, a total of 3224 TF-TF interaction pairs among
619 TFs were extracted from these public databases.
Positive and Negative Dataset
• The positive dataset: the extracted 3224 TFTF interaction pairs formed.
• The negative dataset: similar size of TFs
was constructed by randomizing pairs of
TFs that do not already exist in the positive
dataset.
• Perform five-fold cross validation to test the
experimental results
Comparison between Feature Combinations
• We conducted an experiment to determine the
effectiveness of the different features in
predicting TF-TF interactions.
• All 15 combinations of the 4 features (D:
domains, P: biological processes, F: molecular
functions, C: cellular components) were tested
over different kernels.
Performance of default kernel Kd for
different feature combinations
90
80
75
70
65
F-measure
Accuracy
AUC
Feature combinations for default SVM kernel
DPFC
PFC
DFC
DPC
DPF
FC
PC
PF
DC
DF
DP
C
F
P
60
D
Performance
85
Performance of different kernels
using all features with equal weights.
95
Kd
Kv
Kh
Kh+v
90
Performance
85
80
75
70
65
F-measure
Accuracy
AUC
Precision
Recall
Optimization of Kernel and Feature Weights
Weights Optimization Using GA
86.00
85.50
F-measure
85.00
84.50
84.00
83.50
max
median
min
83.00
10
20
30
40
50
Applying 0our optimized
HV-SVM
classifier
with
kernel
K
hvo
Generation
on test set achieved 85.7% F-measure, which is 1.0% higher
than kernel Khv and 2.5% higher than randomly assigned
weights for the initial population.
Performance Comparison of Various Classifiers
Classifier
AUC
Accuracy
F-Measure
SVM with Kd
88.98
81.61
81.75
SVM with Kv
89.98
82.57
82.71
SVM with Kh
84.99
76.77
74.13
HV-SVM with Khv
91.80
84.77
84.68
HV-SVM withKhvo
92.76
85.24
85.70
Naïve Bayes
85.23
78.88
79.49
Conclusions
• To understand the complex mechanisms behind
transcription regulation, it is imperative to
unravel the coordinated interactions of the TFs.
• We characterized the TFs using Pfam domains
and GO annotations. Our results that integrating
multiple biological evidences improves the
prediction of TF-TF interactions.
Conclusions (Cont.)
• We specifically designed two novel pairwise
kernels for predicting TF-TF interactions
– The vertical pairwise kernel measures similarity
across individual TFs between two TF pairs
– The horizontal pairwise kernel considers similarity
between two TF pairs by measuring the similarity
between the feature pairs of the two sets of TFs.
– Genetic algorithm was then employed to learn the
kernel and features weights of the kernel
combination to give the best results.
Future Work
• A future area of work would be to incorporate
the current three classes of TF-TF interaction
prediction techniques, namely gene
expression correlation based techniques,
interacting motif based techniques and PPI
network based techniques, into our HV-SVM
machine learning approach for even better
predictions.
Thank you
for your attention!