Journal report: High Resolution Model of Transcription Factor

Download Report

Transcript Journal report: High Resolution Model of Transcription Factor

Journal report: High Resolution
Model of Transcription FactorDNA Affinities Improve In Vitro
and In Vivo Binding Predictions
Paper by: Phadera Gius, Aaron Arvey,
William Chang, William Stafford Noble,
Christina Leslie
Memorial Sloa-Kettering Cancer Center, NY
Presented by Yaron Orenstein for ACGT group
meeting, 19 January 2011
Introduction – Biological Background
• Gene regulatory programs are orchestrated by
transcription factors (TFs).
• These proteins usually bind to binding sites
(BSs) in the promoter region and enable or
impend transcription of the gene.
• Accurately modeling the DNA sequence
preferences of TFs is a key piece in unraveling
the regulatory code.
Modeling BSs: PSSM model
• The most popular model to represent binding
sites is the PSSM: position specific scoring
matrix.
1
2
3
4
5
6
A
0.1
0.8
0
0.7
0.2
0
C
0
0.1
0.5
0.1
0.4
0.6
G
0
0
0.5
0.1
0.4
0.1
T
0.9
0.1
0
0.1
0
0.3
• These motifs may match thousands of sites in
intergenic regions, producing an unreliable list
of potential TF target genes.
All possible 8-mers model
• This model contains a list of all possible
8-mers ranked by the TF preference.
• This information can be obtained for
example from PBM data and calculating
an enrichment-score for each 8-mer.
• The disadvantage is clearly its large size
and uninterpretability. In addition, the
sequence similarities between 8-mers is
not considered.
Protein Binding Microarray data
• PBM array contains ~41,000 probe sequence
of length 35bp each, covering all possible DNA
10-mers.
• For each probe the binding intensity is
reported.
Support vector regression
• Motivation: predict real values based on a
feature set.
• Given a training set
, find a
function f which best predicts y.
• For example, if f is linear, then f(x) = <w,x>+b,
where w is the set of feature weights.
•
is minimized under some error
constraints.
Example for SVR
• A simple way to predict binding intensity from
PBM data based on 8-mer features.
• Use indicator features for each 8-mer:
– 1 if sequence x contains the 8-mer.
– 0 if it does not.
An overview
Methods
• They developed a training strategy for the SVR
model that involves three key components:
1. The choice of kernel.
2. The sampling procedure for selecting the most
informative training sequences.
3. The feature selection method.
The di-mismatch kernel
• Let
be a set of unique k-mers that
occur in the set of training sequences.
• Define the set of substrings of length k in s (of
length N:
• Then s is represented by the feature vector:
• And
counts the number of
matching dinucleotides between and
.
Example for the di-mismatch kernel
•Two non-consecutive pair of mismatches
lead to a count of mismatches 6:
4 consecutive mismatches lead to a count
of 5:
Sampling PBM data to obtain an
informative training set
• They selected the set of “positive” training
probes to be those sequences associated with
normalized binding intensities Z ≥ 3.5.
• If there were more than 500, they selected the
top 500 ranked by their binding signals.
• The same number of “negative” training
probes was selected from the other end of the
distribution.
Feature Selection
• They selected the feature set
to be those
k-mers that are over-represented either in the
“positive” or “negative” probe class
• They computed the mean di-mismatch score
for each k-mer in each class and ranking
features by the difference between these
means.
• They used at most 4000 k-mers.
Results
• First, they tested how well they predict the
ranking of probe sequences of one PBM array
based on learning from another PBM array.
• They used the metric of: Top 100, meaning
how many of the top 100 probes were ranked
to be in the top 100 by the model.
• They compared to PSSM and E-Score (full 8mers list) models.
• The left scatter plot shows the detection of the top 100
probes using maximum E-scores (x-axis) and the SVR model
(y-axis) in the prediction of in vitro TF binding preferences.
Each point corresponds to one TF.
• The right panel is similar to the left, but compares the SVR
versus PBM-derived PSSMs for the 114 mouse TFs.
Testing on Chip-Chip data
Prediction of in-vivo occupancy
• They computed the binding occupancy using a
sliding 36-mer window for scoring.
• They compared to:
1. PSSM. Log-odds scores were used.
2. E-score over a fixed threshold.
3. E-score based occupancy (using the median
probe intensity of PBM probes containing the
highest-scoring 8-mer pattern).
• Predicted binding profile for:
– (left) yeast TF Ume6 along IGR iYFL022C
– (right) yeast TF Gal4 along IGR iYFR026C
They computed the
detection of the top 200
inter genomic regions by the
top 200 predictions, where
the top 200 “bound” IGRs
were determined by their pvalue ranking.
• Prediction of in vivo is weak to very poor (due to indirect and competitive
binding as well as other factors).
• Still, in 8 out of 9 example the SVR method outperforms the occupancy
score method of Zhu et al. (2009).
• Against PSSM model it was: 6 wins, 1 ties, 2 losses.
Testing on ChIP-seq data
Testing on ChIP-seq data
• They selected 1000 confident peak regions
(60bp each) and 1000 “negative” regions from
flanking sequences (60bp regions 300bp away
from the peaks).
• Model performance measured by area under
the ROC curve (AUC), using the maximum SVR
prediction score (over 36-mer windows) to
rank ChIP-seq 60-mers.
• ROC = true positive rate vs. false positive rate.
SVRs trained on PBM arrays are able to capture ChIP-seq peaks
better than PSSMs or the occupancy score.
Support Vector Machines
• Here we want to classify the data to binary
classes, i.e. the training set is
Training discriminative models on
ChIP-seq data
• Trained SVMs using the (13,5) parameters on 60mer ChIP-seq peaks (positive sequences) and
flanking negative sequences.
• Evaluation by computing AUCs on the same test
sets of 1000 ChIP-seq peaks and 1000 flanking
negative sequences using 10-fold crossvalidation.
• Tested against Weeder and Mdscan, which
determine overrepresented k-mer and PSSM
motifs, respectively.
• SVMs trained on ChIP-seq data capture sequence information from
the genomic context of ChIP-seq peaks and improve in vivo
prediction performance.
• There was no advantage to training regression models on ChIP-seq
peaks label with real-valued occupancy.
PBM experiments may capture in vivo
preference
• To investigate how some PBMs contain 2 different
binding sites, they did:
1. Cluster k-mer features based on their co-occurrence
in the training sequences.
2. Projected highly weighted k-mers into 2 dimenstions
using principal component analysis (PCA)
• Two clusters were found, each representing a different
motif.
• The SVR was trained on the features of each motif
separately and the AUCs were 0.75 and 0.54.
• K-mers contributing to the (left) Oct4 PBM model and (right) Sox2
ChIP model, where each point represents a 13-mer and is colored
according to its model weight. Star and circle point styles indicate
different clusters.
• For the PBM derived model, the clusters represent the primary and
secondary binding motifs
• For the ChIP-derived model, the clusters correspond to the motifs
for Sox2 and its cofactor Oct4.
Summary
• A flexible new discriminative framework for
learning TF binding models from high
resolution in vitro and in vivo data.
• The SVR/SVM models better predict binding
affinity and thus are more suitable for
representing complex regulatory regions.
Possible directions to continue
1. Training jointly on PBM and ChIP-seq data for
the same TF.
2. Develop multi-task training strategies for
modeling the binding preferences of a class
of structurally relate TFs using features of the
amino acid sequence.
3. Combine in vivo TF sequence preference
models with data on chromatin state to
predict TF target genes in new cell types.