A Probabilistic Model to Integrate Microarray and ChIP Data
Download
Report
Transcript A Probabilistic Model to Integrate Microarray and ChIP Data
A Probabilistic Dynamical Model
for Quantitative Inference of the
Regulatory Mechanism of
Transcription
Guido Sanguinetti, Magnus Rattray
and Neil D. Lawrence
Talk plan
•
•
•
•
•
Overview of the problem
Extending regression
Introducing dynamics
Modelling separately concentrations
What next?
The problem
• The Central Dogma
Genes
Transcription
Easy to measure
mRNA
Translation
Hard to measure
COMPLEX!
Proteins
Protein interactions
Life
Specific problem
• Transcription factors produce proteins that
promote or repress transcription of other
genes; they play a fundamental role in
gene networking
• Deduce the activity of the transcription
factors’ proteins (in an experimental
condition) from the mRNA expression
data.
Why not use the TFs expressions?
TFs are often low expressed, noisy
TFs are post-transcriptionally regulated
TFs interact non-trivially with each other
Current approaches
• Integrate with ChIP-on-chip data
• ChIP-on-chip gives a binary matrix X of
transcription factors binding genes (connectivity
matrix)
• Regress microarray expression data on X
bmt is the transcription factor activity (TFA) of TF m
at time t, monotonically linked to protein
concentrations (Liao et al, Boulesteix and Strimmer,
Gao et al,...)
Problems
• All genes bound by the TF contribute
equally to the estimate of the TFA,
regardless of the regulation type.
• TFAs are gene-independent, but the
influence of a transcription factor varies
from gene to gene (and according to
condition)
• The model is linear (inevitable)
Extending Regression
Modify the regression model to allow different TFAs
for different genes and experiments
Reduce the number of parameters by placing a prior
distribution over the gene-specific TFAs. The choice of
the prior distribution depends on the situation we
model. E.g., for independent samples we may assume
TFAs at different time points to be independent
Introducing dynamics
• To model time series data, we choose a Kalman
filter prior on the rows of B
where
This is equivalent to assuming TFAs vary smoothly
Likelihood function
• Given the model and the prior, we can obtain a likelihood
The likelihood can be estimated efficiently using the sparsity
of the covariance and recursion relations.
Estimating the TFAs
TFAs can be estimated a posteriori using Bayes’s
Theorem and moment matching
Error bars associated with each TFA are given by
the squared root of the diagonal entries in the
posterior covariance.
Mean TFAs can be obtained by averaging genespecific TFAs over the target genes.
Testing the model
• We compared our
averaged TFAs with
the ones obtained by
regression for the
Spellman dataset
(Mol.Biol.Cell,1998),
ChIP data from Lee et
al. (Science 2002).
The diagrams show
the TFA for ACE2p.
...but we also get...
TFA for SCW11
TFA for YER124C
TFA for CTS1
TFA for YKL151C
...and we can do more!
Gene Name
Maximum TFA with error
YER124C
YHR143W
ACE2=1.1±0.2, FKH2=0.03±0.04
PHO3
AGA1
NDD1=1.6±0.2, FKH2=0.06±0.02
ACE2=1.4±0.2, FKH1=0.011±0.009,
FKH2=0.03±0.04
MBP1=1.5±0.4, SWI4=1.0±0.4, MCM1=0±0.003
• Error bars allow to determine which regulations
are significant
• Correlations among TFs can be obtained from Σ
Decoupling action and concentration
• It is not clear in the model whether a high gene-specific
TFA is the result of a high affinity or of a high protein
concentration
• We modify the model to distinguish the effects of protein
concentration and affinity
• Specifically, we model
Estimating the parameters
• The model is no longer exact.
• Approximate inference is performed using a variational
EM algorithm
• This exploits Jensen’s inequality to get a bound on the
log likelihood
Under a factorization assumption on the approximating
distribution q, the E-step becomes exactly solvable via
fixed point equations.
Results
The left hand picture shows the expression level of ACE2
in the yeast cell cycle, the middle shows the inferred protein
concentration and right shows the significance of the activities.
Problems
• ChIP data is notoriously noisy; for example the same
transcription factor (MSN4) in the same conditions (rich
medium) is found to bind 32 genes in Lee et al. and 57
genes in Harbison et al. (the intersection is 20 genes).
• Posterior estimation helps with false positives, not with
false negatives.
• The model is additive (in log space) and doesn’t model
combinatorial effects.
What next?
• Collaborate with biologists to validate our
predictions on novel data
• Microarray and ChIP data from same lab
should be more consistent
• Use the model results as a starting point
for systems biology modeling
• Introduce combinatorial effects