Transcript document

A Likely(hood) Story: Statistical Inference and
the Problem of Understanding Hi-Throughput
Transcription Factor Binding Assays
Curtis G. Callan, Jr.
Physics Department
Princeton University
Reporting on work with Justin Kinney and Gasper Tkacik
13/1/2007
NIH Center Retreat
1
Acknowledgements
• Collaborators (responsible for most of the good ideas):
Justin Kinney
(PU Grad Student)
Gasper Tkacik
(PU Grad Student)
• Support:
Burroughs-Wellcome Program in Quantitative Biology
NIH Center for Systems Biology at Princeton University
• Reference: J Kinney, G Tkacik & C Callan PNAS104:501(2007)
13/1/2007
NIH Center Retreat
2
TF-DNA Energy Models from Binding Assays
• Transcription factors (TFs) are DNA-binding proteins which regulate
gene transcription: key mechanism in all organisms.
• A quantitative understanding of gene regulation, especially its evolution,
requires a quantitative understanding of TF-DNA interaction, i.e.
sequence-dependent binding energy (SDBE).
• High-throughput experiments can give massive amounts of (rather
noisy) information on TF-DNA binding. Popular examples are
• PBM: protein binding microarrays (in vitro)
• ChIP-chip: chromatin immuno-precipitation microarrays (in vivo)
• Usual goal: use the data to identify the TF binding sites (yes-no answer)
• Our goal: infer quantitative SDBE models from this noisy data. We will
take the statistical inference approach used by physicists to deal with
WMAP/HEP data. Outcome is a probability distribution on model space.
13/1/2007
NIH Center Retreat
3
PBM Assay Overview
(Mukherjee et al)
• Uses dsDNA microarrays to simultaneously assess TF binding to all
intergenic regions of S. cerevisiae.
bind
scan
label
• Fluorescence log-intensity ratios (LIRs) are filtered, averaged over
replicates and normalized to taste. Each sequence si is assigned
some best measured value zi (for i = 1,2, …, N intergenic regions).
• Connection between these measured values and whether a TF is
bound to the region (or not) is very noisy due to the complicated and
loosely-controlled chemistry. How to interpret the data?
•ChIP-chip assay (in vivo) produces similar-looking data.
13/1/2007
NIH Center Retreat
4
Simple Binding Energy (and Binding) Model
• Bases within a site (length L) contribute
additively to the binding energy. Model is
a 4xL “energy matrix” M.
• A stretch of DNA is “bound” if it contains
a site with E<1 (other-wise “unbound”).
Site occupancy modeled by step
function.
• A matrix model M predicts if any DNA
sequence s is bound (x=1) unbound
(x=0).
C
• How does this compare with what is
seen in the expt? Does any choice of M
fit the data?
L=6
Site = TGTGAC
Energy = 0.7 < 1
A
1.2
1.8
5.6
2.5
0.0
6.0
C
3.7
0.0
0.5
1.2
5.2
0.0
G
2.9
0.1
0.8
0.6
1.3
1.4
T
0.0
3.0
0.0
0.0
3.1
3.2
A T G T G A C C T
Region is “bound”
( binary x )
( continuous z )
13/1/2007
NIH Center Retreat
5
Connecting “Theory” and Experiment:
• Fluorescence z of a bound (unbound) region
is probabilistic (due to chemistry, etc.). Summarized by ``error models’’ for the two states:
• Experiment sees only the histogram of net
fluorescence N(z)=N0p(z|0)+N1p(z|1) due to
N0 “unbound” + N1 “bound” genes. Usually try
to discriminate the two states by a “cut” on z.
• How good is a model M ? If it predicts {xi}, the likelihood of data {zi} is:
product over all regions
• Bayes’ Rule then gives the likelihood of the model, given the data:
with model prior p(M)
• Result is prob dist’n on model space: basis for statistical inference.
Good! Small hitch: the actual error model is usually totally unknown!
13/1/2007
NIH Center Retreat
6
Quenching the Error Model: EMA Likelihood
• In ignorance of the true error model, we will average data likelihood
over all error models to get an error-model-averaged (EMA) likelihood.
• To actually do this average, we need to discretize the continuous data
• Bin each region si according to
fluorescence (discretize N(z))
• Find predictions {xi} of model M, record
counts czx per bin (parse bins into
separate binding states)
• EMA likelihood is functional integral:
m bins with
equal #s of
regions ..
C1 1
C1 0
Cm 1
Cm 0
each bin
splits into
two states
Mutual information appears!
• Our binned data yield a simple formula:
Practical algorithm for
evaluating p(M|{zi}) (up to
normalization!)
13/1/2007
NIH Center Retreat
7
Monte Carlo (MCMC) Evaluation of p(M|{zi})
• Random walk in space
of energy models (i.e.
elements of matrix M)
• Random start or …
• Accept/reject steps by
change in p(M|{zi}).
• After “burn-in”, get
model ensemble dist’d
according to p(M|{zi}) !
• Do 107 steps, record
4x103 models at 103
step intervals.
• Calculate statistics by
ensemble averaging
• Test for “burn-in” by
comparing inter- and
intra-run variances
PBM
ChIP-chip
Key result: p(M|{zi}) has a single smooth peak, easily found by MCMC!
13/1/2007
NIH Center Retreat
8
MCMC Results for TF ABF1p (Yeast)
RTCRYNNNNNACGW
Y=C,T
R=A,G
• MCMC generates 40,000 matrices M
sampled from p(M|{zi})using EMA
likelihood.
• All 80 matrix elements have wellsampled distributions (see insets).
• Mean matrix makes perfect sense in
terms of the known motif (more later)
• Distributions are amazingly tight: most
RMSDs ≤ 5% of functional range.
• Meaningful structure, even in the
middle of the binding site, where there
is little specificity.
• That the data imply a smooth probability landscape in the high-dimensional model space is a surprise.
• No one model is the “best” model. We
can now treat model predictions as
clean probabilistic statements.
1/13/2007
NIGMS Center Retreat
9
PBM vs. ChIP-chip Data Analysis
• PBM and ChIP-chip data give very similar
matrices (but with the ChIP-chip cutoff
set to .75 instead of 1).
• Cutoff approximates the chemical
potential of the TF: can vary between
experiments (but not the energy matrix)
• Simple 2 test used to asess the overlap
between the PBM and ChIP-chip
distributions for each matrix element, i.e.
test for consistency.
• Most elements have overlapping
distributions. Only 3 don’t, and those are
outside of the main site.
• Element by element match of mean and
variance between the two analyses is
impressive: No Free Parameters!
• The error models of the two exp’ts are
very different. We infer them from the
data; that the same energy matrix is
inferred is serious consistency check.
13/1/2007
NIH Center Retreat
11
PBM vs. ChIP-chip Binding Predictions
• We declare sites flagged by > 50%
of energy matrices to be putative
binding sites. Can make it tighter.
• Putative ChIP-chip sites are
almost all predicted by the PBM
matrices. As they should be!
• Experimenters identify putative
sites by cutoff. Overlap between
ChIP-chip and PBM is poor.
• Changing the cutoff will only admit
more false positives: improvement
only by decreasing expt’l noise.
• By “understanding” the noise, we
can flag more sites with little false
positive penalty.
General lesson is that noise can be “understood” if the data is “bottlenecked” through a “good” parametric model. That the difference between
in vivo versus in vitro experiment is captured as noise is pleasant surprise.
13/1/2007
NIH Center Retreat
12
Evidence From Evolution
• Direct experimental evidence about TF-DNA binding energy is
limited. But energy is the focus of our method. We need direct hithroughput energy measurements…
• Functional properties of binding sites are governed in large part by
energy: energy, not sequence, should be conserved in evolution if
function is to be maintained
• Suggests that comparing “orthologous” binding sites between species
is an indirect way of assessing whether our energy model is doing the
right thing.
• If the model passes this test, our hundreds of binding sites are the
raw material for a new kind of quantitative study of the dynamics of
evolution.
13/1/2007
NIH Center Retreat
13
Binding Energy is Conserved (Yeast ABF1)
Use the energy model derived by likelihood analysis from S.cerevisiae assays to
assign energies to sites (an a priori genotype-phenotype map):
• 676 (!) intergenic sites in S. cerevisiae
with Abf1p binding energy < 1 have
clear orthologs in S. bayanus
• ~75% of these orthologs also have
energy < 1. Conservation this strong is
clearly not an accident!
• Sites with energy > 1 have no
correlation between energies in the two
genomes: mutation randomizes energy.
• Clear evidence of selection pressure on
bound site energy; pressure on
sequence is indirect.
• Precise genotype-phenotype map is
good starting point for a quantitative
understanding of how TF binding sites
and regulatory networks evolved.
13/1/2007
NIH Center Retreat
14
Energy Imprints Itself on Sequence Evolution
We have a rich yeast phylogenetic tree.
Plus a large collection of orthologous site
pairs. Ask population average questions
about the likelihood of base changes at diff’t
locations in thte ABF1 site. Look for
selection on the basis of energy ….not siteby-site, but on average in the population.
ABF1
Substitution probability pattern
matches structure of energy
matrix (bigger DE’s disfavored).
Pattern evolves with time (from
last common ancestor) as if under
control of common `Hamiltonian’.
13/1/2007
NIH Center Retreat
15
Comments and Conclusions
• For lack of time I didn’t show you lots of nice stuff: hows ensemble
are stable under bin size (m=20 to m=200 regions) and matrix width
(L=14 to L=26 bp), and under discarding random halves of the data.
• I also didn’t show you how it works for other TFs. We’ve looked at
other “broad-acting” TFs (RAP1, REB1, ..) with “similar” results.
• The opportunities for quantitative study of evolution with a large
population of binding sites and a clean quantitative phenotype are
pretty exciting. Its not data analysis, but you can’t resist.
• That a minimalist energy model so accurately describes complex
DNA binding data is a big surprise. Arbitrary sets of 100s of genes
cannot be so regulated; how rare are large sets which could be?
• Different binding sites have different affinities (for good reasons?).
We make specific predictions about how affinities are ordered. How
does this concord with biochemical reality (Maerkl/Quake)?
• The output of this effort is meant to be input to an effort to construct
proper stat mech models of interesting networks. Next step!
13/1/2007
NIH Center Retreat
16
“Equilibrium Thermodynamics” of Site Evolution
Assume: a) site s1 evolves over time into an ensemble {s1} thermal with respect to a
fitness function F(E(s)) b) many equivalent sites [si] of a TF in one species can be
treated as an equivalent ensemble (ergodicity?).
Population genetics (Kimura): raw mutation
rate m is modified by fitness change DF:
F(E))
ABF1 site stats
Pbkgd(E)
Pfunc(E)
13/1/2007
Infer the actual fitness function F(E)
from ABF1 site energy statistics in
S.cerevisiae: Known background m
determines dist’n P0(E) of non-func’l
sites. Then Pfunc(E) = Ptrue(E) - P0(E) .
F(E) = ln(Pfunc(E)) is the “Hamiltonian”
governing stochastic evolution of
ABF1 binding sites (Lassig-Mustonen)
Do MC simulations agree with data
provided by real genomes?
NIH Center Retreat
17