Transcript Slide 1

Propagating Measurement Uncertainty in
Microarray Data Analysis
Magnus Rattray
School of Computer Science
University of Manchester
Combining the strengths of UMIST and
The Victoria University of Manchester
Talk Outline
•
•
•
•
•
•
•
•
Part 1: Affymetrix probe-level analysis
Probabilistic model for oligonucleotide arrays
Estimating credibility intervals
Evaluation on real and spike-in data
Part 2: Propagating uncertainties
A general framework for propagating uncertainties
Example 1: Identifying differentially expressed genes
Example 2: Modified Principal Component Analysis
Combining the strengths of UMIST and
The Victoria University of Manchester
Part 1: Affy probe-level analysis
PM – Perfect match DNA probe
designed to measure signal
MM – Mismatch DNA probe
designed to measure background
Probes for the same gene differ
greatly in their binding affinities, eg.
PM
83
77
70
982
530
1013
340 1832
464 1111
MM
86
65
79
489
172
1224
181 985
191 313
~10000-50000 probe-sets with 11-20 PM/MM probe-pairs
Combining the strengths of UMIST and
The Victoria University of Manchester
Are mismatch probes useful?
• In practice there is specific binding to MM, so some
methods ignore MM probes altogether. But…
…if fraction is the same for
each chip, this term cancels
when computing expression
ratios.
Combining the strengths of UMIST and
The Victoria University of Manchester
Probabilistic probe-level analysis
•
•
•
•
Most methods return a single expression level estimate
Probabilistic models provide confidence intervals
Useful for propagating through higher-level analysis
Hopefully, this approach will also improve accuracy
A hierarchical Bayesian model (Hein et al. 2005) uses
MCMC for Bayesian parameter estimation, but this can
be prohibitively slow – a more efficient approach is
required.
Combining the strengths of UMIST and
The Victoria University of Manchester
Gamma model for oligo signal: gMOS
Models (PM,MM) distribution for each probe-set
- PM (background+signal)
- MM (background)
- signal
Mean log-signal
where
Combining the strengths of UMIST and
The Victoria University of Manchester
Milo et. al., Biochemical Transactions 31, 6 (2003)
Modelling probe affinity: mgMOS
•
•
•
•
PM and MM probes have correlated binding affinities
Use a shared scale parameter
for probe-pair
Treat scale parameter as a latent variable
Distribution of PM ( ) and MM (
) is
Improves
fit to data
Combining the strengths of UMIST and
The Victoria University of Manchester
Further extensions of the model
• Share binding affinity parameter across multiple chips
• Include fraction specific binding to MM probe
Probe in probe-set
on chip
Parameter
is unidentifiable
We estimate an empirical prior
from spike-in data
Combining the strengths of UMIST and
The Victoria University of Manchester
Liu et. al., Bioinformatics 21, 3637 (2005).
Posterior signal distribution
• We estimate the mean signal over a probe-set as
• Only the first term is chip & condition specific
• Distribution of gives posterior signal distribution
• We assume a uniform positive prior on
• Approximate posterior of
as truncated Gaussian or
using a histogram approach (very similar in practice)
• Percentiles of
Combining the strengths of UMIST and
The Victoria University of Manchester
provide percentiles of
Posterior signal distribution
• Posterior becomes more peaked as signal increases
• Normal provides good fit for large signals
• For low signal there is a long left-hand tail due to the
fact that we are measuring
• Posterior distribution can be used to put credibility
intervals on the estimated expression level
Combining the strengths of UMIST and
The Victoria University of Manchester
Results: Accuracy on real data
• 5 time-points, 3 replicates & qr-PCR for 14 genes
mgMOS
Method
Error
GC-RMA
0.69
MAS 5.0
0.66
mgMOS (post.median) 0.60
multi-mgMOS
multi-mgMOS
0.60
Hierarchical Bayesian
0.72
RMS error to PCR results
Combining the strengths of UMIST and
The Victoria University of Manchester
Mouse hair-follicle morphogenesis data from Lin et. al. PNAS 101, 15955 (2004).
Importance of credibility intervals
Red boxes show
truly differentially
expressed genes
Left: Log-ratios used
to rank genes
Right: Credibility
intervals used to
rank genes
1331 up-regulated genes (1.2 to 4-fold), 12679 invariant
Combining the strengths of UMIST and
The Victoria University of Manchester
Spike-in data from Choe et al Genome Biology 6, R16 (2005).
Part 2: Propagating uncertainties
• Uncertainties can be propagated as noise
where
is diagonal covariance matrix for gene
• Use your favourite probabilistic model for
• Data is not i.i.d. making parameter estimation tricky
We consider two popular tasks as examples:
(i) Combining replicates and identifying differential
expression
(ii) Principal Component Analysis (PCA)
Combining the strengths of UMIST and
The Victoria University of Manchester
(i) Combining replicates
Simplest model of log-expression
for replicate
in conditions
is a Gaussian:
with priors
• Parameters are
• Hyper-parameters are
• We can then calculate the probability of the sign of
change in expression level between two conditions:
Combining the strengths of UMIST and
The Victoria University of Manchester
Hyper-parameter estimation
Likelihood:
Prior:
We wish to optimise the log marginal likelihood:
The integral is intractable, so we use a variational
approximation (popular approach in machine learning).
The resulting optimisation resembles an EM-algorithm.
Combining the strengths of UMIST and
The Victoria University of Manchester
Variational approximation
E-step:
M-step:
We use a factorised approximation to the posterior:
Combining the strengths of UMIST and
The Victoria University of Manchester
Results: credibility intervals
Combining the strengths of UMIST and
The Victoria University of Manchester
Data from Lin et. al. PNAS 101, 15955 (2004)
Identifying differential expression
One chip per condition
3 replicates per condition
1331 up-regulated genes (1.2 to 4-fold), 12679 invariant
Combining the strengths of UMIST and
The Victoria University of Manchester
Spike-in data from Choe et al Genome Biology 6, R16 (2005).
(ii) Principal Component Analysis
• Popular dimensionality reduction technique
• Project data onto directions of greatest variation
Useful tool for visualising patterns and
clusters within the data set
Usually requires an ad-hoc method for
removing genes with low signal/noise
Combining the strengths of UMIST and
The Victoria University of Manchester
This example from Pomeroy et. al. Nature 415, 436, 2002.
Embryonic tumours of the central nervous system.
Probabilistic PCA
• PCA can be cast as a probabilistic model
with -dimensional latent variables
• The resulting data distribution is
• Maximum likelihood solution is equivalent to PCA
Diagonal
contains the top sample covariance
eigenvalues and
contains associated eigenvectors
Combining the strengths of UMIST and
The Victoria University of Manchester
Tipping and Bishop, J. Royal Stat. Soc. 6, 611 (1999).
Relationship to Factor Analysis
• Probabilistic PCA is equivalent to factor analysis with
equal noise for every dimension
• In factor analysis
for a diagonal
covariance matrix
• An iterative algorithm (eg. EM) is required to find
parameters if precisions are not known in advance
In our case we want the precision to be gene and
experiment specific – we need a more flexible model
Combining the strengths of UMIST and
The Victoria University of Manchester
PCA with measurement uncertainty
• If we let the covariance matrix be gene specific then
Probabilistic PCA:
Corrupted data model:
• The log-likelihood is
with
• The maximum likelihood solution for the mean is
Combining the strengths of UMIST and
The Victoria University of Manchester
which is no longer the sample mean
Likelihood optimisation
• The optimal parameters are solutions to a coupled
non-linear set of equations (eg.
•
•
•
•
depends on
Gradients require inversion of large matrices
An EM-algorithm provides more efficient optimisation
M-step still requires non-linear optimisation
Redundant parameterisation of model gives us a
significant speed-up
Combining the strengths of UMIST and
The Victoria University of Manchester
)
Advantages over standard PCA
• Automatically eliminates influence of consistently
noisy genes, eg. noisy in all experiments
• Automatically chooses no. of principal components
because noise “explains away” some of the variation
• Down-weights influence of noisy measurements in an
experiment specific way
• Provides error-bars on the reduced dimension
representation of the data
• Can be used to “denoise” expression profiles
Combining the strengths of UMIST and
The Victoria University of Manchester
Results: Improved visualisation
Under standard
PCA 43% of
samples are closest
to a sample of the
same tumour type.
For modified PCA
this percentage
increases to 71%.
Combining the strengths of UMIST and
The Victoria University of Manchester
Data from Pomeroy et. al. Nature 415, 436, 2002.
Denoising a data set
• We can estimate the uncorrupted data
measurements
from the noisy
as
• Denoised profile approaches original as noise is reduced
• Denoised data improves performance of clustering
Combining the strengths of UMIST and
The Victoria University of Manchester
Sanguinetti et al. Bioinformatics 21, 3748 (2005).
Conclusions
• We have developed a computationally efficient
probabilistic model for Affymetrix probe-level analysis.
• The model provides good accuracy and confidence
intervals for gene expression level estimates.
• Measurement uncertainties can be propagated
through an appropriate probabilistic model.
• Example applications to Bayesian t-test and PCA.
• Parameter estimation becomes much more difficult, so
approximate methods are needed.
• Same principal can be applied to other models.
Combining the strengths of UMIST and
The Victoria University of Manchester
Acknowledgments
Rest of the team:
Xuejun Liu, School of Computer Science, University of Manchester.
Guido Sanguinetti, Marta Milo & Neil Lawrence, Department of Computer
Science, University of Sheffield.
Software: www.bioinf.man.ac.uk/resources/puma
Papers:
Liu et al. “A tractable probabilistic model for Affymetrix probe-level analysis
across multiple chips” Bioinformatics 21, 3637 (2005).
Sanguinetti et al. “Accounting for probe-level noise in principal component
analysis of microarray data” Bioinformatics 21, 3748 (2005).
Supported by a BBSRC award “Improved processing of microarray data with
probabilistic models”
Combining the strengths of UMIST and
The Victoria University of Manchester