Microarrays and gene expression
Download
Report
Transcript Microarrays and gene expression
Using microarray for identifying
regulatory motifs and analysing time
series gene expression
Outline:
Using Bayesian statistics in classifying
genes in microarray analysis
Combining gene expression data and
sequence data for transcription site
discovery
Time Series Gene Expression analysis
Pietro Lió –
EMBL-EBI and Dept of Computer Science, Cambridge University [email protected] [email protected]
Types of questions
What genes are expressed in a particular cell type of an
organism, at a particular time, under particular
conditions?
Comparison of gene expression between normal and
diseased (e.g., cancerous) cells.
A typical
dimension of such
an array is about 1
inch or less.
The spot diameter
is of the order of
0.1 mm, for some
microarray types
can be even
smaller.
One of the most popular microarray
applications allows to compare gene
expression levels in two different samples,
e.g., the same cell type in a healthy and
diseased state.
Classification of genes in microarray analysis
Microarray statistics can be supervised or
unsupervised.
In the unsupervised case, only the expression data
are available and the goal is mainly to identify
distinct sets of genes with similar expressions,
suggesting that they may be biologically related.
In supervised problems a response measurement is
also available for each sample and the goal of the
experiment is to find sets of genes that, for
example, relate to different kind of diseases, so
that future tissue samples can be correctly
classified.
Bayesian Statistics
Bayesian statistics
models all source of
uncertainty (physical
randomness, subjective
opinions, prior
ignorance, etc) with
probability
distributions and then
trying to find the a
posteriori distribution
of all unknown variable
of interest after
considering the data.
Bayes, T. 1763. An essay towards solving
a problem in the doctrine of chances.
Philosophical Transactions of the Royal
Society of London 53:370-418. Reprinted,
E. S. Pearson and M. G. Kendall (eds.).
1970. Pages 131-153 in Studies in the
History of Statistics and Probability.
Charles Griffin, London.
In Bayesian statistics, evidences in favour of certain parameter θ
is considered :
PrD | Pr
Pr | D
PrD
Inference is based on posterior distribution, Pr(θ|D) which is the
conditional distribution of the parameter given the data, D. It is
a combination of the prior information and the data.
The term Pr(θ) represents the prior distribution on the
parameter. Prior information can be based on previous
experiments, or theoretical consideration. The term Pr(D|θ) can
be a likelihood and the denominator Pr(D) is a normalizing factor.
First write down the likelihood function, i.e. the probability of
the observed data given the unknowns, and multiply it by the a
priori distribution.
Let D denote the observed data and θ the unobserved
parameter.
Through the application of the Bayes theorem we can obtain
the posterior distribution:
Pr | D
PrD |
PrD | d
PrD |
PrD |
PrD
When θ is discrete, the integral is replaced by summation.
Bayesian Learning
Posterior is prior modified by data
Flat (“uninformative prior”)
0
θ
1
MCMC
The most useful method to approximate the posterior
probability is Monte Carlo Markov chain (MCMC).
MCMC, …….
…consider a robot random walking in a landscape with
hills…(these hills will be the genes involved in a disease
or a set of regulatory regions but you will see later)
Imagine a bivariate normal hills and the Markov chain
represents the positions of steps taken by a robot.
hills
A proposed move is always
accepted if it is “uphill” and is
often accepted if it is not too
much “downhill” from the
current position.
RULE: The step would be
rejected if the elevation at
proposed new location is less
than that of the current
location, and the uniform
random number drawn was
greater than the ratio of
new elevation to old
elevation.
The inner and the outer circles represent 50% and 95%
probability contours for the bivariate normal density
The robot visits points in proportion
to their height. Since the height
represents the value of the
probability density function, we can
simply count the dots in order to
obtain an approximation of the
probability corresponding to a given
area. In other words, the necessary
integration of the density function is
approximated by the MCMC run.
BURN IN: the initial steps (the “burn in”) are often discarded in a
MCMC run.
A proposed move is always accepted if it is
“uphill” and is often accepted if it is not too
much “downhill” from the current position.
Formally, in the Metropolis-Hastings (Metrolis
et al., 1953, Hastings, 1970) algorithm, a move
from θ to θ* is accepted with probability
min(r,1), where
P r * | D P r | *
r
P r | D P r * |
Longer runs of the chain result in better approximations
to posterior probabilities, and stopping too soon can have
serious consequences (chain may not have been visited
large portions of the parameters space).
Below is a plot showing the natural logaritm of the posterior
probability density (i.e. the elevation) for each point visited by the
robot. These are often called “history plots” .
Plots like this are often made for MCMC runs to see how well the
chain is “mixing”.
Mixing and MCMCMC
Setting the step size too small
runs the risk of missing the big
picture: the robot may explore
one hill in the landscape very
well but not take steps large
enough to ever traverse the gap
between hills.
MCMCMC (MC)3
The solution is to run several Markov chains
simultaneously, letting one of them (the “cold”) chain to
be the one that counts, and the others (called “the
heated” chains) serve as scouts for the cold chain.
Data set on Golub et al. (Science 286, 531-537,
1999) on cancer classification that has 38 samples
and expression profiles of thousands of genes.
Samples are classified into acute lymphoblastic and
acute myeloid leukemia. 34 further samples available
for validation.
Data set of Alon et al. (PNAS 96, 6745-6750,1999)
on colon cancer classification that has a total of 62
samples and expression profiles of 2000 genes.
The method is based on using a regression setting
and a Bayesian priors to identify the important
genes.
The model assigns a prior probability to all
possible subsets of genes and then proceeds by
searching for those subsets that have high
posterior probability.
This method is able to locate small sets of genes
that lead to a good classification results.
MCMC and microarray
The MCMC involves 2 steps:
(i) A subset of genes is proposed by stochastically
adding or dropping one gene each time.
(ii) This set is then either accepted or rejected with
a probability described by Metropolis and Hastings .
If a new set is accepted, then it is subjected to
further perturbation.
Output: set of genes that differ among
the two types of leukemia
genes
Leukemia classification: posterior probabilities
for single genes in 9 runs
Leukemia classification: renormalized posterior
probabilities for single genes
Some results from leukemia data set
Cystatin A and C belong to cystein protease inhibitors.
Cystein proteases induce apoptosis of leukemia cells.
IL-8 is a chemokine released in response to an inflammatory
stimulus that attracts and activates several types of
leukocytes. IL-8 gene is close to several other genes
involved in cancer, such as the interferon gamma induced
factor.
Adipsin (Complement factor D) is a serine protease
secreted by adipocytes involved in antibacterial host
defense. Note that IL-8 is also released from human
adipose tissue.
Zyxin is a LIM-domain protein localized in focal celladhesion.
Some results from Colon cancer data set
CRP contains 2 copies of LIM domain (acting cell growth
factors); its expression is parallel to that of c-myc that has
an abnormally high expression in a wide variety of human
tumours. It interacts with cytoskeleton and cell- adhesion
proteins.
Myosin regulatory light chain 2 binds calcium ions and
regulate contractile activity of several cell types. It is
present in several EST cDNA extracted from a large
number of tumours.
Regulatory regions discovery using gene
expression data
Related to Reverse engineering of genetic networks
Related genes that have similar expression
profiles are likely to have similar regulation
mechanisms as there must be a reason why their
expression is similar under a variety of
conditions (not always true!).
If we cluster the genes by similarities in their
expression profiles and take sets of promoter
sequences from genes in such clusters, some of
these sets of sequences may contain a ‘pattern’
which is relevant to regulation of these genes
Combining sequence and gene expression
data to find transcription sites
The search for transcription sites often starts from
the assessment of the statistical significance of
over- or under-representation of oligomers in
complete genomes that may indicate phenomena of
positive or negative selection.
Finding the sequences (protein
binding sites) that affect the
gene expression
5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT …gene1
5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG …gene2
5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT …gene3
5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC …gene4
5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA …gene5
5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA …gene6
5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA …gene7
AAAAGAGTCA
AAATGACTCA
AAGTGAGTCA
AAAAGAGTCA
GGATGAGTCA
AAATGAGTCA
GAATGAGTCA
AAAAGAGTCA
**********
Motif
Gene expression
Y X ,
~ N 0, 2
Strategy diagram
Rank all genes by expression and obtain their
upstream sequences
Find motifs from promoters of most induced
and most repressed genes
Score each upstream sequence for matches
to each considered motif
Perform variable selection on the significant
motifs to find group of motifs acting
together to affect expression
MCMC and microarray
The MCMC involves 2 steps:
A new subset of motifs is proposed by
stochastically adding or dropping one
motif each time.
This set is then either accepted or
rejected with a probability described
by Metropolis and Hastings .
Gene expression data, combined with sequence
analysis, seem to represent a promising method to
identify transcription sites.
We consider both yeast sequences and expression
data and outputs statistically significant motifs: the
regions upstream genes contribute to the logexpression level of a gene.
First we have first identified a number of sequence
patterns upstream the yeast genes and then tested
for candidate transcription sites.
Finding all cell-cycle genes from gene expression data
2.5
Periodically
expressed genes2
1.5
Intensity
1
Non periodically
expressed genes0.5
0
0
2
4
6
8
10
12
Time measure
14
16
18
20
Cell-cycle experiments
J.B. Fourier (1768-1830)
time
frequency
2
1.8
1.6
1.4
1.2
1
0.8
0.6
0.4
0
2
4
6
8
10
12
14
16
18
20
Regression using Fourier terms
Suppose Yi(t) denotes the expression of gene i at time t.
For all genes the vector Yi(t) was fit first with least squares
to Yi(t)=aiS(t)+biC(t)+Ri(t), where S(t)=sin(2t/T) and
C(t)=cos(2t/T) with T is the period of the cell cycle.
Yi(t) can be decomposed into a periodic component
Zi(t)=aiSi(t)+biCi(t) and a component Ri(t) that is aperiodic.
The proportion of variance explained by the Fourier basis
(Fourier proportion of variance explained -PVE) is the ratio
mi=var(Zi(t))/var(Yi(t)), which can range from 0 to 1.
Values closer to 1 indicate greater sinusoidal expression,
whereas values closer to 0 indicate lack of periodicity or
periodicity with a period that is substancially different.
The fitted waveform Z(t) resembles a sine wave. For each
gene, the phase of Z(t) (equal to time of maximal
expression) was determined.
3.5
1.7
2
1.6
1.8
3
1.5
1.6
2.5
1.4
1.4
2
1.3
1.2
1.5
1.1
1
1
0.8
0.5
0.9
0.6
0
0.8
-0.5
0.4
0.7
0
2
4
6
8
10
12
14
16
18
20
3.5
2
1.8
3
1.6
2.5
1.4
2
1.2
1.5
1
0.8
0.5
0.6
0
-0.5
0.4
0
2
4
6
8
10
12
14
16
18
20
Wavelets
A wavelet is an oscillation that decays quickly
are related to fractals in that the same shapes repeat at
different orders of magnitude.
outperforms a Fourier when the signal under
consideration contains discontinuities and sharp spikes
differ from Fourier methods in that they allow the
localization of a signal in both time and frequency
A wavelet is an oscillation that decays quickly
Wavelet basis are
related to fractals in
that the same shapes
repeat at different
orders of magnitude
A function can be represented
in wavelet domain by the an
infinite series expansion of
dilated and translated version
of a function :
Each wavelet function jk is
multiplied by a wavelet coefficient
J, scale parameter
K, position parameter
Wavelet methods differ from Fourier methods in that they allow
the localization of a signal in both time and frequency
w
f(t)
Impossible!
t
Instant
frequency
t
Fourier
Local Fourier
w
w
t
t
w
When modeling time/frequency
phenomena no precision in time
and frequency at the same time!
t
Wavelet transform
the time/frequency plane is
patched with rectangles of
w different shape
t
Wavelet coefficients: the upper
ones describe gross features of
the signal, the bottom ones
descrive little details
Scale J
Position,k
The sum of the squares
of the wavelet coefficients
at each scale is termed
a scalogram
Denoising and extraction of
components of figure a
Periodic
Analysis of
single time
series
0.3
0.2
Variance
0.1
0
The wavelet variance of
periodically expressed
genes is 100 times larger
than that of non periodic
genes
-0.1
-0.2
-0.3
0
0.5
1
1.5
2
Wavelet scale (fine to coarse)
2.5
3
Non Periodic
-3
x 10
5
4
3
2
Variance
1
Confidence intervals
0
-1
wavelet variance is a scale
-2by scale decomposition of
the variance of a signal.
-3
-4
-5
0
0.5
1
1.5
2
Wavelet scale (fine to coarse)
2.5
3
Replacing ‘global’ variability with variability over scales allows us to investigate
the effects of constraints acting at different time or sequence-length scales
Cell cycle analysis
Reconstruction by an
Anova in time domain
2
2
5
0
0
0
-2
0
0.5
1
-2
0
0.5
1
-5
0
10
10
5
0
0
0
-10
0
0.5
Errors for Anova
in time domain
Reconstruction by an
Anova in wavelet domain
1
-10
10
10
0
0
0
0.5
1
-5
0
Errors for Anova in
wavelet domain
0.5
1
0.5
1
0.5
1
5
0
-10
-20
-10
0
0.5
1
-20
0
0.5
1
-5
0
RESULTS:
Residual errors
are smaller in
wavelet domain
-> BETTER
ESTIMATE OF
VARIANCE
USING
WAVELETS
FDR false discovery rate
Bonferroni method belongs to a class of methods called family-wise type I
error rate (FWE), which is the probability of committing at least one error in
the family of hypotheses. The main problem with classical procedures, such
the Bonferroni correction, which hinder many applications in biology, is that
they tend to have substantially less power than uncorrected procedures.
Recent FEW methods have been used in microarray by Dudoit et al (2002)
The multiplicity problem in microarray data does not require a protection
against even a single type I error, so that the severe loss of power involved
in such protection is not justified. Instead, it may be more appropriate to
emphasize the proportion of errors among the identified differentially
expressed genes.
The expectation of this proportion is the False Discovery Rate (FDR) of
Benjamini and Hochberg (1995). The FDR is the expected proportion of
true null hypotheses rejected out of the total number of null hypotheses
rejected.
Adaptive - FDR procedure
Given a q (for example q=0.05)
Order the p-values
p(1) p(2) … p(m)
Let
r = max (i : p(i) i · q / m )
Reject
H0(1), H0(2), … H0(r)
Benjamini Y, Hochberg Y (1995).
Controlling the false discovery rate: A practical and powerful approach to
multiple testing. Journal of the Royal Statistical Society, Series B, 57:289--300.
Benjamini, Y, Yekutieli D (2003).
The control of the false discovery rate under dependence. Annals of Statistics.
To appear.
Thanks to
Alvis Brazma
Marina Vannucci
Nick Goldman