Transcript slides

LIMMA
Linear Models for Microarray Data
Difficulties with microarray data
• Variability of the expression values differs
between genes
• Non-identical and dependent distribution
between genes
• Multiple testing of tens of thousands of
genes
Correct for multiple comparisons
• Multiple testing - Family-wise error rate
- False Discovery Rate etc.
• Parallel nature of the inference allows for
compensating possibilities
• Borrowing information from the ensemble of
genes to assist in inference from individual
genes
Empirical Bayes
• Frequentist methods, a hypothesis is typically
rejected or not rejected without directly
assigning a probability
• Bayesian methods, specifies some prior
probability, which is then updated in the light of
new data.
• For Bayesian techniques, the prior distribution is
assigned independent of the data and fixed
before any data is observed.
Empirical Bayes
• Superficially similar to Bayesian methods
in that a prior distribution is assigned.
• However, prior distribution is estimated
from the data
• Therefore Empirical Bayes is a frequentist
technique
LIMMA
• Empiricial Bayes techniques have previously
been applied to microarray data
• Analysis specific to experiment and very difficult
to implement
• LIMMA - Simple model with simple expression of
posterior odds
• Allows linear modelling to be applied to
microarray data
Estrogen Data
• 2x2 factorial experiment on MCF7 breast cancer
cells using Affymetrix HGU95av2 arrays
• Factors : Estrogen (Presence/Absence)
Length of exposure (10hr/48hr)
• The idea of the study is to identify genes that
respond to estrogen treatment
Read in the Data
• Load in the estrogen data
• Normalise the data
• Define the targets (factors) for the linear
model
Design Matrix
1 low10-1.cel absent 10
2 low10-2.cel absent 10
3 high10-1.cel present 10
4 high10-2.cel present 10
5 low48-1.cel absent 48
6 low48-2.cel absent 48
7 high48-1.cel present 48
8 high48-2.cel present 48
• Eight arrays
• Four pairs of replicates
• Four parameters in the linear model
Contrast Matrix
1 low10-1.cel absent 10
2 low10-2.cel absent 10
3 high10-1.cel present 10
4 high10-2.cel present 10
5 low48-1.cel absent 48
6 low48-2.cel absent 48
7 high48-1.cel present 48
8 high48-2.cel present 48
Estrogen effect
at 10 hours
Estrogen effect at 48 hours
Time effect
without estrogen
Differential Expression
• Extract linear model fit for contrasts
• Obtain list of differentially expressed
genes for contrasts
• Look for overlap among differentially
expressed genes
Linear Model Fit
• logFC - Estimate of the log2-fold-change
corresponding to the effect or contrast
• AveExpr - Average log2-expression for the
probe over all arrays/channels
• t - moderated t-statistic
• P.Value - Raw p-value
• adj.P.Value -Adjusted p-value
• B - log odds that the gene is differentially
expressed
Annotating Data
• Probe arrays can be annotated with
external data
• Multiple sources of gene annotations
Gene Set Enrichment
• All biochemical pathways are determined by sets of
genes
• Gene sets are determined by prior biological knowledge
relating to co-expression, function, location or known
biochemical pathways.
• If a pathway is in any way related to a biological trait
then the co-functioning genes should display a higher
degree of enrichment compared to the rest of the
transcriptome.
• Gene Set Enrichment (GSE) is a computational
technique which determines whether a priori defined set
of genes show statistically significant overlap
Estrogen receptor (ER) gene set
• If estrogen is present, ER genes will bind
the estrogen and become activated
• Gain ability to regulate gene expression
and result in differential expression
between the cells with and without
estrogen
• Should lead to up regulation of ER genes