Transcript Analysis

Supervised microarray data analysis
Mark van de Wiel
\department of mathematics and computer science
Quality control
•Protocols
•Perform a small scale, well-controlled experiment to assess influence of
experimental factors (Microarrays from different batches, printing tips,
dyes, linearity of the scanner, etc.)
•Continuous factors (temperature, humidity, spotsize over time, intensity
of control spot over time) can be monitored with standard control chart
techniques.
\department of mathematics and computer science
Design of the experiment
•Think very, very well what the biological goals are.
•What software do you have at your disposal to analyse the data?
•Do we need reference or not?
•‘Biological design’: what tissues to combine on an array (cDNA)? More
than one biological factor: factorial design
•Dye-bias: dye-swap.
•Design on the array (negative/positive controls, repeats?, how many
genes? Pilot study first, distributing the repeats over experimental factors
(spatial, printing tips, etc.))
•Save some space on the (cDNA) microarray for assessing variability due
to experimental factors (e.g. print same control gene with several printing
tips)
\department of mathematics and computer science
Analysis: Multiple testing (after normalization)
Objective: control the number of falsely selected genes
FWE: Family wise error rate
• Weak FWE control:
P(falsely select gene i, i=1, ..., 20.000 | no gene truly expressed)  
• Strong FWE control:
P(falsely select gene i, i=1, ..., 20.000 | some genes expressed, some
genes not expressed)  
FDR: False Discovery Rate
F: Expected number of false rejections when no genes are expressed,
T: Total number of rejections
FDR control: F/T  
\department of mathematics and computer science
Multiple testing: FWE vs FDR
• Control of FDR implies weak control of FWE
• Advantage strong control of the FWE: significance level  under all
situations controlled
• Disadvantage: less power than FDR control
• FWE based procedures tend to select less genes than FDR based
procedure
Software:
• Bioconductor: Step-down Westfall-Young (Dudoit et al.), control FDR
and FWE.
• SAM (permutation based ‘control’ of FDR)
\department of mathematics and computer science
Significant: 249
Delta 1,03419
SAM Plot
Median # false significant: 12,66389
8
6
4
Observed
2
0
-5
-4
-3
-2
-1
0
-2
-4
-6
-8
Expected
1
2
3
4
5
SAM
•
Developed at Stanford, Tibshirani et al. (Paper: Tusher et al, PNAS
98, 5116-5121) Claim is FDR-control
Plus:
1. Ease of use, add-in to Excel
2. Allows asymmetric cut-offs
Minus:
1. Distribution under the null-hypotheses (‘no expression’) needs to be
the same for all genes to guarantee FDR control
2. Combination with k-fold rule: no control of FDR anymore
Solutions: Use (normal) rank scores and a simple rank statistic
Explicitly test on k-fold expression; combine with FDR criterion
\department of mathematics and computer science
Modelling vs Normalisation + Testing
•
Modelling forces you to state what the assumptions are (linearity,
normality, independence, etc.)
•
Normalisation steps may not be commutative
•
Non-linearities can be dealt with by normalisation methods
•
Advanced modelling requires help of statistician/bio-informatician
Standard approach to modelling: ANOVA. Model has two levels:
1. Normalisation level which includes linear corrections for dye and
microarray effects
2. Gene expression level which includes effects on gene level, including
interactions (interaction of interest is usually gene*variety)
\department of mathematics and computer science
Software
• Freeware: SAM, Bioconductor
• Specialized commercial software: Spotfire, Genespring, Genesight,
Rosetta
Most contain: normalisation, variance stabilizing transformations, ANOVA,
testing (most do not yet include the advanced multiple testing criteria)
• Statistical software: SAS, S-Plus, SPSS
Much more debugged, long history, better documentation (Often very
unclear what the specialized packages really do.)
Advantages specialized software: user-friendly, visualisation (nice
pictures), link with data bases, annotation
Try several!!!
\department of mathematics and computer science
Bayesian models
+Natural translation to networks (pathways)
+Complex models (linearity is not necessary, interactions)
+Prior biological knowledge can be included
+Nesting of the models (image analysis + normalisation + gene
expression)
+Inference for complex functions of gene expression data is
relatively easy
-No ‘easy’ software
-Computational methods may take time to find reliable estimates
Example Network
\department of mathematics and computer science
Validation
•
Cross-validation: leave some data out and see how well the
data values are predicted by the model (Note that for
normalisation procedures it may be harder to predict the data
from the normalized data)
•
Biological validation (spikes: known concentrations)
Very useful for validating the normalisation procedure or the
model:
1. Pretend that spikes with equal concentrations that are used
under different conditions (different dyes, microarray batch)are
different quantities.
2. Estimate ratio of two estimates after normalisation or modelling
3. Ratio should approximately be equal to 1.
\department of mathematics and computer science
Comparison and meta analysis
•Objective comparisons between methods very much needed!
•Simulations may help (because we know the truth then). Setting
up realistic simulations may be hard!
•Competition between several methods (CAMDA ’03: Lung cancer)
Future goals:
•Methods that allow for combining data from several experiments.
•From relative quantities to absolute quantities.
•Absolute quantities allow for direct comparison between labs.
(otherwise, only if labs have used same reference material etc.)
\department of mathematics and computer science
Useful overview papers, books
Design: Churchill, G.A. (2002) Fundamental of experimental design for
cDNA microarrays. Nature Genet.32 (490-495)
Analysis: Slonim, D.K. (2002) From patterns to pathways: gene expression
data analysis comes of age Nature Genet.32 (502-508)
Normalisation: Quackenbush, J. (2002) Microarray normalisation and
transformation Nature Genet.32 (496-501)
Pitfalls: Richard Simon et al. (2003) Pitfalls in the Use of DNA Microarray
Data for Diagnostic and Prognostic Classification J Natl Cancer Inst; 95:
14-18.
Books: Baldi & Hatfield (2002), DNA Microarrays and Gene expression,
Cambridge University Press
Speed, T. (2003) Statistical Analysis of Gene Expression Microarray Data
Chapman & Hall
Acknowledgement: Nicola Armstrong (EURANDOM)
\department of mathematics and computer science