Transcript Slide 1

Microarrays
Naomi Altman
Dept. of Statistics and
PSU
March 7, 2007
1
Microarrays 100
A Statistician’s Simplification
A microarray is a piece of glass or polymer with
several thousand spots, each of which contains
thousands of copies of a short piece of 1 strand of the
"double helix" of DNA or of cDNA (to be explained).
The rungs of the DNA ladder consists of 2 bound
codons which are designated C, G, A, T. These are
called base pairs. Each spot on a microarray consists
of a piece of one side of the ladder with the attached
base.
C binds only to G.
A binds only to T.
A sample containing an unknown number of the
complementary strands is labeled and hybridized to
the array.
http://www.bioteach.ubc.ca
/MolecularBiology/AMonks
FlourishingGarden/
The response is a measure of the quantity of label for
each spot, which should be proportional to the
number of complementary strands in the sample. 2
Agenda
1. DNA 100 - what we need to know to understand what
a microarray can measure.
2. What can a microarray measure?
3. Where does the material printed on the microarray
come from?
4. What does a microarray experiment "look like" and
where do statistical methods fit in?
5. (Time permitting) Gene expression experiments and
3
the p >> n problem.
DNA 100
A Statistician’s Simplification
Every cell in an organism has the same genetic
material, stored in the double helix of DNA.
In a diploid population, most cells have 2 copies of
each chromosome.
Genes are the part of the DNA that code for
proteins, but there are many other important
features that interest biologists.
http://www.accessexcellence.org/
RC/VL/GG/chromosome.html
4
Transcription (Making RNA)
•transcription factors bind to the
promoter and bind RNA polymerase
•DNA strands separate and
transcription is initiated
•transcription continues in the 3'-5'
direction until the stop codons are
reached
•The completed RNA strand is
released for post-processing
www.csu.edu.au/faculty/health/biomed/subjects/molbol/basic.htm
5
Introns and Exons
promoter
Chromosome
In "higher" organisms, the gene contains
noncoding regions, called introns, and
coding regions called exons.
The introns are spliced out of the mRNA
before translation into protein.
"Splicing variants" can be formed by the cell
selecting combinations of the exons.
The resulting spliced strand is the mRNA.
We can "predict" exons using statistical
algorithms, but the gold standard is that
only exons match mRNA sequences
At each end of the mRNA is an untranslated
region (UTR) which is unique to the gene.
http://biology.unm.edu/ccouncil/Bi
ology_124/Summaries/T&T.html
6
cDNA
RNA is much less stable than DNA.
To preserve the exon sequence, and for printing
microarrays, reverse transcription is used in the
lab to convert the RNA into the complementary
cDNA.
cDNA can be preserved by inserting it into the
genome of a living microbe (cDNA library).
7
DNA 100
A Statistician’s Simplification
DNA is complicated stuff.
Protein-coding regions are called genes.
There are also other functional parts to the DNA, some of which
code for RNA and some of which are regulatory regions - i.e. they
help control how the coding regions are used - e.g. promoters
The supercoiling of the DNA may also control how the coding
regions are used.
As well, there is a lot of DNA which appears to be "junk" - i.e. to
date no function is known. But we keep making new discoveries e.g. some of the "junk" codes for small RNA pieces that are
functional.
8
What can be measured on a microarray?
1. Amount of mRNA expressed by a gene.
2. Amount of mRNA expressed by an exon.
3. Amount of RNA expressed by a region of DNA.
4. Which strand of DNA is expressed.
5. Which of several similar DNA sequences is present in the genome.
6. How many copies of a gene is present in the genome.
7. Where a known protein has bound to the DNA. (ChIP on chip)
9
What can be measured on a microarray?
1. Amount of mRNA expressed by a gene.
gene expression array, exon array, tiling array
2. Amount of mRNA expressed by an exon.
exon array, tiling array
3. Amount of RNA expressed by a region of DNA.
tiling array
4. Which strand of DNA is expressed.
exon array, tiling array
5. Which of several similar DNA sequences is present in the genome.
SNP array
6. How many copies of a gene is present in the genome.
gene expression array, exon array, tiling array
7. Where a known protein has bound to the DNA. (ChIP on chip)
promoter array, tiling array
10
Types of Microarrays
cDNA sequence
UTR
Exon 1 Exon 2
Exon 3 UTR
oligo
exon
exon
exon
cDNA
chromosome sequence
CCGTTCACATTAGGATACCAGTTCAAGGCCGTTCACATTAGGATACCAGTTCAAGGAGGCCGTTCAGTTCACATTA
CCGTTCACA
AAGGCCGTT
CCGTGCACA
AAGGACGTT
tile
tile
tile
tile
tile
tile
tile
tile
tile
SNP
tile
tile
tile
tile
tile
tile
tile
tile
tile
tile
promoter
A cDNA microarray can be made from the unsequenced cDNA library. All the
other types require that the sequence be available.
11
Print Technology
The cDNA or oligo can be:
1. Printed on the slide using an "arraying robot" which deposits a drop of liquid
containing the material at each spot. (gene expression only)
40,000+ spots
2. Oligos (all the same length) can be synthesized on the slide using:
i) inkjet technology
ii) photolithography
1,000,000+ spots
3. There are other technologies that give similar types of results (e.g. "beads").
12
Spotted 2-Channel Array
Spotted arrays are
printed on
coated
microscope
slides.
2 RNA samples
are converted
to cDNA. Each
is labelled with
a different dye.
http://www.anst.uu.se/frgra677/bilder/micro_method_large.jpg
13
Format of an Affymetrix Array
http://cnx.rice.edu/content/m12388/latest/figE.JPG
14
Microarray experiments
Print or buy the microarray
Obtain sequence info
select oligos
Print microarray
15
Microarray experiments
Print or buy the microarray
Obtain sequence info
select oligos
sequencing error
assembly error
contamination
unique
similar hybridization rates
Print microarray
16
Microarray experiments
Print or buy the microarray
Obtain sequence info
Create the labeled samples
obtain tissue sample
select oligos
extract RNA
Print microarray
extract mRNA
normalize mRNA
label
17
Microarray experiments
Print or buy the microarray
Obtain sequence info
Create the labeled samples
obtain tissue sample
select oligos
experimental design
-number of biological
replicates
-technical replicates
blocks
sample pooling
extract RNA
Print microarray
extract mRNA
normalize mRNA
label
18
Microarray experiments
Print or buy the microarray
Obtain sequence info
Create the labeled samples
obtain tissue sample
select oligos
extract RNA
extract mRNA
Print microarray
normalize mRNA
hybridize
label
19
Microarray experiments
Print or buy the microrray
Obtain sequence info
Create the labeled samples
obtain tissue sample
select oligos
extract RNA
extract mRNA
Print microarray
normalize mRNA
hybridize
label
hybridization design (multichannel)
20
Microarray experiments
hybridize
scan
process image
detect spots
detect background
compute spot summary
detect bad spots
remove array specific noise
21
Microarray experiments
hybridize
scan
process image
using multiple scans
detect spots
spot detection software
detect background
compute spot summary
detect bad spots
remove array specific noise
pixel mean, median ...
background correction
detection limit
background > foreground
badly printed spots
flaws
normalization
22
Spot Detection
Rafael A Irizarry,
Department of
Biostatistics JHU
[email protected]
http://www.biostat.
jhsph.edu/~ririzarr
http://www.biocon
ductor.org
nci 2002
Adaptive segmentation Fixed circle segmentation
---- GenePix
---- QuantArray
---- ScanAnalyze
Spot uses
morphological opening
23
Gene Expression Microarray experiments
obtain numerical
summary for each gene
or exon on each array
differential expression analysis
sample
classification
clustering genes
and samples
24
Gene Expression Microarray experiments
obtain numerical
summary for each gene
or exon on each array
robust methods to downweight
outliers
data imputation (if needed)
differential expression analysis
sample
classification
discriminant analysis
support vector machines
supervised learning
t-tests, ANOVA
Bayesian versions of above
Fourier analysis of time series
False discovery and nondiscovery
rates
clustering genes
and samples
unsupervised learning
hierarchical clustering
k-means clustering
heatmaps
25
sample clusters show that different regions of
the brain cluster more closely than different species
A heatmap
samples of different regions of the
brain in humans and chimpanzees
gene clusters show that some genes differentiate
among brain regions while other differentiate the
2 species ☺
26
SNP Microarray experiments
obtain numerical
summary for each SNP
estimate SNP frequency
haplotyping
association with subpopulation
27
SNP Microarray experiments
obtain numerical
summary for each SNP
estimate SNP frequency
haplotyping
association with subpopulation
binomial distribution
determine which sets of SNPs come
from each of the 2 chromosomes
association with disease,
ecotype, etc
multivariate analysis
mixture models
28
Tiling Microarray experiments
obtain numerical
summary for each codon
on the chromosome
visualization
29
Tiling Microarray experiments
obtain numerical
summary for each codon
on the chromosome
nonparametric smoothing
visualization
30
p >> n
n=#samples
Usually, we have some type of response on the samples
which may be quantitative (e.g. body mass index,
HDL) or categorical (cancer type, growth stage ...)
p=#measurements
which may be the intensity per gene, exon, locus,
promoter region ...
31
p >> n
n=#samples
Call the response Y typically a n x 1 vector (e.g. BMI)
p=#measurements
Call the measurements X, an n x p matrix
32
p >> n
n=#samples
Call the response Y typically a n x 1 vector (e.g. BMI)
p=#measurements
Call the measurements X, an n x p matrix.
Typically we might think that Y is related to a small set of
the X measures, e.g. k<<n of the p variables.
e.g. Y=Ub + e
where U is the n x k matrix of measurements, b is an
unknown vector of constants and e is random.
33
p >> n
n=#samples
Call the response Y typically a n x 1 vector (e.g. BMI)
p=#measurements
Call the measurements X, an n x p matrix.
Typically we might think that Y is related to a small set of the X measures, e.g. k<<n of
the p variables.
e.g. Y=Ub + e
where U is the n x k matrix of measurements, b is an unknown vector of constants and e
is random.
If we try to solve Y=Xb and X has rank n, we will always
find an exact solution.
In fact, if we select any submatrix of columns of X with
rank n, we will always find an exact solution, even if
34
those columns are completely independent of U.
p >> n
Another approach is to try to predict using 1 column of X
at a time.
If none of the columns are in U (so that the
corresponding coefficients are 0), then, if we do any
statistical test for b=0 and reject for p-value <a, we will
reject ap of the tests and conclude that the
corresponding spots are associated with Y.
Because usually p>10000, we will make a lot of
mistakes unless a is extremely small.
35
p >> n
All of the special methods for analysis of gene
expression data are developed to solve the p >> n
problem.
36
p >> n
e.g. 2-sample problem
2 conditions: e.g. cancer normal
For gene (exon, locus ...) i, we have n samples with p genes and
observe
Yijk i = gene id
j= condition id
Usual method: 2-sample t-test:
k=sample id
t* 
Yi ,cancer  Yi ,normal
S
2
i ,cancer
n
+
S
2
i , normal
n
37
p >> n
e.g. 2-sample problem
t* 
Yi ,cancer  Yi ,normal
2
i ,cancer
2
i , normal
S
Some ideas to improve selection of S
+
differentially expressed genes:
n
n
1) Force all genes to have the same variability (2-way ANOVA) by
the normalization step.
2) assume that there is a distribution of gene means known in
advance or estimated from the data (Bayesian or empirical
Bayes methods).
3) Use the data to estimate the number of inference errors.
4) Force the data to be normally distributed (within gene) in the
normalization step or use bootstrap or permutation methods
(suitable for fairly large sample size).
38
Unsolved Problems
People are still working on normalization, differential
expression analysis, clustering and classification for
gene expression arrays. There are also problems in
combining data from other sources including
measurements from other platforms, meta-analysis,
and data from the literature.
These problems are not dead, but it will be increasingly
difficult to find new problems without a paradigm
change.
The new arrays (exon, SNP, tiling) will need more new
methodology.
39
Acknowledgements
Francesca Chiaromonte
Floral Genome Project
dePamphilis Lab
Ma Lab
Carlson Lab
McNellis Lab
Pugh Lab
Fedorov Lab
Tony Hua
Xianyun Mao
Bioinformatics Consulting Center
Huck Institute
Diya Zhang
Wenlei Liu
Qing Zhang
Allison Lab (UAB)
40