Transcript Document
Unit 2.3: Introduction to DNA microarrays
Objectives:
-learn the major types and uses of DNA microarrays
-be introduced to the basic principles of analysis of microarray
data
Readings:
Trevino, V., Falciani, F., and Barrera-Saldana, H.A. 2007. DNA
microarrays: a powerful genomic tool for biomedical and clinical
research. Mol Med 13: 527-541.
Kato, H., Saito, K., and Kimura, T. 2005. A perspective on DNA
microarray technology in food and nutritional science. Curr Opin
Clin Nutr Metab Care 8: 516-522.
Genes can be regulated at many levels
Usually, when we speak of gene regulation, we are referring to
transcriptional regulation. The complete set of all genes being
transcribed are referred to as the “transcriptome.”
• transcription
• post transcription (RNA stability)
the “transcriptome”
• post transcription (translational control)
• post translation (not considered gene regulation)
DNA
RNA
TRANSCRIPTION
PROTEIN
TRANSLATION
In the last dozen years, it has become possible to look at
the entire transcriptome in a single experiment!
While there are a number of variations, there are
essentially two basic ways of doing this—using
sequencing-based methods and microarrays. These
have largely replaced older methods such as subtractive
hybridization and differential display.
Sequencing-based methods are very powerful but have
typically been prohibitively expensive. However, with
recent advances in low-cost, high-throughput next
generation sequencing, these methods—referred to as
“RNA-seq”—are becoming more common and may soon
be dominant.
Genomic analysis of gene expression
• Methods capable of giving a “snapshot” of RNA
expression of all genes
• Can be used as diagnostic profile
– Example: cancer diagnosis
• Can show how RNA levels change during
development, after exposure to stimulus, during
cell cycle, etc.
• Provides large amounts of data
• Can help us start to understand how whole
systems function
Benfey and Protopapas, "Genomics" © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper
Saddle River, New Jersey 07458
RNA-seq
Although details of the methods vary, the concept behind
RNA-seq is simple:
• isolate all mRNA
• convert to cDNA using reverse transcriptase
• sequence the cDNA
• map sequences to the genome
The more times a given sequence is detected, the more
abundantly transcribed it is. If enough sequences are
generated, a comprehensive and quantitative view of the
entire transcriptome of an organism or tissue can be
obtained.
DNA microarrays
Microarrays may eventully be eclipsed by sequence-based methods,
but meanwhile have become incredibly popular since their inception
in 1995 (Schena et al. (1995) Science 270:467-70).
Microarrays are based on the ability of complementary strands of
DNA (or DNA and RNA) to hybridize to one another in solution
with high specificity.
There are now many variations. We’ll take a quick look at the two
basic types: Affymetrix (high density oligonucleotide) and glass
slide (cDNA, long oligo, etc). Both are conceptually similar, with
differences in manufacture and details of design and analysis.
Basics of microarrays
• DNA attached to solid
support
– Glass, plastic, or nylon
• RNA is labeled
– Usually indirectly
• Bound DNA is the
probe
– Labeled RNA is the
“target”
Benfey and Protopapas, "Genomics" © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper
Saddle River, New Jersey 07458
Slide-based DNA microarrays
For slide-based microarrays, the probe DNA is affixed
directly to the surface of a glass microscope slide. The probe
DNA can range from a medium-length oligonucleotide (e.g.,
60 nt) to an entire cDNA clone or larger. Oligonucleotide
arrays have become more common and can be obtained from
several different commercial vendors.
The DNA is deposited on the slide by any of a number of
methods, including “printing” with what is essentially an
ink-jet printer and spotting using a robotically controlled set
of fine-tipped metal pins. An example of the latter is seen in
the following slides:
DNA spotting I
• DNA spotting usually
uses multiple pins
• DNA in microtiter
plate
• DNA usually PCR
amplified
• Oligonucleotides can
also be spotted
Benfey and Protopapas, "Genomics" © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper
Saddle River, New Jersey 07458
Commercial DNA spotter
Benfey and Protopapas, "Genomics" © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper
Saddle River, New Jersey 07458
Movie of microarray spotting
QuickTime™ and a
YUV420 codec decompressor
are needed to see this picture.
Benfey and Protopapas, "Genomics" © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper
Saddle River, New Jersey 07458
Slide-based DNA microarrays
In general, slide-based arrays are used to make a direct comparison
between two different RNA samples. These can be a tissue sample
vs. a reference, mutant vs. wild type, treated vs. control, etc. The
microarray provides a readout of the relative differences in
abundance of the RNAs present in each sample.
cell type A
extract
mRNA
cell type B
make
labeled
cDNA
hybridize to
microarray
more in “A”
more in “B”
equal in A & B
cDNA microarrays: key points
•hybridize two samples/chip (i.e., direct comparison of
samples)
• non-standardized production can affect reproducibility (i.e.,
depends a lot on who made them), although there are now
many quality-controlled commercial arrays available
• longer sequences can have cross-hybridization with other
genes
• don’t necessarily need to know all genes in genome: can use
unsequenced ESTs, for instance
Affymetrix GeneChips
• Oligonucleotides
– Usually 20–25 bases in length
– 10–20 different oligonucleotides for each gene
• Oligonucleotides for each gene selected by
computer program to be the following:
– Unique in genome
– Nonoverlapping
• Composition based on design rules
• Empirically derived
Benfey and Protopapas, "Genomics" © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper
Saddle River, New Jersey 07458
How microarrays are made:
Affymetrix GeneChips
• Oligonucleotides synthesized on silicon chip
– One base at a time
• Uses process of photolithography
– Developed for printing computer circuits
Benfey and Protopapas, "Genomics" © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper
Saddle River, New Jersey 07458
Affymetrix GeneChips
Each gene is represented by a “probe set” consisting of 12-20 probes of 25 nt each.
Each probe has a corresponding “mismatch” probe with a single base difference at
the 13th nucleotide. Labeled RNA is hybridized to the array, and a measure of
abundance is calculated based on the amount of hybridization seen for the entire
probe set, correcting for hybridization to the mismatch probes, which indicates
possible non-specific effects.
(12-20/gene)
probe pair
Mismatch probe cells
Affymetrix GeneChips
(12-20/gene)
probe pair
Mismatch probe cells
Affymetrix: key points
• can hybridize only one sample/chip (i.e., no direct
comparisons of 2 samples)
• standardized production tends to give good
reproducibility
• limited amount of probe sequence can be problematic
(miss alternative splices, bias toward one end of
transcript, dependent on good genome annotation), but
can also be helpful in limiting cross- hybridization
Comparison of microarray
hybridization
• Spotted microarrays
– Competitive hybridization
• Two labeled cDNAs hybridized to same slide
• Affymetrix GeneChips
– One labeled RNA population per chip
– Comparison made between hybridization
intensities of same oligonucleotides on different
chips
Benfey and Protopapas, "Genomics" © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper
Saddle River, New Jersey 07458
Uses of microarrays
• Gene discovery
- tissue profiles
- time course data
- altered genetic backgrounds
• Comparing tissues/genotypes
there are still some inherent difficulties here
• Classification
there’s a lot of promise in medicine (especially cancer
research) for this
Tiled microarrays
So-called tiled microarrays cover a genomic region (or the
whole genome!) at high coverage. Probes are designed to cover
virtually every basepair of the sequence, usually excluding
only simple sequence repeats. In this way, there is no bias
toward known transcribed regions.
genomic sequence
probes on array
probe size and spacing determines the resolution of the array
Other types and uses of microarrays: ChIP-chip
Microarray technology can be combined with other methods for
purposes in addition to looking at transcription (“transcriptional
profiling”). For instance, it can be used along with chromatin
immunoprecipitation (ChIP) to look at proteins bound to DNA
within the cell. This works best with whole genome tiling
arrays and be used to look at transcription factor binding and
post translation modifications to histone proteins.
Other types and uses of microarrays: ChIP-chip
ChIP-chip
Other types and uses of microarrays: RIP-chip
Similar to ChIP-chip but for discovering RNA binding
proteins rather than DNA binding proteins
Tenenbaum, S.A., Carson, C.C., Lager, P.J., Keene, J.D. 2000. Identifying mRNA
subsets in messenger ribonucleoprotein complexes by using cDNA arrays. Proc.
Natl. Acad. Sci. 97: 14085–14090.
Other types and uses of microarrays: aCGH
CGH (comparative genomic hybridization) looks at cytogenetic
abnormalities
•genomic DNA hybridized to array
•often uses large clones (e.g., BACs) as array features
Other types and uses of microarrays: PBMs
Protein-binding microarrays can be used to identify transcription
factor binding sequences (motifs)
•double-stranded DNA probes used on array
•purified protein hybridized to array
•detected by antibody to protein or to epitope tag
•can use real genomic sequence or carefully designed
oligonucleotides
•possible to look at all possible 10-mer nucleotide sequences
on a single array!
Berger, M.F. and M.L. Bulyk. 2006. Methods Mol Biol 338: 245-260.
Berger, M.F., A.A. Philippakis, A.M. Qureshi, F.S. He, P.W. Estep, 3rd, and M.L. Bulyk. 2006. Nat
Biotechnol 24: 1429-1435.
Validation of data
There’s no way that all of your microarray data can
be validated.
It’s strongly recommended that any key findings
be verified by independent means.
Northern blots and quantitative RT-PCR are the
typical ways of doing this; real-time, quantitative
RT-PCR is generally the method of choice.
Microarray Experiments
experimental design
statistical processing and analysis
condition 1
condition 2
condition 3
conditions
genes
Experimental Design for Microarrays
There are a number of important experimental design
considerations for a microarray experiment:
•technical vs biological replicates
•amplification of RNA
•dye swaps
•reference samples
Experimental Design for Microarrays
Technical vs biological replicates
•technical replicates are repeat hybridizations using
the same RNA isolate
•biological replicates use RNA isolated from
separate experiments/experimental organisms
Although technical replicates can be useful for
reducing variation due to hybridization, imaging, etc.,
biological replicates are necessary for a properly
controlled experiment
Experimental Design for Microarrays
Amplification of RNA
• linear amplification methods can be used to
increase the amount of RNA so that microarray
experiments can be performed using very small
numbers of cells. It’s not clear to what degree this
affects results, especially with respect to rare
transcripts, but seems to be generally OK if done
correctly
Experimental Design for Microarrays
Dye swaps
When using 2-color arrays, it’s important to hybridize
replicates using a dye-swap strategy in which the
colors (labels) are reversed between the two
replicates. This is because there can be biases in
hybridization intensity due to which dye is used (even
when the sequence is the same).
S1
S2
S1
S2
Experimental Design for Microarrays
Reference samples
•one common strategy is to use a reference sample
in one channel on each array. This is usually
something that will hybridize to most of the
features (e.g., a complex RNA mixture). Using a
reference sample allows comparisons to be made
between different experimental conditions, as each
is compared to the common reference.
S1
R
S2
R
S3
R
compare
S1/R vs. S2/R vs. S3/R
Experimental Design for Microarrays
The bottom line is that you should discuss your
experimental design with a statistician before going
ahead and beginning your experiments. It’s usually
too late and too expensive to change the design once
you’ve begun!
MIAME (Minimal Information
About a Microarray Experiment)
When you publish a microarray experiment, you are expected to make available
the following minimal information. This allows others to evaluate your data and
compare it to other experimental results:
• EXPERIMENT DESIGN
type, factors, number of arrays, reference sample, qc, database
accession (ArrayExpress, GEO)
• SAMPLES USED, PREPARATION AND LABELING
• HYBRIDIZATION PROCEDURES AND PARAMETERS
• MEASUREMENT DATA AND SPECIFICATIONS
quantitations, hardware & software used for scanning and analysis,
raw measurements, data selection and transformation procedures, final
expression data
• ARRAY DESIGN
platform type, features and locations, manufacturing protocols or
commercial p/n
Microarray Experiments
experimental design
statistical processing and analysis
condition 1
condition 2
condition 3
conditions
genes
Analysis of microarray data
• Microarrays can measure the expression of
thousands of genes simultaneously
• Vast amounts of data require computers
• Types of analysis
– Gene-by-gene
• Method: Statistical techniques
– Categorizing groups of genes
• Method: Clustering algorithms
– Deducing patterns of gene regulation
• Method: Under development
Benfey and Protopapas, "Genomics" © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper
Saddle River, New Jersey 07458
Data Analysis—Don’t try this at home
Microarray analysis is a complex and rapidly evolving
field. Issues include normalization within and among
arrays, limited replication of experiments, and massive
multiple testing (20,000 genes vs 20,000 genes). Each
array platform has its own quirks and requirements.
Although a lot of software packages will do your
analysis for you, working with a true statistician is
highly recommended.
But it’s also important to have a grasp of the basics!
Data Analysis: What genes are
differentially expressed?
• Early days—fold change cutoffs (e.g., 2x difference or better)
• not a very satisfying approach:
-doesn’t take into account variance
-misses any small changes
Here, “A” has a fold change
>2.5, but varies greatly between
replicate experiments. “B” has a
fold change of only 1.75, but
changes reliably each time the
experiment is performed.
Data Analysis: What genes are
differentially expressed?
• Early days—fold change cutoffs (e.g., 2x difference or better)
• not a very satisfying approach:
-doesn’t take into account variance
-misses any small changes
ROC curve for fold change cutoffs
In this example from a
real experiment, you can
see how poorly a fold
change cutoff does at
accurately detecting
most of the changed
genes
Percent True Positives Found (1-false
negative)
70.00%
60.00%
50.00%
40.00%
30.00%
20.00%
10.00%
0.00%
0.00%
5.00%
10.00%
15.00%
False Positive Rate
20.00%
25.00%
Data Analysis: Statistical Considerations
Using statistical tests is usually a better way of determining
which genes are trully changed in your microarray experiment.
The p-value tells you how much confidence to place in your
result.
Many different methods and statistical tests for evaluating
microarray data have been proposed. Your statistician can help
you to determine what will be most effective for you
experiments.
In general, you want to maximize the sensitivity of your
experiment—how many of the truly changed genes you
detect—without sacrificing specificity (i.e., incorrectly thinking
that a gene’s expression is altered).
Data Analysis: Statistical Considerations
Balancing sensitivity and specificity:
Type I error (a, “p-value”)—false positives
Type II error (b)—false negatives
the higher the one, the lower the other
“ROC” (receiver-operator
characteristic) curve
Data Analysis: Statistical Considerations
Microarrays raise unique statistical challenges due to the fact
that they interrogate so many gene simultaneously. This leads to
extreme cases of what statisticians refer to as the multiple
testing problem. The multiple testing problem occurs when even
though the chance of any one of your results being called
significant when it really is not is small, you’re doing so many
statistical tests that it is almost guaranteed that many of your
results will be incorrectly considered significant.
At a = 0.05, 5 of every 100 results might be called significant by
mistake (p<0.05). Normally, this is fine. But with microarrays,
you’re doing up to 20,000 tests per experiment. Thus, you might
erroneously call 1000 genes significantly changed!
Data Analysis: Statistical Considerations
The traditional way to confront this multiple testing
problem (“control the family-wise error rate”) is with
the Bonferroni correction:
Bonferroni correction
set a to desired a/number of tests
so: 0.05/20,000 = 2.5x10-6
That is, instead of using a p-value of 0.05 for each
gene, only consider a result significant if p<2.5x106.
But this is very extreme—you’re not going to find
many genes with such a small p-value!
Data Analysis: Statistical Considerations
But wait: that’s controlling the chance of having any
Type I errors—even just one incorrect result. With
microarrays, it’s often OK to have some false
positives, just not too many. We can therfore control
the False Discovery Rate (FDR).
False Discovery Rate (FDR, or q-value): the expected proportion
of false-positives among the positive results
i.e., at q=0.05, if you call 1000 genes significantly changed, up to
50 of those might be false positives, but the rest will be true
positives.
Often, this is an acceptable trade-off as it allows you
to find many more genes.
Data Analysis: Clustering
What do you do once you have a list of significantly
differentially expressed genes from your experiment?
Sometimes, it’s useful to look at how various subsets
or groups of genes change in different experimental
conditions.
We can do this using different types of clustering
analysis.
Data Analysis: Clustering
• Cluster analysis: dividing samples (genes) into
homogeneous groups based on a set of features
• Steps:
generate expression summaries
measure pairwise distances
cluster
Distance (semi)metrics for pairwise measurements
Euclidean distance
p
d(x, y)
(x y )
2
i
i
i1
Manhattan distance
p
d(x, y) | xi yi |
i1
Correlation coefficient
(xi x )(yi y )
p
r
i1
p
p
(x x ) (y y )
2
i
i1
i
i1
2
Metrics for gene expression
• Need a method to
measure how similar
genes are based on
expression
• Examples
– Euclidean distance
– Pearson correlation
coefficient
Euclidean
distance
Pearson
correlation
coefficient
Benfey and Protopapas, "Genomics" © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper
Saddle River, New Jersey 07458
Data Analysis: Clustering
• Clustering: supervised and unsupervised methods
• Supervised methods must be trained; examples are
Support Vector Machines (SVM) and Artificial
Neural Networks (ANN)
• Unsupervised methods make no assumptions about
how the data should behave; includes
hierarchical clustering
k-means clustering
self-organizing maps (SOM)
principal component analysis (PCA)
Supervised techniques
• Divide groups of genes
based on sample
properties
• Can predict sample
condition based on gene
expression pattern
• Examples
– Support vector machine
– Nearest neighbor
Support
vector
machine
Nearest
neighbor
Benfey and Protopapas, "Genomics" © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper
Saddle River, New Jersey 07458
Data Analysis: Clustering
Hierarchical clustering
At the beginning, each gene is a cluster. In each subsequent
step, the two closest clusters are merged until only one cluster
remains. There are a few different ways of doing this.
conditions
genes
• simple and widely used method
• in large clusters, can lose true representation of expression pattern
• cannot go back—early errors become fixed
Data Analysis: Clustering
k-means clustering
Genes are partitioned into one of k clusters. Assignment is
initially random, then iteratively updated to minimize withincluster distances while maximizing between-cluster differences
• need to decide how many clusters
• computationally intensive
• can be followed up with hierarchical clustering of genes within
each cluster
• non-deterministic (can get different result each time)
• SOMs work differently than k-means, but are similar in overall
concept
Data Analysis: Clustering
Principal component analysis (PCA)
“Although the mathematics is complex, the basic principles are
straightforward. Imagine taking a three dimensional cloud of
data points and rotating it so that you can view it from different
perspectives.You might imagine that certain views would allow
you to better separate the data into groups than other views.
PCA finds those views that give you the best separation of the
data.”
—Quackenbush (2001) Nat Rev Genet 2:418
Data Analysis: Clustering
Data Analysis: Clustering
• Different clustering methods will give you different views of
the data
• There is no “correct” clustering method—clustering is just a
guide that helps you to see the data in different ways
• On the other hand, there are “incorrect” methods—a certain
amount of mathematical validation should be undertaken
• It’s often worth trying a variety of clustering methods to see
what is the most useful for your purposes